never vice versa (if a worse solution is allowed then so is a better
one).
+If there are multiple ``ganeti:watcher:autorepair:<type>`` tags in an
+object (cluster, node group or instance), the least destructive tag
+takes precedence. When multiplicity happens across objects, the nearest
+tag wins. For example, if in a cluster with two instances, *I1* and
+*I2*, *I1* has ``failover``, and the cluster itself has both
+``fix-storage`` and ``reinstall``, *I1* will end up with ``failover``
+and *I2* with ``fix-storage``.
+
ganeti:watcher:autorepair:suspend[:<timestamp>]
+++++++++++++++++++++++++++++++++++++++++++++++
instances matching a filter on some charateristic. But again, this
wouldn't be specific to this tag.
-ganeti:watcher:repair:pending:<type>:<id>:<timestamp>:<jobs>
-++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+If there are multiple
+``ganeti:watcher:autorepair:suspend[:<timestamp>]`` tags in an object,
+the form without timestamp takes precedence (permanent suspension); or,
+if all object tags have a timestamp, the one with the highest timestamp.
+When multiplicity happens across objects, the nearest tag wins, as
+above. This makes it possible to suspend cluster-enabled repairs with a
+single tag in the cluster object; or to suspend them only for a certain
+node group or instance. At the same time, it is possible to re-enable
+cluster-suspended repairs in a particular instance or group by applying
+an enable tag to them.
+
+ganeti:watcher:autorepair:pending:<type>:<id>:<timestamp>:<jobs>
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
(instance)
If this tag is present a repair of type ``type`` is pending on the
the timestamp will never change for the same repair)
``jobs`` is the list of jobs already run or being run to repair the
-instance. If the instance has just been put in pending state but no job
-has run yet, this list is empty.
+instance (separated by a plus sign, *+*). If the instance has just
+been put in pending state but no job has run yet, this list is empty.
This tag will be set by ganeti if an equivalent autorepair tag is
present and a a repair is needed, or can be set by an external tool to
If multiple instances of this tag are present they will be handled in
order of timestamp.
-ganeti:watcher:repair:result:<type>:<id>:<timestamp>:<result>:<jobs>
-++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+ganeti:watcher:autorepair:result:<type>:<id>:<timestamp>:<result>:<jobs>
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
(instance)
If this tag is present a repair of type ``type`` has been performed on
the instance and has been completed by ``timestamp``. The result is
-either ``success``, ``failure`` or ``enoperm``, and jobs is a comma
-separated list of jobs that were executed for this repair.
+either ``success``, ``failure`` or ``enoperm``, and jobs is a
+*+*-separated list of jobs that were executed for this repair.
An ``enoperm`` result is returned when the repair was brought on until
possible, but the repair type doesn't consent to proceed further.
temporarily) to mark the instance as "not touch" when we think a human
needs to look at it. To be decided).
+A graph with the possible transitions follows; note that in the graph,
+following the implementation, the two ``Needs repair`` states have been
+coalesced into one; and the ``Suspended`` state disapears, for it
+becames an attribute of the instance object (its auto-repair policy).
+
+.. digraph:: "auto-repair-states"
+
+ node [shape=circle, style=filled, fillcolor="#BEDEF1",
+ width=2, fixedsize=true];
+ healthy [label="Healthy"];
+ needsrep [label="Needs repair"];
+ pendrep [label="Pending repair"];
+ failed [label="Failed repair"];
+ disabled [label="(no state)", width=1.25];
+
+ {rank=same; needsrep}
+ {rank=same; healthy}
+ {rank=same; pendrep}
+ {rank=same; failed}
+ {rank=same; disabled}
+
+ // These nodes are needed to be the "origin" of the "initial state" arrows.
+ node [width=.5, label="", style=invis];
+ inih;
+ inin;
+ inip;
+ inif;
+ inix;
+
+ edge [fontsize=10, fontname="Arial Bold", fontcolor=blue]
+
+ inih -> healthy [label="No tags or\nresult:success"];
+ inip -> pendrep [label="Tag:\nautorepair:pending"];
+ inif -> failed [label="Tag:\nresult:failure"];
+ inix -> disabled [fontcolor=black, label="ArNotEnabled"];
+
+ edge [fontcolor="orange"];
+
+ healthy -> healthy [label="No problems\ndetected"];
+
+ healthy -> needsrep [
+ label="Brokeness\ndetected in\nfirst half of\nthe tool run"];
+
+ pendrep -> healthy [
+ label="All jobs\ncompleted\nsuccessfully /\ninstance healthy"];
+
+ pendrep -> failed [label="Some job(s)\nfailed"];
+
+ edge [fontcolor="red"];
+
+ needsrep -> pendrep [
+ label="Repair\nallowed and\ninitial job(s)\nsubmitted"];
+
+ needsrep -> needsrep [
+ label="Repairs suspended\n(no-op) or enabled\nbut not powerful enough\n(result: enoperm)"];
+
+ pendrep -> pendrep [label="More jobs\nsubmitted"];
+
+
Repair operation
----------------