X-Git-Url: https://code.grnet.gr/git/ganeti-local/blobdiff_plain/68640987f07ae5991b9b284f823629c402fa7a03..99c7cd5be025e86745aa46003ca0962609e0b4e2:/doc/design-autorepair.rst diff --git a/doc/design-autorepair.rst b/doc/design-autorepair.rst index 480979b..5ab446b 100644 --- a/doc/design-autorepair.rst +++ b/doc/design-autorepair.rst @@ -81,6 +81,14 @@ error condition that requires a more risky or drastic solution, but never vice versa (if a worse solution is allowed then so is a better one). +If there are multiple ``ganeti:watcher:autorepair:`` tags in an +object (cluster, node group or instance), the least destructive tag +takes precedence. When multiplicity happens across objects, the nearest +tag wins. For example, if in a cluster with two instances, *I1* and +*I2*, *I1* has ``failover``, and the cluster itself has both +``fix-storage`` and ``reinstall``, *I1* will end up with ``failover`` +and *I2* with ``fix-storage``. + ganeti:watcher:autorepair:suspend[:] +++++++++++++++++++++++++++++++++++++++++++++++ @@ -102,8 +110,19 @@ It might also be useful to easily have an operation that tags all instances matching a filter on some charateristic. But again, this wouldn't be specific to this tag. -ganeti:watcher:repair:pending:::: -++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ +If there are multiple +``ganeti:watcher:autorepair:suspend[:]`` tags in an object, +the form without timestamp takes precedence (permanent suspension); or, +if all object tags have a timestamp, the one with the highest timestamp. +When multiplicity happens across objects, the nearest tag wins, as +above. This makes it possible to suspend cluster-enabled repairs with a +single tag in the cluster object; or to suspend them only for a certain +node group or instance. At the same time, it is possible to re-enable +cluster-suspended repairs in a particular instance or group by applying +an enable tag to them. + +ganeti:watcher:autorepair:pending:::: +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ (instance) If this tag is present a repair of type ``type`` is pending on the @@ -115,8 +134,8 @@ to this instance for this ``id`` (we will "update" the tag by adding a the timestamp will never change for the same repair) ``jobs`` is the list of jobs already run or being run to repair the -instance. If the instance has just been put in pending state but no job -has run yet, this list is empty. +instance (separated by a plus sign, *+*). If the instance has just +been put in pending state but no job has run yet, this list is empty. This tag will be set by ganeti if an equivalent autorepair tag is present and a a repair is needed, or can be set by an external tool to @@ -125,14 +144,14 @@ request a repair as a "once off". If multiple instances of this tag are present they will be handled in order of timestamp. -ganeti:watcher:repair:result::::: -++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ +ganeti:watcher:autorepair:result::::: +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ (instance) If this tag is present a repair of type ``type`` has been performed on the instance and has been completed by ``timestamp``. The result is -either ``success``, ``failure`` or ``enoperm``, and jobs is a comma -separated list of jobs that were executed for this repair. +either ``success``, ``failure`` or ``enoperm``, and jobs is a +*+*-separated list of jobs that were executed for this repair. An ``enoperm`` result is returned when the repair was brought on until possible, but the repair type doesn't consent to proceed further. @@ -252,6 +271,65 @@ and safe to turn back to the normal autorepair system. temporarily) to mark the instance as "not touch" when we think a human needs to look at it. To be decided). +A graph with the possible transitions follows; note that in the graph, +following the implementation, the two ``Needs repair`` states have been +coalesced into one; and the ``Suspended`` state disapears, for it +becames an attribute of the instance object (its auto-repair policy). + +.. digraph:: "auto-repair-states" + + node [shape=circle, style=filled, fillcolor="#BEDEF1", + width=2, fixedsize=true]; + healthy [label="Healthy"]; + needsrep [label="Needs repair"]; + pendrep [label="Pending repair"]; + failed [label="Failed repair"]; + disabled [label="(no state)", width=1.25]; + + {rank=same; needsrep} + {rank=same; healthy} + {rank=same; pendrep} + {rank=same; failed} + {rank=same; disabled} + + // These nodes are needed to be the "origin" of the "initial state" arrows. + node [width=.5, label="", style=invis]; + inih; + inin; + inip; + inif; + inix; + + edge [fontsize=10, fontname="Arial Bold", fontcolor=blue] + + inih -> healthy [label="No tags or\nresult:success"]; + inip -> pendrep [label="Tag:\nautorepair:pending"]; + inif -> failed [label="Tag:\nresult:failure"]; + inix -> disabled [fontcolor=black, label="ArNotEnabled"]; + + edge [fontcolor="orange"]; + + healthy -> healthy [label="No problems\ndetected"]; + + healthy -> needsrep [ + label="Brokeness\ndetected in\nfirst half of\nthe tool run"]; + + pendrep -> healthy [ + label="All jobs\ncompleted\nsuccessfully /\ninstance healthy"]; + + pendrep -> failed [label="Some job(s)\nfailed"]; + + edge [fontcolor="red"]; + + needsrep -> pendrep [ + label="Repair\nallowed and\ninitial job(s)\nsubmitted"]; + + needsrep -> needsrep [ + label="Repairs suspended\n(no-op) or enabled\nbut not powerful enough\n(result: enoperm)"]; + + pendrep -> pendrep [label="More jobs\nsubmitted"]; + + Repair operation ----------------