code.grnet.gr Git - ganeti-local/blob - doc/design-autorepair.rst

   1 ====================
   2 Instance auto-repair
   3 ====================
   4
   5 .. contents:: :depth: 4
   6
   7 This is a design document detailing the implementation of self-repair and
   8 recreation of instances in Ganeti. It also discusses ideas that might be useful
   9 for more future self-repair situations.
  10
  11 Current state and shortcomings
  12 ==============================
  13
  14 Ganeti currently doesn't do any sort of self-repair or self-recreate of
  15 instances:
  16
  17 - If a drbd instance is broken (its primary of secondary nodes go
  18   offline or need to be drained) an admin or an external tool must fail
  19   it over if necessary, and then trigger a disk replacement.
  20 - If a plain instance is broken (or both nodes of a drbd instance are)
  21   an admin or an external tool must recreate its disk and reinstall it.
  22
  23 Moreover in an oversubscribed cluster operations mentioned above might
  24 fail for lack of capacity until a node is repaired or a new one added.
  25 In this case an external tool would also need to go through any
  26 "pending-recreate" or "pending-repair" instances and fix them.
  27
  28 Proposed changes
  29 ================
  30
  31 We'd like to increase the self-repair capabilities of Ganeti, at least
  32 with regards to instances. In order to do so we plan to add mechanisms
  33 to mark an instance as "due for being repaired" and then the relevant
  34 repair to be performed as soon as it's possible, on the cluster.
  35
  36 The self repair will be written as part of ganeti-watcher or as an extra
  37 watcher component that is called less often.
  38
  39 As the first version we'll only handle the case in which an instance
  40 lives on an offline or drained node. In the future we may add more
  41 self-repair capabilities for errors ganeti can detect.
  42
  43 New attributes (or tags)
  44 ------------------------
  45
  46 In order to know when to perform a self-repair operation we need to know
  47 whether they are allowed by the cluster administrator.
  48
  49 This can be implemented as either new attributes or tags. Tags could be
  50 acceptable as they would only be read and interpreted by the self-repair tool
  51 (part of the watcher), and not by the ganeti core opcodes and node rpcs. The
  52 following tags would be needed:
  53
  54 ganeti:watcher:autorepair:<type>
  55 ++++++++++++++++++++++++++++++++
  56
  57 (instance/nodegroup/cluster)
  58 Allow repairs to happen on an instance that has the tag, or that lives
  59 in a cluster or nodegroup which does. Types of repair are in order of
  60 perceived risk, lower to higher, and each type includes allowing the
  61 operations in the lower ones:
  62
  63 - ``fix-storage`` allows a disk replacement or another operation that
  64   fixes the instance backend storage without affecting the instance
  65   itself. This can for example recover from a broken drbd secondary, but
  66   risks data loss if something is wrong on the primary but the secondary
  67   was somehow recoverable.
  68 - ``migrate`` allows an instance migration. This can recover from a
  69   drained primary, but can cause an instance crash in some cases (bugs).
  70 - ``failover`` allows instance reboot on the secondary. This can recover
  71   from an offline primary, but the instance will lose its running state.
  72 - ``reinstall`` allows disks to be recreated and an instance to be
  73   reinstalled. This can recover from primary&secondary both being
  74   offline, or from an offline primary in the case of non-redundant
  75   instances. It causes data loss.
  76
  77 Each repair type allows all the operations in the previous types, in the
  78 order above, in order to ensure a repair can be completed fully. As such
  79 a repair of a lower type might not be able to proceed if it detects an
  80 error condition that requires a more risky or drastic solution, but
  81 never vice versa (if a worse solution is allowed then so is a better
  82 one).
  83
  84 ganeti:watcher:autorepair:suspend[:<timestamp>]
  85 +++++++++++++++++++++++++++++++++++++++++++++++
  86
  87 (instance/nodegroup/cluster)
  88 If this tag is encountered no autorepair operations will start for the
  89 instance (or for any instance, if present at the cluster or group
  90 level). Any job which already started will be allowed to finish, but
  91 then the autorepair system will not proceed further until this tag is
  92 removed, or the timestamp passes (in which case the tag will be removed
  93 automatically by the watcher).
  94
  95 Note that depending on how this tag is used there might still be race
  96 conditions related to it for an external tool that uses it
  97 programmatically, as no "lock tag" or tag "test-and-set" operation is
  98 present at this time. While this is known we won't solve these race
  99 conditions in the first version.
 100
 101 It might also be useful to easily have an operation that tags all
 102 instances matching a  filter on some charateristic. But again, this
 103 wouldn't be specific to this tag.
 104
 105 ganeti:watcher:repair:pending:<type>:<id>:<timestamp>:<jobs>
 106 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 107
 108 (instance)
 109 If this tag is present a repair of type ``type`` is pending on the
 110 target instance. This means that either jobs are being run, or it's
 111 waiting for resource availability. ``id`` is the unique id identifying
 112 this repair, ``timestamp`` is the time when this tag was first applied
 113 to this instance for this ``id`` (we will "update" the tag by adding a
 114 "new copy" of it and removing the old version as we run more jobs, but
 115 the timestamp will never change for the same repair)
 116
 117 ``jobs`` is the list of jobs already run or being run to repair the
 118 instance. If the instance has just been put in pending state but no job
 119 has run yet, this list is empty.
 120
 121 This tag will be set by ganeti if an equivalent autorepair tag is
 122 present and a a repair is needed, or can be set by an external tool to
 123 request a repair as a "once off".
 124
 125 If multiple instances of this tag are present they will be handled in
 126 order of timestamp.
 127
 128 ganeti:watcher:repair:result:<type>:<id>:<timestamp>:<result>:<jobs>
 129 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 130
 131 (instance)
 132 If this tag is present a repair of type ``type`` has been performed on
 133 the instance and has been completed by ``timestamp``. The result is
 134 either ``success``, ``failure`` or ``enoperm``, and jobs is a comma
 135 separated list of jobs that were executed for this repair.
 136
 137 An ``enoperm`` result is returned when the repair was brought on until
 138 possible, but the repair type doesn't consent to proceed further.
 139
 140 Possible states, and transitions
 141 --------------------------------
 142
 143 At any point an instance can be in one of the following health states:
 144
 145 Healthy
 146 +++++++
 147
 148 The instance lives on only online nodes. The autorepair system will
 149 never touch these instances. Any ``repair:pending`` tags will be removed
 150 and marked ``success`` with no jobs attached to them.
 151
 152 This state can transition to:
 153
 154 - Needs-repair, repair disallowed (node offlined or drained, no
 155   autorepair tag)
 156 - Needs-repair, autorepair allowed (node offlined or drained, autorepair
 157   tag present)
 158 - Suspended (a suspend tag is added)
 159
 160 Suspended
 161 +++++++++
 162
 163 Whenever a ``repair:suspend`` tag is added the autorepair code won't
 164 touch the instance until the timestamp on the tag has passed, if
 165 present. The tag will be removed afterwards (and the instance will
 166 transition to its correct state, depending on its health and other
 167 tags).
 168
 169 Note that when an instance is suspended any pending repair is
 170 interrupted, but jobs which were submitted before the suspension are
 171 allowed to finish.
 172
 173 Needs-repair, repair disallowed
 174 +++++++++++++++++++++++++++++++
 175
 176 The instance lives on an offline or drained node, but no autorepair tag
 177 is set, or the autorepair tag set is of a type not powerful enough to
 178 finish the repair. The autorepair system will never touch these
 179 instances, and they can transition to:
 180
 181 - Healthy (manual repair)
 182 - Pending repair (a ``repair:pending`` tag is added)
 183 - Needs-repair, repair allowed always (an autorepair always tag is added)
 184 - Suspended (a suspend tag is added)
 185
 186 Needs-repair, repair allowed always
 187 +++++++++++++++++++++++++++++++++++
 188
 189 A ``repair:pending`` tag is added, and the instance transitions to the
 190 Pending Repair state. The autorepair tag is preserved.
 191
 192 Of course if a ``repair:suspended`` tag is found no pending tag will be
 193 added, and the instance will instead transition to the Suspended state.
 194
 195 Pending repair
 196 ++++++++++++++
 197
 198 When an instance is in this stage the following will happen:
 199
 200 If a ``repair:suspended`` tag is found the instance won't be touched and
 201 moved to the Suspended state. Any jobs which were already running will
 202 be left untouched.
 203
 204 If there are still jobs running related to the instance and scheduled by
 205 this repair they will be given more time to run, and the instance will
 206 be checked again later.  The state transitions to itself.
 207
 208 If no jobs are running and the instance is detected to be healthy, the
 209 ``repair:result`` tag will be added, and the current active
 210 ``repair:pending`` tag will be removed. It will then transition to the
 211 Healthy state if there are no ``repair:pending`` tags, or to the Pending
 212 state otherwise: there, the instance being healthy, those tags will be
 213 resolved without any operation as well (note that this is the same as
 214 transitioning to the Healthy state, where ``repair:pending`` tags would
 215 also be resolved).
 216
 217 If no jobs are running and the instance still has issues:
 218
 219 - if the last job(s) failed it can either be retried a few times, if
 220   deemed to be safe, or the repair can transition to the Failed state.
 221   The ``repair:result`` tag will be added, and the active
 222   ``repair:pending`` tag will be removed (further ``repair:pending``
 223   tags will not be able to proceed, as explained by the Failed state,
 224   until the failure state is cleared)
 225 - if the last job(s) succeeded but there are not enough resources to
 226   proceed, the state will transition to itself and no jobs are
 227   scheduled. The tag is left untouched (and later checked again). This
 228   basically just delays any repairs, the current ``pending`` tag stays
 229   active, and any others are untouched).
 230 - if the last job(s) succeeded but the repair type cannot allow to
 231   proceed any further the ``repair:result`` tag is added with an
 232   ``enoperm`` result, and the current ``repair:pending`` tag is removed.
 233   The instance is now back to "Needs-repair, repair disallowed",
 234   "Needs-repair, autorepair allowed", or "Pending" if there is already a
 235   future tag that can repair the instance.
 236 - if the last job(s) succeeded and the repair can continue new job(s)
 237   can be submitted, and the ``repair:pending`` tag can be updated.
 238
 239 Failed
 240 ++++++
 241
 242 If repairing an instance has failed a ``repair:result:failure`` is
 243 added. The presence of this tag is used to detect that an instance is in
 244 this state, and it will not be touched until the failure is investigated
 245 and the tag is removed.
 246
 247 An external tool or person needs to investigate the state of the
 248 instance and remove this tag when he is sure the instance is repaired
 249 and safe to turn back to the normal autorepair system.
 250
 251 (Alternatively we can use the suspended state (indefinitely or
 252 temporarily) to mark the instance as "not touch" when we think a human
 253 needs to look at it. To be decided).
 254
 255 Repair operation
 256 ----------------
 257
 258 Possible repairs are:
 259
 260 - Replace-disks (drbd, if the secondary is down), (or other storage
 261   specific fixes)
 262 - Migrate (shared storage, rbd, drbd, if the primary is drained)
 263 - Failover (shared storage, rbd, drbd, if the primary is down)
 264 - Recreate disks + reinstall (all nodes down, plain, files or drbd)
 265
 266 Note that more than one of these operations may need to happen before a
 267 full repair is completed (eg. if a drbd primary goes offline first a
 268 failover will happen, then a replce-disks).
 269
 270 The self-repair tool will first take care of all needs-repair instance
 271 that can be brought into ``pending`` state, and transition them as
 272 described above.
 273
 274 Then it will go through any ``repair:pending`` instances and handle them
 275 as described above.
 276
 277 Note that the repair tool MAY "group" instances by performing common
 278 repair jobs for them (eg: node evacuate).
 279
 280 Staging of work
 281 ---------------
 282
 283 First version: recreate-disks + reinstall (2.6.1)
 284 Second version: failover and migrate repairs (2.7)
 285 Third version: replace disks repair (2.7 or 2.8)
 286
 287 Future work
 288 ===========
 289
 290 One important piece of work will be reporting what the autorepair system
 291 is "thinking" and exporting this in a form that can be read by an
 292 outside user or system. In order to do this we need a better
 293 communication system than embedding this information into tags. This
 294 should be thought in an extensible way that can be used in general for
 295 Ganeti to provide "advisory" information about entities it manages, and
 296 for an external system to "advise" ganeti over what it can do, but in a
 297 less direct manner than submitting individual jobs.
 298
 299 Note that cluster verify checks some errors that are actually instance
 300 specific, (eg. a missing backend disk on a drbd node) or node-specific
 301 (eg. an extra lvm device). If we were to split these into "instance
 302 verify", "node verify" and "cluster verify", then we could easily use
 303 this tool to perform some of those repairs as well.
 304
 305 Finally self-repairs could also be extended to the cluster level, for
 306 example concepts like "N+1 failures", missing master candidates, etc. or
 307 node level for some specific types of errors.
 308
 309 .. vim: set textwidth=72 :
 310 .. Local Variables:
 311 .. mode: rst
 312 .. fill-column: 72
 313 .. End: