Statistics
| Branch: | Tag: | Revision:

root / doc / design-autorepair.rst @ 819358e1

History | View | Annotate | Download (14.7 kB)

1
====================
2
Instance auto-repair
3
====================
4

    
5
.. contents:: :depth: 4
6

    
7
This is a design document detailing the implementation of self-repair and
8
recreation of instances in Ganeti. It also discusses ideas that might be useful
9
for more future self-repair situations.
10

    
11
Current state and shortcomings
12
==============================
13

    
14
Ganeti currently doesn't do any sort of self-repair or self-recreate of
15
instances:
16

    
17
- If a drbd instance is broken (its primary of secondary nodes go
18
  offline or need to be drained) an admin or an external tool must fail
19
  it over if necessary, and then trigger a disk replacement.
20
- If a plain instance is broken (or both nodes of a drbd instance are)
21
  an admin or an external tool must recreate its disk and reinstall it.
22

    
23
Moreover in an oversubscribed cluster operations mentioned above might
24
fail for lack of capacity until a node is repaired or a new one added.
25
In this case an external tool would also need to go through any
26
"pending-recreate" or "pending-repair" instances and fix them.
27

    
28
Proposed changes
29
================
30

    
31
We'd like to increase the self-repair capabilities of Ganeti, at least
32
with regards to instances. In order to do so we plan to add mechanisms
33
to mark an instance as "due for being repaired" and then the relevant
34
repair to be performed as soon as it's possible, on the cluster.
35

    
36
The self repair will be written as part of ganeti-watcher or as an extra
37
watcher component that is called less often.
38

    
39
As the first version we'll only handle the case in which an instance
40
lives on an offline or drained node. In the future we may add more
41
self-repair capabilities for errors ganeti can detect.
42

    
43
New attributes (or tags)
44
------------------------
45

    
46
In order to know when to perform a self-repair operation we need to know
47
whether they are allowed by the cluster administrator.
48

    
49
This can be implemented as either new attributes or tags. Tags could be
50
acceptable as they would only be read and interpreted by the self-repair tool
51
(part of the watcher), and not by the ganeti core opcodes and node rpcs. The
52
following tags would be needed:
53

    
54
ganeti:watcher:autorepair:<type>
55
++++++++++++++++++++++++++++++++
56

    
57
(instance/nodegroup/cluster)
58
Allow repairs to happen on an instance that has the tag, or that lives
59
in a cluster or nodegroup which does. Types of repair are in order of
60
perceived risk, lower to higher, and each type includes allowing the
61
operations in the lower ones:
62

    
63
- ``fix-storage`` allows a disk replacement or another operation that
64
  fixes the instance backend storage without affecting the instance
65
  itself. This can for example recover from a broken drbd secondary, but
66
  risks data loss if something is wrong on the primary but the secondary
67
  was somehow recoverable.
68
- ``migrate`` allows an instance migration. This can recover from a
69
  drained primary, but can cause an instance crash in some cases (bugs).
70
- ``failover`` allows instance reboot on the secondary. This can recover
71
  from an offline primary, but the instance will lose its running state.
72
- ``reinstall`` allows disks to be recreated and an instance to be
73
  reinstalled. This can recover from primary&secondary both being
74
  offline, or from an offline primary in the case of non-redundant
75
  instances. It causes data loss.
76

    
77
Each repair type allows all the operations in the previous types, in the
78
order above, in order to ensure a repair can be completed fully. As such
79
a repair of a lower type might not be able to proceed if it detects an
80
error condition that requires a more risky or drastic solution, but
81
never vice versa (if a worse solution is allowed then so is a better
82
one).
83

    
84
ganeti:watcher:autorepair:suspend[:<timestamp>]
85
+++++++++++++++++++++++++++++++++++++++++++++++
86

    
87
(instance/nodegroup/cluster)
88
If this tag is encountered no autorepair operations will start for the
89
instance (or for any instance, if present at the cluster or group
90
level). Any job which already started will be allowed to finish, but
91
then the autorepair system will not proceed further until this tag is
92
removed, or the timestamp passes (in which case the tag will be removed
93
automatically by the watcher).
94

    
95
Note that depending on how this tag is used there might still be race
96
conditions related to it for an external tool that uses it
97
programmatically, as no "lock tag" or tag "test-and-set" operation is
98
present at this time. While this is known we won't solve these race
99
conditions in the first version.
100

    
101
It might also be useful to easily have an operation that tags all
102
instances matching a  filter on some charateristic. But again, this
103
wouldn't be specific to this tag.
104

    
105
ganeti:watcher:autorepair:pending:<type>:<id>:<timestamp>:<jobs>
106
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
107

    
108
(instance)
109
If this tag is present a repair of type ``type`` is pending on the
110
target instance. This means that either jobs are being run, or it's
111
waiting for resource availability. ``id`` is the unique id identifying
112
this repair, ``timestamp`` is the time when this tag was first applied
113
to this instance for this ``id`` (we will "update" the tag by adding a
114
"new copy" of it and removing the old version as we run more jobs, but
115
the timestamp will never change for the same repair)
116

    
117
``jobs`` is the list of jobs already run or being run to repair the
118
instance. If the instance has just been put in pending state but no job
119
has run yet, this list is empty.
120

    
121
This tag will be set by ganeti if an equivalent autorepair tag is
122
present and a a repair is needed, or can be set by an external tool to
123
request a repair as a "once off".
124

    
125
If multiple instances of this tag are present they will be handled in
126
order of timestamp.
127

    
128
ganeti:watcher:autorepair:result:<type>:<id>:<timestamp>:<result>:<jobs>
129
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
130

    
131
(instance)
132
If this tag is present a repair of type ``type`` has been performed on
133
the instance and has been completed by ``timestamp``. The result is
134
either ``success``, ``failure`` or ``enoperm``, and jobs is a comma
135
separated list of jobs that were executed for this repair.
136

    
137
An ``enoperm`` result is returned when the repair was brought on until
138
possible, but the repair type doesn't consent to proceed further.
139

    
140
Possible states, and transitions
141
--------------------------------
142

    
143
At any point an instance can be in one of the following health states:
144

    
145
Healthy
146
+++++++
147

    
148
The instance lives on only online nodes. The autorepair system will
149
never touch these instances. Any ``repair:pending`` tags will be removed
150
and marked ``success`` with no jobs attached to them.
151

    
152
This state can transition to:
153

    
154
- Needs-repair, repair disallowed (node offlined or drained, no
155
  autorepair tag)
156
- Needs-repair, autorepair allowed (node offlined or drained, autorepair
157
  tag present)
158
- Suspended (a suspend tag is added)
159

    
160
Suspended
161
+++++++++
162

    
163
Whenever a ``repair:suspend`` tag is added the autorepair code won't
164
touch the instance until the timestamp on the tag has passed, if
165
present. The tag will be removed afterwards (and the instance will
166
transition to its correct state, depending on its health and other
167
tags).
168

    
169
Note that when an instance is suspended any pending repair is
170
interrupted, but jobs which were submitted before the suspension are
171
allowed to finish.
172

    
173
Needs-repair, repair disallowed
174
+++++++++++++++++++++++++++++++
175

    
176
The instance lives on an offline or drained node, but no autorepair tag
177
is set, or the autorepair tag set is of a type not powerful enough to
178
finish the repair. The autorepair system will never touch these
179
instances, and they can transition to:
180

    
181
- Healthy (manual repair)
182
- Pending repair (a ``repair:pending`` tag is added)
183
- Needs-repair, repair allowed always (an autorepair always tag is added)
184
- Suspended (a suspend tag is added)
185

    
186
Needs-repair, repair allowed always
187
+++++++++++++++++++++++++++++++++++
188

    
189
A ``repair:pending`` tag is added, and the instance transitions to the
190
Pending Repair state. The autorepair tag is preserved.
191

    
192
Of course if a ``repair:suspended`` tag is found no pending tag will be
193
added, and the instance will instead transition to the Suspended state.
194

    
195
Pending repair
196
++++++++++++++
197

    
198
When an instance is in this stage the following will happen:
199

    
200
If a ``repair:suspended`` tag is found the instance won't be touched and
201
moved to the Suspended state. Any jobs which were already running will
202
be left untouched.
203

    
204
If there are still jobs running related to the instance and scheduled by
205
this repair they will be given more time to run, and the instance will
206
be checked again later.  The state transitions to itself.
207

    
208
If no jobs are running and the instance is detected to be healthy, the
209
``repair:result`` tag will be added, and the current active
210
``repair:pending`` tag will be removed. It will then transition to the
211
Healthy state if there are no ``repair:pending`` tags, or to the Pending
212
state otherwise: there, the instance being healthy, those tags will be
213
resolved without any operation as well (note that this is the same as
214
transitioning to the Healthy state, where ``repair:pending`` tags would
215
also be resolved).
216

    
217
If no jobs are running and the instance still has issues:
218

    
219
- if the last job(s) failed it can either be retried a few times, if
220
  deemed to be safe, or the repair can transition to the Failed state.
221
  The ``repair:result`` tag will be added, and the active
222
  ``repair:pending`` tag will be removed (further ``repair:pending``
223
  tags will not be able to proceed, as explained by the Failed state,
224
  until the failure state is cleared)
225
- if the last job(s) succeeded but there are not enough resources to
226
  proceed, the state will transition to itself and no jobs are
227
  scheduled. The tag is left untouched (and later checked again). This
228
  basically just delays any repairs, the current ``pending`` tag stays
229
  active, and any others are untouched).
230
- if the last job(s) succeeded but the repair type cannot allow to
231
  proceed any further the ``repair:result`` tag is added with an
232
  ``enoperm`` result, and the current ``repair:pending`` tag is removed.
233
  The instance is now back to "Needs-repair, repair disallowed",
234
  "Needs-repair, autorepair allowed", or "Pending" if there is already a
235
  future tag that can repair the instance.
236
- if the last job(s) succeeded and the repair can continue new job(s)
237
  can be submitted, and the ``repair:pending`` tag can be updated.
238

    
239
Failed
240
++++++
241

    
242
If repairing an instance has failed a ``repair:result:failure`` is
243
added. The presence of this tag is used to detect that an instance is in
244
this state, and it will not be touched until the failure is investigated
245
and the tag is removed.
246

    
247
An external tool or person needs to investigate the state of the
248
instance and remove this tag when he is sure the instance is repaired
249
and safe to turn back to the normal autorepair system.
250

    
251
(Alternatively we can use the suspended state (indefinitely or
252
temporarily) to mark the instance as "not touch" when we think a human
253
needs to look at it. To be decided).
254

    
255
A graph with the possible transitions follows; note that in the graph,
256
following the implementation, the two ``Needs repair`` states have been
257
coalesced into one; and the ``Suspended`` state disapears, for it
258
becames an attribute of the instance object (its auto-repair policy).
259

    
260
.. digraph:: "auto-repair-states"
261

    
262
  node     [shape=circle, style=filled, fillcolor="#BEDEF1",
263
            width=2, fixedsize=true];
264
  healthy  [label="Healthy"];
265
  needsrep [label="Needs repair"];
266
  pendrep  [label="Pending repair"];
267
  failed   [label="Failed repair"];
268
  disabled [label="(no state)", width=1.25];
269

    
270
  {rank=same; needsrep}
271
  {rank=same; healthy}
272
  {rank=same; pendrep}
273
  {rank=same; failed}
274
  {rank=same; disabled}
275

    
276
  // These nodes are needed to be the "origin" of the "initial state" arrows.
277
  node [width=.5, label="", style=invis];
278
  inih;
279
  inin;
280
  inip;
281
  inif;
282
  inix;
283

    
284
  edge [fontsize=10, fontname="Arial Bold", fontcolor=blue]
285

    
286
  inih -> healthy  [label="No tags or\nresult:success"];
287
  inip -> pendrep  [label="Tag:\nautorepair:pending"];
288
  inif -> failed   [label="Tag:\nresult:failure"];
289
  inix -> disabled [fontcolor=black, label="ArNotEnabled"];
290

    
291
  edge [fontcolor="orange"];
292

    
293
  healthy -> healthy [label="No problems\ndetected"];
294

    
295
  healthy -> needsrep [
296
             label="Brokeness\ndetected in\nfirst half of\nthe tool run"];
297

    
298
  pendrep -> healthy [
299
             label="All jobs\ncompleted\nsuccessfully /\ninstance healthy"];
300

    
301
  pendrep -> failed [label="Some job(s)\nfailed"];
302

    
303
  edge [fontcolor="red"];
304

    
305
  needsrep -> pendrep [
306
              label="Repair\nallowed and\ninitial job(s)\nsubmitted"];
307

    
308
  needsrep -> needsrep [
309
              label="Repairs suspended\n(no-op) or enabled\nbut not powerful enough\n(result: enoperm)"];
310

    
311
  pendrep -> pendrep [label="More jobs\nsubmitted"];
312

    
313

    
314
Repair operation
315
----------------
316

    
317
Possible repairs are:
318

    
319
- Replace-disks (drbd, if the secondary is down), (or other storage
320
  specific fixes)
321
- Migrate (shared storage, rbd, drbd, if the primary is drained)
322
- Failover (shared storage, rbd, drbd, if the primary is down)
323
- Recreate disks + reinstall (all nodes down, plain, files or drbd)
324

    
325
Note that more than one of these operations may need to happen before a
326
full repair is completed (eg. if a drbd primary goes offline first a
327
failover will happen, then a replce-disks).
328

    
329
The self-repair tool will first take care of all needs-repair instance
330
that can be brought into ``pending`` state, and transition them as
331
described above.
332

    
333
Then it will go through any ``repair:pending`` instances and handle them
334
as described above.
335

    
336
Note that the repair tool MAY "group" instances by performing common
337
repair jobs for them (eg: node evacuate).
338

    
339
Staging of work
340
---------------
341

    
342
First version: recreate-disks + reinstall (2.6.1)
343
Second version: failover and migrate repairs (2.7)
344
Third version: replace disks repair (2.7 or 2.8)
345

    
346
Future work
347
===========
348

    
349
One important piece of work will be reporting what the autorepair system
350
is "thinking" and exporting this in a form that can be read by an
351
outside user or system. In order to do this we need a better
352
communication system than embedding this information into tags. This
353
should be thought in an extensible way that can be used in general for
354
Ganeti to provide "advisory" information about entities it manages, and
355
for an external system to "advise" ganeti over what it can do, but in a
356
less direct manner than submitting individual jobs.
357

    
358
Note that cluster verify checks some errors that are actually instance
359
specific, (eg. a missing backend disk on a drbd node) or node-specific
360
(eg. an extra lvm device). If we were to split these into "instance
361
verify", "node verify" and "cluster verify", then we could easily use
362
this tool to perform some of those repairs as well.
363

    
364
Finally self-repairs could also be extended to the cluster level, for
365
example concepts like "N+1 failures", missing master candidates, etc. or
366
node level for some specific types of errors.
367

    
368
.. vim: set textwidth=72 :
369
.. Local Variables:
370
.. mode: rst
371
.. fill-column: 72
372
.. End: