root / doc / design-autorepair.rst @ 11c97d7c
History | View | Annotate | Download (15.7 kB)
1 |
==================== |
---|---|
2 |
Instance auto-repair |
3 |
==================== |
4 |
|
5 |
.. contents:: :depth: 4 |
6 |
|
7 |
This is a design document detailing the implementation of self-repair and |
8 |
recreation of instances in Ganeti. It also discusses ideas that might be useful |
9 |
for more future self-repair situations. |
10 |
|
11 |
Current state and shortcomings |
12 |
============================== |
13 |
|
14 |
Ganeti currently doesn't do any sort of self-repair or self-recreate of |
15 |
instances: |
16 |
|
17 |
- If a drbd instance is broken (its primary of secondary nodes go |
18 |
offline or need to be drained) an admin or an external tool must fail |
19 |
it over if necessary, and then trigger a disk replacement. |
20 |
- If a plain instance is broken (or both nodes of a drbd instance are) |
21 |
an admin or an external tool must recreate its disk and reinstall it. |
22 |
|
23 |
Moreover in an oversubscribed cluster operations mentioned above might |
24 |
fail for lack of capacity until a node is repaired or a new one added. |
25 |
In this case an external tool would also need to go through any |
26 |
"pending-recreate" or "pending-repair" instances and fix them. |
27 |
|
28 |
Proposed changes |
29 |
================ |
30 |
|
31 |
We'd like to increase the self-repair capabilities of Ganeti, at least |
32 |
with regards to instances. In order to do so we plan to add mechanisms |
33 |
to mark an instance as "due for being repaired" and then the relevant |
34 |
repair to be performed as soon as it's possible, on the cluster. |
35 |
|
36 |
The self repair will be written as part of ganeti-watcher or as an extra |
37 |
watcher component that is called less often. |
38 |
|
39 |
As the first version we'll only handle the case in which an instance |
40 |
lives on an offline or drained node. In the future we may add more |
41 |
self-repair capabilities for errors ganeti can detect. |
42 |
|
43 |
New attributes (or tags) |
44 |
------------------------ |
45 |
|
46 |
In order to know when to perform a self-repair operation we need to know |
47 |
whether they are allowed by the cluster administrator. |
48 |
|
49 |
This can be implemented as either new attributes or tags. Tags could be |
50 |
acceptable as they would only be read and interpreted by the self-repair tool |
51 |
(part of the watcher), and not by the ganeti core opcodes and node rpcs. The |
52 |
following tags would be needed: |
53 |
|
54 |
ganeti:watcher:autorepair:<type> |
55 |
++++++++++++++++++++++++++++++++ |
56 |
|
57 |
(instance/nodegroup/cluster) |
58 |
Allow repairs to happen on an instance that has the tag, or that lives |
59 |
in a cluster or nodegroup which does. Types of repair are in order of |
60 |
perceived risk, lower to higher, and each type includes allowing the |
61 |
operations in the lower ones: |
62 |
|
63 |
- ``fix-storage`` allows a disk replacement or another operation that |
64 |
fixes the instance backend storage without affecting the instance |
65 |
itself. This can for example recover from a broken drbd secondary, but |
66 |
risks data loss if something is wrong on the primary but the secondary |
67 |
was somehow recoverable. |
68 |
- ``migrate`` allows an instance migration. This can recover from a |
69 |
drained primary, but can cause an instance crash in some cases (bugs). |
70 |
- ``failover`` allows instance reboot on the secondary. This can recover |
71 |
from an offline primary, but the instance will lose its running state. |
72 |
- ``reinstall`` allows disks to be recreated and an instance to be |
73 |
reinstalled. This can recover from primary&secondary both being |
74 |
offline, or from an offline primary in the case of non-redundant |
75 |
instances. It causes data loss. |
76 |
|
77 |
Each repair type allows all the operations in the previous types, in the |
78 |
order above, in order to ensure a repair can be completed fully. As such |
79 |
a repair of a lower type might not be able to proceed if it detects an |
80 |
error condition that requires a more risky or drastic solution, but |
81 |
never vice versa (if a worse solution is allowed then so is a better |
82 |
one). |
83 |
|
84 |
If there are multiple ``ganeti:watcher:autorepair:<type>`` tags in an |
85 |
object (cluster, node group or instance), the least destructive tag |
86 |
takes precedence. When multiplicity happens across objects, the nearest |
87 |
tag wins. For example, if in a cluster with two instances, *I1* and |
88 |
*I2*, *I1* has ``failover``, and the cluster itself has both |
89 |
``fix-storage`` and ``reinstall``, *I1* will end up with ``failover`` |
90 |
and *I2* with ``fix-storage``. |
91 |
|
92 |
ganeti:watcher:autorepair:suspend[:<timestamp>] |
93 |
+++++++++++++++++++++++++++++++++++++++++++++++ |
94 |
|
95 |
(instance/nodegroup/cluster) |
96 |
If this tag is encountered no autorepair operations will start for the |
97 |
instance (or for any instance, if present at the cluster or group |
98 |
level). Any job which already started will be allowed to finish, but |
99 |
then the autorepair system will not proceed further until this tag is |
100 |
removed, or the timestamp passes (in which case the tag will be removed |
101 |
automatically by the watcher). |
102 |
|
103 |
Note that depending on how this tag is used there might still be race |
104 |
conditions related to it for an external tool that uses it |
105 |
programmatically, as no "lock tag" or tag "test-and-set" operation is |
106 |
present at this time. While this is known we won't solve these race |
107 |
conditions in the first version. |
108 |
|
109 |
It might also be useful to easily have an operation that tags all |
110 |
instances matching a filter on some charateristic. But again, this |
111 |
wouldn't be specific to this tag. |
112 |
|
113 |
If there are multiple |
114 |
``ganeti:watcher:autorepair:suspend[:<timestamp>]`` tags in an object, |
115 |
the form without timestamp takes precedence (permanent suspension); or, |
116 |
if all object tags have a timestamp, the one with the highest timestamp. |
117 |
When multiplicity happens across objects, the nearest tag wins, as |
118 |
above. This makes it possible to suspend cluster-enabled repairs with a |
119 |
single tag in the cluster object; or to suspend them only for a certain |
120 |
node group or instance. At the same time, it is possible to re-enable |
121 |
cluster-suspended repairs in a particular instance or group by applying |
122 |
an enable tag to them. |
123 |
|
124 |
ganeti:watcher:autorepair:pending:<type>:<id>:<timestamp>:<jobs> |
125 |
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ |
126 |
|
127 |
(instance) |
128 |
If this tag is present a repair of type ``type`` is pending on the |
129 |
target instance. This means that either jobs are being run, or it's |
130 |
waiting for resource availability. ``id`` is the unique id identifying |
131 |
this repair, ``timestamp`` is the time when this tag was first applied |
132 |
to this instance for this ``id`` (we will "update" the tag by adding a |
133 |
"new copy" of it and removing the old version as we run more jobs, but |
134 |
the timestamp will never change for the same repair) |
135 |
|
136 |
``jobs`` is the list of jobs already run or being run to repair the |
137 |
instance (separated by a plus sign, *+*). If the instance has just |
138 |
been put in pending state but no job has run yet, this list is empty. |
139 |
|
140 |
This tag will be set by ganeti if an equivalent autorepair tag is |
141 |
present and a a repair is needed, or can be set by an external tool to |
142 |
request a repair as a "once off". |
143 |
|
144 |
If multiple instances of this tag are present they will be handled in |
145 |
order of timestamp. |
146 |
|
147 |
ganeti:watcher:autorepair:result:<type>:<id>:<timestamp>:<result>:<jobs> |
148 |
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ |
149 |
|
150 |
(instance) |
151 |
If this tag is present a repair of type ``type`` has been performed on |
152 |
the instance and has been completed by ``timestamp``. The result is |
153 |
either ``success``, ``failure`` or ``enoperm``, and jobs is a |
154 |
*+*-separated list of jobs that were executed for this repair. |
155 |
|
156 |
An ``enoperm`` result is returned when the repair was brought on until |
157 |
possible, but the repair type doesn't consent to proceed further. |
158 |
|
159 |
Possible states, and transitions |
160 |
-------------------------------- |
161 |
|
162 |
At any point an instance can be in one of the following health states: |
163 |
|
164 |
Healthy |
165 |
+++++++ |
166 |
|
167 |
The instance lives on only online nodes. The autorepair system will |
168 |
never touch these instances. Any ``repair:pending`` tags will be removed |
169 |
and marked ``success`` with no jobs attached to them. |
170 |
|
171 |
This state can transition to: |
172 |
|
173 |
- Needs-repair, repair disallowed (node offlined or drained, no |
174 |
autorepair tag) |
175 |
- Needs-repair, autorepair allowed (node offlined or drained, autorepair |
176 |
tag present) |
177 |
- Suspended (a suspend tag is added) |
178 |
|
179 |
Suspended |
180 |
+++++++++ |
181 |
|
182 |
Whenever a ``repair:suspend`` tag is added the autorepair code won't |
183 |
touch the instance until the timestamp on the tag has passed, if |
184 |
present. The tag will be removed afterwards (and the instance will |
185 |
transition to its correct state, depending on its health and other |
186 |
tags). |
187 |
|
188 |
Note that when an instance is suspended any pending repair is |
189 |
interrupted, but jobs which were submitted before the suspension are |
190 |
allowed to finish. |
191 |
|
192 |
Needs-repair, repair disallowed |
193 |
+++++++++++++++++++++++++++++++ |
194 |
|
195 |
The instance lives on an offline or drained node, but no autorepair tag |
196 |
is set, or the autorepair tag set is of a type not powerful enough to |
197 |
finish the repair. The autorepair system will never touch these |
198 |
instances, and they can transition to: |
199 |
|
200 |
- Healthy (manual repair) |
201 |
- Pending repair (a ``repair:pending`` tag is added) |
202 |
- Needs-repair, repair allowed always (an autorepair always tag is added) |
203 |
- Suspended (a suspend tag is added) |
204 |
|
205 |
Needs-repair, repair allowed always |
206 |
+++++++++++++++++++++++++++++++++++ |
207 |
|
208 |
A ``repair:pending`` tag is added, and the instance transitions to the |
209 |
Pending Repair state. The autorepair tag is preserved. |
210 |
|
211 |
Of course if a ``repair:suspended`` tag is found no pending tag will be |
212 |
added, and the instance will instead transition to the Suspended state. |
213 |
|
214 |
Pending repair |
215 |
++++++++++++++ |
216 |
|
217 |
When an instance is in this stage the following will happen: |
218 |
|
219 |
If a ``repair:suspended`` tag is found the instance won't be touched and |
220 |
moved to the Suspended state. Any jobs which were already running will |
221 |
be left untouched. |
222 |
|
223 |
If there are still jobs running related to the instance and scheduled by |
224 |
this repair they will be given more time to run, and the instance will |
225 |
be checked again later. The state transitions to itself. |
226 |
|
227 |
If no jobs are running and the instance is detected to be healthy, the |
228 |
``repair:result`` tag will be added, and the current active |
229 |
``repair:pending`` tag will be removed. It will then transition to the |
230 |
Healthy state if there are no ``repair:pending`` tags, or to the Pending |
231 |
state otherwise: there, the instance being healthy, those tags will be |
232 |
resolved without any operation as well (note that this is the same as |
233 |
transitioning to the Healthy state, where ``repair:pending`` tags would |
234 |
also be resolved). |
235 |
|
236 |
If no jobs are running and the instance still has issues: |
237 |
|
238 |
- if the last job(s) failed it can either be retried a few times, if |
239 |
deemed to be safe, or the repair can transition to the Failed state. |
240 |
The ``repair:result`` tag will be added, and the active |
241 |
``repair:pending`` tag will be removed (further ``repair:pending`` |
242 |
tags will not be able to proceed, as explained by the Failed state, |
243 |
until the failure state is cleared) |
244 |
- if the last job(s) succeeded but there are not enough resources to |
245 |
proceed, the state will transition to itself and no jobs are |
246 |
scheduled. The tag is left untouched (and later checked again). This |
247 |
basically just delays any repairs, the current ``pending`` tag stays |
248 |
active, and any others are untouched). |
249 |
- if the last job(s) succeeded but the repair type cannot allow to |
250 |
proceed any further the ``repair:result`` tag is added with an |
251 |
``enoperm`` result, and the current ``repair:pending`` tag is removed. |
252 |
The instance is now back to "Needs-repair, repair disallowed", |
253 |
"Needs-repair, autorepair allowed", or "Pending" if there is already a |
254 |
future tag that can repair the instance. |
255 |
- if the last job(s) succeeded and the repair can continue new job(s) |
256 |
can be submitted, and the ``repair:pending`` tag can be updated. |
257 |
|
258 |
Failed |
259 |
++++++ |
260 |
|
261 |
If repairing an instance has failed a ``repair:result:failure`` is |
262 |
added. The presence of this tag is used to detect that an instance is in |
263 |
this state, and it will not be touched until the failure is investigated |
264 |
and the tag is removed. |
265 |
|
266 |
An external tool or person needs to investigate the state of the |
267 |
instance and remove this tag when he is sure the instance is repaired |
268 |
and safe to turn back to the normal autorepair system. |
269 |
|
270 |
(Alternatively we can use the suspended state (indefinitely or |
271 |
temporarily) to mark the instance as "not touch" when we think a human |
272 |
needs to look at it. To be decided). |
273 |
|
274 |
A graph with the possible transitions follows; note that in the graph, |
275 |
following the implementation, the two ``Needs repair`` states have been |
276 |
coalesced into one; and the ``Suspended`` state disapears, for it |
277 |
becames an attribute of the instance object (its auto-repair policy). |
278 |
|
279 |
.. digraph:: "auto-repair-states" |
280 |
|
281 |
node [shape=circle, style=filled, fillcolor="#BEDEF1", |
282 |
width=2, fixedsize=true]; |
283 |
healthy [label="Healthy"]; |
284 |
needsrep [label="Needs repair"]; |
285 |
pendrep [label="Pending repair"]; |
286 |
failed [label="Failed repair"]; |
287 |
disabled [label="(no state)", width=1.25]; |
288 |
|
289 |
{rank=same; needsrep} |
290 |
{rank=same; healthy} |
291 |
{rank=same; pendrep} |
292 |
{rank=same; failed} |
293 |
{rank=same; disabled} |
294 |
|
295 |
// These nodes are needed to be the "origin" of the "initial state" arrows. |
296 |
node [width=.5, label="", style=invis]; |
297 |
inih; |
298 |
inin; |
299 |
inip; |
300 |
inif; |
301 |
inix; |
302 |
|
303 |
edge [fontsize=10, fontname="Arial Bold", fontcolor=blue] |
304 |
|
305 |
inih -> healthy [label="No tags or\nresult:success"]; |
306 |
inip -> pendrep [label="Tag:\nautorepair:pending"]; |
307 |
inif -> failed [label="Tag:\nresult:failure"]; |
308 |
inix -> disabled [fontcolor=black, label="ArNotEnabled"]; |
309 |
|
310 |
edge [fontcolor="orange"]; |
311 |
|
312 |
healthy -> healthy [label="No problems\ndetected"]; |
313 |
|
314 |
healthy -> needsrep [ |
315 |
label="Brokeness\ndetected in\nfirst half of\nthe tool run"]; |
316 |
|
317 |
pendrep -> healthy [ |
318 |
label="All jobs\ncompleted\nsuccessfully /\ninstance healthy"]; |
319 |
|
320 |
pendrep -> failed [label="Some job(s)\nfailed"]; |
321 |
|
322 |
edge [fontcolor="red"]; |
323 |
|
324 |
needsrep -> pendrep [ |
325 |
label="Repair\nallowed and\ninitial job(s)\nsubmitted"]; |
326 |
|
327 |
needsrep -> needsrep [ |
328 |
label="Repairs suspended\n(no-op) or enabled\nbut not powerful enough\n(result: enoperm)"]; |
329 |
|
330 |
pendrep -> pendrep [label="More jobs\nsubmitted"]; |
331 |
|
332 |
|
333 |
Repair operation |
334 |
---------------- |
335 |
|
336 |
Possible repairs are: |
337 |
|
338 |
- Replace-disks (drbd, if the secondary is down), (or other storage |
339 |
specific fixes) |
340 |
- Migrate (shared storage, rbd, drbd, if the primary is drained) |
341 |
- Failover (shared storage, rbd, drbd, if the primary is down) |
342 |
- Recreate disks + reinstall (all nodes down, plain, files or drbd) |
343 |
|
344 |
Note that more than one of these operations may need to happen before a |
345 |
full repair is completed (eg. if a drbd primary goes offline first a |
346 |
failover will happen, then a replce-disks). |
347 |
|
348 |
The self-repair tool will first take care of all needs-repair instance |
349 |
that can be brought into ``pending`` state, and transition them as |
350 |
described above. |
351 |
|
352 |
Then it will go through any ``repair:pending`` instances and handle them |
353 |
as described above. |
354 |
|
355 |
Note that the repair tool MAY "group" instances by performing common |
356 |
repair jobs for them (eg: node evacuate). |
357 |
|
358 |
Staging of work |
359 |
--------------- |
360 |
|
361 |
First version: recreate-disks + reinstall (2.6.1) |
362 |
Second version: failover and migrate repairs (2.7) |
363 |
Third version: replace disks repair (2.7 or 2.8) |
364 |
|
365 |
Future work |
366 |
=========== |
367 |
|
368 |
One important piece of work will be reporting what the autorepair system |
369 |
is "thinking" and exporting this in a form that can be read by an |
370 |
outside user or system. In order to do this we need a better |
371 |
communication system than embedding this information into tags. This |
372 |
should be thought in an extensible way that can be used in general for |
373 |
Ganeti to provide "advisory" information about entities it manages, and |
374 |
for an external system to "advise" ganeti over what it can do, but in a |
375 |
less direct manner than submitting individual jobs. |
376 |
|
377 |
Note that cluster verify checks some errors that are actually instance |
378 |
specific, (eg. a missing backend disk on a drbd node) or node-specific |
379 |
(eg. an extra lvm device). If we were to split these into "instance |
380 |
verify", "node verify" and "cluster verify", then we could easily use |
381 |
this tool to perform some of those repairs as well. |
382 |
|
383 |
Finally self-repairs could also be extended to the cluster level, for |
384 |
example concepts like "N+1 failures", missing master candidates, etc. or |
385 |
node level for some specific types of errors. |
386 |
|
387 |
.. vim: set textwidth=72 : |
388 |
.. Local Variables: |
389 |
.. mode: rst |
390 |
.. fill-column: 72 |
391 |
.. End: |