root / doc / design-autorepair.rst @ e47e51a8
History | View | Annotate | Download (12.9 kB)
1 |
==================== |
---|---|
2 |
Instance auto-repair |
3 |
==================== |
4 |
|
5 |
.. contents:: :depth: 4 |
6 |
|
7 |
This is a design document detailing the implementation of self-repair and |
8 |
recreation of instances in Ganeti. It also discusses ideas that might be useful |
9 |
for more future self-repair situations. |
10 |
|
11 |
Current state and shortcomings |
12 |
============================== |
13 |
|
14 |
Ganeti currently doesn't do any sort of self-repair or self-recreate of |
15 |
instances: |
16 |
|
17 |
- If a drbd instance is broken (its primary of secondary nodes go |
18 |
offline or need to be drained) an admin or an external tool must fail |
19 |
it over if necessary, and then trigger a disk replacement. |
20 |
- If a plain instance is broken (or both nodes of a drbd instance are) |
21 |
an admin or an external tool must recreate its disk and reinstall it. |
22 |
|
23 |
Moreover in an oversubscribed cluster operations mentioned above might |
24 |
fail for lack of capacity until a node is repaired or a new one added. |
25 |
In this case an external tool would also need to go through any |
26 |
"pending-recreate" or "pending-repair" instances and fix them. |
27 |
|
28 |
Proposed changes |
29 |
================ |
30 |
|
31 |
We'd like to increase the self-repair capabilities of Ganeti, at least |
32 |
with regards to instances. In order to do so we plan to add mechanisms |
33 |
to mark an instance as "due for being repaired" and then the relevant |
34 |
repair to be performed as soon as it's possible, on the cluster. |
35 |
|
36 |
The self repair will be written as part of ganeti-watcher or as an extra |
37 |
watcher component that is called less often. |
38 |
|
39 |
As the first version we'll only handle the case in which an instance |
40 |
lives on an offline or drained node. In the future we may add more |
41 |
self-repair capabilities for errors ganeti can detect. |
42 |
|
43 |
New attributes (or tags) |
44 |
------------------------ |
45 |
|
46 |
In order to know when to perform a self-repair operation we need to know |
47 |
whether they are allowed by the cluster administrator. |
48 |
|
49 |
This can be implemented as either new attributes or tags. Tags could be |
50 |
acceptable as they would only be read and interpreted by the self-repair tool |
51 |
(part of the watcher), and not by the ganeti core opcodes and node rpcs. The |
52 |
following tags would be needed: |
53 |
|
54 |
ganeti:watcher:autorepair:<type> |
55 |
++++++++++++++++++++++++++++++++ |
56 |
|
57 |
(instance/nodegroup/cluster) |
58 |
Allow repairs to happen on an instance that has the tag, or that lives |
59 |
in a cluster or nodegroup which does. Types of repair are in order of |
60 |
perceived risk, lower to higher, and each type includes allowing the |
61 |
operations in the lower ones: |
62 |
|
63 |
- ``fix-storage`` allows a disk replacement or another operation that |
64 |
fixes the instance backend storage without affecting the instance |
65 |
itself. This can for example recover from a broken drbd secondary, but |
66 |
risks data loss if something is wrong on the primary but the secondary |
67 |
was somehow recoverable. |
68 |
- ``migrate`` allows an instance migration. This can recover from a |
69 |
drained primary, but can cause an instance crash in some cases (bugs). |
70 |
- ``failover`` allows instance reboot on the secondary. This can recover |
71 |
from an offline primary, but the instance will lose its running state. |
72 |
- ``reinstall`` allows disks to be recreated and an instance to be |
73 |
reinstalled. This can recover from primary&secondary both being |
74 |
offline, or from an offline primary in the case of non-redundant |
75 |
instances. It causes data loss. |
76 |
|
77 |
Each repair type allows all the operations in the previous types, in the |
78 |
order above, in order to ensure a repair can be completed fully. As such |
79 |
a repair of a lower type might not be able to proceed if it detects an |
80 |
error condition that requires a more risky or drastic solution, but |
81 |
never vice versa (if a worse solution is allowed then so is a better |
82 |
one). |
83 |
|
84 |
ganeti:watcher:autorepair:suspend[:<timestamp>] |
85 |
+++++++++++++++++++++++++++++++++++++++++++++++ |
86 |
|
87 |
(instance/nodegroup/cluster) |
88 |
If this tag is encountered no autorepair operations will start for the |
89 |
instance (or for any instance, if present at the cluster or group |
90 |
level). Any job which already started will be allowed to finish, but |
91 |
then the autorepair system will not proceed further until this tag is |
92 |
removed, or the timestamp passes (in which case the tag will be removed |
93 |
automatically by the watcher). |
94 |
|
95 |
Note that depending on how this tag is used there might still be race |
96 |
conditions related to it for an external tool that uses it |
97 |
programmatically, as no "lock tag" or tag "test-and-set" operation is |
98 |
present at this time. While this is known we won't solve these race |
99 |
conditions in the first version. |
100 |
|
101 |
It might also be useful to easily have an operation that tags all |
102 |
instances matching a filter on some charateristic. But again, this |
103 |
wouldn't be specific to this tag. |
104 |
|
105 |
ganeti:watcher:autorepair:pending:<type>:<id>:<timestamp>:<jobs> |
106 |
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ |
107 |
|
108 |
(instance) |
109 |
If this tag is present a repair of type ``type`` is pending on the |
110 |
target instance. This means that either jobs are being run, or it's |
111 |
waiting for resource availability. ``id`` is the unique id identifying |
112 |
this repair, ``timestamp`` is the time when this tag was first applied |
113 |
to this instance for this ``id`` (we will "update" the tag by adding a |
114 |
"new copy" of it and removing the old version as we run more jobs, but |
115 |
the timestamp will never change for the same repair) |
116 |
|
117 |
``jobs`` is the list of jobs already run or being run to repair the |
118 |
instance. If the instance has just been put in pending state but no job |
119 |
has run yet, this list is empty. |
120 |
|
121 |
This tag will be set by ganeti if an equivalent autorepair tag is |
122 |
present and a a repair is needed, or can be set by an external tool to |
123 |
request a repair as a "once off". |
124 |
|
125 |
If multiple instances of this tag are present they will be handled in |
126 |
order of timestamp. |
127 |
|
128 |
ganeti:watcher:autorepair:result:<type>:<id>:<timestamp>:<result>:<jobs> |
129 |
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ |
130 |
|
131 |
(instance) |
132 |
If this tag is present a repair of type ``type`` has been performed on |
133 |
the instance and has been completed by ``timestamp``. The result is |
134 |
either ``success``, ``failure`` or ``enoperm``, and jobs is a comma |
135 |
separated list of jobs that were executed for this repair. |
136 |
|
137 |
An ``enoperm`` result is returned when the repair was brought on until |
138 |
possible, but the repair type doesn't consent to proceed further. |
139 |
|
140 |
Possible states, and transitions |
141 |
-------------------------------- |
142 |
|
143 |
At any point an instance can be in one of the following health states: |
144 |
|
145 |
Healthy |
146 |
+++++++ |
147 |
|
148 |
The instance lives on only online nodes. The autorepair system will |
149 |
never touch these instances. Any ``repair:pending`` tags will be removed |
150 |
and marked ``success`` with no jobs attached to them. |
151 |
|
152 |
This state can transition to: |
153 |
|
154 |
- Needs-repair, repair disallowed (node offlined or drained, no |
155 |
autorepair tag) |
156 |
- Needs-repair, autorepair allowed (node offlined or drained, autorepair |
157 |
tag present) |
158 |
- Suspended (a suspend tag is added) |
159 |
|
160 |
Suspended |
161 |
+++++++++ |
162 |
|
163 |
Whenever a ``repair:suspend`` tag is added the autorepair code won't |
164 |
touch the instance until the timestamp on the tag has passed, if |
165 |
present. The tag will be removed afterwards (and the instance will |
166 |
transition to its correct state, depending on its health and other |
167 |
tags). |
168 |
|
169 |
Note that when an instance is suspended any pending repair is |
170 |
interrupted, but jobs which were submitted before the suspension are |
171 |
allowed to finish. |
172 |
|
173 |
Needs-repair, repair disallowed |
174 |
+++++++++++++++++++++++++++++++ |
175 |
|
176 |
The instance lives on an offline or drained node, but no autorepair tag |
177 |
is set, or the autorepair tag set is of a type not powerful enough to |
178 |
finish the repair. The autorepair system will never touch these |
179 |
instances, and they can transition to: |
180 |
|
181 |
- Healthy (manual repair) |
182 |
- Pending repair (a ``repair:pending`` tag is added) |
183 |
- Needs-repair, repair allowed always (an autorepair always tag is added) |
184 |
- Suspended (a suspend tag is added) |
185 |
|
186 |
Needs-repair, repair allowed always |
187 |
+++++++++++++++++++++++++++++++++++ |
188 |
|
189 |
A ``repair:pending`` tag is added, and the instance transitions to the |
190 |
Pending Repair state. The autorepair tag is preserved. |
191 |
|
192 |
Of course if a ``repair:suspended`` tag is found no pending tag will be |
193 |
added, and the instance will instead transition to the Suspended state. |
194 |
|
195 |
Pending repair |
196 |
++++++++++++++ |
197 |
|
198 |
When an instance is in this stage the following will happen: |
199 |
|
200 |
If a ``repair:suspended`` tag is found the instance won't be touched and |
201 |
moved to the Suspended state. Any jobs which were already running will |
202 |
be left untouched. |
203 |
|
204 |
If there are still jobs running related to the instance and scheduled by |
205 |
this repair they will be given more time to run, and the instance will |
206 |
be checked again later. The state transitions to itself. |
207 |
|
208 |
If no jobs are running and the instance is detected to be healthy, the |
209 |
``repair:result`` tag will be added, and the current active |
210 |
``repair:pending`` tag will be removed. It will then transition to the |
211 |
Healthy state if there are no ``repair:pending`` tags, or to the Pending |
212 |
state otherwise: there, the instance being healthy, those tags will be |
213 |
resolved without any operation as well (note that this is the same as |
214 |
transitioning to the Healthy state, where ``repair:pending`` tags would |
215 |
also be resolved). |
216 |
|
217 |
If no jobs are running and the instance still has issues: |
218 |
|
219 |
- if the last job(s) failed it can either be retried a few times, if |
220 |
deemed to be safe, or the repair can transition to the Failed state. |
221 |
The ``repair:result`` tag will be added, and the active |
222 |
``repair:pending`` tag will be removed (further ``repair:pending`` |
223 |
tags will not be able to proceed, as explained by the Failed state, |
224 |
until the failure state is cleared) |
225 |
- if the last job(s) succeeded but there are not enough resources to |
226 |
proceed, the state will transition to itself and no jobs are |
227 |
scheduled. The tag is left untouched (and later checked again). This |
228 |
basically just delays any repairs, the current ``pending`` tag stays |
229 |
active, and any others are untouched). |
230 |
- if the last job(s) succeeded but the repair type cannot allow to |
231 |
proceed any further the ``repair:result`` tag is added with an |
232 |
``enoperm`` result, and the current ``repair:pending`` tag is removed. |
233 |
The instance is now back to "Needs-repair, repair disallowed", |
234 |
"Needs-repair, autorepair allowed", or "Pending" if there is already a |
235 |
future tag that can repair the instance. |
236 |
- if the last job(s) succeeded and the repair can continue new job(s) |
237 |
can be submitted, and the ``repair:pending`` tag can be updated. |
238 |
|
239 |
Failed |
240 |
++++++ |
241 |
|
242 |
If repairing an instance has failed a ``repair:result:failure`` is |
243 |
added. The presence of this tag is used to detect that an instance is in |
244 |
this state, and it will not be touched until the failure is investigated |
245 |
and the tag is removed. |
246 |
|
247 |
An external tool or person needs to investigate the state of the |
248 |
instance and remove this tag when he is sure the instance is repaired |
249 |
and safe to turn back to the normal autorepair system. |
250 |
|
251 |
(Alternatively we can use the suspended state (indefinitely or |
252 |
temporarily) to mark the instance as "not touch" when we think a human |
253 |
needs to look at it. To be decided). |
254 |
|
255 |
Repair operation |
256 |
---------------- |
257 |
|
258 |
Possible repairs are: |
259 |
|
260 |
- Replace-disks (drbd, if the secondary is down), (or other storage |
261 |
specific fixes) |
262 |
- Migrate (shared storage, rbd, drbd, if the primary is drained) |
263 |
- Failover (shared storage, rbd, drbd, if the primary is down) |
264 |
- Recreate disks + reinstall (all nodes down, plain, files or drbd) |
265 |
|
266 |
Note that more than one of these operations may need to happen before a |
267 |
full repair is completed (eg. if a drbd primary goes offline first a |
268 |
failover will happen, then a replce-disks). |
269 |
|
270 |
The self-repair tool will first take care of all needs-repair instance |
271 |
that can be brought into ``pending`` state, and transition them as |
272 |
described above. |
273 |
|
274 |
Then it will go through any ``repair:pending`` instances and handle them |
275 |
as described above. |
276 |
|
277 |
Note that the repair tool MAY "group" instances by performing common |
278 |
repair jobs for them (eg: node evacuate). |
279 |
|
280 |
Staging of work |
281 |
--------------- |
282 |
|
283 |
First version: recreate-disks + reinstall (2.6.1) |
284 |
Second version: failover and migrate repairs (2.7) |
285 |
Third version: replace disks repair (2.7 or 2.8) |
286 |
|
287 |
Future work |
288 |
=========== |
289 |
|
290 |
One important piece of work will be reporting what the autorepair system |
291 |
is "thinking" and exporting this in a form that can be read by an |
292 |
outside user or system. In order to do this we need a better |
293 |
communication system than embedding this information into tags. This |
294 |
should be thought in an extensible way that can be used in general for |
295 |
Ganeti to provide "advisory" information about entities it manages, and |
296 |
for an external system to "advise" ganeti over what it can do, but in a |
297 |
less direct manner than submitting individual jobs. |
298 |
|
299 |
Note that cluster verify checks some errors that are actually instance |
300 |
specific, (eg. a missing backend disk on a drbd node) or node-specific |
301 |
(eg. an extra lvm device). If we were to split these into "instance |
302 |
verify", "node verify" and "cluster verify", then we could easily use |
303 |
this tool to perform some of those repairs as well. |
304 |
|
305 |
Finally self-repairs could also be extended to the cluster level, for |
306 |
example concepts like "N+1 failures", missing master candidates, etc. or |
307 |
node level for some specific types of errors. |
308 |
|
309 |
.. vim: set textwidth=72 : |
310 |
.. Local Variables: |
311 |
.. mode: rst |
312 |
.. fill-column: 72 |
313 |
.. End: |