Statistics
| Branch: | Tag: | Revision:

root / doc / design-autorepair.rst @ ab6536ba

History | View | Annotate | Download (15.7 kB)

1 68640987 Guido Trotter
====================
2 68640987 Guido Trotter
Instance auto-repair
3 68640987 Guido Trotter
====================
4 68640987 Guido Trotter
5 68640987 Guido Trotter
.. contents:: :depth: 4
6 68640987 Guido Trotter
7 68640987 Guido Trotter
This is a design document detailing the implementation of self-repair and
8 68640987 Guido Trotter
recreation of instances in Ganeti. It also discusses ideas that might be useful
9 68640987 Guido Trotter
for more future self-repair situations.
10 68640987 Guido Trotter
11 68640987 Guido Trotter
Current state and shortcomings
12 68640987 Guido Trotter
==============================
13 68640987 Guido Trotter
14 68640987 Guido Trotter
Ganeti currently doesn't do any sort of self-repair or self-recreate of
15 68640987 Guido Trotter
instances:
16 68640987 Guido Trotter
17 68640987 Guido Trotter
- If a drbd instance is broken (its primary of secondary nodes go
18 68640987 Guido Trotter
  offline or need to be drained) an admin or an external tool must fail
19 68640987 Guido Trotter
  it over if necessary, and then trigger a disk replacement.
20 68640987 Guido Trotter
- If a plain instance is broken (or both nodes of a drbd instance are)
21 68640987 Guido Trotter
  an admin or an external tool must recreate its disk and reinstall it.
22 68640987 Guido Trotter
23 68640987 Guido Trotter
Moreover in an oversubscribed cluster operations mentioned above might
24 68640987 Guido Trotter
fail for lack of capacity until a node is repaired or a new one added.
25 68640987 Guido Trotter
In this case an external tool would also need to go through any
26 68640987 Guido Trotter
"pending-recreate" or "pending-repair" instances and fix them.
27 68640987 Guido Trotter
28 68640987 Guido Trotter
Proposed changes
29 68640987 Guido Trotter
================
30 68640987 Guido Trotter
31 68640987 Guido Trotter
We'd like to increase the self-repair capabilities of Ganeti, at least
32 68640987 Guido Trotter
with regards to instances. In order to do so we plan to add mechanisms
33 68640987 Guido Trotter
to mark an instance as "due for being repaired" and then the relevant
34 68640987 Guido Trotter
repair to be performed as soon as it's possible, on the cluster.
35 68640987 Guido Trotter
36 68640987 Guido Trotter
The self repair will be written as part of ganeti-watcher or as an extra
37 68640987 Guido Trotter
watcher component that is called less often.
38 68640987 Guido Trotter
39 68640987 Guido Trotter
As the first version we'll only handle the case in which an instance
40 68640987 Guido Trotter
lives on an offline or drained node. In the future we may add more
41 68640987 Guido Trotter
self-repair capabilities for errors ganeti can detect.
42 68640987 Guido Trotter
43 68640987 Guido Trotter
New attributes (or tags)
44 68640987 Guido Trotter
------------------------
45 68640987 Guido Trotter
46 68640987 Guido Trotter
In order to know when to perform a self-repair operation we need to know
47 68640987 Guido Trotter
whether they are allowed by the cluster administrator.
48 68640987 Guido Trotter
49 68640987 Guido Trotter
This can be implemented as either new attributes or tags. Tags could be
50 68640987 Guido Trotter
acceptable as they would only be read and interpreted by the self-repair tool
51 68640987 Guido Trotter
(part of the watcher), and not by the ganeti core opcodes and node rpcs. The
52 68640987 Guido Trotter
following tags would be needed:
53 68640987 Guido Trotter
54 68640987 Guido Trotter
ganeti:watcher:autorepair:<type>
55 68640987 Guido Trotter
++++++++++++++++++++++++++++++++
56 68640987 Guido Trotter
57 68640987 Guido Trotter
(instance/nodegroup/cluster)
58 68640987 Guido Trotter
Allow repairs to happen on an instance that has the tag, or that lives
59 68640987 Guido Trotter
in a cluster or nodegroup which does. Types of repair are in order of
60 68640987 Guido Trotter
perceived risk, lower to higher, and each type includes allowing the
61 68640987 Guido Trotter
operations in the lower ones:
62 68640987 Guido Trotter
63 68640987 Guido Trotter
- ``fix-storage`` allows a disk replacement or another operation that
64 68640987 Guido Trotter
  fixes the instance backend storage without affecting the instance
65 68640987 Guido Trotter
  itself. This can for example recover from a broken drbd secondary, but
66 68640987 Guido Trotter
  risks data loss if something is wrong on the primary but the secondary
67 68640987 Guido Trotter
  was somehow recoverable.
68 68640987 Guido Trotter
- ``migrate`` allows an instance migration. This can recover from a
69 68640987 Guido Trotter
  drained primary, but can cause an instance crash in some cases (bugs).
70 68640987 Guido Trotter
- ``failover`` allows instance reboot on the secondary. This can recover
71 68640987 Guido Trotter
  from an offline primary, but the instance will lose its running state.
72 68640987 Guido Trotter
- ``reinstall`` allows disks to be recreated and an instance to be
73 68640987 Guido Trotter
  reinstalled. This can recover from primary&secondary both being
74 68640987 Guido Trotter
  offline, or from an offline primary in the case of non-redundant
75 68640987 Guido Trotter
  instances. It causes data loss.
76 68640987 Guido Trotter
77 68640987 Guido Trotter
Each repair type allows all the operations in the previous types, in the
78 68640987 Guido Trotter
order above, in order to ensure a repair can be completed fully. As such
79 68640987 Guido Trotter
a repair of a lower type might not be able to proceed if it detects an
80 68640987 Guido Trotter
error condition that requires a more risky or drastic solution, but
81 68640987 Guido Trotter
never vice versa (if a worse solution is allowed then so is a better
82 68640987 Guido Trotter
one).
83 68640987 Guido Trotter
84 b1eb71c7 Dato Simó
If there are multiple ``ganeti:watcher:autorepair:<type>`` tags in an
85 b1eb71c7 Dato Simó
object (cluster, node group or instance), the least destructive tag
86 b1eb71c7 Dato Simó
takes precedence. When multiplicity happens across objects, the nearest
87 b1eb71c7 Dato Simó
tag wins. For example, if in a cluster with two instances, *I1* and
88 b1eb71c7 Dato Simó
*I2*, *I1* has ``failover``, and the cluster itself has both
89 b1eb71c7 Dato Simó
``fix-storage`` and ``reinstall``, *I1* will end up with ``failover``
90 b1eb71c7 Dato Simó
and *I2* with ``fix-storage``.
91 b1eb71c7 Dato Simó
92 68640987 Guido Trotter
ganeti:watcher:autorepair:suspend[:<timestamp>]
93 68640987 Guido Trotter
+++++++++++++++++++++++++++++++++++++++++++++++
94 68640987 Guido Trotter
95 68640987 Guido Trotter
(instance/nodegroup/cluster)
96 68640987 Guido Trotter
If this tag is encountered no autorepair operations will start for the
97 68640987 Guido Trotter
instance (or for any instance, if present at the cluster or group
98 68640987 Guido Trotter
level). Any job which already started will be allowed to finish, but
99 68640987 Guido Trotter
then the autorepair system will not proceed further until this tag is
100 68640987 Guido Trotter
removed, or the timestamp passes (in which case the tag will be removed
101 68640987 Guido Trotter
automatically by the watcher).
102 68640987 Guido Trotter
103 68640987 Guido Trotter
Note that depending on how this tag is used there might still be race
104 68640987 Guido Trotter
conditions related to it for an external tool that uses it
105 68640987 Guido Trotter
programmatically, as no "lock tag" or tag "test-and-set" operation is
106 68640987 Guido Trotter
present at this time. While this is known we won't solve these race
107 68640987 Guido Trotter
conditions in the first version.
108 68640987 Guido Trotter
109 68640987 Guido Trotter
It might also be useful to easily have an operation that tags all
110 68640987 Guido Trotter
instances matching a  filter on some charateristic. But again, this
111 68640987 Guido Trotter
wouldn't be specific to this tag.
112 68640987 Guido Trotter
113 b1eb71c7 Dato Simó
If there are multiple
114 b1eb71c7 Dato Simó
``ganeti:watcher:autorepair:suspend[:<timestamp>]`` tags in an object,
115 b1eb71c7 Dato Simó
the form without timestamp takes precedence (permanent suspension); or,
116 b1eb71c7 Dato Simó
if all object tags have a timestamp, the one with the highest timestamp.
117 b1eb71c7 Dato Simó
When multiplicity happens across objects, the nearest tag wins, as
118 b1eb71c7 Dato Simó
above. This makes it possible to suspend cluster-enabled repairs with a
119 b1eb71c7 Dato Simó
single tag in the cluster object; or to suspend them only for a certain
120 b1eb71c7 Dato Simó
node group or instance. At the same time, it is possible to re-enable
121 b1eb71c7 Dato Simó
cluster-suspended repairs in a particular instance or group by applying
122 b1eb71c7 Dato Simó
an enable tag to them.
123 b1eb71c7 Dato Simó
124 e47e51a8 Dato Simó
ganeti:watcher:autorepair:pending:<type>:<id>:<timestamp>:<jobs>
125 e47e51a8 Dato Simó
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
126 68640987 Guido Trotter
127 68640987 Guido Trotter
(instance)
128 68640987 Guido Trotter
If this tag is present a repair of type ``type`` is pending on the
129 68640987 Guido Trotter
target instance. This means that either jobs are being run, or it's
130 68640987 Guido Trotter
waiting for resource availability. ``id`` is the unique id identifying
131 68640987 Guido Trotter
this repair, ``timestamp`` is the time when this tag was first applied
132 68640987 Guido Trotter
to this instance for this ``id`` (we will "update" the tag by adding a
133 68640987 Guido Trotter
"new copy" of it and removing the old version as we run more jobs, but
134 68640987 Guido Trotter
the timestamp will never change for the same repair)
135 68640987 Guido Trotter
136 68640987 Guido Trotter
``jobs`` is the list of jobs already run or being run to repair the
137 6d675203 Dato Simó
instance (separated by a plus sign, *+*). If the instance has just
138 6d675203 Dato Simó
been put in pending state but no job has run yet, this list is empty.
139 68640987 Guido Trotter
140 68640987 Guido Trotter
This tag will be set by ganeti if an equivalent autorepair tag is
141 68640987 Guido Trotter
present and a a repair is needed, or can be set by an external tool to
142 68640987 Guido Trotter
request a repair as a "once off".
143 68640987 Guido Trotter
144 68640987 Guido Trotter
If multiple instances of this tag are present they will be handled in
145 68640987 Guido Trotter
order of timestamp.
146 68640987 Guido Trotter
147 e47e51a8 Dato Simó
ganeti:watcher:autorepair:result:<type>:<id>:<timestamp>:<result>:<jobs>
148 e47e51a8 Dato Simó
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
149 68640987 Guido Trotter
150 68640987 Guido Trotter
(instance)
151 68640987 Guido Trotter
If this tag is present a repair of type ``type`` has been performed on
152 68640987 Guido Trotter
the instance and has been completed by ``timestamp``. The result is
153 6d675203 Dato Simó
either ``success``, ``failure`` or ``enoperm``, and jobs is a
154 6d675203 Dato Simó
*+*-separated list of jobs that were executed for this repair.
155 68640987 Guido Trotter
156 68640987 Guido Trotter
An ``enoperm`` result is returned when the repair was brought on until
157 68640987 Guido Trotter
possible, but the repair type doesn't consent to proceed further.
158 68640987 Guido Trotter
159 68640987 Guido Trotter
Possible states, and transitions
160 68640987 Guido Trotter
--------------------------------
161 68640987 Guido Trotter
162 68640987 Guido Trotter
At any point an instance can be in one of the following health states:
163 68640987 Guido Trotter
164 68640987 Guido Trotter
Healthy
165 68640987 Guido Trotter
+++++++
166 68640987 Guido Trotter
167 68640987 Guido Trotter
The instance lives on only online nodes. The autorepair system will
168 68640987 Guido Trotter
never touch these instances. Any ``repair:pending`` tags will be removed
169 68640987 Guido Trotter
and marked ``success`` with no jobs attached to them.
170 68640987 Guido Trotter
171 68640987 Guido Trotter
This state can transition to:
172 68640987 Guido Trotter
173 68640987 Guido Trotter
- Needs-repair, repair disallowed (node offlined or drained, no
174 68640987 Guido Trotter
  autorepair tag)
175 68640987 Guido Trotter
- Needs-repair, autorepair allowed (node offlined or drained, autorepair
176 68640987 Guido Trotter
  tag present)
177 68640987 Guido Trotter
- Suspended (a suspend tag is added)
178 68640987 Guido Trotter
179 68640987 Guido Trotter
Suspended
180 68640987 Guido Trotter
+++++++++
181 68640987 Guido Trotter
182 68640987 Guido Trotter
Whenever a ``repair:suspend`` tag is added the autorepair code won't
183 68640987 Guido Trotter
touch the instance until the timestamp on the tag has passed, if
184 68640987 Guido Trotter
present. The tag will be removed afterwards (and the instance will
185 68640987 Guido Trotter
transition to its correct state, depending on its health and other
186 68640987 Guido Trotter
tags).
187 68640987 Guido Trotter
188 68640987 Guido Trotter
Note that when an instance is suspended any pending repair is
189 68640987 Guido Trotter
interrupted, but jobs which were submitted before the suspension are
190 68640987 Guido Trotter
allowed to finish.
191 68640987 Guido Trotter
192 68640987 Guido Trotter
Needs-repair, repair disallowed
193 68640987 Guido Trotter
+++++++++++++++++++++++++++++++
194 68640987 Guido Trotter
195 68640987 Guido Trotter
The instance lives on an offline or drained node, but no autorepair tag
196 68640987 Guido Trotter
is set, or the autorepair tag set is of a type not powerful enough to
197 68640987 Guido Trotter
finish the repair. The autorepair system will never touch these
198 68640987 Guido Trotter
instances, and they can transition to:
199 68640987 Guido Trotter
200 68640987 Guido Trotter
- Healthy (manual repair)
201 68640987 Guido Trotter
- Pending repair (a ``repair:pending`` tag is added)
202 68640987 Guido Trotter
- Needs-repair, repair allowed always (an autorepair always tag is added)
203 68640987 Guido Trotter
- Suspended (a suspend tag is added)
204 68640987 Guido Trotter
205 68640987 Guido Trotter
Needs-repair, repair allowed always
206 68640987 Guido Trotter
+++++++++++++++++++++++++++++++++++
207 68640987 Guido Trotter
208 68640987 Guido Trotter
A ``repair:pending`` tag is added, and the instance transitions to the
209 68640987 Guido Trotter
Pending Repair state. The autorepair tag is preserved.
210 68640987 Guido Trotter
211 68640987 Guido Trotter
Of course if a ``repair:suspended`` tag is found no pending tag will be
212 68640987 Guido Trotter
added, and the instance will instead transition to the Suspended state.
213 68640987 Guido Trotter
214 68640987 Guido Trotter
Pending repair
215 68640987 Guido Trotter
++++++++++++++
216 68640987 Guido Trotter
217 68640987 Guido Trotter
When an instance is in this stage the following will happen:
218 68640987 Guido Trotter
219 68640987 Guido Trotter
If a ``repair:suspended`` tag is found the instance won't be touched and
220 68640987 Guido Trotter
moved to the Suspended state. Any jobs which were already running will
221 68640987 Guido Trotter
be left untouched.
222 68640987 Guido Trotter
223 68640987 Guido Trotter
If there are still jobs running related to the instance and scheduled by
224 68640987 Guido Trotter
this repair they will be given more time to run, and the instance will
225 68640987 Guido Trotter
be checked again later.  The state transitions to itself.
226 68640987 Guido Trotter
227 68640987 Guido Trotter
If no jobs are running and the instance is detected to be healthy, the
228 68640987 Guido Trotter
``repair:result`` tag will be added, and the current active
229 68640987 Guido Trotter
``repair:pending`` tag will be removed. It will then transition to the
230 68640987 Guido Trotter
Healthy state if there are no ``repair:pending`` tags, or to the Pending
231 68640987 Guido Trotter
state otherwise: there, the instance being healthy, those tags will be
232 68640987 Guido Trotter
resolved without any operation as well (note that this is the same as
233 68640987 Guido Trotter
transitioning to the Healthy state, where ``repair:pending`` tags would
234 68640987 Guido Trotter
also be resolved).
235 68640987 Guido Trotter
236 68640987 Guido Trotter
If no jobs are running and the instance still has issues:
237 68640987 Guido Trotter
238 68640987 Guido Trotter
- if the last job(s) failed it can either be retried a few times, if
239 68640987 Guido Trotter
  deemed to be safe, or the repair can transition to the Failed state.
240 68640987 Guido Trotter
  The ``repair:result`` tag will be added, and the active
241 68640987 Guido Trotter
  ``repair:pending`` tag will be removed (further ``repair:pending``
242 68640987 Guido Trotter
  tags will not be able to proceed, as explained by the Failed state,
243 68640987 Guido Trotter
  until the failure state is cleared)
244 68640987 Guido Trotter
- if the last job(s) succeeded but there are not enough resources to
245 68640987 Guido Trotter
  proceed, the state will transition to itself and no jobs are
246 68640987 Guido Trotter
  scheduled. The tag is left untouched (and later checked again). This
247 68640987 Guido Trotter
  basically just delays any repairs, the current ``pending`` tag stays
248 68640987 Guido Trotter
  active, and any others are untouched).
249 68640987 Guido Trotter
- if the last job(s) succeeded but the repair type cannot allow to
250 68640987 Guido Trotter
  proceed any further the ``repair:result`` tag is added with an
251 68640987 Guido Trotter
  ``enoperm`` result, and the current ``repair:pending`` tag is removed.
252 68640987 Guido Trotter
  The instance is now back to "Needs-repair, repair disallowed",
253 68640987 Guido Trotter
  "Needs-repair, autorepair allowed", or "Pending" if there is already a
254 68640987 Guido Trotter
  future tag that can repair the instance.
255 68640987 Guido Trotter
- if the last job(s) succeeded and the repair can continue new job(s)
256 68640987 Guido Trotter
  can be submitted, and the ``repair:pending`` tag can be updated.
257 68640987 Guido Trotter
258 68640987 Guido Trotter
Failed
259 68640987 Guido Trotter
++++++
260 68640987 Guido Trotter
261 68640987 Guido Trotter
If repairing an instance has failed a ``repair:result:failure`` is
262 68640987 Guido Trotter
added. The presence of this tag is used to detect that an instance is in
263 68640987 Guido Trotter
this state, and it will not be touched until the failure is investigated
264 68640987 Guido Trotter
and the tag is removed.
265 68640987 Guido Trotter
266 68640987 Guido Trotter
An external tool or person needs to investigate the state of the
267 68640987 Guido Trotter
instance and remove this tag when he is sure the instance is repaired
268 68640987 Guido Trotter
and safe to turn back to the normal autorepair system.
269 68640987 Guido Trotter
270 68640987 Guido Trotter
(Alternatively we can use the suspended state (indefinitely or
271 68640987 Guido Trotter
temporarily) to mark the instance as "not touch" when we think a human
272 68640987 Guido Trotter
needs to look at it. To be decided).
273 68640987 Guido Trotter
274 819358e1 Dato Simó
A graph with the possible transitions follows; note that in the graph,
275 819358e1 Dato Simó
following the implementation, the two ``Needs repair`` states have been
276 819358e1 Dato Simó
coalesced into one; and the ``Suspended`` state disapears, for it
277 819358e1 Dato Simó
becames an attribute of the instance object (its auto-repair policy).
278 819358e1 Dato Simó
279 819358e1 Dato Simó
.. digraph:: "auto-repair-states"
280 819358e1 Dato Simó
281 819358e1 Dato Simó
  node     [shape=circle, style=filled, fillcolor="#BEDEF1",
282 819358e1 Dato Simó
            width=2, fixedsize=true];
283 819358e1 Dato Simó
  healthy  [label="Healthy"];
284 819358e1 Dato Simó
  needsrep [label="Needs repair"];
285 819358e1 Dato Simó
  pendrep  [label="Pending repair"];
286 819358e1 Dato Simó
  failed   [label="Failed repair"];
287 819358e1 Dato Simó
  disabled [label="(no state)", width=1.25];
288 819358e1 Dato Simó
289 819358e1 Dato Simó
  {rank=same; needsrep}
290 819358e1 Dato Simó
  {rank=same; healthy}
291 819358e1 Dato Simó
  {rank=same; pendrep}
292 819358e1 Dato Simó
  {rank=same; failed}
293 819358e1 Dato Simó
  {rank=same; disabled}
294 819358e1 Dato Simó
295 819358e1 Dato Simó
  // These nodes are needed to be the "origin" of the "initial state" arrows.
296 819358e1 Dato Simó
  node [width=.5, label="", style=invis];
297 819358e1 Dato Simó
  inih;
298 819358e1 Dato Simó
  inin;
299 819358e1 Dato Simó
  inip;
300 819358e1 Dato Simó
  inif;
301 819358e1 Dato Simó
  inix;
302 819358e1 Dato Simó
303 819358e1 Dato Simó
  edge [fontsize=10, fontname="Arial Bold", fontcolor=blue]
304 819358e1 Dato Simó
305 819358e1 Dato Simó
  inih -> healthy  [label="No tags or\nresult:success"];
306 819358e1 Dato Simó
  inip -> pendrep  [label="Tag:\nautorepair:pending"];
307 819358e1 Dato Simó
  inif -> failed   [label="Tag:\nresult:failure"];
308 819358e1 Dato Simó
  inix -> disabled [fontcolor=black, label="ArNotEnabled"];
309 819358e1 Dato Simó
310 819358e1 Dato Simó
  edge [fontcolor="orange"];
311 819358e1 Dato Simó
312 819358e1 Dato Simó
  healthy -> healthy [label="No problems\ndetected"];
313 819358e1 Dato Simó
314 819358e1 Dato Simó
  healthy -> needsrep [
315 819358e1 Dato Simó
             label="Brokeness\ndetected in\nfirst half of\nthe tool run"];
316 819358e1 Dato Simó
317 819358e1 Dato Simó
  pendrep -> healthy [
318 819358e1 Dato Simó
             label="All jobs\ncompleted\nsuccessfully /\ninstance healthy"];
319 819358e1 Dato Simó
320 819358e1 Dato Simó
  pendrep -> failed [label="Some job(s)\nfailed"];
321 819358e1 Dato Simó
322 819358e1 Dato Simó
  edge [fontcolor="red"];
323 819358e1 Dato Simó
324 819358e1 Dato Simó
  needsrep -> pendrep [
325 819358e1 Dato Simó
              label="Repair\nallowed and\ninitial job(s)\nsubmitted"];
326 819358e1 Dato Simó
327 819358e1 Dato Simó
  needsrep -> needsrep [
328 819358e1 Dato Simó
              label="Repairs suspended\n(no-op) or enabled\nbut not powerful enough\n(result: enoperm)"];
329 819358e1 Dato Simó
330 819358e1 Dato Simó
  pendrep -> pendrep [label="More jobs\nsubmitted"];
331 819358e1 Dato Simó
332 819358e1 Dato Simó
333 68640987 Guido Trotter
Repair operation
334 68640987 Guido Trotter
----------------
335 68640987 Guido Trotter
336 68640987 Guido Trotter
Possible repairs are:
337 68640987 Guido Trotter
338 68640987 Guido Trotter
- Replace-disks (drbd, if the secondary is down), (or other storage
339 68640987 Guido Trotter
  specific fixes)
340 68640987 Guido Trotter
- Migrate (shared storage, rbd, drbd, if the primary is drained)
341 68640987 Guido Trotter
- Failover (shared storage, rbd, drbd, if the primary is down)
342 68640987 Guido Trotter
- Recreate disks + reinstall (all nodes down, plain, files or drbd)
343 68640987 Guido Trotter
344 68640987 Guido Trotter
Note that more than one of these operations may need to happen before a
345 68640987 Guido Trotter
full repair is completed (eg. if a drbd primary goes offline first a
346 68640987 Guido Trotter
failover will happen, then a replce-disks).
347 68640987 Guido Trotter
348 68640987 Guido Trotter
The self-repair tool will first take care of all needs-repair instance
349 68640987 Guido Trotter
that can be brought into ``pending`` state, and transition them as
350 68640987 Guido Trotter
described above.
351 68640987 Guido Trotter
352 68640987 Guido Trotter
Then it will go through any ``repair:pending`` instances and handle them
353 68640987 Guido Trotter
as described above.
354 68640987 Guido Trotter
355 68640987 Guido Trotter
Note that the repair tool MAY "group" instances by performing common
356 68640987 Guido Trotter
repair jobs for them (eg: node evacuate).
357 68640987 Guido Trotter
358 68640987 Guido Trotter
Staging of work
359 68640987 Guido Trotter
---------------
360 68640987 Guido Trotter
361 68640987 Guido Trotter
First version: recreate-disks + reinstall (2.6.1)
362 68640987 Guido Trotter
Second version: failover and migrate repairs (2.7)
363 68640987 Guido Trotter
Third version: replace disks repair (2.7 or 2.8)
364 68640987 Guido Trotter
365 68640987 Guido Trotter
Future work
366 68640987 Guido Trotter
===========
367 68640987 Guido Trotter
368 68640987 Guido Trotter
One important piece of work will be reporting what the autorepair system
369 68640987 Guido Trotter
is "thinking" and exporting this in a form that can be read by an
370 68640987 Guido Trotter
outside user or system. In order to do this we need a better
371 68640987 Guido Trotter
communication system than embedding this information into tags. This
372 68640987 Guido Trotter
should be thought in an extensible way that can be used in general for
373 68640987 Guido Trotter
Ganeti to provide "advisory" information about entities it manages, and
374 68640987 Guido Trotter
for an external system to "advise" ganeti over what it can do, but in a
375 68640987 Guido Trotter
less direct manner than submitting individual jobs.
376 68640987 Guido Trotter
377 68640987 Guido Trotter
Note that cluster verify checks some errors that are actually instance
378 68640987 Guido Trotter
specific, (eg. a missing backend disk on a drbd node) or node-specific
379 68640987 Guido Trotter
(eg. an extra lvm device). If we were to split these into "instance
380 68640987 Guido Trotter
verify", "node verify" and "cluster verify", then we could easily use
381 68640987 Guido Trotter
this tool to perform some of those repairs as well.
382 68640987 Guido Trotter
383 68640987 Guido Trotter
Finally self-repairs could also be extended to the cluster level, for
384 68640987 Guido Trotter
example concepts like "N+1 failures", missing master candidates, etc. or
385 68640987 Guido Trotter
node level for some specific types of errors.
386 68640987 Guido Trotter
387 68640987 Guido Trotter
.. vim: set textwidth=72 :
388 68640987 Guido Trotter
.. Local Variables:
389 68640987 Guido Trotter
.. mode: rst
390 68640987 Guido Trotter
.. fill-column: 72
391 68640987 Guido Trotter
.. End: