root / doc / design-autorepair.rst @ d3b06210
History | View | Annotate | Download (12.9 kB)
1 | 68640987 | Guido Trotter | ==================== |
---|---|---|---|
2 | 68640987 | Guido Trotter | Instance auto-repair |
3 | 68640987 | Guido Trotter | ==================== |
4 | 68640987 | Guido Trotter | |
5 | 68640987 | Guido Trotter | .. contents:: :depth: 4 |
6 | 68640987 | Guido Trotter | |
7 | 68640987 | Guido Trotter | This is a design document detailing the implementation of self-repair and |
8 | 68640987 | Guido Trotter | recreation of instances in Ganeti. It also discusses ideas that might be useful |
9 | 68640987 | Guido Trotter | for more future self-repair situations. |
10 | 68640987 | Guido Trotter | |
11 | 68640987 | Guido Trotter | Current state and shortcomings |
12 | 68640987 | Guido Trotter | ============================== |
13 | 68640987 | Guido Trotter | |
14 | 68640987 | Guido Trotter | Ganeti currently doesn't do any sort of self-repair or self-recreate of |
15 | 68640987 | Guido Trotter | instances: |
16 | 68640987 | Guido Trotter | |
17 | 68640987 | Guido Trotter | - If a drbd instance is broken (its primary of secondary nodes go |
18 | 68640987 | Guido Trotter | offline or need to be drained) an admin or an external tool must fail |
19 | 68640987 | Guido Trotter | it over if necessary, and then trigger a disk replacement. |
20 | 68640987 | Guido Trotter | - If a plain instance is broken (or both nodes of a drbd instance are) |
21 | 68640987 | Guido Trotter | an admin or an external tool must recreate its disk and reinstall it. |
22 | 68640987 | Guido Trotter | |
23 | 68640987 | Guido Trotter | Moreover in an oversubscribed cluster operations mentioned above might |
24 | 68640987 | Guido Trotter | fail for lack of capacity until a node is repaired or a new one added. |
25 | 68640987 | Guido Trotter | In this case an external tool would also need to go through any |
26 | 68640987 | Guido Trotter | "pending-recreate" or "pending-repair" instances and fix them. |
27 | 68640987 | Guido Trotter | |
28 | 68640987 | Guido Trotter | Proposed changes |
29 | 68640987 | Guido Trotter | ================ |
30 | 68640987 | Guido Trotter | |
31 | 68640987 | Guido Trotter | We'd like to increase the self-repair capabilities of Ganeti, at least |
32 | 68640987 | Guido Trotter | with regards to instances. In order to do so we plan to add mechanisms |
33 | 68640987 | Guido Trotter | to mark an instance as "due for being repaired" and then the relevant |
34 | 68640987 | Guido Trotter | repair to be performed as soon as it's possible, on the cluster. |
35 | 68640987 | Guido Trotter | |
36 | 68640987 | Guido Trotter | The self repair will be written as part of ganeti-watcher or as an extra |
37 | 68640987 | Guido Trotter | watcher component that is called less often. |
38 | 68640987 | Guido Trotter | |
39 | 68640987 | Guido Trotter | As the first version we'll only handle the case in which an instance |
40 | 68640987 | Guido Trotter | lives on an offline or drained node. In the future we may add more |
41 | 68640987 | Guido Trotter | self-repair capabilities for errors ganeti can detect. |
42 | 68640987 | Guido Trotter | |
43 | 68640987 | Guido Trotter | New attributes (or tags) |
44 | 68640987 | Guido Trotter | ------------------------ |
45 | 68640987 | Guido Trotter | |
46 | 68640987 | Guido Trotter | In order to know when to perform a self-repair operation we need to know |
47 | 68640987 | Guido Trotter | whether they are allowed by the cluster administrator. |
48 | 68640987 | Guido Trotter | |
49 | 68640987 | Guido Trotter | This can be implemented as either new attributes or tags. Tags could be |
50 | 68640987 | Guido Trotter | acceptable as they would only be read and interpreted by the self-repair tool |
51 | 68640987 | Guido Trotter | (part of the watcher), and not by the ganeti core opcodes and node rpcs. The |
52 | 68640987 | Guido Trotter | following tags would be needed: |
53 | 68640987 | Guido Trotter | |
54 | 68640987 | Guido Trotter | ganeti:watcher:autorepair:<type> |
55 | 68640987 | Guido Trotter | ++++++++++++++++++++++++++++++++ |
56 | 68640987 | Guido Trotter | |
57 | 68640987 | Guido Trotter | (instance/nodegroup/cluster) |
58 | 68640987 | Guido Trotter | Allow repairs to happen on an instance that has the tag, or that lives |
59 | 68640987 | Guido Trotter | in a cluster or nodegroup which does. Types of repair are in order of |
60 | 68640987 | Guido Trotter | perceived risk, lower to higher, and each type includes allowing the |
61 | 68640987 | Guido Trotter | operations in the lower ones: |
62 | 68640987 | Guido Trotter | |
63 | 68640987 | Guido Trotter | - ``fix-storage`` allows a disk replacement or another operation that |
64 | 68640987 | Guido Trotter | fixes the instance backend storage without affecting the instance |
65 | 68640987 | Guido Trotter | itself. This can for example recover from a broken drbd secondary, but |
66 | 68640987 | Guido Trotter | risks data loss if something is wrong on the primary but the secondary |
67 | 68640987 | Guido Trotter | was somehow recoverable. |
68 | 68640987 | Guido Trotter | - ``migrate`` allows an instance migration. This can recover from a |
69 | 68640987 | Guido Trotter | drained primary, but can cause an instance crash in some cases (bugs). |
70 | 68640987 | Guido Trotter | - ``failover`` allows instance reboot on the secondary. This can recover |
71 | 68640987 | Guido Trotter | from an offline primary, but the instance will lose its running state. |
72 | 68640987 | Guido Trotter | - ``reinstall`` allows disks to be recreated and an instance to be |
73 | 68640987 | Guido Trotter | reinstalled. This can recover from primary&secondary both being |
74 | 68640987 | Guido Trotter | offline, or from an offline primary in the case of non-redundant |
75 | 68640987 | Guido Trotter | instances. It causes data loss. |
76 | 68640987 | Guido Trotter | |
77 | 68640987 | Guido Trotter | Each repair type allows all the operations in the previous types, in the |
78 | 68640987 | Guido Trotter | order above, in order to ensure a repair can be completed fully. As such |
79 | 68640987 | Guido Trotter | a repair of a lower type might not be able to proceed if it detects an |
80 | 68640987 | Guido Trotter | error condition that requires a more risky or drastic solution, but |
81 | 68640987 | Guido Trotter | never vice versa (if a worse solution is allowed then so is a better |
82 | 68640987 | Guido Trotter | one). |
83 | 68640987 | Guido Trotter | |
84 | 68640987 | Guido Trotter | ganeti:watcher:autorepair:suspend[:<timestamp>] |
85 | 68640987 | Guido Trotter | +++++++++++++++++++++++++++++++++++++++++++++++ |
86 | 68640987 | Guido Trotter | |
87 | 68640987 | Guido Trotter | (instance/nodegroup/cluster) |
88 | 68640987 | Guido Trotter | If this tag is encountered no autorepair operations will start for the |
89 | 68640987 | Guido Trotter | instance (or for any instance, if present at the cluster or group |
90 | 68640987 | Guido Trotter | level). Any job which already started will be allowed to finish, but |
91 | 68640987 | Guido Trotter | then the autorepair system will not proceed further until this tag is |
92 | 68640987 | Guido Trotter | removed, or the timestamp passes (in which case the tag will be removed |
93 | 68640987 | Guido Trotter | automatically by the watcher). |
94 | 68640987 | Guido Trotter | |
95 | 68640987 | Guido Trotter | Note that depending on how this tag is used there might still be race |
96 | 68640987 | Guido Trotter | conditions related to it for an external tool that uses it |
97 | 68640987 | Guido Trotter | programmatically, as no "lock tag" or tag "test-and-set" operation is |
98 | 68640987 | Guido Trotter | present at this time. While this is known we won't solve these race |
99 | 68640987 | Guido Trotter | conditions in the first version. |
100 | 68640987 | Guido Trotter | |
101 | 68640987 | Guido Trotter | It might also be useful to easily have an operation that tags all |
102 | 68640987 | Guido Trotter | instances matching a filter on some charateristic. But again, this |
103 | 68640987 | Guido Trotter | wouldn't be specific to this tag. |
104 | 68640987 | Guido Trotter | |
105 | 68640987 | Guido Trotter | ganeti:watcher:repair:pending:<type>:<id>:<timestamp>:<jobs> |
106 | 68640987 | Guido Trotter | ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ |
107 | 68640987 | Guido Trotter | |
108 | 68640987 | Guido Trotter | (instance) |
109 | 68640987 | Guido Trotter | If this tag is present a repair of type ``type`` is pending on the |
110 | 68640987 | Guido Trotter | target instance. This means that either jobs are being run, or it's |
111 | 68640987 | Guido Trotter | waiting for resource availability. ``id`` is the unique id identifying |
112 | 68640987 | Guido Trotter | this repair, ``timestamp`` is the time when this tag was first applied |
113 | 68640987 | Guido Trotter | to this instance for this ``id`` (we will "update" the tag by adding a |
114 | 68640987 | Guido Trotter | "new copy" of it and removing the old version as we run more jobs, but |
115 | 68640987 | Guido Trotter | the timestamp will never change for the same repair) |
116 | 68640987 | Guido Trotter | |
117 | 68640987 | Guido Trotter | ``jobs`` is the list of jobs already run or being run to repair the |
118 | 68640987 | Guido Trotter | instance. If the instance has just been put in pending state but no job |
119 | 68640987 | Guido Trotter | has run yet, this list is empty. |
120 | 68640987 | Guido Trotter | |
121 | 68640987 | Guido Trotter | This tag will be set by ganeti if an equivalent autorepair tag is |
122 | 68640987 | Guido Trotter | present and a a repair is needed, or can be set by an external tool to |
123 | 68640987 | Guido Trotter | request a repair as a "once off". |
124 | 68640987 | Guido Trotter | |
125 | 68640987 | Guido Trotter | If multiple instances of this tag are present they will be handled in |
126 | 68640987 | Guido Trotter | order of timestamp. |
127 | 68640987 | Guido Trotter | |
128 | 68640987 | Guido Trotter | ganeti:watcher:repair:result:<type>:<id>:<timestamp>:<result>:<jobs> |
129 | 68640987 | Guido Trotter | ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ |
130 | 68640987 | Guido Trotter | |
131 | 68640987 | Guido Trotter | (instance) |
132 | 68640987 | Guido Trotter | If this tag is present a repair of type ``type`` has been performed on |
133 | 68640987 | Guido Trotter | the instance and has been completed by ``timestamp``. The result is |
134 | 68640987 | Guido Trotter | either ``success``, ``failure`` or ``enoperm``, and jobs is a comma |
135 | 68640987 | Guido Trotter | separated list of jobs that were executed for this repair. |
136 | 68640987 | Guido Trotter | |
137 | 68640987 | Guido Trotter | An ``enoperm`` result is returned when the repair was brought on until |
138 | 68640987 | Guido Trotter | possible, but the repair type doesn't consent to proceed further. |
139 | 68640987 | Guido Trotter | |
140 | 68640987 | Guido Trotter | Possible states, and transitions |
141 | 68640987 | Guido Trotter | -------------------------------- |
142 | 68640987 | Guido Trotter | |
143 | 68640987 | Guido Trotter | At any point an instance can be in one of the following health states: |
144 | 68640987 | Guido Trotter | |
145 | 68640987 | Guido Trotter | Healthy |
146 | 68640987 | Guido Trotter | +++++++ |
147 | 68640987 | Guido Trotter | |
148 | 68640987 | Guido Trotter | The instance lives on only online nodes. The autorepair system will |
149 | 68640987 | Guido Trotter | never touch these instances. Any ``repair:pending`` tags will be removed |
150 | 68640987 | Guido Trotter | and marked ``success`` with no jobs attached to them. |
151 | 68640987 | Guido Trotter | |
152 | 68640987 | Guido Trotter | This state can transition to: |
153 | 68640987 | Guido Trotter | |
154 | 68640987 | Guido Trotter | - Needs-repair, repair disallowed (node offlined or drained, no |
155 | 68640987 | Guido Trotter | autorepair tag) |
156 | 68640987 | Guido Trotter | - Needs-repair, autorepair allowed (node offlined or drained, autorepair |
157 | 68640987 | Guido Trotter | tag present) |
158 | 68640987 | Guido Trotter | - Suspended (a suspend tag is added) |
159 | 68640987 | Guido Trotter | |
160 | 68640987 | Guido Trotter | Suspended |
161 | 68640987 | Guido Trotter | +++++++++ |
162 | 68640987 | Guido Trotter | |
163 | 68640987 | Guido Trotter | Whenever a ``repair:suspend`` tag is added the autorepair code won't |
164 | 68640987 | Guido Trotter | touch the instance until the timestamp on the tag has passed, if |
165 | 68640987 | Guido Trotter | present. The tag will be removed afterwards (and the instance will |
166 | 68640987 | Guido Trotter | transition to its correct state, depending on its health and other |
167 | 68640987 | Guido Trotter | tags). |
168 | 68640987 | Guido Trotter | |
169 | 68640987 | Guido Trotter | Note that when an instance is suspended any pending repair is |
170 | 68640987 | Guido Trotter | interrupted, but jobs which were submitted before the suspension are |
171 | 68640987 | Guido Trotter | allowed to finish. |
172 | 68640987 | Guido Trotter | |
173 | 68640987 | Guido Trotter | Needs-repair, repair disallowed |
174 | 68640987 | Guido Trotter | +++++++++++++++++++++++++++++++ |
175 | 68640987 | Guido Trotter | |
176 | 68640987 | Guido Trotter | The instance lives on an offline or drained node, but no autorepair tag |
177 | 68640987 | Guido Trotter | is set, or the autorepair tag set is of a type not powerful enough to |
178 | 68640987 | Guido Trotter | finish the repair. The autorepair system will never touch these |
179 | 68640987 | Guido Trotter | instances, and they can transition to: |
180 | 68640987 | Guido Trotter | |
181 | 68640987 | Guido Trotter | - Healthy (manual repair) |
182 | 68640987 | Guido Trotter | - Pending repair (a ``repair:pending`` tag is added) |
183 | 68640987 | Guido Trotter | - Needs-repair, repair allowed always (an autorepair always tag is added) |
184 | 68640987 | Guido Trotter | - Suspended (a suspend tag is added) |
185 | 68640987 | Guido Trotter | |
186 | 68640987 | Guido Trotter | Needs-repair, repair allowed always |
187 | 68640987 | Guido Trotter | +++++++++++++++++++++++++++++++++++ |
188 | 68640987 | Guido Trotter | |
189 | 68640987 | Guido Trotter | A ``repair:pending`` tag is added, and the instance transitions to the |
190 | 68640987 | Guido Trotter | Pending Repair state. The autorepair tag is preserved. |
191 | 68640987 | Guido Trotter | |
192 | 68640987 | Guido Trotter | Of course if a ``repair:suspended`` tag is found no pending tag will be |
193 | 68640987 | Guido Trotter | added, and the instance will instead transition to the Suspended state. |
194 | 68640987 | Guido Trotter | |
195 | 68640987 | Guido Trotter | Pending repair |
196 | 68640987 | Guido Trotter | ++++++++++++++ |
197 | 68640987 | Guido Trotter | |
198 | 68640987 | Guido Trotter | When an instance is in this stage the following will happen: |
199 | 68640987 | Guido Trotter | |
200 | 68640987 | Guido Trotter | If a ``repair:suspended`` tag is found the instance won't be touched and |
201 | 68640987 | Guido Trotter | moved to the Suspended state. Any jobs which were already running will |
202 | 68640987 | Guido Trotter | be left untouched. |
203 | 68640987 | Guido Trotter | |
204 | 68640987 | Guido Trotter | If there are still jobs running related to the instance and scheduled by |
205 | 68640987 | Guido Trotter | this repair they will be given more time to run, and the instance will |
206 | 68640987 | Guido Trotter | be checked again later. The state transitions to itself. |
207 | 68640987 | Guido Trotter | |
208 | 68640987 | Guido Trotter | If no jobs are running and the instance is detected to be healthy, the |
209 | 68640987 | Guido Trotter | ``repair:result`` tag will be added, and the current active |
210 | 68640987 | Guido Trotter | ``repair:pending`` tag will be removed. It will then transition to the |
211 | 68640987 | Guido Trotter | Healthy state if there are no ``repair:pending`` tags, or to the Pending |
212 | 68640987 | Guido Trotter | state otherwise: there, the instance being healthy, those tags will be |
213 | 68640987 | Guido Trotter | resolved without any operation as well (note that this is the same as |
214 | 68640987 | Guido Trotter | transitioning to the Healthy state, where ``repair:pending`` tags would |
215 | 68640987 | Guido Trotter | also be resolved). |
216 | 68640987 | Guido Trotter | |
217 | 68640987 | Guido Trotter | If no jobs are running and the instance still has issues: |
218 | 68640987 | Guido Trotter | |
219 | 68640987 | Guido Trotter | - if the last job(s) failed it can either be retried a few times, if |
220 | 68640987 | Guido Trotter | deemed to be safe, or the repair can transition to the Failed state. |
221 | 68640987 | Guido Trotter | The ``repair:result`` tag will be added, and the active |
222 | 68640987 | Guido Trotter | ``repair:pending`` tag will be removed (further ``repair:pending`` |
223 | 68640987 | Guido Trotter | tags will not be able to proceed, as explained by the Failed state, |
224 | 68640987 | Guido Trotter | until the failure state is cleared) |
225 | 68640987 | Guido Trotter | - if the last job(s) succeeded but there are not enough resources to |
226 | 68640987 | Guido Trotter | proceed, the state will transition to itself and no jobs are |
227 | 68640987 | Guido Trotter | scheduled. The tag is left untouched (and later checked again). This |
228 | 68640987 | Guido Trotter | basically just delays any repairs, the current ``pending`` tag stays |
229 | 68640987 | Guido Trotter | active, and any others are untouched). |
230 | 68640987 | Guido Trotter | - if the last job(s) succeeded but the repair type cannot allow to |
231 | 68640987 | Guido Trotter | proceed any further the ``repair:result`` tag is added with an |
232 | 68640987 | Guido Trotter | ``enoperm`` result, and the current ``repair:pending`` tag is removed. |
233 | 68640987 | Guido Trotter | The instance is now back to "Needs-repair, repair disallowed", |
234 | 68640987 | Guido Trotter | "Needs-repair, autorepair allowed", or "Pending" if there is already a |
235 | 68640987 | Guido Trotter | future tag that can repair the instance. |
236 | 68640987 | Guido Trotter | - if the last job(s) succeeded and the repair can continue new job(s) |
237 | 68640987 | Guido Trotter | can be submitted, and the ``repair:pending`` tag can be updated. |
238 | 68640987 | Guido Trotter | |
239 | 68640987 | Guido Trotter | Failed |
240 | 68640987 | Guido Trotter | ++++++ |
241 | 68640987 | Guido Trotter | |
242 | 68640987 | Guido Trotter | If repairing an instance has failed a ``repair:result:failure`` is |
243 | 68640987 | Guido Trotter | added. The presence of this tag is used to detect that an instance is in |
244 | 68640987 | Guido Trotter | this state, and it will not be touched until the failure is investigated |
245 | 68640987 | Guido Trotter | and the tag is removed. |
246 | 68640987 | Guido Trotter | |
247 | 68640987 | Guido Trotter | An external tool or person needs to investigate the state of the |
248 | 68640987 | Guido Trotter | instance and remove this tag when he is sure the instance is repaired |
249 | 68640987 | Guido Trotter | and safe to turn back to the normal autorepair system. |
250 | 68640987 | Guido Trotter | |
251 | 68640987 | Guido Trotter | (Alternatively we can use the suspended state (indefinitely or |
252 | 68640987 | Guido Trotter | temporarily) to mark the instance as "not touch" when we think a human |
253 | 68640987 | Guido Trotter | needs to look at it. To be decided). |
254 | 68640987 | Guido Trotter | |
255 | 68640987 | Guido Trotter | Repair operation |
256 | 68640987 | Guido Trotter | ---------------- |
257 | 68640987 | Guido Trotter | |
258 | 68640987 | Guido Trotter | Possible repairs are: |
259 | 68640987 | Guido Trotter | |
260 | 68640987 | Guido Trotter | - Replace-disks (drbd, if the secondary is down), (or other storage |
261 | 68640987 | Guido Trotter | specific fixes) |
262 | 68640987 | Guido Trotter | - Migrate (shared storage, rbd, drbd, if the primary is drained) |
263 | 68640987 | Guido Trotter | - Failover (shared storage, rbd, drbd, if the primary is down) |
264 | 68640987 | Guido Trotter | - Recreate disks + reinstall (all nodes down, plain, files or drbd) |
265 | 68640987 | Guido Trotter | |
266 | 68640987 | Guido Trotter | Note that more than one of these operations may need to happen before a |
267 | 68640987 | Guido Trotter | full repair is completed (eg. if a drbd primary goes offline first a |
268 | 68640987 | Guido Trotter | failover will happen, then a replce-disks). |
269 | 68640987 | Guido Trotter | |
270 | 68640987 | Guido Trotter | The self-repair tool will first take care of all needs-repair instance |
271 | 68640987 | Guido Trotter | that can be brought into ``pending`` state, and transition them as |
272 | 68640987 | Guido Trotter | described above. |
273 | 68640987 | Guido Trotter | |
274 | 68640987 | Guido Trotter | Then it will go through any ``repair:pending`` instances and handle them |
275 | 68640987 | Guido Trotter | as described above. |
276 | 68640987 | Guido Trotter | |
277 | 68640987 | Guido Trotter | Note that the repair tool MAY "group" instances by performing common |
278 | 68640987 | Guido Trotter | repair jobs for them (eg: node evacuate). |
279 | 68640987 | Guido Trotter | |
280 | 68640987 | Guido Trotter | Staging of work |
281 | 68640987 | Guido Trotter | --------------- |
282 | 68640987 | Guido Trotter | |
283 | 68640987 | Guido Trotter | First version: recreate-disks + reinstall (2.6.1) |
284 | 68640987 | Guido Trotter | Second version: failover and migrate repairs (2.7) |
285 | 68640987 | Guido Trotter | Third version: replace disks repair (2.7 or 2.8) |
286 | 68640987 | Guido Trotter | |
287 | 68640987 | Guido Trotter | Future work |
288 | 68640987 | Guido Trotter | =========== |
289 | 68640987 | Guido Trotter | |
290 | 68640987 | Guido Trotter | One important piece of work will be reporting what the autorepair system |
291 | 68640987 | Guido Trotter | is "thinking" and exporting this in a form that can be read by an |
292 | 68640987 | Guido Trotter | outside user or system. In order to do this we need a better |
293 | 68640987 | Guido Trotter | communication system than embedding this information into tags. This |
294 | 68640987 | Guido Trotter | should be thought in an extensible way that can be used in general for |
295 | 68640987 | Guido Trotter | Ganeti to provide "advisory" information about entities it manages, and |
296 | 68640987 | Guido Trotter | for an external system to "advise" ganeti over what it can do, but in a |
297 | 68640987 | Guido Trotter | less direct manner than submitting individual jobs. |
298 | 68640987 | Guido Trotter | |
299 | 68640987 | Guido Trotter | Note that cluster verify checks some errors that are actually instance |
300 | 68640987 | Guido Trotter | specific, (eg. a missing backend disk on a drbd node) or node-specific |
301 | 68640987 | Guido Trotter | (eg. an extra lvm device). If we were to split these into "instance |
302 | 68640987 | Guido Trotter | verify", "node verify" and "cluster verify", then we could easily use |
303 | 68640987 | Guido Trotter | this tool to perform some of those repairs as well. |
304 | 68640987 | Guido Trotter | |
305 | 68640987 | Guido Trotter | Finally self-repairs could also be extended to the cluster level, for |
306 | 68640987 | Guido Trotter | example concepts like "N+1 failures", missing master candidates, etc. or |
307 | 68640987 | Guido Trotter | node level for some specific types of errors. |
308 | 68640987 | Guido Trotter | |
309 | 68640987 | Guido Trotter | .. vim: set textwidth=72 : |
310 | 68640987 | Guido Trotter | .. Local Variables: |
311 | 68640987 | Guido Trotter | .. mode: rst |
312 | 68640987 | Guido Trotter | .. fill-column: 72 |
313 | 68640987 | Guido Trotter | .. End: |