Revision fb4b885a
b/doc/design-hroller.rst | ||
---|---|---|
26 | 26 |
Proposed changes |
27 | 27 |
================ |
28 | 28 |
|
29 |
New options |
|
30 |
----------- |
|
31 |
|
|
32 |
- HRoller should be able to operate on single nodegroups (-G flag) or |
|
33 |
select its target node through some other mean (eg. via a tag, or a |
|
34 |
regexp). (Note that individual node selection is already possible via |
|
35 |
the -O flag, that makes hroller ignore a node altogether). |
|
36 |
- HRoller should handle non redundant instances: currently these are |
|
37 |
ignored but there should be a way to select its behavior between "it's |
|
38 |
ok to reboot a node when a non-redundant instance is on it" or "skip |
|
39 |
nodes with non-redundant instances". This will only be selectable |
|
40 |
globally, and not per instance. |
|
41 |
- Hroller will make sure to keep any instance which is up in its current |
|
42 |
state, via live migrations, unless explicitly overridden. The |
|
43 |
algorithm that will be used calculate the rolling reboot with live |
|
44 |
migrations is described below, and any override on considering the |
|
45 |
instance status will only be possible on the whole run, and not |
|
46 |
per-instance. |
|
47 |
|
|
29 | 48 |
|
30 | 49 |
Calculating rolling maintenances |
31 | 50 |
-------------------------------- |
... | ... | |
38 | 57 |
++++++++++++++ |
39 | 58 |
|
40 | 59 |
If an instance was shutdown when the maintenance started it will be |
41 |
ignored. This allows avoiding needlessly moving its primary around, |
|
42 |
since it won't suffer a downtime anyway. |
|
60 |
considered for avoiding contemporary reboot of its primary and secondary |
|
61 |
nodes, but will *not* be considered as a target for the node evacuation. |
|
62 |
This allows avoiding needlessly moving its primary around, since it |
|
63 |
won't suffer a downtime anyway. |
|
43 | 64 |
|
65 |
Note that a node with non-redundant instances will only ever be |
|
66 |
considered good for rolling-reboot if these are down (or the checking of |
|
67 |
status is overridden) *and* an explicit option to allow it is set. |
|
44 | 68 |
|
45 | 69 |
DRBD |
46 | 70 |
++++ |
... | ... | |
56 | 80 |
In order to do that we can use the following algorithm: |
57 | 81 |
|
58 | 82 |
1) Compute node sets that don't contain both the primary and the |
59 |
secondary for any instance. This can be done already by the current |
|
60 |
hroller graph coloring algorithm: nodes are in the same set (color) if
|
|
61 |
and only if no edge (instance) exists between them (see the |
|
62 |
:manpage:`hroller(1)` manpage for more details). |
|
83 |
secondary for any instance. This can be done already by the current
|
|
84 |
hroller graph coloring algorithm: nodes are in the same set (color)
|
|
85 |
if and only if no edge (instance) exists between them (see the
|
|
86 |
:manpage:`hroller(1)` manpage for more details).
|
|
63 | 87 |
2) Inside each node set calculate subsets that don't have any secondary |
64 |
node in common (this can be done by creating a graph of nodes that are
|
|
65 |
connected if and only if an instance on both has the same secondary
|
|
66 |
node, and coloring that graph) |
|
88 |
node in common (this can be done by creating a graph of nodes that
|
|
89 |
are connected if and only if an instance on both has the same
|
|
90 |
secondary node, and coloring that graph)
|
|
67 | 91 |
3) It is then possible to migrate in parallel all nodes in a subset |
68 |
created at step 2, and then reboot/perform maintenance on them, and |
|
69 |
migrate back their original primaries, which allows the computation |
|
70 |
above to be reused for each following subset without N+1 failures being
|
|
71 |
triggered, if none were present before. See below about the actual
|
|
72 |
execution of the maintenance. |
|
92 |
created at step 2, and then reboot/perform maintenance on them, and
|
|
93 |
migrate back their original primaries, which allows the computation
|
|
94 |
above to be reused for each following subset without N+1 failures
|
|
95 |
being triggered, if none were present before. See below about the
|
|
96 |
actual execution of the maintenance.
|
|
73 | 97 |
|
74 | 98 |
Non-DRBD |
75 | 99 |
++++++++ |
... | ... | |
99 | 123 |
managing better RBD pools, if those are implemented on top of nodes |
100 | 124 |
storage, rather than on dedicated storage machines. |
101 | 125 |
|
102 |
Executing rolling maintenances |
|
103 |
------------------------------ |
|
104 |
|
|
105 |
Hroller accepts commands to run to do maintenance automatically. These |
|
106 |
are going to be run on the machine hroller runs on, and take a node name |
|
107 |
as input. They have then to gain access to the target node (via ssh, |
|
108 |
restricted commands, or some other means) and perform their duty. |
|
109 |
|
|
110 |
1) A command (--check-cmd) will be called on all selected online nodes |
|
111 |
to check whether a node needs maintenance. Hroller will proceed only on |
|
112 |
nodes that respond positively to this invocation. |
|
113 |
FIXME: decide about -D |
|
114 |
2) Hroller will evacuate the node of all primary instances. |
|
115 |
3) A command (--maint-cmd) will be called on a node to do the actual |
|
116 |
maintenance operation. It should do any operation needed to perform the |
|
117 |
maintenance including triggering the actual reboot. |
|
118 |
3) A command (--verify-cmd) will be called to check that the operation |
|
119 |
was successful, it has to wait until the target node is back up (and |
|
120 |
decide after how long it should give up) and perform the verification. |
|
121 |
If it's not successful hroller will stop and not proceed with other |
|
122 |
nodes. |
|
123 |
4) The master node will be kept last, but will not otherwise be treated |
|
124 |
specially. If hroller was running on the master node, care must be |
|
125 |
exercised as its maintenance will have interrupted the software itself, |
|
126 |
and as such the verification step will not happen. This will not |
|
127 |
automatically be taken care of, in the first version. An additional flag |
|
128 |
to just skip the master node will be present as well, in case that's |
|
129 |
preferred. |
|
130 |
|
|
131 |
|
|
132 | 126 |
Future work |
133 | 127 |
=========== |
134 | 128 |
|
129 |
Hroller should become able to execute rolling maintenances, rather than |
|
130 |
just calculate them. For this to succeed properly one of the following |
|
131 |
must happen: |
|
132 |
|
|
133 |
- HRoller handles rolling maintenances that happen at the same time as |
|
134 |
unrelated cluster jobs, and thus recalculates the maintenance at each |
|
135 |
step |
|
136 |
- HRoller can selectively drain the cluster so it's sure that only the |
|
137 |
rolling maintenance can be going on |
|
138 |
|
|
135 | 139 |
DRBD nodes' ``replace-disks``' functionality should be implemented. Note |
136 | 140 |
that when we will support a DRBD version that allows multi-secondary |
137 | 141 |
this can be done safely, without losing replication at any time, by |
138 | 142 |
adding a temporary secondary and only when the sync is finished dropping |
139 | 143 |
the previous one. |
140 | 144 |
|
145 |
Non-redundant (plain or file) instances should have a way to be moved |
|
146 |
off as well via plain storage live migration or ``gnt-instance move`` |
|
147 |
(which requires downtime). |
|
148 |
|
|
141 | 149 |
If/when RBD pools can be managed inside Ganeti, care can be taken so |
142 | 150 |
that the pool is evacuated as well from a node before it's put into |
143 | 151 |
maintenance. This is equivalent to evacuating DRBD secondaries. |
Also available in: Unified diff