Revision 09208925
b/doc/design-hroller.rst | ||
---|---|---|
1 |
============ |
|
2 |
HRoller tool |
|
3 |
============ |
|
4 |
|
|
5 |
.. contents:: :depth: 4 |
|
6 |
|
|
7 |
This is a design document detailing the cluster maintenance scheduler, |
|
8 |
HRoller. |
|
9 |
|
|
10 |
|
|
11 |
Current state and shortcomings |
|
12 |
============================== |
|
13 |
|
|
14 |
To enable automating cluster-wide reboots a new htool, called HRoller, |
|
15 |
was added to Ganeti starting from version 2.7. This tool helps |
|
16 |
parallelizing cluster offline maintenances by calculating which nodes |
|
17 |
are not both primary and secondary for a DRBD instance, and thus can be |
|
18 |
rebooted at the same time, when all instances are down. |
|
19 |
|
|
20 |
The way this is done is documented in the :manpage:`hroller(1)` manpage. |
|
21 |
|
|
22 |
We would now like to perform online maintenance on the cluster by |
|
23 |
rebooting nodes after evacuating their primary instances (rolling |
|
24 |
reboots). |
|
25 |
|
|
26 |
Proposed changes |
|
27 |
================ |
|
28 |
|
|
29 |
|
|
30 |
Calculating rolling maintenances |
|
31 |
-------------------------------- |
|
32 |
|
|
33 |
In order to perform rolling maintenance we need to migrate instances off |
|
34 |
the nodes before a reboot. How this can be done depends on the |
|
35 |
instance's disk template and status: |
|
36 |
|
|
37 |
Down instances |
|
38 |
++++++++++++++ |
|
39 |
|
|
40 |
If an instance was shutdown when the maintenance started it will be |
|
41 |
ignored. This allows avoiding needlessly moving its primary around, |
|
42 |
since it won't suffer a downtime anyway. |
|
43 |
|
|
44 |
|
|
45 |
DRBD |
|
46 |
++++ |
|
47 |
|
|
48 |
Each node must migrate all instances off to their secondaries, and then |
|
49 |
can either be rebooted, or the secondaries can be evacuated as well. |
|
50 |
|
|
51 |
Since currently doing a ``replace-disks`` on DRBD breaks redundancy, |
|
52 |
it's not any safer than temporarily rebooting a node with secondaries on |
|
53 |
them (citation needed). As such we'll implement for now just the |
|
54 |
"migrate+reboot" mode, and focus later on replace-disks as well. |
|
55 |
|
|
56 |
In order to do that we can use the following algorithm: |
|
57 |
|
|
58 |
1) Compute node sets that don't contain both the primary and the |
|
59 |
secondary for any instance. This can be done already by the current |
|
60 |
hroller graph coloring algorithm: nodes are in the same set (color) if |
|
61 |
and only if no edge (instance) exists between them (see the |
|
62 |
:manpage:`hroller(1)` manpage for more details). |
|
63 |
2) Inside each node set calculate subsets that don't have any secondary |
|
64 |
node in common (this can be done by creating a graph of nodes that are |
|
65 |
connected if and only if an instance on both has the same secondary |
|
66 |
node, and coloring that graph) |
|
67 |
3) It is then possible to migrate in parallel all nodes in a subset |
|
68 |
created at step 2, and then reboot/perform maintenance on them, and |
|
69 |
migrate back their original primaries, which allows the computation |
|
70 |
above to be reused for each following subset without N+1 failures being |
|
71 |
triggered, if none were present before. See below about the actual |
|
72 |
execution of the maintenance. |
|
73 |
|
|
74 |
Non-DRBD |
|
75 |
++++++++ |
|
76 |
|
|
77 |
All non-DRBD disk templates that can be migrated have no "secondary" |
|
78 |
concept. As such instances can be migrated to any node (in the same |
|
79 |
nodegroup). In order to do the job we can either: |
|
80 |
|
|
81 |
- Perform migrations on one node at a time, perform the maintenance on |
|
82 |
that node, and proceed (the node will then be targeted again to host |
|
83 |
instances automatically, as hail chooses targets for the instances |
|
84 |
between all nodes in a group. Nodes in different nodegroups can be |
|
85 |
handled in parallel. |
|
86 |
- Perform migrations on one node at a time, but without waiting for the |
|
87 |
first node to come back before proceeding. This allows us to continue, |
|
88 |
restricting the cluster, until no more capacity in the nodegroup is |
|
89 |
available, and then having to wait for some nodes to come back so that |
|
90 |
capacity is available again for the last few nodes. |
|
91 |
- Pre-Calculate sets of nodes that can be migrated together (probably |
|
92 |
with a greedy algorithm) and parallelize between them, with the |
|
93 |
migrate-back approach discussed for DRBD to perform the calculation |
|
94 |
only once. |
|
95 |
|
|
96 |
Note that for non-DRBD disks that still use local storage (eg. RBD and |
|
97 |
plain) redundancy might break anyway, and nothing except the first |
|
98 |
algorithm might be safe. This perhaps would be a good reason to consider |
|
99 |
managing better RBD pools, if those are implemented on top of nodes |
|
100 |
storage, rather than on dedicated storage machines. |
|
101 |
|
|
102 |
Executing rolling maintenances |
|
103 |
------------------------------ |
|
104 |
|
|
105 |
Hroller accepts commands to run to do maintenance automatically. These |
|
106 |
are going to be run on the machine hroller runs on, and take a node name |
|
107 |
as input. They have then to gain access to the target node (via ssh, |
|
108 |
restricted commands, or some other means) and perform their duty. |
|
109 |
|
|
110 |
1) A command (--check-cmd) will be called on all selected online nodes |
|
111 |
to check whether a node needs maintenance. Hroller will proceed only on |
|
112 |
nodes that respond positively to this invocation. |
|
113 |
FIXME: decide about -D |
|
114 |
2) Hroller will evacuate the node of all primary instances. |
|
115 |
3) A command (--maint-cmd) will be called on a node to do the actual |
|
116 |
maintenance operation. It should do any operation needed to perform the |
|
117 |
maintenance including triggering the actual reboot. |
|
118 |
3) A command (--verify-cmd) will be called to check that the operation |
|
119 |
was successful, it has to wait until the target node is back up (and |
|
120 |
decide after how long it should give up) and perform the verification. |
|
121 |
If it's not successful hroller will stop and not proceed with other |
|
122 |
nodes. |
|
123 |
4) The master node will be kept last, but will not otherwise be treated |
|
124 |
specially. If hroller was running on the master node, care must be |
|
125 |
exercised as its maintenance will have interrupted the software itself, |
|
126 |
and as such the verification step will not happen. This will not |
|
127 |
automatically be taken care of, in the first version. An additional flag |
|
128 |
to just skip the master node will be present as well, in case that's |
|
129 |
preferred. |
|
130 |
|
|
131 |
|
|
132 |
Future work |
|
133 |
=========== |
|
134 |
|
|
135 |
DRBD nodes' ``replace-disks``' functionality should be implemented. Note |
|
136 |
that when we will support a DRBD version that allows multi-secondary |
|
137 |
this can be done safely, without losing replication at any time, by |
|
138 |
adding a temporary secondary and only when the sync is finished dropping |
|
139 |
the previous one. |
|
140 |
|
|
141 |
If/when RBD pools can be managed inside Ganeti, care can be taken so |
|
142 |
that the pool is evacuated as well from a node before it's put into |
|
143 |
maintenance. This is equivalent to evacuating DRBD secondaries. |
|
144 |
|
|
145 |
Master failovers during the maintenance should be performed by hroller. |
|
146 |
This requires RPC/RAPI support for master failover. Hroller should also |
|
147 |
be modified to better support running on the master itself and |
|
148 |
continuing on the new master. |
|
149 |
|
|
150 |
.. vim: set textwidth=72 : |
|
151 |
.. Local Variables: |
|
152 |
.. mode: rst |
|
153 |
.. fill-column: 72 |
|
154 |
.. End: |
Also available in: Unified diff