root / doc / design-hroller.rst @ 09208925
History | View | Annotate | Download (6.3 kB)
1 |
============ |
---|---|
2 |
HRoller tool |
3 |
============ |
4 |
|
5 |
.. contents:: :depth: 4 |
6 |
|
7 |
This is a design document detailing the cluster maintenance scheduler, |
8 |
HRoller. |
9 |
|
10 |
|
11 |
Current state and shortcomings |
12 |
============================== |
13 |
|
14 |
To enable automating cluster-wide reboots a new htool, called HRoller, |
15 |
was added to Ganeti starting from version 2.7. This tool helps |
16 |
parallelizing cluster offline maintenances by calculating which nodes |
17 |
are not both primary and secondary for a DRBD instance, and thus can be |
18 |
rebooted at the same time, when all instances are down. |
19 |
|
20 |
The way this is done is documented in the :manpage:`hroller(1)` manpage. |
21 |
|
22 |
We would now like to perform online maintenance on the cluster by |
23 |
rebooting nodes after evacuating their primary instances (rolling |
24 |
reboots). |
25 |
|
26 |
Proposed changes |
27 |
================ |
28 |
|
29 |
|
30 |
Calculating rolling maintenances |
31 |
-------------------------------- |
32 |
|
33 |
In order to perform rolling maintenance we need to migrate instances off |
34 |
the nodes before a reboot. How this can be done depends on the |
35 |
instance's disk template and status: |
36 |
|
37 |
Down instances |
38 |
++++++++++++++ |
39 |
|
40 |
If an instance was shutdown when the maintenance started it will be |
41 |
ignored. This allows avoiding needlessly moving its primary around, |
42 |
since it won't suffer a downtime anyway. |
43 |
|
44 |
|
45 |
DRBD |
46 |
++++ |
47 |
|
48 |
Each node must migrate all instances off to their secondaries, and then |
49 |
can either be rebooted, or the secondaries can be evacuated as well. |
50 |
|
51 |
Since currently doing a ``replace-disks`` on DRBD breaks redundancy, |
52 |
it's not any safer than temporarily rebooting a node with secondaries on |
53 |
them (citation needed). As such we'll implement for now just the |
54 |
"migrate+reboot" mode, and focus later on replace-disks as well. |
55 |
|
56 |
In order to do that we can use the following algorithm: |
57 |
|
58 |
1) Compute node sets that don't contain both the primary and the |
59 |
secondary for any instance. This can be done already by the current |
60 |
hroller graph coloring algorithm: nodes are in the same set (color) if |
61 |
and only if no edge (instance) exists between them (see the |
62 |
:manpage:`hroller(1)` manpage for more details). |
63 |
2) Inside each node set calculate subsets that don't have any secondary |
64 |
node in common (this can be done by creating a graph of nodes that are |
65 |
connected if and only if an instance on both has the same secondary |
66 |
node, and coloring that graph) |
67 |
3) It is then possible to migrate in parallel all nodes in a subset |
68 |
created at step 2, and then reboot/perform maintenance on them, and |
69 |
migrate back their original primaries, which allows the computation |
70 |
above to be reused for each following subset without N+1 failures being |
71 |
triggered, if none were present before. See below about the actual |
72 |
execution of the maintenance. |
73 |
|
74 |
Non-DRBD |
75 |
++++++++ |
76 |
|
77 |
All non-DRBD disk templates that can be migrated have no "secondary" |
78 |
concept. As such instances can be migrated to any node (in the same |
79 |
nodegroup). In order to do the job we can either: |
80 |
|
81 |
- Perform migrations on one node at a time, perform the maintenance on |
82 |
that node, and proceed (the node will then be targeted again to host |
83 |
instances automatically, as hail chooses targets for the instances |
84 |
between all nodes in a group. Nodes in different nodegroups can be |
85 |
handled in parallel. |
86 |
- Perform migrations on one node at a time, but without waiting for the |
87 |
first node to come back before proceeding. This allows us to continue, |
88 |
restricting the cluster, until no more capacity in the nodegroup is |
89 |
available, and then having to wait for some nodes to come back so that |
90 |
capacity is available again for the last few nodes. |
91 |
- Pre-Calculate sets of nodes that can be migrated together (probably |
92 |
with a greedy algorithm) and parallelize between them, with the |
93 |
migrate-back approach discussed for DRBD to perform the calculation |
94 |
only once. |
95 |
|
96 |
Note that for non-DRBD disks that still use local storage (eg. RBD and |
97 |
plain) redundancy might break anyway, and nothing except the first |
98 |
algorithm might be safe. This perhaps would be a good reason to consider |
99 |
managing better RBD pools, if those are implemented on top of nodes |
100 |
storage, rather than on dedicated storage machines. |
101 |
|
102 |
Executing rolling maintenances |
103 |
------------------------------ |
104 |
|
105 |
Hroller accepts commands to run to do maintenance automatically. These |
106 |
are going to be run on the machine hroller runs on, and take a node name |
107 |
as input. They have then to gain access to the target node (via ssh, |
108 |
restricted commands, or some other means) and perform their duty. |
109 |
|
110 |
1) A command (--check-cmd) will be called on all selected online nodes |
111 |
to check whether a node needs maintenance. Hroller will proceed only on |
112 |
nodes that respond positively to this invocation. |
113 |
FIXME: decide about -D |
114 |
2) Hroller will evacuate the node of all primary instances. |
115 |
3) A command (--maint-cmd) will be called on a node to do the actual |
116 |
maintenance operation. It should do any operation needed to perform the |
117 |
maintenance including triggering the actual reboot. |
118 |
3) A command (--verify-cmd) will be called to check that the operation |
119 |
was successful, it has to wait until the target node is back up (and |
120 |
decide after how long it should give up) and perform the verification. |
121 |
If it's not successful hroller will stop and not proceed with other |
122 |
nodes. |
123 |
4) The master node will be kept last, but will not otherwise be treated |
124 |
specially. If hroller was running on the master node, care must be |
125 |
exercised as its maintenance will have interrupted the software itself, |
126 |
and as such the verification step will not happen. This will not |
127 |
automatically be taken care of, in the first version. An additional flag |
128 |
to just skip the master node will be present as well, in case that's |
129 |
preferred. |
130 |
|
131 |
|
132 |
Future work |
133 |
=========== |
134 |
|
135 |
DRBD nodes' ``replace-disks``' functionality should be implemented. Note |
136 |
that when we will support a DRBD version that allows multi-secondary |
137 |
this can be done safely, without losing replication at any time, by |
138 |
adding a temporary secondary and only when the sync is finished dropping |
139 |
the previous one. |
140 |
|
141 |
If/when RBD pools can be managed inside Ganeti, care can be taken so |
142 |
that the pool is evacuated as well from a node before it's put into |
143 |
maintenance. This is equivalent to evacuating DRBD secondaries. |
144 |
|
145 |
Master failovers during the maintenance should be performed by hroller. |
146 |
This requires RPC/RAPI support for master failover. Hroller should also |
147 |
be modified to better support running on the master itself and |
148 |
continuing on the new master. |
149 |
|
150 |
.. vim: set textwidth=72 : |
151 |
.. Local Variables: |
152 |
.. mode: rst |
153 |
.. fill-column: 72 |
154 |
.. End: |