root / doc / design-hroller.rst @ 33c730a2
History | View | Annotate | Download (6.7 kB)
1 |
============ |
---|---|
2 |
HRoller tool |
3 |
============ |
4 |
|
5 |
.. contents:: :depth: 4 |
6 |
|
7 |
This is a design document detailing the cluster maintenance scheduler, |
8 |
HRoller. |
9 |
|
10 |
|
11 |
Current state and shortcomings |
12 |
============================== |
13 |
|
14 |
To enable automating cluster-wide reboots a new htool, called HRoller, |
15 |
was added to Ganeti starting from version 2.7. This tool helps |
16 |
parallelizing cluster offline maintenances by calculating which nodes |
17 |
are not both primary and secondary for a DRBD instance, and thus can be |
18 |
rebooted at the same time, when all instances are down. |
19 |
|
20 |
The way this is done is documented in the :manpage:`hroller(1)` manpage. |
21 |
|
22 |
We would now like to perform online maintenance on the cluster by |
23 |
rebooting nodes after evacuating their primary instances (rolling |
24 |
reboots). |
25 |
|
26 |
Proposed changes |
27 |
================ |
28 |
|
29 |
New options |
30 |
----------- |
31 |
|
32 |
- HRoller should be able to operate on single nodegroups (-G flag) or |
33 |
select its target node through some other mean (eg. via a tag, or a |
34 |
regexp). (Note that individual node selection is already possible via |
35 |
the -O flag, that makes hroller ignore a node altogether). |
36 |
- HRoller should handle non redundant instances: currently these are |
37 |
ignored but there should be a way to select its behavior between "it's |
38 |
ok to reboot a node when a non-redundant instance is on it" or "skip |
39 |
nodes with non-redundant instances". This will only be selectable |
40 |
globally, and not per instance. |
41 |
- Hroller will make sure to keep any instance which is up in its current |
42 |
state, via live migrations, unless explicitly overridden. The |
43 |
algorithm that will be used calculate the rolling reboot with live |
44 |
migrations is described below, and any override on considering the |
45 |
instance status will only be possible on the whole run, and not |
46 |
per-instance. |
47 |
|
48 |
|
49 |
Calculating rolling maintenances |
50 |
-------------------------------- |
51 |
|
52 |
In order to perform rolling maintenance we need to migrate instances off |
53 |
the nodes before a reboot. How this can be done depends on the |
54 |
instance's disk template and status: |
55 |
|
56 |
Down instances |
57 |
++++++++++++++ |
58 |
|
59 |
If an instance was shutdown when the maintenance started it will be |
60 |
considered for avoiding contemporary reboot of its primary and secondary |
61 |
nodes, but will *not* be considered as a target for the node evacuation. |
62 |
This allows avoiding needlessly moving its primary around, since it |
63 |
won't suffer a downtime anyway. |
64 |
|
65 |
Note that a node with non-redundant instances will only ever be |
66 |
considered good for rolling-reboot if these are down (or the checking of |
67 |
status is overridden) *and* an explicit option to allow it is set. |
68 |
|
69 |
DRBD |
70 |
++++ |
71 |
|
72 |
Each node must migrate all instances off to their secondaries, and then |
73 |
can either be rebooted, or the secondaries can be evacuated as well. |
74 |
|
75 |
Since currently doing a ``replace-disks`` on DRBD breaks redundancy, |
76 |
it's not any safer than temporarily rebooting a node with secondaries on |
77 |
them (citation needed). As such we'll implement for now just the |
78 |
"migrate+reboot" mode, and focus later on replace-disks as well. |
79 |
|
80 |
In order to do that we can use the following algorithm: |
81 |
|
82 |
1) Compute node sets that don't contain both the primary and the |
83 |
secondary for any instance. This can be done already by the current |
84 |
hroller graph coloring algorithm: nodes are in the same set (color) |
85 |
if and only if no edge (instance) exists between them (see the |
86 |
:manpage:`hroller(1)` manpage for more details). |
87 |
2) Inside each node set calculate subsets that don't have any secondary |
88 |
node in common (this can be done by creating a graph of nodes that |
89 |
are connected if and only if an instance on both has the same |
90 |
secondary node, and coloring that graph) |
91 |
3) It is then possible to migrate in parallel all nodes in a subset |
92 |
created at step 2, and then reboot/perform maintenance on them, and |
93 |
migrate back their original primaries, which allows the computation |
94 |
above to be reused for each following subset without N+1 failures |
95 |
being triggered, if none were present before. See below about the |
96 |
actual execution of the maintenance. |
97 |
|
98 |
Non-DRBD |
99 |
++++++++ |
100 |
|
101 |
All non-DRBD disk templates that can be migrated have no "secondary" |
102 |
concept. As such instances can be migrated to any node (in the same |
103 |
nodegroup). In order to do the job we can either: |
104 |
|
105 |
- Perform migrations on one node at a time, perform the maintenance on |
106 |
that node, and proceed (the node will then be targeted again to host |
107 |
instances automatically, as hail chooses targets for the instances |
108 |
between all nodes in a group. Nodes in different nodegroups can be |
109 |
handled in parallel. |
110 |
- Perform migrations on one node at a time, but without waiting for the |
111 |
first node to come back before proceeding. This allows us to continue, |
112 |
restricting the cluster, until no more capacity in the nodegroup is |
113 |
available, and then having to wait for some nodes to come back so that |
114 |
capacity is available again for the last few nodes. |
115 |
- Pre-Calculate sets of nodes that can be migrated together (probably |
116 |
with a greedy algorithm) and parallelize between them, with the |
117 |
migrate-back approach discussed for DRBD to perform the calculation |
118 |
only once. |
119 |
|
120 |
Note that for non-DRBD disks that still use local storage (eg. RBD and |
121 |
plain) redundancy might break anyway, and nothing except the first |
122 |
algorithm might be safe. This perhaps would be a good reason to consider |
123 |
managing better RBD pools, if those are implemented on top of nodes |
124 |
storage, rather than on dedicated storage machines. |
125 |
|
126 |
Future work |
127 |
=========== |
128 |
|
129 |
Hroller should become able to execute rolling maintenances, rather than |
130 |
just calculate them. For this to succeed properly one of the following |
131 |
must happen: |
132 |
|
133 |
- HRoller handles rolling maintenances that happen at the same time as |
134 |
unrelated cluster jobs, and thus recalculates the maintenance at each |
135 |
step |
136 |
- HRoller can selectively drain the cluster so it's sure that only the |
137 |
rolling maintenance can be going on |
138 |
|
139 |
DRBD nodes' ``replace-disks``' functionality should be implemented. Note |
140 |
that when we will support a DRBD version that allows multi-secondary |
141 |
this can be done safely, without losing replication at any time, by |
142 |
adding a temporary secondary and only when the sync is finished dropping |
143 |
the previous one. |
144 |
|
145 |
Non-redundant (plain or file) instances should have a way to be moved |
146 |
off as well via plain storage live migration or ``gnt-instance move`` |
147 |
(which requires downtime). |
148 |
|
149 |
If/when RBD pools can be managed inside Ganeti, care can be taken so |
150 |
that the pool is evacuated as well from a node before it's put into |
151 |
maintenance. This is equivalent to evacuating DRBD secondaries. |
152 |
|
153 |
Master failovers during the maintenance should be performed by hroller. |
154 |
This requires RPC/RAPI support for master failover. Hroller should also |
155 |
be modified to better support running on the master itself and |
156 |
continuing on the new master. |
157 |
|
158 |
.. vim: set textwidth=72 : |
159 |
.. Local Variables: |
160 |
.. mode: rst |
161 |
.. fill-column: 72 |
162 |
.. End: |