root / doc / design-hroller.rst @ 0102e732
History | View | Annotate | Download (7.2 kB)
1 | 09208925 | Guido Trotter | ============ |
---|---|---|---|
2 | 09208925 | Guido Trotter | HRoller tool |
3 | 09208925 | Guido Trotter | ============ |
4 | 09208925 | Guido Trotter | |
5 | 09208925 | Guido Trotter | .. contents:: :depth: 4 |
6 | 09208925 | Guido Trotter | |
7 | 09208925 | Guido Trotter | This is a design document detailing the cluster maintenance scheduler, |
8 | 09208925 | Guido Trotter | HRoller. |
9 | 09208925 | Guido Trotter | |
10 | 09208925 | Guido Trotter | |
11 | 09208925 | Guido Trotter | Current state and shortcomings |
12 | 09208925 | Guido Trotter | ============================== |
13 | 09208925 | Guido Trotter | |
14 | 09208925 | Guido Trotter | To enable automating cluster-wide reboots a new htool, called HRoller, |
15 | 09208925 | Guido Trotter | was added to Ganeti starting from version 2.7. This tool helps |
16 | 09208925 | Guido Trotter | parallelizing cluster offline maintenances by calculating which nodes |
17 | 09208925 | Guido Trotter | are not both primary and secondary for a DRBD instance, and thus can be |
18 | 09208925 | Guido Trotter | rebooted at the same time, when all instances are down. |
19 | 09208925 | Guido Trotter | |
20 | 09208925 | Guido Trotter | The way this is done is documented in the :manpage:`hroller(1)` manpage. |
21 | 09208925 | Guido Trotter | |
22 | 09208925 | Guido Trotter | We would now like to perform online maintenance on the cluster by |
23 | 09208925 | Guido Trotter | rebooting nodes after evacuating their primary instances (rolling |
24 | 09208925 | Guido Trotter | reboots). |
25 | 09208925 | Guido Trotter | |
26 | 09208925 | Guido Trotter | Proposed changes |
27 | 09208925 | Guido Trotter | ================ |
28 | 09208925 | Guido Trotter | |
29 | fb4b885a | Guido Trotter | New options |
30 | fb4b885a | Guido Trotter | ----------- |
31 | fb4b885a | Guido Trotter | |
32 | fb4b885a | Guido Trotter | - HRoller should be able to operate on single nodegroups (-G flag) or |
33 | fb4b885a | Guido Trotter | select its target node through some other mean (eg. via a tag, or a |
34 | fb4b885a | Guido Trotter | regexp). (Note that individual node selection is already possible via |
35 | fb4b885a | Guido Trotter | the -O flag, that makes hroller ignore a node altogether). |
36 | fb4b885a | Guido Trotter | - HRoller should handle non redundant instances: currently these are |
37 | fb4b885a | Guido Trotter | ignored but there should be a way to select its behavior between "it's |
38 | fb4b885a | Guido Trotter | ok to reboot a node when a non-redundant instance is on it" or "skip |
39 | fb4b885a | Guido Trotter | nodes with non-redundant instances". This will only be selectable |
40 | fb4b885a | Guido Trotter | globally, and not per instance. |
41 | fb4b885a | Guido Trotter | - Hroller will make sure to keep any instance which is up in its current |
42 | fb4b885a | Guido Trotter | state, via live migrations, unless explicitly overridden. The |
43 | fb4b885a | Guido Trotter | algorithm that will be used calculate the rolling reboot with live |
44 | fb4b885a | Guido Trotter | migrations is described below, and any override on considering the |
45 | fb4b885a | Guido Trotter | instance status will only be possible on the whole run, and not |
46 | fb4b885a | Guido Trotter | per-instance. |
47 | fb4b885a | Guido Trotter | |
48 | 09208925 | Guido Trotter | |
49 | 09208925 | Guido Trotter | Calculating rolling maintenances |
50 | 09208925 | Guido Trotter | -------------------------------- |
51 | 09208925 | Guido Trotter | |
52 | 09208925 | Guido Trotter | In order to perform rolling maintenance we need to migrate instances off |
53 | 09208925 | Guido Trotter | the nodes before a reboot. How this can be done depends on the |
54 | 09208925 | Guido Trotter | instance's disk template and status: |
55 | 09208925 | Guido Trotter | |
56 | 09208925 | Guido Trotter | Down instances |
57 | 09208925 | Guido Trotter | ++++++++++++++ |
58 | 09208925 | Guido Trotter | |
59 | 09208925 | Guido Trotter | If an instance was shutdown when the maintenance started it will be |
60 | fb4b885a | Guido Trotter | considered for avoiding contemporary reboot of its primary and secondary |
61 | fb4b885a | Guido Trotter | nodes, but will *not* be considered as a target for the node evacuation. |
62 | fb4b885a | Guido Trotter | This allows avoiding needlessly moving its primary around, since it |
63 | fb4b885a | Guido Trotter | won't suffer a downtime anyway. |
64 | 09208925 | Guido Trotter | |
65 | fb4b885a | Guido Trotter | Note that a node with non-redundant instances will only ever be |
66 | fb4b885a | Guido Trotter | considered good for rolling-reboot if these are down (or the checking of |
67 | fb4b885a | Guido Trotter | status is overridden) *and* an explicit option to allow it is set. |
68 | 09208925 | Guido Trotter | |
69 | 09208925 | Guido Trotter | DRBD |
70 | 09208925 | Guido Trotter | ++++ |
71 | 09208925 | Guido Trotter | |
72 | 09208925 | Guido Trotter | Each node must migrate all instances off to their secondaries, and then |
73 | 09208925 | Guido Trotter | can either be rebooted, or the secondaries can be evacuated as well. |
74 | 09208925 | Guido Trotter | |
75 | 09208925 | Guido Trotter | Since currently doing a ``replace-disks`` on DRBD breaks redundancy, |
76 | 09208925 | Guido Trotter | it's not any safer than temporarily rebooting a node with secondaries on |
77 | 09208925 | Guido Trotter | them (citation needed). As such we'll implement for now just the |
78 | 09208925 | Guido Trotter | "migrate+reboot" mode, and focus later on replace-disks as well. |
79 | 09208925 | Guido Trotter | |
80 | 09208925 | Guido Trotter | In order to do that we can use the following algorithm: |
81 | 09208925 | Guido Trotter | |
82 | 09208925 | Guido Trotter | 1) Compute node sets that don't contain both the primary and the |
83 | 4a4697de | Klaus Aehlig | secondary of any instance, and also don't contain the primary |
84 | 4a4697de | Klaus Aehlig | nodes of two instances that have the same node as secondary. These |
85 | 4a4697de | Klaus Aehlig | can be obtained by computing a coloring of the graph with nodes |
86 | 4a4697de | Klaus Aehlig | as vertexes and an edge between two nodes, if either condition |
87 | 4a4697de | Klaus Aehlig | prevents simultaneous maintenance. (This is the current algorithm of |
88 | 4a4697de | Klaus Aehlig | :manpage:`hroller(1)` with the extension that the graph to be colored |
89 | 4a4697de | Klaus Aehlig | has additional edges between the primary nodes of two instances sharing |
90 | 4a4697de | Klaus Aehlig | their secondary node.) |
91 | 4a4697de | Klaus Aehlig | 2) It is then possible to migrate in parallel all nodes in a set |
92 | 4a4697de | Klaus Aehlig | created at step 1, and then reboot/perform maintenance on them, and |
93 | fb4b885a | Guido Trotter | migrate back their original primaries, which allows the computation |
94 | 4a4697de | Klaus Aehlig | above to be reused for each following set without N+1 failures |
95 | fb4b885a | Guido Trotter | being triggered, if none were present before. See below about the |
96 | fb4b885a | Guido Trotter | actual execution of the maintenance. |
97 | 09208925 | Guido Trotter | |
98 | 09208925 | Guido Trotter | Non-DRBD |
99 | 09208925 | Guido Trotter | ++++++++ |
100 | 09208925 | Guido Trotter | |
101 | 09208925 | Guido Trotter | All non-DRBD disk templates that can be migrated have no "secondary" |
102 | 09208925 | Guido Trotter | concept. As such instances can be migrated to any node (in the same |
103 | 09208925 | Guido Trotter | nodegroup). In order to do the job we can either: |
104 | 09208925 | Guido Trotter | |
105 | 09208925 | Guido Trotter | - Perform migrations on one node at a time, perform the maintenance on |
106 | 09208925 | Guido Trotter | that node, and proceed (the node will then be targeted again to host |
107 | 09208925 | Guido Trotter | instances automatically, as hail chooses targets for the instances |
108 | 09208925 | Guido Trotter | between all nodes in a group. Nodes in different nodegroups can be |
109 | 09208925 | Guido Trotter | handled in parallel. |
110 | 09208925 | Guido Trotter | - Perform migrations on one node at a time, but without waiting for the |
111 | 09208925 | Guido Trotter | first node to come back before proceeding. This allows us to continue, |
112 | 09208925 | Guido Trotter | restricting the cluster, until no more capacity in the nodegroup is |
113 | 09208925 | Guido Trotter | available, and then having to wait for some nodes to come back so that |
114 | 09208925 | Guido Trotter | capacity is available again for the last few nodes. |
115 | 09208925 | Guido Trotter | - Pre-Calculate sets of nodes that can be migrated together (probably |
116 | 09208925 | Guido Trotter | with a greedy algorithm) and parallelize between them, with the |
117 | 09208925 | Guido Trotter | migrate-back approach discussed for DRBD to perform the calculation |
118 | 09208925 | Guido Trotter | only once. |
119 | 09208925 | Guido Trotter | |
120 | 09208925 | Guido Trotter | Note that for non-DRBD disks that still use local storage (eg. RBD and |
121 | 09208925 | Guido Trotter | plain) redundancy might break anyway, and nothing except the first |
122 | 09208925 | Guido Trotter | algorithm might be safe. This perhaps would be a good reason to consider |
123 | 09208925 | Guido Trotter | managing better RBD pools, if those are implemented on top of nodes |
124 | 09208925 | Guido Trotter | storage, rather than on dedicated storage machines. |
125 | 09208925 | Guido Trotter | |
126 | 0102e732 | Klaus Aehlig | Full-Evacuation |
127 | 0102e732 | Klaus Aehlig | +++++++++++++++ |
128 | 0102e732 | Klaus Aehlig | |
129 | 0102e732 | Klaus Aehlig | If full evacuation of the nodes to be rebooted is desired, a simple |
130 | 0102e732 | Klaus Aehlig | migration is not enough for the DRBD instances. To keep the number of |
131 | 0102e732 | Klaus Aehlig | disk operations small, we restrict moves to ``migrate, replace-secondary``. |
132 | 0102e732 | Klaus Aehlig | That is, after migrating instances out of the nodes to be rebooted, |
133 | 0102e732 | Klaus Aehlig | replacement secondaries are searched for, for all instances that have |
134 | 0102e732 | Klaus Aehlig | their then secondary on one of the rebooted nodes. This is done by a |
135 | 0102e732 | Klaus Aehlig | greedy algorithm, refining the initial reboot partition, if necessary. |
136 | 0102e732 | Klaus Aehlig | |
137 | 09208925 | Guido Trotter | Future work |
138 | 09208925 | Guido Trotter | =========== |
139 | 09208925 | Guido Trotter | |
140 | fb4b885a | Guido Trotter | Hroller should become able to execute rolling maintenances, rather than |
141 | fb4b885a | Guido Trotter | just calculate them. For this to succeed properly one of the following |
142 | fb4b885a | Guido Trotter | must happen: |
143 | fb4b885a | Guido Trotter | |
144 | fb4b885a | Guido Trotter | - HRoller handles rolling maintenances that happen at the same time as |
145 | fb4b885a | Guido Trotter | unrelated cluster jobs, and thus recalculates the maintenance at each |
146 | fb4b885a | Guido Trotter | step |
147 | fb4b885a | Guido Trotter | - HRoller can selectively drain the cluster so it's sure that only the |
148 | fb4b885a | Guido Trotter | rolling maintenance can be going on |
149 | fb4b885a | Guido Trotter | |
150 | 09208925 | Guido Trotter | DRBD nodes' ``replace-disks``' functionality should be implemented. Note |
151 | 09208925 | Guido Trotter | that when we will support a DRBD version that allows multi-secondary |
152 | 09208925 | Guido Trotter | this can be done safely, without losing replication at any time, by |
153 | 09208925 | Guido Trotter | adding a temporary secondary and only when the sync is finished dropping |
154 | 09208925 | Guido Trotter | the previous one. |
155 | 09208925 | Guido Trotter | |
156 | fb4b885a | Guido Trotter | Non-redundant (plain or file) instances should have a way to be moved |
157 | fb4b885a | Guido Trotter | off as well via plain storage live migration or ``gnt-instance move`` |
158 | fb4b885a | Guido Trotter | (which requires downtime). |
159 | fb4b885a | Guido Trotter | |
160 | 09208925 | Guido Trotter | If/when RBD pools can be managed inside Ganeti, care can be taken so |
161 | 09208925 | Guido Trotter | that the pool is evacuated as well from a node before it's put into |
162 | 09208925 | Guido Trotter | maintenance. This is equivalent to evacuating DRBD secondaries. |
163 | 09208925 | Guido Trotter | |
164 | 09208925 | Guido Trotter | Master failovers during the maintenance should be performed by hroller. |
165 | 09208925 | Guido Trotter | This requires RPC/RAPI support for master failover. Hroller should also |
166 | 09208925 | Guido Trotter | be modified to better support running on the master itself and |
167 | 09208925 | Guido Trotter | continuing on the new master. |
168 | 09208925 | Guido Trotter | |
169 | 09208925 | Guido Trotter | .. vim: set textwidth=72 : |
170 | 09208925 | Guido Trotter | .. Local Variables: |
171 | 09208925 | Guido Trotter | .. mode: rst |
172 | 09208925 | Guido Trotter | .. fill-column: 72 |
173 | 09208925 | Guido Trotter | .. End: |