root / doc / design-hroller.rst @ 09208925
History | View | Annotate | Download (6.3 kB)
1 | 09208925 | Guido Trotter | ============ |
---|---|---|---|
2 | 09208925 | Guido Trotter | HRoller tool |
3 | 09208925 | Guido Trotter | ============ |
4 | 09208925 | Guido Trotter | |
5 | 09208925 | Guido Trotter | .. contents:: :depth: 4 |
6 | 09208925 | Guido Trotter | |
7 | 09208925 | Guido Trotter | This is a design document detailing the cluster maintenance scheduler, |
8 | 09208925 | Guido Trotter | HRoller. |
9 | 09208925 | Guido Trotter | |
10 | 09208925 | Guido Trotter | |
11 | 09208925 | Guido Trotter | Current state and shortcomings |
12 | 09208925 | Guido Trotter | ============================== |
13 | 09208925 | Guido Trotter | |
14 | 09208925 | Guido Trotter | To enable automating cluster-wide reboots a new htool, called HRoller, |
15 | 09208925 | Guido Trotter | was added to Ganeti starting from version 2.7. This tool helps |
16 | 09208925 | Guido Trotter | parallelizing cluster offline maintenances by calculating which nodes |
17 | 09208925 | Guido Trotter | are not both primary and secondary for a DRBD instance, and thus can be |
18 | 09208925 | Guido Trotter | rebooted at the same time, when all instances are down. |
19 | 09208925 | Guido Trotter | |
20 | 09208925 | Guido Trotter | The way this is done is documented in the :manpage:`hroller(1)` manpage. |
21 | 09208925 | Guido Trotter | |
22 | 09208925 | Guido Trotter | We would now like to perform online maintenance on the cluster by |
23 | 09208925 | Guido Trotter | rebooting nodes after evacuating their primary instances (rolling |
24 | 09208925 | Guido Trotter | reboots). |
25 | 09208925 | Guido Trotter | |
26 | 09208925 | Guido Trotter | Proposed changes |
27 | 09208925 | Guido Trotter | ================ |
28 | 09208925 | Guido Trotter | |
29 | 09208925 | Guido Trotter | |
30 | 09208925 | Guido Trotter | Calculating rolling maintenances |
31 | 09208925 | Guido Trotter | -------------------------------- |
32 | 09208925 | Guido Trotter | |
33 | 09208925 | Guido Trotter | In order to perform rolling maintenance we need to migrate instances off |
34 | 09208925 | Guido Trotter | the nodes before a reboot. How this can be done depends on the |
35 | 09208925 | Guido Trotter | instance's disk template and status: |
36 | 09208925 | Guido Trotter | |
37 | 09208925 | Guido Trotter | Down instances |
38 | 09208925 | Guido Trotter | ++++++++++++++ |
39 | 09208925 | Guido Trotter | |
40 | 09208925 | Guido Trotter | If an instance was shutdown when the maintenance started it will be |
41 | 09208925 | Guido Trotter | ignored. This allows avoiding needlessly moving its primary around, |
42 | 09208925 | Guido Trotter | since it won't suffer a downtime anyway. |
43 | 09208925 | Guido Trotter | |
44 | 09208925 | Guido Trotter | |
45 | 09208925 | Guido Trotter | DRBD |
46 | 09208925 | Guido Trotter | ++++ |
47 | 09208925 | Guido Trotter | |
48 | 09208925 | Guido Trotter | Each node must migrate all instances off to their secondaries, and then |
49 | 09208925 | Guido Trotter | can either be rebooted, or the secondaries can be evacuated as well. |
50 | 09208925 | Guido Trotter | |
51 | 09208925 | Guido Trotter | Since currently doing a ``replace-disks`` on DRBD breaks redundancy, |
52 | 09208925 | Guido Trotter | it's not any safer than temporarily rebooting a node with secondaries on |
53 | 09208925 | Guido Trotter | them (citation needed). As such we'll implement for now just the |
54 | 09208925 | Guido Trotter | "migrate+reboot" mode, and focus later on replace-disks as well. |
55 | 09208925 | Guido Trotter | |
56 | 09208925 | Guido Trotter | In order to do that we can use the following algorithm: |
57 | 09208925 | Guido Trotter | |
58 | 09208925 | Guido Trotter | 1) Compute node sets that don't contain both the primary and the |
59 | 09208925 | Guido Trotter | secondary for any instance. This can be done already by the current |
60 | 09208925 | Guido Trotter | hroller graph coloring algorithm: nodes are in the same set (color) if |
61 | 09208925 | Guido Trotter | and only if no edge (instance) exists between them (see the |
62 | 09208925 | Guido Trotter | :manpage:`hroller(1)` manpage for more details). |
63 | 09208925 | Guido Trotter | 2) Inside each node set calculate subsets that don't have any secondary |
64 | 09208925 | Guido Trotter | node in common (this can be done by creating a graph of nodes that are |
65 | 09208925 | Guido Trotter | connected if and only if an instance on both has the same secondary |
66 | 09208925 | Guido Trotter | node, and coloring that graph) |
67 | 09208925 | Guido Trotter | 3) It is then possible to migrate in parallel all nodes in a subset |
68 | 09208925 | Guido Trotter | created at step 2, and then reboot/perform maintenance on them, and |
69 | 09208925 | Guido Trotter | migrate back their original primaries, which allows the computation |
70 | 09208925 | Guido Trotter | above to be reused for each following subset without N+1 failures being |
71 | 09208925 | Guido Trotter | triggered, if none were present before. See below about the actual |
72 | 09208925 | Guido Trotter | execution of the maintenance. |
73 | 09208925 | Guido Trotter | |
74 | 09208925 | Guido Trotter | Non-DRBD |
75 | 09208925 | Guido Trotter | ++++++++ |
76 | 09208925 | Guido Trotter | |
77 | 09208925 | Guido Trotter | All non-DRBD disk templates that can be migrated have no "secondary" |
78 | 09208925 | Guido Trotter | concept. As such instances can be migrated to any node (in the same |
79 | 09208925 | Guido Trotter | nodegroup). In order to do the job we can either: |
80 | 09208925 | Guido Trotter | |
81 | 09208925 | Guido Trotter | - Perform migrations on one node at a time, perform the maintenance on |
82 | 09208925 | Guido Trotter | that node, and proceed (the node will then be targeted again to host |
83 | 09208925 | Guido Trotter | instances automatically, as hail chooses targets for the instances |
84 | 09208925 | Guido Trotter | between all nodes in a group. Nodes in different nodegroups can be |
85 | 09208925 | Guido Trotter | handled in parallel. |
86 | 09208925 | Guido Trotter | - Perform migrations on one node at a time, but without waiting for the |
87 | 09208925 | Guido Trotter | first node to come back before proceeding. This allows us to continue, |
88 | 09208925 | Guido Trotter | restricting the cluster, until no more capacity in the nodegroup is |
89 | 09208925 | Guido Trotter | available, and then having to wait for some nodes to come back so that |
90 | 09208925 | Guido Trotter | capacity is available again for the last few nodes. |
91 | 09208925 | Guido Trotter | - Pre-Calculate sets of nodes that can be migrated together (probably |
92 | 09208925 | Guido Trotter | with a greedy algorithm) and parallelize between them, with the |
93 | 09208925 | Guido Trotter | migrate-back approach discussed for DRBD to perform the calculation |
94 | 09208925 | Guido Trotter | only once. |
95 | 09208925 | Guido Trotter | |
96 | 09208925 | Guido Trotter | Note that for non-DRBD disks that still use local storage (eg. RBD and |
97 | 09208925 | Guido Trotter | plain) redundancy might break anyway, and nothing except the first |
98 | 09208925 | Guido Trotter | algorithm might be safe. This perhaps would be a good reason to consider |
99 | 09208925 | Guido Trotter | managing better RBD pools, if those are implemented on top of nodes |
100 | 09208925 | Guido Trotter | storage, rather than on dedicated storage machines. |
101 | 09208925 | Guido Trotter | |
102 | 09208925 | Guido Trotter | Executing rolling maintenances |
103 | 09208925 | Guido Trotter | ------------------------------ |
104 | 09208925 | Guido Trotter | |
105 | 09208925 | Guido Trotter | Hroller accepts commands to run to do maintenance automatically. These |
106 | 09208925 | Guido Trotter | are going to be run on the machine hroller runs on, and take a node name |
107 | 09208925 | Guido Trotter | as input. They have then to gain access to the target node (via ssh, |
108 | 09208925 | Guido Trotter | restricted commands, or some other means) and perform their duty. |
109 | 09208925 | Guido Trotter | |
110 | 09208925 | Guido Trotter | 1) A command (--check-cmd) will be called on all selected online nodes |
111 | 09208925 | Guido Trotter | to check whether a node needs maintenance. Hroller will proceed only on |
112 | 09208925 | Guido Trotter | nodes that respond positively to this invocation. |
113 | 09208925 | Guido Trotter | FIXME: decide about -D |
114 | 09208925 | Guido Trotter | 2) Hroller will evacuate the node of all primary instances. |
115 | 09208925 | Guido Trotter | 3) A command (--maint-cmd) will be called on a node to do the actual |
116 | 09208925 | Guido Trotter | maintenance operation. It should do any operation needed to perform the |
117 | 09208925 | Guido Trotter | maintenance including triggering the actual reboot. |
118 | 09208925 | Guido Trotter | 3) A command (--verify-cmd) will be called to check that the operation |
119 | 09208925 | Guido Trotter | was successful, it has to wait until the target node is back up (and |
120 | 09208925 | Guido Trotter | decide after how long it should give up) and perform the verification. |
121 | 09208925 | Guido Trotter | If it's not successful hroller will stop and not proceed with other |
122 | 09208925 | Guido Trotter | nodes. |
123 | 09208925 | Guido Trotter | 4) The master node will be kept last, but will not otherwise be treated |
124 | 09208925 | Guido Trotter | specially. If hroller was running on the master node, care must be |
125 | 09208925 | Guido Trotter | exercised as its maintenance will have interrupted the software itself, |
126 | 09208925 | Guido Trotter | and as such the verification step will not happen. This will not |
127 | 09208925 | Guido Trotter | automatically be taken care of, in the first version. An additional flag |
128 | 09208925 | Guido Trotter | to just skip the master node will be present as well, in case that's |
129 | 09208925 | Guido Trotter | preferred. |
130 | 09208925 | Guido Trotter | |
131 | 09208925 | Guido Trotter | |
132 | 09208925 | Guido Trotter | Future work |
133 | 09208925 | Guido Trotter | =========== |
134 | 09208925 | Guido Trotter | |
135 | 09208925 | Guido Trotter | DRBD nodes' ``replace-disks``' functionality should be implemented. Note |
136 | 09208925 | Guido Trotter | that when we will support a DRBD version that allows multi-secondary |
137 | 09208925 | Guido Trotter | this can be done safely, without losing replication at any time, by |
138 | 09208925 | Guido Trotter | adding a temporary secondary and only when the sync is finished dropping |
139 | 09208925 | Guido Trotter | the previous one. |
140 | 09208925 | Guido Trotter | |
141 | 09208925 | Guido Trotter | If/when RBD pools can be managed inside Ganeti, care can be taken so |
142 | 09208925 | Guido Trotter | that the pool is evacuated as well from a node before it's put into |
143 | 09208925 | Guido Trotter | maintenance. This is equivalent to evacuating DRBD secondaries. |
144 | 09208925 | Guido Trotter | |
145 | 09208925 | Guido Trotter | Master failovers during the maintenance should be performed by hroller. |
146 | 09208925 | Guido Trotter | This requires RPC/RAPI support for master failover. Hroller should also |
147 | 09208925 | Guido Trotter | be modified to better support running on the master itself and |
148 | 09208925 | Guido Trotter | continuing on the new master. |
149 | 09208925 | Guido Trotter | |
150 | 09208925 | Guido Trotter | .. vim: set textwidth=72 : |
151 | 09208925 | Guido Trotter | .. Local Variables: |
152 | 09208925 | Guido Trotter | .. mode: rst |
153 | 09208925 | Guido Trotter | .. fill-column: 72 |
154 | 09208925 | Guido Trotter | .. End: |