root / doc / design-hroller.rst @ c2610080
History | View | Annotate | Download (6.7 kB)
1 | 09208925 | Guido Trotter | ============ |
---|---|---|---|
2 | 09208925 | Guido Trotter | HRoller tool |
3 | 09208925 | Guido Trotter | ============ |
4 | 09208925 | Guido Trotter | |
5 | 09208925 | Guido Trotter | .. contents:: :depth: 4 |
6 | 09208925 | Guido Trotter | |
7 | 09208925 | Guido Trotter | This is a design document detailing the cluster maintenance scheduler, |
8 | 09208925 | Guido Trotter | HRoller. |
9 | 09208925 | Guido Trotter | |
10 | 09208925 | Guido Trotter | |
11 | 09208925 | Guido Trotter | Current state and shortcomings |
12 | 09208925 | Guido Trotter | ============================== |
13 | 09208925 | Guido Trotter | |
14 | 09208925 | Guido Trotter | To enable automating cluster-wide reboots a new htool, called HRoller, |
15 | 09208925 | Guido Trotter | was added to Ganeti starting from version 2.7. This tool helps |
16 | 09208925 | Guido Trotter | parallelizing cluster offline maintenances by calculating which nodes |
17 | 09208925 | Guido Trotter | are not both primary and secondary for a DRBD instance, and thus can be |
18 | 09208925 | Guido Trotter | rebooted at the same time, when all instances are down. |
19 | 09208925 | Guido Trotter | |
20 | 09208925 | Guido Trotter | The way this is done is documented in the :manpage:`hroller(1)` manpage. |
21 | 09208925 | Guido Trotter | |
22 | 09208925 | Guido Trotter | We would now like to perform online maintenance on the cluster by |
23 | 09208925 | Guido Trotter | rebooting nodes after evacuating their primary instances (rolling |
24 | 09208925 | Guido Trotter | reboots). |
25 | 09208925 | Guido Trotter | |
26 | 09208925 | Guido Trotter | Proposed changes |
27 | 09208925 | Guido Trotter | ================ |
28 | 09208925 | Guido Trotter | |
29 | fb4b885a | Guido Trotter | New options |
30 | fb4b885a | Guido Trotter | ----------- |
31 | fb4b885a | Guido Trotter | |
32 | fb4b885a | Guido Trotter | - HRoller should be able to operate on single nodegroups (-G flag) or |
33 | fb4b885a | Guido Trotter | select its target node through some other mean (eg. via a tag, or a |
34 | fb4b885a | Guido Trotter | regexp). (Note that individual node selection is already possible via |
35 | fb4b885a | Guido Trotter | the -O flag, that makes hroller ignore a node altogether). |
36 | fb4b885a | Guido Trotter | - HRoller should handle non redundant instances: currently these are |
37 | fb4b885a | Guido Trotter | ignored but there should be a way to select its behavior between "it's |
38 | fb4b885a | Guido Trotter | ok to reboot a node when a non-redundant instance is on it" or "skip |
39 | fb4b885a | Guido Trotter | nodes with non-redundant instances". This will only be selectable |
40 | fb4b885a | Guido Trotter | globally, and not per instance. |
41 | fb4b885a | Guido Trotter | - Hroller will make sure to keep any instance which is up in its current |
42 | fb4b885a | Guido Trotter | state, via live migrations, unless explicitly overridden. The |
43 | fb4b885a | Guido Trotter | algorithm that will be used calculate the rolling reboot with live |
44 | fb4b885a | Guido Trotter | migrations is described below, and any override on considering the |
45 | fb4b885a | Guido Trotter | instance status will only be possible on the whole run, and not |
46 | fb4b885a | Guido Trotter | per-instance. |
47 | fb4b885a | Guido Trotter | |
48 | 09208925 | Guido Trotter | |
49 | 09208925 | Guido Trotter | Calculating rolling maintenances |
50 | 09208925 | Guido Trotter | -------------------------------- |
51 | 09208925 | Guido Trotter | |
52 | 09208925 | Guido Trotter | In order to perform rolling maintenance we need to migrate instances off |
53 | 09208925 | Guido Trotter | the nodes before a reboot. How this can be done depends on the |
54 | 09208925 | Guido Trotter | instance's disk template and status: |
55 | 09208925 | Guido Trotter | |
56 | 09208925 | Guido Trotter | Down instances |
57 | 09208925 | Guido Trotter | ++++++++++++++ |
58 | 09208925 | Guido Trotter | |
59 | 09208925 | Guido Trotter | If an instance was shutdown when the maintenance started it will be |
60 | fb4b885a | Guido Trotter | considered for avoiding contemporary reboot of its primary and secondary |
61 | fb4b885a | Guido Trotter | nodes, but will *not* be considered as a target for the node evacuation. |
62 | fb4b885a | Guido Trotter | This allows avoiding needlessly moving its primary around, since it |
63 | fb4b885a | Guido Trotter | won't suffer a downtime anyway. |
64 | 09208925 | Guido Trotter | |
65 | fb4b885a | Guido Trotter | Note that a node with non-redundant instances will only ever be |
66 | fb4b885a | Guido Trotter | considered good for rolling-reboot if these are down (or the checking of |
67 | fb4b885a | Guido Trotter | status is overridden) *and* an explicit option to allow it is set. |
68 | 09208925 | Guido Trotter | |
69 | 09208925 | Guido Trotter | DRBD |
70 | 09208925 | Guido Trotter | ++++ |
71 | 09208925 | Guido Trotter | |
72 | 09208925 | Guido Trotter | Each node must migrate all instances off to their secondaries, and then |
73 | 09208925 | Guido Trotter | can either be rebooted, or the secondaries can be evacuated as well. |
74 | 09208925 | Guido Trotter | |
75 | 09208925 | Guido Trotter | Since currently doing a ``replace-disks`` on DRBD breaks redundancy, |
76 | 09208925 | Guido Trotter | it's not any safer than temporarily rebooting a node with secondaries on |
77 | 09208925 | Guido Trotter | them (citation needed). As such we'll implement for now just the |
78 | 09208925 | Guido Trotter | "migrate+reboot" mode, and focus later on replace-disks as well. |
79 | 09208925 | Guido Trotter | |
80 | 09208925 | Guido Trotter | In order to do that we can use the following algorithm: |
81 | 09208925 | Guido Trotter | |
82 | 09208925 | Guido Trotter | 1) Compute node sets that don't contain both the primary and the |
83 | fb4b885a | Guido Trotter | secondary for any instance. This can be done already by the current |
84 | fb4b885a | Guido Trotter | hroller graph coloring algorithm: nodes are in the same set (color) |
85 | fb4b885a | Guido Trotter | if and only if no edge (instance) exists between them (see the |
86 | fb4b885a | Guido Trotter | :manpage:`hroller(1)` manpage for more details). |
87 | 09208925 | Guido Trotter | 2) Inside each node set calculate subsets that don't have any secondary |
88 | fb4b885a | Guido Trotter | node in common (this can be done by creating a graph of nodes that |
89 | fb4b885a | Guido Trotter | are connected if and only if an instance on both has the same |
90 | fb4b885a | Guido Trotter | secondary node, and coloring that graph) |
91 | 09208925 | Guido Trotter | 3) It is then possible to migrate in parallel all nodes in a subset |
92 | fb4b885a | Guido Trotter | created at step 2, and then reboot/perform maintenance on them, and |
93 | fb4b885a | Guido Trotter | migrate back their original primaries, which allows the computation |
94 | fb4b885a | Guido Trotter | above to be reused for each following subset without N+1 failures |
95 | fb4b885a | Guido Trotter | being triggered, if none were present before. See below about the |
96 | fb4b885a | Guido Trotter | actual execution of the maintenance. |
97 | 09208925 | Guido Trotter | |
98 | 09208925 | Guido Trotter | Non-DRBD |
99 | 09208925 | Guido Trotter | ++++++++ |
100 | 09208925 | Guido Trotter | |
101 | 09208925 | Guido Trotter | All non-DRBD disk templates that can be migrated have no "secondary" |
102 | 09208925 | Guido Trotter | concept. As such instances can be migrated to any node (in the same |
103 | 09208925 | Guido Trotter | nodegroup). In order to do the job we can either: |
104 | 09208925 | Guido Trotter | |
105 | 09208925 | Guido Trotter | - Perform migrations on one node at a time, perform the maintenance on |
106 | 09208925 | Guido Trotter | that node, and proceed (the node will then be targeted again to host |
107 | 09208925 | Guido Trotter | instances automatically, as hail chooses targets for the instances |
108 | 09208925 | Guido Trotter | between all nodes in a group. Nodes in different nodegroups can be |
109 | 09208925 | Guido Trotter | handled in parallel. |
110 | 09208925 | Guido Trotter | - Perform migrations on one node at a time, but without waiting for the |
111 | 09208925 | Guido Trotter | first node to come back before proceeding. This allows us to continue, |
112 | 09208925 | Guido Trotter | restricting the cluster, until no more capacity in the nodegroup is |
113 | 09208925 | Guido Trotter | available, and then having to wait for some nodes to come back so that |
114 | 09208925 | Guido Trotter | capacity is available again for the last few nodes. |
115 | 09208925 | Guido Trotter | - Pre-Calculate sets of nodes that can be migrated together (probably |
116 | 09208925 | Guido Trotter | with a greedy algorithm) and parallelize between them, with the |
117 | 09208925 | Guido Trotter | migrate-back approach discussed for DRBD to perform the calculation |
118 | 09208925 | Guido Trotter | only once. |
119 | 09208925 | Guido Trotter | |
120 | 09208925 | Guido Trotter | Note that for non-DRBD disks that still use local storage (eg. RBD and |
121 | 09208925 | Guido Trotter | plain) redundancy might break anyway, and nothing except the first |
122 | 09208925 | Guido Trotter | algorithm might be safe. This perhaps would be a good reason to consider |
123 | 09208925 | Guido Trotter | managing better RBD pools, if those are implemented on top of nodes |
124 | 09208925 | Guido Trotter | storage, rather than on dedicated storage machines. |
125 | 09208925 | Guido Trotter | |
126 | 09208925 | Guido Trotter | Future work |
127 | 09208925 | Guido Trotter | =========== |
128 | 09208925 | Guido Trotter | |
129 | fb4b885a | Guido Trotter | Hroller should become able to execute rolling maintenances, rather than |
130 | fb4b885a | Guido Trotter | just calculate them. For this to succeed properly one of the following |
131 | fb4b885a | Guido Trotter | must happen: |
132 | fb4b885a | Guido Trotter | |
133 | fb4b885a | Guido Trotter | - HRoller handles rolling maintenances that happen at the same time as |
134 | fb4b885a | Guido Trotter | unrelated cluster jobs, and thus recalculates the maintenance at each |
135 | fb4b885a | Guido Trotter | step |
136 | fb4b885a | Guido Trotter | - HRoller can selectively drain the cluster so it's sure that only the |
137 | fb4b885a | Guido Trotter | rolling maintenance can be going on |
138 | fb4b885a | Guido Trotter | |
139 | 09208925 | Guido Trotter | DRBD nodes' ``replace-disks``' functionality should be implemented. Note |
140 | 09208925 | Guido Trotter | that when we will support a DRBD version that allows multi-secondary |
141 | 09208925 | Guido Trotter | this can be done safely, without losing replication at any time, by |
142 | 09208925 | Guido Trotter | adding a temporary secondary and only when the sync is finished dropping |
143 | 09208925 | Guido Trotter | the previous one. |
144 | 09208925 | Guido Trotter | |
145 | fb4b885a | Guido Trotter | Non-redundant (plain or file) instances should have a way to be moved |
146 | fb4b885a | Guido Trotter | off as well via plain storage live migration or ``gnt-instance move`` |
147 | fb4b885a | Guido Trotter | (which requires downtime). |
148 | fb4b885a | Guido Trotter | |
149 | 09208925 | Guido Trotter | If/when RBD pools can be managed inside Ganeti, care can be taken so |
150 | 09208925 | Guido Trotter | that the pool is evacuated as well from a node before it's put into |
151 | 09208925 | Guido Trotter | maintenance. This is equivalent to evacuating DRBD secondaries. |
152 | 09208925 | Guido Trotter | |
153 | 09208925 | Guido Trotter | Master failovers during the maintenance should be performed by hroller. |
154 | 09208925 | Guido Trotter | This requires RPC/RAPI support for master failover. Hroller should also |
155 | 09208925 | Guido Trotter | be modified to better support running on the master itself and |
156 | 09208925 | Guido Trotter | continuing on the new master. |
157 | 09208925 | Guido Trotter | |
158 | 09208925 | Guido Trotter | .. vim: set textwidth=72 : |
159 | 09208925 | Guido Trotter | .. Local Variables: |
160 | 09208925 | Guido Trotter | .. mode: rst |
161 | 09208925 | Guido Trotter | .. fill-column: 72 |
162 | 09208925 | Guido Trotter | .. End: |