5 .. contents:: :depth: 4
7 This is a design document detailing the cluster maintenance scheduler,
11 Current state and shortcomings
12 ==============================
14 To enable automating cluster-wide reboots a new htool, called HRoller,
15 was added to Ganeti starting from version 2.7. This tool helps
16 parallelizing cluster offline maintenances by calculating which nodes
17 are not both primary and secondary for a DRBD instance, and thus can be
18 rebooted at the same time, when all instances are down.
20 The way this is done is documented in the :manpage:`hroller(1)` manpage.
22 We would now like to perform online maintenance on the cluster by
23 rebooting nodes after evacuating their primary instances (rolling
32 - HRoller should be able to operate on single nodegroups (-G flag) or
33 select its target node through some other mean (eg. via a tag, or a
34 regexp). (Note that individual node selection is already possible via
35 the -O flag, that makes hroller ignore a node altogether).
36 - HRoller should handle non redundant instances: currently these are
37 ignored but there should be a way to select its behavior between "it's
38 ok to reboot a node when a non-redundant instance is on it" or "skip
39 nodes with non-redundant instances". This will only be selectable
40 globally, and not per instance.
41 - Hroller will make sure to keep any instance which is up in its current
42 state, via live migrations, unless explicitly overridden. The
43 algorithm that will be used calculate the rolling reboot with live
44 migrations is described below, and any override on considering the
45 instance status will only be possible on the whole run, and not
49 Calculating rolling maintenances
50 --------------------------------
52 In order to perform rolling maintenance we need to migrate instances off
53 the nodes before a reboot. How this can be done depends on the
54 instance's disk template and status:
59 If an instance was shutdown when the maintenance started it will be
60 considered for avoiding contemporary reboot of its primary and secondary
61 nodes, but will *not* be considered as a target for the node evacuation.
62 This allows avoiding needlessly moving its primary around, since it
63 won't suffer a downtime anyway.
65 Note that a node with non-redundant instances will only ever be
66 considered good for rolling-reboot if these are down (or the checking of
67 status is overridden) *and* an explicit option to allow it is set.
72 Each node must migrate all instances off to their secondaries, and then
73 can either be rebooted, or the secondaries can be evacuated as well.
75 Since currently doing a ``replace-disks`` on DRBD breaks redundancy,
76 it's not any safer than temporarily rebooting a node with secondaries on
77 them (citation needed). As such we'll implement for now just the
78 "migrate+reboot" mode, and focus later on replace-disks as well.
80 In order to do that we can use the following algorithm:
82 1) Compute node sets that don't contain both the primary and the
83 secondary of any instance, and also don't contain the primary
84 nodes of two instances that have the same node as secondary. These
85 can be obtained by computing a coloring of the graph with nodes
86 as vertexes and an edge between two nodes, if either condition
87 prevents simultaneous maintenance. (This is the current algorithm of
88 :manpage:`hroller(1)` with the extension that the graph to be colored
89 has additional edges between the primary nodes of two instances sharing
90 their secondary node.)
91 2) It is then possible to migrate in parallel all nodes in a set
92 created at step 1, and then reboot/perform maintenance on them, and
93 migrate back their original primaries, which allows the computation
94 above to be reused for each following set without N+1 failures
95 being triggered, if none were present before. See below about the
96 actual execution of the maintenance.
101 All non-DRBD disk templates that can be migrated have no "secondary"
102 concept. As such instances can be migrated to any node (in the same
103 nodegroup). In order to do the job we can either:
105 - Perform migrations on one node at a time, perform the maintenance on
106 that node, and proceed (the node will then be targeted again to host
107 instances automatically, as hail chooses targets for the instances
108 between all nodes in a group. Nodes in different nodegroups can be
110 - Perform migrations on one node at a time, but without waiting for the
111 first node to come back before proceeding. This allows us to continue,
112 restricting the cluster, until no more capacity in the nodegroup is
113 available, and then having to wait for some nodes to come back so that
114 capacity is available again for the last few nodes.
115 - Pre-Calculate sets of nodes that can be migrated together (probably
116 with a greedy algorithm) and parallelize between them, with the
117 migrate-back approach discussed for DRBD to perform the calculation
120 Note that for non-DRBD disks that still use local storage (eg. RBD and
121 plain) redundancy might break anyway, and nothing except the first
122 algorithm might be safe. This perhaps would be a good reason to consider
123 managing better RBD pools, if those are implemented on top of nodes
124 storage, rather than on dedicated storage machines.
129 If full evacuation of the nodes to be rebooted is desired, a simple
130 migration is not enough for the DRBD instances. To keep the number of
131 disk operations small, we restrict moves to ``migrate, replace-secondary``.
132 That is, after migrating instances out of the nodes to be rebooted,
133 replacement secondaries are searched for, for all instances that have
134 their then secondary on one of the rebooted nodes. This is done by a
135 greedy algorithm, refining the initial reboot partition, if necessary.
140 Hroller should become able to execute rolling maintenances, rather than
141 just calculate them. For this to succeed properly one of the following
144 - HRoller handles rolling maintenances that happen at the same time as
145 unrelated cluster jobs, and thus recalculates the maintenance at each
147 - HRoller can selectively drain the cluster so it's sure that only the
148 rolling maintenance can be going on
150 DRBD nodes' ``replace-disks``' functionality should be implemented. Note
151 that when we will support a DRBD version that allows multi-secondary
152 this can be done safely, without losing replication at any time, by
153 adding a temporary secondary and only when the sync is finished dropping
156 Non-redundant (plain or file) instances should have a way to be moved
157 off as well via plain storage live migration or ``gnt-instance move``
158 (which requires downtime).
160 If/when RBD pools can be managed inside Ganeti, care can be taken so
161 that the pool is evacuated as well from a node before it's put into
162 maintenance. This is equivalent to evacuating DRBD secondaries.
164 Master failovers during the maintenance should be performed by hroller.
165 This requires RPC/RAPI support for master failover. Hroller should also
166 be modified to better support running on the master itself and
167 continuing on the new master.
169 .. vim: set textwidth=72 :