Statistics
| Branch: | Tag: | Revision:

root / doc / design-hroller.rst @ 09208925

History | View | Annotate | Download (6.3 kB)

1 09208925 Guido Trotter
============
2 09208925 Guido Trotter
HRoller tool
3 09208925 Guido Trotter
============
4 09208925 Guido Trotter
5 09208925 Guido Trotter
.. contents:: :depth: 4
6 09208925 Guido Trotter
7 09208925 Guido Trotter
This is a design document detailing the cluster maintenance scheduler,
8 09208925 Guido Trotter
HRoller.
9 09208925 Guido Trotter
10 09208925 Guido Trotter
11 09208925 Guido Trotter
Current state and shortcomings
12 09208925 Guido Trotter
==============================
13 09208925 Guido Trotter
14 09208925 Guido Trotter
To enable automating cluster-wide reboots a new htool, called HRoller,
15 09208925 Guido Trotter
was added to Ganeti starting from version 2.7. This tool helps
16 09208925 Guido Trotter
parallelizing cluster offline maintenances by calculating which nodes
17 09208925 Guido Trotter
are not both primary and secondary for a DRBD instance, and thus can be
18 09208925 Guido Trotter
rebooted at the same time, when all instances are down.
19 09208925 Guido Trotter
20 09208925 Guido Trotter
The way this is done is documented in the :manpage:`hroller(1)` manpage.
21 09208925 Guido Trotter
22 09208925 Guido Trotter
We would now like to perform online maintenance on the cluster by
23 09208925 Guido Trotter
rebooting nodes after evacuating their primary instances (rolling
24 09208925 Guido Trotter
reboots).
25 09208925 Guido Trotter
26 09208925 Guido Trotter
Proposed changes
27 09208925 Guido Trotter
================
28 09208925 Guido Trotter
29 09208925 Guido Trotter
30 09208925 Guido Trotter
Calculating rolling maintenances
31 09208925 Guido Trotter
--------------------------------
32 09208925 Guido Trotter
33 09208925 Guido Trotter
In order to perform rolling maintenance we need to migrate instances off
34 09208925 Guido Trotter
the nodes before a reboot. How this can be done depends on the
35 09208925 Guido Trotter
instance's disk template and status:
36 09208925 Guido Trotter
37 09208925 Guido Trotter
Down instances
38 09208925 Guido Trotter
++++++++++++++
39 09208925 Guido Trotter
40 09208925 Guido Trotter
If an instance was shutdown when the maintenance started it will be
41 09208925 Guido Trotter
ignored. This allows avoiding needlessly moving its primary around,
42 09208925 Guido Trotter
since it won't suffer a downtime anyway.
43 09208925 Guido Trotter
44 09208925 Guido Trotter
45 09208925 Guido Trotter
DRBD
46 09208925 Guido Trotter
++++
47 09208925 Guido Trotter
48 09208925 Guido Trotter
Each node must migrate all instances off to their secondaries, and then
49 09208925 Guido Trotter
can either be rebooted, or the secondaries can be evacuated as well.
50 09208925 Guido Trotter
51 09208925 Guido Trotter
Since currently doing a ``replace-disks`` on DRBD breaks redundancy,
52 09208925 Guido Trotter
it's not any safer than temporarily rebooting a node with secondaries on
53 09208925 Guido Trotter
them (citation needed). As such we'll implement for now just the
54 09208925 Guido Trotter
"migrate+reboot" mode, and focus later on replace-disks as well.
55 09208925 Guido Trotter
56 09208925 Guido Trotter
In order to do that we can use the following algorithm:
57 09208925 Guido Trotter
58 09208925 Guido Trotter
1) Compute node sets that don't contain both the primary and the
59 09208925 Guido Trotter
secondary for any instance. This can be done already by the current
60 09208925 Guido Trotter
hroller graph coloring algorithm: nodes are in the same set (color) if
61 09208925 Guido Trotter
and only if no edge (instance) exists between them (see the
62 09208925 Guido Trotter
:manpage:`hroller(1)` manpage for more details).
63 09208925 Guido Trotter
2) Inside each node set calculate subsets that don't have any secondary
64 09208925 Guido Trotter
node in common (this can be done by creating a graph of nodes that are
65 09208925 Guido Trotter
connected if and only if an instance on both has the same secondary
66 09208925 Guido Trotter
node, and coloring that graph)
67 09208925 Guido Trotter
3) It is then possible to migrate in parallel all nodes in a subset
68 09208925 Guido Trotter
created at step 2, and then reboot/perform maintenance on them, and
69 09208925 Guido Trotter
migrate back their original primaries, which allows the computation
70 09208925 Guido Trotter
above to be reused for each following subset without N+1 failures being
71 09208925 Guido Trotter
triggered, if none were present before. See below about the actual
72 09208925 Guido Trotter
execution of the maintenance.
73 09208925 Guido Trotter
74 09208925 Guido Trotter
Non-DRBD
75 09208925 Guido Trotter
++++++++
76 09208925 Guido Trotter
77 09208925 Guido Trotter
All non-DRBD disk templates that can be migrated have no "secondary"
78 09208925 Guido Trotter
concept. As such instances can be migrated to any node (in the same
79 09208925 Guido Trotter
nodegroup). In order to do the job we can either:
80 09208925 Guido Trotter
81 09208925 Guido Trotter
- Perform migrations on one node at a time, perform the maintenance on
82 09208925 Guido Trotter
  that node, and proceed (the node will then be targeted again to host
83 09208925 Guido Trotter
  instances automatically, as hail chooses targets for the instances
84 09208925 Guido Trotter
  between all nodes in a group. Nodes in different nodegroups can be
85 09208925 Guido Trotter
  handled in parallel.
86 09208925 Guido Trotter
- Perform migrations on one node at a time, but without waiting for the
87 09208925 Guido Trotter
  first node to come back before proceeding. This allows us to continue,
88 09208925 Guido Trotter
  restricting the cluster, until no more capacity in the nodegroup is
89 09208925 Guido Trotter
  available, and then having to wait for some nodes to come back so that
90 09208925 Guido Trotter
  capacity is available again for the last few nodes.
91 09208925 Guido Trotter
- Pre-Calculate sets of nodes that can be migrated together (probably
92 09208925 Guido Trotter
  with a greedy algorithm) and parallelize between them, with the
93 09208925 Guido Trotter
  migrate-back approach discussed for DRBD to perform the calculation
94 09208925 Guido Trotter
  only once.
95 09208925 Guido Trotter
96 09208925 Guido Trotter
Note that for non-DRBD disks that still use local storage (eg. RBD and
97 09208925 Guido Trotter
plain) redundancy might break anyway, and nothing except the first
98 09208925 Guido Trotter
algorithm might be safe. This perhaps would be a good reason to consider
99 09208925 Guido Trotter
managing better RBD pools, if those are implemented on top of nodes
100 09208925 Guido Trotter
storage, rather than on dedicated storage machines.
101 09208925 Guido Trotter
102 09208925 Guido Trotter
Executing rolling maintenances
103 09208925 Guido Trotter
------------------------------
104 09208925 Guido Trotter
105 09208925 Guido Trotter
Hroller accepts commands to run to do maintenance automatically. These
106 09208925 Guido Trotter
are going to be run on the machine hroller runs on, and take a node name
107 09208925 Guido Trotter
as input. They have then to gain access to the target node (via ssh,
108 09208925 Guido Trotter
restricted commands, or some other means) and perform their duty.
109 09208925 Guido Trotter
110 09208925 Guido Trotter
1) A command (--check-cmd) will be called on all selected online nodes
111 09208925 Guido Trotter
to check whether a node needs maintenance. Hroller will proceed only on
112 09208925 Guido Trotter
nodes that respond positively to this invocation.
113 09208925 Guido Trotter
FIXME: decide about -D
114 09208925 Guido Trotter
2) Hroller will evacuate the node of all primary instances.
115 09208925 Guido Trotter
3) A command (--maint-cmd) will be called on a node to do the actual
116 09208925 Guido Trotter
maintenance operation.  It should do any operation needed to perform the
117 09208925 Guido Trotter
maintenance including triggering the actual reboot.
118 09208925 Guido Trotter
3) A command (--verify-cmd) will be called to check that the operation
119 09208925 Guido Trotter
was successful, it has to wait until the target node is back up (and
120 09208925 Guido Trotter
decide after how long it should give up) and perform the verification.
121 09208925 Guido Trotter
If it's not successful hroller will stop and not proceed with other
122 09208925 Guido Trotter
nodes.
123 09208925 Guido Trotter
4) The master node will be kept last, but will not otherwise be treated
124 09208925 Guido Trotter
specially. If hroller was running on the master node, care must be
125 09208925 Guido Trotter
exercised as its maintenance will have interrupted the software itself,
126 09208925 Guido Trotter
and as such the verification step will not happen. This will not
127 09208925 Guido Trotter
automatically be taken care of, in the first version. An additional flag
128 09208925 Guido Trotter
to just skip the master node will be present as well, in case that's
129 09208925 Guido Trotter
preferred.
130 09208925 Guido Trotter
131 09208925 Guido Trotter
132 09208925 Guido Trotter
Future work
133 09208925 Guido Trotter
===========
134 09208925 Guido Trotter
135 09208925 Guido Trotter
DRBD nodes' ``replace-disks``' functionality should be implemented. Note
136 09208925 Guido Trotter
that when we will support a DRBD version that allows multi-secondary
137 09208925 Guido Trotter
this can be done safely, without losing replication at any time, by
138 09208925 Guido Trotter
adding a temporary secondary and only when the sync is finished dropping
139 09208925 Guido Trotter
the previous one.
140 09208925 Guido Trotter
141 09208925 Guido Trotter
If/when RBD pools can be managed inside Ganeti, care can be taken so
142 09208925 Guido Trotter
that the pool is evacuated as well from a node before it's put into
143 09208925 Guido Trotter
maintenance. This is equivalent to evacuating DRBD secondaries.
144 09208925 Guido Trotter
145 09208925 Guido Trotter
Master failovers during the maintenance should be performed by hroller.
146 09208925 Guido Trotter
This requires RPC/RAPI support for master failover. Hroller should also
147 09208925 Guido Trotter
be modified to better support running on the master itself and
148 09208925 Guido Trotter
continuing on the new master.
149 09208925 Guido Trotter
150 09208925 Guido Trotter
.. vim: set textwidth=72 :
151 09208925 Guido Trotter
.. Local Variables:
152 09208925 Guido Trotter
.. mode: rst
153 09208925 Guido Trotter
.. fill-column: 72
154 09208925 Guido Trotter
.. End: