Statistics
| Branch: | Tag: | Revision:

root / doc / design-hroller.rst @ 09208925

History | View | Annotate | Download (6.3 kB)

1
============
2
HRoller tool
3
============
4

    
5
.. contents:: :depth: 4
6

    
7
This is a design document detailing the cluster maintenance scheduler,
8
HRoller.
9

    
10

    
11
Current state and shortcomings
12
==============================
13

    
14
To enable automating cluster-wide reboots a new htool, called HRoller,
15
was added to Ganeti starting from version 2.7. This tool helps
16
parallelizing cluster offline maintenances by calculating which nodes
17
are not both primary and secondary for a DRBD instance, and thus can be
18
rebooted at the same time, when all instances are down.
19

    
20
The way this is done is documented in the :manpage:`hroller(1)` manpage.
21

    
22
We would now like to perform online maintenance on the cluster by
23
rebooting nodes after evacuating their primary instances (rolling
24
reboots).
25

    
26
Proposed changes
27
================
28

    
29

    
30
Calculating rolling maintenances
31
--------------------------------
32

    
33
In order to perform rolling maintenance we need to migrate instances off
34
the nodes before a reboot. How this can be done depends on the
35
instance's disk template and status:
36

    
37
Down instances
38
++++++++++++++
39

    
40
If an instance was shutdown when the maintenance started it will be
41
ignored. This allows avoiding needlessly moving its primary around,
42
since it won't suffer a downtime anyway.
43

    
44

    
45
DRBD
46
++++
47

    
48
Each node must migrate all instances off to their secondaries, and then
49
can either be rebooted, or the secondaries can be evacuated as well.
50

    
51
Since currently doing a ``replace-disks`` on DRBD breaks redundancy,
52
it's not any safer than temporarily rebooting a node with secondaries on
53
them (citation needed). As such we'll implement for now just the
54
"migrate+reboot" mode, and focus later on replace-disks as well.
55

    
56
In order to do that we can use the following algorithm:
57

    
58
1) Compute node sets that don't contain both the primary and the
59
secondary for any instance. This can be done already by the current
60
hroller graph coloring algorithm: nodes are in the same set (color) if
61
and only if no edge (instance) exists between them (see the
62
:manpage:`hroller(1)` manpage for more details).
63
2) Inside each node set calculate subsets that don't have any secondary
64
node in common (this can be done by creating a graph of nodes that are
65
connected if and only if an instance on both has the same secondary
66
node, and coloring that graph)
67
3) It is then possible to migrate in parallel all nodes in a subset
68
created at step 2, and then reboot/perform maintenance on them, and
69
migrate back their original primaries, which allows the computation
70
above to be reused for each following subset without N+1 failures being
71
triggered, if none were present before. See below about the actual
72
execution of the maintenance.
73

    
74
Non-DRBD
75
++++++++
76

    
77
All non-DRBD disk templates that can be migrated have no "secondary"
78
concept. As such instances can be migrated to any node (in the same
79
nodegroup). In order to do the job we can either:
80

    
81
- Perform migrations on one node at a time, perform the maintenance on
82
  that node, and proceed (the node will then be targeted again to host
83
  instances automatically, as hail chooses targets for the instances
84
  between all nodes in a group. Nodes in different nodegroups can be
85
  handled in parallel.
86
- Perform migrations on one node at a time, but without waiting for the
87
  first node to come back before proceeding. This allows us to continue,
88
  restricting the cluster, until no more capacity in the nodegroup is
89
  available, and then having to wait for some nodes to come back so that
90
  capacity is available again for the last few nodes.
91
- Pre-Calculate sets of nodes that can be migrated together (probably
92
  with a greedy algorithm) and parallelize between them, with the
93
  migrate-back approach discussed for DRBD to perform the calculation
94
  only once.
95

    
96
Note that for non-DRBD disks that still use local storage (eg. RBD and
97
plain) redundancy might break anyway, and nothing except the first
98
algorithm might be safe. This perhaps would be a good reason to consider
99
managing better RBD pools, if those are implemented on top of nodes
100
storage, rather than on dedicated storage machines.
101

    
102
Executing rolling maintenances
103
------------------------------
104

    
105
Hroller accepts commands to run to do maintenance automatically. These
106
are going to be run on the machine hroller runs on, and take a node name
107
as input. They have then to gain access to the target node (via ssh,
108
restricted commands, or some other means) and perform their duty.
109

    
110
1) A command (--check-cmd) will be called on all selected online nodes
111
to check whether a node needs maintenance. Hroller will proceed only on
112
nodes that respond positively to this invocation.
113
FIXME: decide about -D
114
2) Hroller will evacuate the node of all primary instances.
115
3) A command (--maint-cmd) will be called on a node to do the actual
116
maintenance operation.  It should do any operation needed to perform the
117
maintenance including triggering the actual reboot.
118
3) A command (--verify-cmd) will be called to check that the operation
119
was successful, it has to wait until the target node is back up (and
120
decide after how long it should give up) and perform the verification.
121
If it's not successful hroller will stop and not proceed with other
122
nodes.
123
4) The master node will be kept last, but will not otherwise be treated
124
specially. If hroller was running on the master node, care must be
125
exercised as its maintenance will have interrupted the software itself,
126
and as such the verification step will not happen. This will not
127
automatically be taken care of, in the first version. An additional flag
128
to just skip the master node will be present as well, in case that's
129
preferred.
130

    
131

    
132
Future work
133
===========
134

    
135
DRBD nodes' ``replace-disks``' functionality should be implemented. Note
136
that when we will support a DRBD version that allows multi-secondary
137
this can be done safely, without losing replication at any time, by
138
adding a temporary secondary and only when the sync is finished dropping
139
the previous one.
140

    
141
If/when RBD pools can be managed inside Ganeti, care can be taken so
142
that the pool is evacuated as well from a node before it's put into
143
maintenance. This is equivalent to evacuating DRBD secondaries.
144

    
145
Master failovers during the maintenance should be performed by hroller.
146
This requires RPC/RAPI support for master failover. Hroller should also
147
be modified to better support running on the master itself and
148
continuing on the new master.
149

    
150
.. vim: set textwidth=72 :
151
.. Local Variables:
152
.. mode: rst
153
.. fill-column: 72
154
.. End: