Statistics
| Branch: | Tag: | Revision:

root / doc / design-hroller.rst @ c2610080

History | View | Annotate | Download (6.7 kB)

1
============
2
HRoller tool
3
============
4

    
5
.. contents:: :depth: 4
6

    
7
This is a design document detailing the cluster maintenance scheduler,
8
HRoller.
9

    
10

    
11
Current state and shortcomings
12
==============================
13

    
14
To enable automating cluster-wide reboots a new htool, called HRoller,
15
was added to Ganeti starting from version 2.7. This tool helps
16
parallelizing cluster offline maintenances by calculating which nodes
17
are not both primary and secondary for a DRBD instance, and thus can be
18
rebooted at the same time, when all instances are down.
19

    
20
The way this is done is documented in the :manpage:`hroller(1)` manpage.
21

    
22
We would now like to perform online maintenance on the cluster by
23
rebooting nodes after evacuating their primary instances (rolling
24
reboots).
25

    
26
Proposed changes
27
================
28

    
29
New options
30
-----------
31

    
32
- HRoller should be able to operate on single nodegroups (-G flag) or
33
  select its target node through some other mean (eg. via a tag, or a
34
  regexp). (Note that individual node selection is already possible via
35
  the -O flag, that makes hroller ignore a node altogether).
36
- HRoller should handle non redundant instances: currently these are
37
  ignored but there should be a way to select its behavior between "it's
38
  ok to reboot a node when a non-redundant instance is on it" or "skip
39
  nodes with non-redundant instances". This will only be selectable
40
  globally, and not per instance.
41
- Hroller will make sure to keep any instance which is up in its current
42
  state, via live migrations, unless explicitly overridden. The
43
  algorithm that will be used calculate the rolling reboot with live
44
  migrations is described below, and any override on considering the
45
  instance status will only be possible on the whole run, and not
46
  per-instance.
47

    
48

    
49
Calculating rolling maintenances
50
--------------------------------
51

    
52
In order to perform rolling maintenance we need to migrate instances off
53
the nodes before a reboot. How this can be done depends on the
54
instance's disk template and status:
55

    
56
Down instances
57
++++++++++++++
58

    
59
If an instance was shutdown when the maintenance started it will be
60
considered for avoiding contemporary reboot of its primary and secondary
61
nodes, but will *not* be considered as a target for the node evacuation.
62
This allows avoiding needlessly moving its primary around, since it
63
won't suffer a downtime anyway.
64

    
65
Note that a node with non-redundant instances will only ever be
66
considered good for rolling-reboot if these are down (or the checking of
67
status is overridden) *and* an explicit option to allow it is set.
68

    
69
DRBD
70
++++
71

    
72
Each node must migrate all instances off to their secondaries, and then
73
can either be rebooted, or the secondaries can be evacuated as well.
74

    
75
Since currently doing a ``replace-disks`` on DRBD breaks redundancy,
76
it's not any safer than temporarily rebooting a node with secondaries on
77
them (citation needed). As such we'll implement for now just the
78
"migrate+reboot" mode, and focus later on replace-disks as well.
79

    
80
In order to do that we can use the following algorithm:
81

    
82
1) Compute node sets that don't contain both the primary and the
83
   secondary for any instance. This can be done already by the current
84
   hroller graph coloring algorithm: nodes are in the same set (color)
85
   if and only if no edge (instance) exists between them (see the
86
   :manpage:`hroller(1)` manpage for more details).
87
2) Inside each node set calculate subsets that don't have any secondary
88
   node in common (this can be done by creating a graph of nodes that
89
   are connected if and only if an instance on both has the same
90
   secondary node, and coloring that graph)
91
3) It is then possible to migrate in parallel all nodes in a subset
92
   created at step 2, and then reboot/perform maintenance on them, and
93
   migrate back their original primaries, which allows the computation
94
   above to be reused for each following subset without N+1 failures
95
   being triggered, if none were present before. See below about the
96
   actual execution of the maintenance.
97

    
98
Non-DRBD
99
++++++++
100

    
101
All non-DRBD disk templates that can be migrated have no "secondary"
102
concept. As such instances can be migrated to any node (in the same
103
nodegroup). In order to do the job we can either:
104

    
105
- Perform migrations on one node at a time, perform the maintenance on
106
  that node, and proceed (the node will then be targeted again to host
107
  instances automatically, as hail chooses targets for the instances
108
  between all nodes in a group. Nodes in different nodegroups can be
109
  handled in parallel.
110
- Perform migrations on one node at a time, but without waiting for the
111
  first node to come back before proceeding. This allows us to continue,
112
  restricting the cluster, until no more capacity in the nodegroup is
113
  available, and then having to wait for some nodes to come back so that
114
  capacity is available again for the last few nodes.
115
- Pre-Calculate sets of nodes that can be migrated together (probably
116
  with a greedy algorithm) and parallelize between them, with the
117
  migrate-back approach discussed for DRBD to perform the calculation
118
  only once.
119

    
120
Note that for non-DRBD disks that still use local storage (eg. RBD and
121
plain) redundancy might break anyway, and nothing except the first
122
algorithm might be safe. This perhaps would be a good reason to consider
123
managing better RBD pools, if those are implemented on top of nodes
124
storage, rather than on dedicated storage machines.
125

    
126
Future work
127
===========
128

    
129
Hroller should become able to execute rolling maintenances, rather than
130
just calculate them. For this to succeed properly one of the following
131
must happen:
132

    
133
- HRoller handles rolling maintenances that happen at the same time as
134
  unrelated cluster jobs, and thus recalculates the maintenance at each
135
  step
136
- HRoller can selectively drain the cluster so it's sure that only the
137
  rolling maintenance can be going on
138

    
139
DRBD nodes' ``replace-disks``' functionality should be implemented. Note
140
that when we will support a DRBD version that allows multi-secondary
141
this can be done safely, without losing replication at any time, by
142
adding a temporary secondary and only when the sync is finished dropping
143
the previous one.
144

    
145
Non-redundant (plain or file) instances should have a way to be moved
146
off as well via plain storage live migration or ``gnt-instance move``
147
(which requires downtime).
148

    
149
If/when RBD pools can be managed inside Ganeti, care can be taken so
150
that the pool is evacuated as well from a node before it's put into
151
maintenance. This is equivalent to evacuating DRBD secondaries.
152

    
153
Master failovers during the maintenance should be performed by hroller.
154
This requires RPC/RAPI support for master failover. Hroller should also
155
be modified to better support running on the master itself and
156
continuing on the new master.
157

    
158
.. vim: set textwidth=72 :
159
.. Local Variables:
160
.. mode: rst
161
.. fill-column: 72
162
.. End: