Statistics
| Branch: | Tag: | Revision:

root / doc / design-hroller.rst @ 513c5e25

History | View | Annotate | Download (7.2 kB)

1 09208925 Guido Trotter
============
2 09208925 Guido Trotter
HRoller tool
3 09208925 Guido Trotter
============
4 09208925 Guido Trotter
5 09208925 Guido Trotter
.. contents:: :depth: 4
6 09208925 Guido Trotter
7 09208925 Guido Trotter
This is a design document detailing the cluster maintenance scheduler,
8 09208925 Guido Trotter
HRoller.
9 09208925 Guido Trotter
10 09208925 Guido Trotter
11 09208925 Guido Trotter
Current state and shortcomings
12 09208925 Guido Trotter
==============================
13 09208925 Guido Trotter
14 09208925 Guido Trotter
To enable automating cluster-wide reboots a new htool, called HRoller,
15 09208925 Guido Trotter
was added to Ganeti starting from version 2.7. This tool helps
16 09208925 Guido Trotter
parallelizing cluster offline maintenances by calculating which nodes
17 09208925 Guido Trotter
are not both primary and secondary for a DRBD instance, and thus can be
18 09208925 Guido Trotter
rebooted at the same time, when all instances are down.
19 09208925 Guido Trotter
20 09208925 Guido Trotter
The way this is done is documented in the :manpage:`hroller(1)` manpage.
21 09208925 Guido Trotter
22 09208925 Guido Trotter
We would now like to perform online maintenance on the cluster by
23 09208925 Guido Trotter
rebooting nodes after evacuating their primary instances (rolling
24 09208925 Guido Trotter
reboots).
25 09208925 Guido Trotter
26 09208925 Guido Trotter
Proposed changes
27 09208925 Guido Trotter
================
28 09208925 Guido Trotter
29 fb4b885a Guido Trotter
New options
30 fb4b885a Guido Trotter
-----------
31 fb4b885a Guido Trotter
32 fb4b885a Guido Trotter
- HRoller should be able to operate on single nodegroups (-G flag) or
33 fb4b885a Guido Trotter
  select its target node through some other mean (eg. via a tag, or a
34 fb4b885a Guido Trotter
  regexp). (Note that individual node selection is already possible via
35 fb4b885a Guido Trotter
  the -O flag, that makes hroller ignore a node altogether).
36 fb4b885a Guido Trotter
- HRoller should handle non redundant instances: currently these are
37 fb4b885a Guido Trotter
  ignored but there should be a way to select its behavior between "it's
38 fb4b885a Guido Trotter
  ok to reboot a node when a non-redundant instance is on it" or "skip
39 fb4b885a Guido Trotter
  nodes with non-redundant instances". This will only be selectable
40 fb4b885a Guido Trotter
  globally, and not per instance.
41 fb4b885a Guido Trotter
- Hroller will make sure to keep any instance which is up in its current
42 fb4b885a Guido Trotter
  state, via live migrations, unless explicitly overridden. The
43 fb4b885a Guido Trotter
  algorithm that will be used calculate the rolling reboot with live
44 fb4b885a Guido Trotter
  migrations is described below, and any override on considering the
45 fb4b885a Guido Trotter
  instance status will only be possible on the whole run, and not
46 fb4b885a Guido Trotter
  per-instance.
47 fb4b885a Guido Trotter
48 09208925 Guido Trotter
49 09208925 Guido Trotter
Calculating rolling maintenances
50 09208925 Guido Trotter
--------------------------------
51 09208925 Guido Trotter
52 09208925 Guido Trotter
In order to perform rolling maintenance we need to migrate instances off
53 09208925 Guido Trotter
the nodes before a reboot. How this can be done depends on the
54 09208925 Guido Trotter
instance's disk template and status:
55 09208925 Guido Trotter
56 09208925 Guido Trotter
Down instances
57 09208925 Guido Trotter
++++++++++++++
58 09208925 Guido Trotter
59 09208925 Guido Trotter
If an instance was shutdown when the maintenance started it will be
60 fb4b885a Guido Trotter
considered for avoiding contemporary reboot of its primary and secondary
61 fb4b885a Guido Trotter
nodes, but will *not* be considered as a target for the node evacuation.
62 fb4b885a Guido Trotter
This allows avoiding needlessly moving its primary around, since it
63 fb4b885a Guido Trotter
won't suffer a downtime anyway.
64 09208925 Guido Trotter
65 fb4b885a Guido Trotter
Note that a node with non-redundant instances will only ever be
66 fb4b885a Guido Trotter
considered good for rolling-reboot if these are down (or the checking of
67 fb4b885a Guido Trotter
status is overridden) *and* an explicit option to allow it is set.
68 09208925 Guido Trotter
69 09208925 Guido Trotter
DRBD
70 09208925 Guido Trotter
++++
71 09208925 Guido Trotter
72 09208925 Guido Trotter
Each node must migrate all instances off to their secondaries, and then
73 09208925 Guido Trotter
can either be rebooted, or the secondaries can be evacuated as well.
74 09208925 Guido Trotter
75 09208925 Guido Trotter
Since currently doing a ``replace-disks`` on DRBD breaks redundancy,
76 09208925 Guido Trotter
it's not any safer than temporarily rebooting a node with secondaries on
77 09208925 Guido Trotter
them (citation needed). As such we'll implement for now just the
78 09208925 Guido Trotter
"migrate+reboot" mode, and focus later on replace-disks as well.
79 09208925 Guido Trotter
80 09208925 Guido Trotter
In order to do that we can use the following algorithm:
81 09208925 Guido Trotter
82 09208925 Guido Trotter
1) Compute node sets that don't contain both the primary and the
83 4a4697de Klaus Aehlig
   secondary of any instance, and also don't contain the primary
84 4a4697de Klaus Aehlig
   nodes of two instances that have the same node as secondary. These
85 4a4697de Klaus Aehlig
   can be obtained by computing a coloring of the graph with nodes
86 4a4697de Klaus Aehlig
   as vertexes and an edge between two nodes, if either condition
87 4a4697de Klaus Aehlig
   prevents simultaneous maintenance. (This is the current algorithm of
88 4a4697de Klaus Aehlig
   :manpage:`hroller(1)` with the extension that the graph to be colored
89 4a4697de Klaus Aehlig
   has additional edges between the primary nodes of two instances sharing
90 4a4697de Klaus Aehlig
   their secondary node.)
91 4a4697de Klaus Aehlig
2) It is then possible to migrate in parallel all nodes in a set
92 4a4697de Klaus Aehlig
   created at step 1, and then reboot/perform maintenance on them, and
93 fb4b885a Guido Trotter
   migrate back their original primaries, which allows the computation
94 4a4697de Klaus Aehlig
   above to be reused for each following set without N+1 failures
95 fb4b885a Guido Trotter
   being triggered, if none were present before. See below about the
96 fb4b885a Guido Trotter
   actual execution of the maintenance.
97 09208925 Guido Trotter
98 09208925 Guido Trotter
Non-DRBD
99 09208925 Guido Trotter
++++++++
100 09208925 Guido Trotter
101 09208925 Guido Trotter
All non-DRBD disk templates that can be migrated have no "secondary"
102 09208925 Guido Trotter
concept. As such instances can be migrated to any node (in the same
103 09208925 Guido Trotter
nodegroup). In order to do the job we can either:
104 09208925 Guido Trotter
105 09208925 Guido Trotter
- Perform migrations on one node at a time, perform the maintenance on
106 09208925 Guido Trotter
  that node, and proceed (the node will then be targeted again to host
107 09208925 Guido Trotter
  instances automatically, as hail chooses targets for the instances
108 09208925 Guido Trotter
  between all nodes in a group. Nodes in different nodegroups can be
109 09208925 Guido Trotter
  handled in parallel.
110 09208925 Guido Trotter
- Perform migrations on one node at a time, but without waiting for the
111 09208925 Guido Trotter
  first node to come back before proceeding. This allows us to continue,
112 09208925 Guido Trotter
  restricting the cluster, until no more capacity in the nodegroup is
113 09208925 Guido Trotter
  available, and then having to wait for some nodes to come back so that
114 09208925 Guido Trotter
  capacity is available again for the last few nodes.
115 09208925 Guido Trotter
- Pre-Calculate sets of nodes that can be migrated together (probably
116 09208925 Guido Trotter
  with a greedy algorithm) and parallelize between them, with the
117 09208925 Guido Trotter
  migrate-back approach discussed for DRBD to perform the calculation
118 09208925 Guido Trotter
  only once.
119 09208925 Guido Trotter
120 09208925 Guido Trotter
Note that for non-DRBD disks that still use local storage (eg. RBD and
121 09208925 Guido Trotter
plain) redundancy might break anyway, and nothing except the first
122 09208925 Guido Trotter
algorithm might be safe. This perhaps would be a good reason to consider
123 09208925 Guido Trotter
managing better RBD pools, if those are implemented on top of nodes
124 09208925 Guido Trotter
storage, rather than on dedicated storage machines.
125 09208925 Guido Trotter
126 0102e732 Klaus Aehlig
Full-Evacuation
127 0102e732 Klaus Aehlig
+++++++++++++++
128 0102e732 Klaus Aehlig
129 0102e732 Klaus Aehlig
If full evacuation of the nodes to be rebooted is desired, a simple
130 0102e732 Klaus Aehlig
migration is not enough for the DRBD instances. To keep the number of
131 0102e732 Klaus Aehlig
disk operations small, we restrict moves to ``migrate, replace-secondary``.
132 0102e732 Klaus Aehlig
That is, after migrating instances out of the nodes to be rebooted,
133 0102e732 Klaus Aehlig
replacement secondaries are searched for, for all instances that have
134 0102e732 Klaus Aehlig
their then secondary on one of the rebooted nodes. This is done by a
135 0102e732 Klaus Aehlig
greedy algorithm, refining the initial reboot partition, if necessary.
136 0102e732 Klaus Aehlig
137 09208925 Guido Trotter
Future work
138 09208925 Guido Trotter
===========
139 09208925 Guido Trotter
140 fb4b885a Guido Trotter
Hroller should become able to execute rolling maintenances, rather than
141 fb4b885a Guido Trotter
just calculate them. For this to succeed properly one of the following
142 fb4b885a Guido Trotter
must happen:
143 fb4b885a Guido Trotter
144 fb4b885a Guido Trotter
- HRoller handles rolling maintenances that happen at the same time as
145 fb4b885a Guido Trotter
  unrelated cluster jobs, and thus recalculates the maintenance at each
146 fb4b885a Guido Trotter
  step
147 fb4b885a Guido Trotter
- HRoller can selectively drain the cluster so it's sure that only the
148 fb4b885a Guido Trotter
  rolling maintenance can be going on
149 fb4b885a Guido Trotter
150 09208925 Guido Trotter
DRBD nodes' ``replace-disks``' functionality should be implemented. Note
151 09208925 Guido Trotter
that when we will support a DRBD version that allows multi-secondary
152 09208925 Guido Trotter
this can be done safely, without losing replication at any time, by
153 09208925 Guido Trotter
adding a temporary secondary and only when the sync is finished dropping
154 09208925 Guido Trotter
the previous one.
155 09208925 Guido Trotter
156 fb4b885a Guido Trotter
Non-redundant (plain or file) instances should have a way to be moved
157 fb4b885a Guido Trotter
off as well via plain storage live migration or ``gnt-instance move``
158 fb4b885a Guido Trotter
(which requires downtime).
159 fb4b885a Guido Trotter
160 09208925 Guido Trotter
If/when RBD pools can be managed inside Ganeti, care can be taken so
161 09208925 Guido Trotter
that the pool is evacuated as well from a node before it's put into
162 09208925 Guido Trotter
maintenance. This is equivalent to evacuating DRBD secondaries.
163 09208925 Guido Trotter
164 09208925 Guido Trotter
Master failovers during the maintenance should be performed by hroller.
165 09208925 Guido Trotter
This requires RPC/RAPI support for master failover. Hroller should also
166 09208925 Guido Trotter
be modified to better support running on the master itself and
167 09208925 Guido Trotter
continuing on the new master.
168 09208925 Guido Trotter
169 09208925 Guido Trotter
.. vim: set textwidth=72 :
170 09208925 Guido Trotter
.. Local Variables:
171 09208925 Guido Trotter
.. mode: rst
172 09208925 Guido Trotter
.. fill-column: 72
173 09208925 Guido Trotter
.. End: