Revision fb4b885a

b/doc/design-hroller.rst
26 26
Proposed changes
27 27
================
28 28

  
29
New options
30
-----------
31

  
32
- HRoller should be able to operate on single nodegroups (-G flag) or
33
  select its target node through some other mean (eg. via a tag, or a
34
  regexp). (Note that individual node selection is already possible via
35
  the -O flag, that makes hroller ignore a node altogether).
36
- HRoller should handle non redundant instances: currently these are
37
  ignored but there should be a way to select its behavior between "it's
38
  ok to reboot a node when a non-redundant instance is on it" or "skip
39
  nodes with non-redundant instances". This will only be selectable
40
  globally, and not per instance.
41
- Hroller will make sure to keep any instance which is up in its current
42
  state, via live migrations, unless explicitly overridden. The
43
  algorithm that will be used calculate the rolling reboot with live
44
  migrations is described below, and any override on considering the
45
  instance status will only be possible on the whole run, and not
46
  per-instance.
47

  
29 48

  
30 49
Calculating rolling maintenances
31 50
--------------------------------
......
38 57
++++++++++++++
39 58

  
40 59
If an instance was shutdown when the maintenance started it will be
41
ignored. This allows avoiding needlessly moving its primary around,
42
since it won't suffer a downtime anyway.
60
considered for avoiding contemporary reboot of its primary and secondary
61
nodes, but will *not* be considered as a target for the node evacuation.
62
This allows avoiding needlessly moving its primary around, since it
63
won't suffer a downtime anyway.
43 64

  
65
Note that a node with non-redundant instances will only ever be
66
considered good for rolling-reboot if these are down (or the checking of
67
status is overridden) *and* an explicit option to allow it is set.
44 68

  
45 69
DRBD
46 70
++++
......
56 80
In order to do that we can use the following algorithm:
57 81

  
58 82
1) Compute node sets that don't contain both the primary and the
59
secondary for any instance. This can be done already by the current
60
hroller graph coloring algorithm: nodes are in the same set (color) if
61
and only if no edge (instance) exists between them (see the
62
:manpage:`hroller(1)` manpage for more details).
83
   secondary for any instance. This can be done already by the current
84
   hroller graph coloring algorithm: nodes are in the same set (color)
85
   if and only if no edge (instance) exists between them (see the
86
   :manpage:`hroller(1)` manpage for more details).
63 87
2) Inside each node set calculate subsets that don't have any secondary
64
node in common (this can be done by creating a graph of nodes that are
65
connected if and only if an instance on both has the same secondary
66
node, and coloring that graph)
88
   node in common (this can be done by creating a graph of nodes that
89
   are connected if and only if an instance on both has the same
90
   secondary node, and coloring that graph)
67 91
3) It is then possible to migrate in parallel all nodes in a subset
68
created at step 2, and then reboot/perform maintenance on them, and
69
migrate back their original primaries, which allows the computation
70
above to be reused for each following subset without N+1 failures being
71
triggered, if none were present before. See below about the actual
72
execution of the maintenance.
92
   created at step 2, and then reboot/perform maintenance on them, and
93
   migrate back their original primaries, which allows the computation
94
   above to be reused for each following subset without N+1 failures
95
   being triggered, if none were present before. See below about the
96
   actual execution of the maintenance.
73 97

  
74 98
Non-DRBD
75 99
++++++++
......
99 123
managing better RBD pools, if those are implemented on top of nodes
100 124
storage, rather than on dedicated storage machines.
101 125

  
102
Executing rolling maintenances
103
------------------------------
104

  
105
Hroller accepts commands to run to do maintenance automatically. These
106
are going to be run on the machine hroller runs on, and take a node name
107
as input. They have then to gain access to the target node (via ssh,
108
restricted commands, or some other means) and perform their duty.
109

  
110
1) A command (--check-cmd) will be called on all selected online nodes
111
to check whether a node needs maintenance. Hroller will proceed only on
112
nodes that respond positively to this invocation.
113
FIXME: decide about -D
114
2) Hroller will evacuate the node of all primary instances.
115
3) A command (--maint-cmd) will be called on a node to do the actual
116
maintenance operation.  It should do any operation needed to perform the
117
maintenance including triggering the actual reboot.
118
3) A command (--verify-cmd) will be called to check that the operation
119
was successful, it has to wait until the target node is back up (and
120
decide after how long it should give up) and perform the verification.
121
If it's not successful hroller will stop and not proceed with other
122
nodes.
123
4) The master node will be kept last, but will not otherwise be treated
124
specially. If hroller was running on the master node, care must be
125
exercised as its maintenance will have interrupted the software itself,
126
and as such the verification step will not happen. This will not
127
automatically be taken care of, in the first version. An additional flag
128
to just skip the master node will be present as well, in case that's
129
preferred.
130

  
131

  
132 126
Future work
133 127
===========
134 128

  
129
Hroller should become able to execute rolling maintenances, rather than
130
just calculate them. For this to succeed properly one of the following
131
must happen:
132

  
133
- HRoller handles rolling maintenances that happen at the same time as
134
  unrelated cluster jobs, and thus recalculates the maintenance at each
135
  step
136
- HRoller can selectively drain the cluster so it's sure that only the
137
  rolling maintenance can be going on
138

  
135 139
DRBD nodes' ``replace-disks``' functionality should be implemented. Note
136 140
that when we will support a DRBD version that allows multi-secondary
137 141
this can be done safely, without losing replication at any time, by
138 142
adding a temporary secondary and only when the sync is finished dropping
139 143
the previous one.
140 144

  
145
Non-redundant (plain or file) instances should have a way to be moved
146
off as well via plain storage live migration or ``gnt-instance move``
147
(which requires downtime).
148

  
141 149
If/when RBD pools can be managed inside Ganeti, care can be taken so
142 150
that the pool is evacuated as well from a node before it's put into
143 151
maintenance. This is equivalent to evacuating DRBD secondaries.

Also available in: Unified diff