/doc/design-hroller.rst - Annotate - snf-ganeti - Greek Research and Technology Network's projects

09208925

Guido Trotter

============

2

09208925

Guido Trotter

HRoller tool

3

09208925

Guido Trotter

============

4

09208925

Guido Trotter

5

09208925

Guido Trotter

.. contents:: :depth: 4

6

09208925

Guido Trotter

7

09208925

Guido Trotter

This is a design document detailing the cluster maintenance scheduler,

8

09208925

Guido Trotter

HRoller.

9

09208925

Guido Trotter

10

09208925

Guido Trotter

11

09208925

Guido Trotter

Current state and shortcomings

12

09208925

Guido Trotter

==============================

13

09208925

Guido Trotter

14

09208925

Guido Trotter

To enable automating cluster-wide reboots a new htool, called HRoller,

15

09208925

Guido Trotter

was added to Ganeti starting from version 2.7. This tool helps

16

09208925

Guido Trotter

parallelizing cluster offline maintenances by calculating which nodes

17

09208925

Guido Trotter

are not both primary and secondary for a DRBD instance, and thus can be

18

09208925

Guido Trotter

rebooted at the same time, when all instances are down.

19

09208925

Guido Trotter

20

09208925

Guido Trotter

The way this is done is documented in the :manpage:`hroller(1)` manpage.

21

09208925

Guido Trotter

22

09208925

Guido Trotter

We would now like to perform online maintenance on the cluster by

23

09208925

Guido Trotter

rebooting nodes after evacuating their primary instances (rolling

24

09208925

Guido Trotter

reboots).

25

09208925

Guido Trotter

26

09208925

Guido Trotter

Proposed changes

27

09208925

Guido Trotter

================

28

09208925

Guido Trotter

29

09208925

Guido Trotter

30

09208925

Guido Trotter

Calculating rolling maintenances

31

09208925

Guido Trotter

--------------------------------

32

09208925

Guido Trotter

33

09208925

Guido Trotter

In order to perform rolling maintenance we need to migrate instances off

34

09208925

Guido Trotter

the nodes before a reboot. How this can be done depends on the

35

09208925

Guido Trotter

instance's disk template and status:

36

09208925

Guido Trotter

37

09208925

Guido Trotter

Down instances

38

09208925

Guido Trotter

++++++++++++++

39

09208925

Guido Trotter

40

09208925

Guido Trotter

If an instance was shutdown when the maintenance started it will be

41

09208925

Guido Trotter

ignored. This allows avoiding needlessly moving its primary around,

42

09208925

Guido Trotter

since it won't suffer a downtime anyway.

43

09208925

Guido Trotter

44

09208925

Guido Trotter

45

09208925

Guido Trotter

DRBD

46

09208925

Guido Trotter

++++

47

09208925

Guido Trotter

48

09208925

Guido Trotter

Each node must migrate all instances off to their secondaries, and then

49

09208925

Guido Trotter

can either be rebooted, or the secondaries can be evacuated as well.

50

09208925

Guido Trotter

51

09208925

Guido Trotter

Since currently doing a ``replace-disks`` on DRBD breaks redundancy,

52

09208925

Guido Trotter

it's not any safer than temporarily rebooting a node with secondaries on

53

09208925

Guido Trotter

them (citation needed). As such we'll implement for now just the

54

09208925

Guido Trotter

"migrate+reboot" mode, and focus later on replace-disks as well.

55

09208925

Guido Trotter

56

09208925

Guido Trotter

In order to do that we can use the following algorithm:

57

09208925

Guido Trotter

58

09208925

Guido Trotter

1) Compute node sets that don't contain both the primary and the

59

09208925

Guido Trotter

secondary for any instance. This can be done already by the current

60

09208925

Guido Trotter

hroller graph coloring algorithm: nodes are in the same set (color) if

61

09208925

Guido Trotter

and only if no edge (instance) exists between them (see the

62

09208925

Guido Trotter

:manpage:`hroller(1)` manpage for more details).

63

09208925

Guido Trotter

2) Inside each node set calculate subsets that don't have any secondary

64

09208925

Guido Trotter

node in common (this can be done by creating a graph of nodes that are

65

09208925

Guido Trotter

connected if and only if an instance on both has the same secondary

66

09208925

Guido Trotter

node, and coloring that graph)

67

09208925

Guido Trotter

3) It is then possible to migrate in parallel all nodes in a subset

68

09208925

Guido Trotter

created at step 2, and then reboot/perform maintenance on them, and

69

09208925

Guido Trotter

migrate back their original primaries, which allows the computation

70

09208925

Guido Trotter

above to be reused for each following subset without N+1 failures being

71

09208925

Guido Trotter

triggered, if none were present before. See below about the actual

72

09208925

Guido Trotter

execution of the maintenance.

73

09208925

Guido Trotter

74

09208925

Guido Trotter

Non-DRBD

75

09208925

Guido Trotter

++++++++

76

09208925

Guido Trotter

77

09208925

Guido Trotter

All non-DRBD disk templates that can be migrated have no "secondary"

78

09208925

Guido Trotter

concept. As such instances can be migrated to any node (in the same

79

09208925

Guido Trotter

nodegroup). In order to do the job we can either:

80

09208925

Guido Trotter

81

09208925

Guido Trotter

- Perform migrations on one node at a time, perform the maintenance on

82

09208925

Guido Trotter

  that node, and proceed (the node will then be targeted again to host

83

09208925

Guido Trotter

  instances automatically, as hail chooses targets for the instances

84

09208925

Guido Trotter

  between all nodes in a group. Nodes in different nodegroups can be

85

09208925

Guido Trotter

  handled in parallel.

86

09208925

Guido Trotter

- Perform migrations on one node at a time, but without waiting for the

87

09208925

Guido Trotter

  first node to come back before proceeding. This allows us to continue,

88

09208925

Guido Trotter

  restricting the cluster, until no more capacity in the nodegroup is

89

09208925

Guido Trotter

  available, and then having to wait for some nodes to come back so that

90

09208925

Guido Trotter

  capacity is available again for the last few nodes.

91

09208925

Guido Trotter

- Pre-Calculate sets of nodes that can be migrated together (probably

92

09208925

Guido Trotter

  with a greedy algorithm) and parallelize between them, with the

93

09208925

Guido Trotter

  migrate-back approach discussed for DRBD to perform the calculation

94

09208925

Guido Trotter

  only once.

95

09208925

Guido Trotter

96

09208925

Guido Trotter

Note that for non-DRBD disks that still use local storage (eg. RBD and

97

09208925

Guido Trotter

plain) redundancy might break anyway, and nothing except the first

98

09208925

Guido Trotter

algorithm might be safe. This perhaps would be a good reason to consider

99

09208925

Guido Trotter

managing better RBD pools, if those are implemented on top of nodes

100

09208925

Guido Trotter

storage, rather than on dedicated storage machines.

101

09208925

Guido Trotter

102

09208925

Guido Trotter

Executing rolling maintenances

103

09208925

Guido Trotter

------------------------------

104

09208925

Guido Trotter

105

09208925

Guido Trotter

Hroller accepts commands to run to do maintenance automatically. These

106

09208925

Guido Trotter

are going to be run on the machine hroller runs on, and take a node name

107

09208925

Guido Trotter

as input. They have then to gain access to the target node (via ssh,

108

09208925

Guido Trotter

restricted commands, or some other means) and perform their duty.

109

09208925

Guido Trotter

110

09208925

Guido Trotter

1) A command (--check-cmd) will be called on all selected online nodes

111

09208925

Guido Trotter

to check whether a node needs maintenance. Hroller will proceed only on

112

09208925

Guido Trotter

nodes that respond positively to this invocation.

113

09208925

Guido Trotter

FIXME: decide about -D

114

09208925

Guido Trotter

2) Hroller will evacuate the node of all primary instances.

115

09208925

Guido Trotter

3) A command (--maint-cmd) will be called on a node to do the actual

116

09208925

Guido Trotter

maintenance operation.  It should do any operation needed to perform the

117

09208925

Guido Trotter

maintenance including triggering the actual reboot.

118

09208925

Guido Trotter

3) A command (--verify-cmd) will be called to check that the operation

119

09208925

Guido Trotter

was successful, it has to wait until the target node is back up (and

120

09208925

Guido Trotter

decide after how long it should give up) and perform the verification.

121

09208925

Guido Trotter

If it's not successful hroller will stop and not proceed with other

122

09208925

Guido Trotter

nodes.

123

09208925

Guido Trotter

4) The master node will be kept last, but will not otherwise be treated

124

09208925

Guido Trotter

specially. If hroller was running on the master node, care must be

125

09208925

Guido Trotter

exercised as its maintenance will have interrupted the software itself,

126

09208925

Guido Trotter

and as such the verification step will not happen. This will not

127

09208925

Guido Trotter

automatically be taken care of, in the first version. An additional flag

128

09208925

Guido Trotter

to just skip the master node will be present as well, in case that's

129

09208925

Guido Trotter

preferred.

130

09208925

Guido Trotter

131

09208925

Guido Trotter

132

09208925

Guido Trotter

Future work

133

09208925

Guido Trotter

===========

134

09208925

Guido Trotter

135

09208925

Guido Trotter

DRBD nodes' ``replace-disks``' functionality should be implemented. Note

136

09208925

Guido Trotter

that when we will support a DRBD version that allows multi-secondary

137

09208925

Guido Trotter

this can be done safely, without losing replication at any time, by

138

09208925

Guido Trotter

adding a temporary secondary and only when the sync is finished dropping

139

09208925

Guido Trotter

the previous one.

140

09208925

Guido Trotter

141

09208925

Guido Trotter

If/when RBD pools can be managed inside Ganeti, care can be taken so

142

09208925

Guido Trotter

that the pool is evacuated as well from a node before it's put into

143

09208925

Guido Trotter

maintenance. This is equivalent to evacuating DRBD secondaries.

144

09208925

Guido Trotter

145

09208925

Guido Trotter

Master failovers during the maintenance should be performed by hroller.

146

09208925

Guido Trotter

This requires RPC/RAPI support for master failover. Hroller should also

147

09208925

Guido Trotter

be modified to better support running on the master itself and

148

09208925

Guido Trotter

continuing on the new master.

149

09208925

Guido Trotter

150

09208925

Guido Trotter

.. vim: set textwidth=72 :

151

09208925

Guido Trotter

.. Local Variables:

152

09208925

Guido Trotter

.. mode: rst

153

09208925

Guido Trotter

.. fill-column: 72

154

09208925

Guido Trotter

.. End:

Synnefo » snf-ganeti

root / doc / design-hroller.rst @ 09208925

1	09208925	Guido Trotter	============
2	09208925	Guido Trotter	HRoller tool
3	09208925	Guido Trotter	============
4	09208925	Guido Trotter
5	09208925	Guido Trotter	.. contents:: :depth: 4
6	09208925	Guido Trotter
7	09208925	Guido Trotter	This is a design document detailing the cluster maintenance scheduler,
8	09208925	Guido Trotter	HRoller.
9	09208925	Guido Trotter
10	09208925	Guido Trotter
11	09208925	Guido Trotter	Current state and shortcomings
12	09208925	Guido Trotter	==============================
13	09208925	Guido Trotter
14	09208925	Guido Trotter	To enable automating cluster-wide reboots a new htool, called HRoller,
15	09208925	Guido Trotter	was added to Ganeti starting from version 2.7. This tool helps
16	09208925	Guido Trotter	parallelizing cluster offline maintenances by calculating which nodes
17	09208925	Guido Trotter	are not both primary and secondary for a DRBD instance, and thus can be
18	09208925	Guido Trotter	rebooted at the same time, when all instances are down.
19	09208925	Guido Trotter
20	09208925	Guido Trotter	The way this is done is documented in the :manpage:`hroller(1)` manpage.
21	09208925	Guido Trotter
22	09208925	Guido Trotter	We would now like to perform online maintenance on the cluster by
23	09208925	Guido Trotter	rebooting nodes after evacuating their primary instances (rolling
24	09208925	Guido Trotter	reboots).
25	09208925	Guido Trotter
26	09208925	Guido Trotter	Proposed changes
27	09208925	Guido Trotter	================
28	09208925	Guido Trotter
29	09208925	Guido Trotter
30	09208925	Guido Trotter	Calculating rolling maintenances
31	09208925	Guido Trotter	--------------------------------
32	09208925	Guido Trotter
33	09208925	Guido Trotter	In order to perform rolling maintenance we need to migrate instances off
34	09208925	Guido Trotter	the nodes before a reboot. How this can be done depends on the
35	09208925	Guido Trotter	instance's disk template and status:
36	09208925	Guido Trotter
37	09208925	Guido Trotter	Down instances
38	09208925	Guido Trotter	++++++++++++++
39	09208925	Guido Trotter
40	09208925	Guido Trotter	If an instance was shutdown when the maintenance started it will be
41	09208925	Guido Trotter	ignored. This allows avoiding needlessly moving its primary around,
42	09208925	Guido Trotter	since it won't suffer a downtime anyway.
43	09208925	Guido Trotter
44	09208925	Guido Trotter
45	09208925	Guido Trotter	DRBD
46	09208925	Guido Trotter	++++
47	09208925	Guido Trotter
48	09208925	Guido Trotter	Each node must migrate all instances off to their secondaries, and then
49	09208925	Guido Trotter	can either be rebooted, or the secondaries can be evacuated as well.
50	09208925	Guido Trotter
51	09208925	Guido Trotter	Since currently doing a ``replace-disks`` on DRBD breaks redundancy,
52	09208925	Guido Trotter	it's not any safer than temporarily rebooting a node with secondaries on
53	09208925	Guido Trotter	them (citation needed). As such we'll implement for now just the
54	09208925	Guido Trotter	"migrate+reboot" mode, and focus later on replace-disks as well.
55	09208925	Guido Trotter
56	09208925	Guido Trotter	In order to do that we can use the following algorithm:
57	09208925	Guido Trotter
58	09208925	Guido Trotter	1) Compute node sets that don't contain both the primary and the
59	09208925	Guido Trotter	secondary for any instance. This can be done already by the current
60	09208925	Guido Trotter	hroller graph coloring algorithm: nodes are in the same set (color) if
61	09208925	Guido Trotter	and only if no edge (instance) exists between them (see the
62	09208925	Guido Trotter	:manpage:`hroller(1)` manpage for more details).
63	09208925	Guido Trotter	2) Inside each node set calculate subsets that don't have any secondary
64	09208925	Guido Trotter	node in common (this can be done by creating a graph of nodes that are
65	09208925	Guido Trotter	connected if and only if an instance on both has the same secondary
66	09208925	Guido Trotter	node, and coloring that graph)
67	09208925	Guido Trotter	3) It is then possible to migrate in parallel all nodes in a subset
68	09208925	Guido Trotter	created at step 2, and then reboot/perform maintenance on them, and
69	09208925	Guido Trotter	migrate back their original primaries, which allows the computation
70	09208925	Guido Trotter	above to be reused for each following subset without N+1 failures being
71	09208925	Guido Trotter	triggered, if none were present before. See below about the actual
72	09208925	Guido Trotter	execution of the maintenance.
73	09208925	Guido Trotter
74	09208925	Guido Trotter	Non-DRBD
75	09208925	Guido Trotter	++++++++
76	09208925	Guido Trotter
77	09208925	Guido Trotter	All non-DRBD disk templates that can be migrated have no "secondary"
78	09208925	Guido Trotter	concept. As such instances can be migrated to any node (in the same
79	09208925	Guido Trotter	nodegroup). In order to do the job we can either:
80	09208925	Guido Trotter
81	09208925	Guido Trotter	- Perform migrations on one node at a time, perform the maintenance on
82	09208925	Guido Trotter	that node, and proceed (the node will then be targeted again to host
83	09208925	Guido Trotter	instances automatically, as hail chooses targets for the instances
84	09208925	Guido Trotter	between all nodes in a group. Nodes in different nodegroups can be
85	09208925	Guido Trotter	handled in parallel.
86	09208925	Guido Trotter	- Perform migrations on one node at a time, but without waiting for the
87	09208925	Guido Trotter	first node to come back before proceeding. This allows us to continue,
88	09208925	Guido Trotter	restricting the cluster, until no more capacity in the nodegroup is
89	09208925	Guido Trotter	available, and then having to wait for some nodes to come back so that
90	09208925	Guido Trotter	capacity is available again for the last few nodes.
91	09208925	Guido Trotter	- Pre-Calculate sets of nodes that can be migrated together (probably
92	09208925	Guido Trotter	with a greedy algorithm) and parallelize between them, with the
93	09208925	Guido Trotter	migrate-back approach discussed for DRBD to perform the calculation
94	09208925	Guido Trotter	only once.
95	09208925	Guido Trotter
96	09208925	Guido Trotter	Note that for non-DRBD disks that still use local storage (eg. RBD and
97	09208925	Guido Trotter	plain) redundancy might break anyway, and nothing except the first
98	09208925	Guido Trotter	algorithm might be safe. This perhaps would be a good reason to consider
99	09208925	Guido Trotter	managing better RBD pools, if those are implemented on top of nodes
100	09208925	Guido Trotter	storage, rather than on dedicated storage machines.
101	09208925	Guido Trotter
102	09208925	Guido Trotter	Executing rolling maintenances
103	09208925	Guido Trotter	------------------------------
104	09208925	Guido Trotter
105	09208925	Guido Trotter	Hroller accepts commands to run to do maintenance automatically. These
106	09208925	Guido Trotter	are going to be run on the machine hroller runs on, and take a node name
107	09208925	Guido Trotter	as input. They have then to gain access to the target node (via ssh,
108	09208925	Guido Trotter	restricted commands, or some other means) and perform their duty.
109	09208925	Guido Trotter
110	09208925	Guido Trotter	1) A command (--check-cmd) will be called on all selected online nodes
111	09208925	Guido Trotter	to check whether a node needs maintenance. Hroller will proceed only on
112	09208925	Guido Trotter	nodes that respond positively to this invocation.
113	09208925	Guido Trotter	FIXME: decide about -D
114	09208925	Guido Trotter	2) Hroller will evacuate the node of all primary instances.
115	09208925	Guido Trotter	3) A command (--maint-cmd) will be called on a node to do the actual
116	09208925	Guido Trotter	maintenance operation. It should do any operation needed to perform the
117	09208925	Guido Trotter	maintenance including triggering the actual reboot.
118	09208925	Guido Trotter	3) A command (--verify-cmd) will be called to check that the operation
119	09208925	Guido Trotter	was successful, it has to wait until the target node is back up (and
120	09208925	Guido Trotter	decide after how long it should give up) and perform the verification.
121	09208925	Guido Trotter	If it's not successful hroller will stop and not proceed with other
122	09208925	Guido Trotter	nodes.
123	09208925	Guido Trotter	4) The master node will be kept last, but will not otherwise be treated
124	09208925	Guido Trotter	specially. If hroller was running on the master node, care must be
125	09208925	Guido Trotter	exercised as its maintenance will have interrupted the software itself,
126	09208925	Guido Trotter	and as such the verification step will not happen. This will not
127	09208925	Guido Trotter	automatically be taken care of, in the first version. An additional flag
128	09208925	Guido Trotter	to just skip the master node will be present as well, in case that's
129	09208925	Guido Trotter	preferred.
130	09208925	Guido Trotter
131	09208925	Guido Trotter
132	09208925	Guido Trotter	Future work
133	09208925	Guido Trotter	===========
134	09208925	Guido Trotter
135	09208925	Guido Trotter	DRBD nodes' ``replace-disks``' functionality should be implemented. Note
136	09208925	Guido Trotter	that when we will support a DRBD version that allows multi-secondary
137	09208925	Guido Trotter	this can be done safely, without losing replication at any time, by
138	09208925	Guido Trotter	adding a temporary secondary and only when the sync is finished dropping
139	09208925	Guido Trotter	the previous one.
140	09208925	Guido Trotter
141	09208925	Guido Trotter	If/when RBD pools can be managed inside Ganeti, care can be taken so
142	09208925	Guido Trotter	that the pool is evacuated as well from a node before it's put into
143	09208925	Guido Trotter	maintenance. This is equivalent to evacuating DRBD secondaries.
144	09208925	Guido Trotter
145	09208925	Guido Trotter	Master failovers during the maintenance should be performed by hroller.
146	09208925	Guido Trotter	This requires RPC/RAPI support for master failover. Hroller should also
147	09208925	Guido Trotter	be modified to better support running on the master itself and
148	09208925	Guido Trotter	continuing on the new master.
149	09208925	Guido Trotter
150	09208925	Guido Trotter	.. vim: set textwidth=72 :
151	09208925	Guido Trotter	.. Local Variables:
152	09208925	Guido Trotter	.. mode: rst
153	09208925	Guido Trotter	.. fill-column: 72
154	09208925	Guido Trotter	.. End: