/doc/design-2.0-disk-handling.rst - Annotate - ganeti-local - Greek Research and Technology Network's projects

Iustin Pop

Ganeti 2.0 disk handling changes

2

fbd6f863

Iustin Pop

================================

3

fbd6f863

Iustin Pop

4

fbd6f863

Iustin Pop

Objective

5

fbd6f863

Iustin Pop

---------

6

fbd6f863

Iustin Pop

7

fbd6f863

Iustin Pop

Change the storage options available and the details of the

8

fbd6f863

Iustin Pop

implementation such that we overcome some design limitations present

9

fbd6f863

Iustin Pop

in Ganeti 1.x.

10

fbd6f863

Iustin Pop

11

fbd6f863

Iustin Pop

Background

12

fbd6f863

Iustin Pop

----------

13

fbd6f863

Iustin Pop

14

fbd6f863

Iustin Pop

The storage options available in Ganeti 1.x were introduced based on

15

fbd6f863

Iustin Pop

then-current software (DRBD 0.7 and later DRBD 8) and the estimated

16

fbd6f863

Iustin Pop

usage patters. However, experience has later shown that some

17

fbd6f863

Iustin Pop

assumptions made initially are not true and that more flexibility is

18

fbd6f863

Iustin Pop

needed.

19

fbd6f863

Iustin Pop

20

fbd6f863

Iustin Pop

One main assupmtion made was that disk failures should be treated as 'rare'

21

fbd6f863

Iustin Pop

events, and that each of them needs to be manually handled in order to ensure

22

fbd6f863

Iustin Pop

data safety; however, both these assumptions are false:

23

fbd6f863

Iustin Pop

24

fbd6f863

Iustin Pop

- disk failures can be a common occurence, based on usage patterns or cluster

25

fbd6f863

Iustin Pop

  size

26

fbd6f863

Iustin Pop

- our disk setup is robust enough (referring to DRBD8 + LVM) that we could

27

fbd6f863

Iustin Pop

  automate more of the recovery

28

fbd6f863

Iustin Pop

29

fbd6f863

Iustin Pop

Note that we still don't have fully-automated disk recovery as a goal, but our

30

fbd6f863

Iustin Pop

goal is to reduce the manual work needed.

31

fbd6f863

Iustin Pop

32

fbd6f863

Iustin Pop

Overview

33

fbd6f863

Iustin Pop

--------

34

fbd6f863

Iustin Pop

35

fbd6f863

Iustin Pop

We plan the following main changes:

36

fbd6f863

Iustin Pop

37

fbd6f863

Iustin Pop

- DRBD8 is much more flexible and stable than its previous version (0.7),

38

fbd6f863

Iustin Pop

  such that removing the support for the ``remote_raid1`` template and

39

fbd6f863

Iustin Pop

  focusing only on DRBD8 is easier

40

fbd6f863

Iustin Pop

41

fbd6f863

Iustin Pop

- dynamic discovery of DRBD devices is not actually needed in a cluster that

42

fbd6f863

Iustin Pop

  where the DRBD namespace is controlled by Ganeti; switching to a static

43

fbd6f863

Iustin Pop

  assignment (done at either instance creation time or change secondary time)

44

fbd6f863

Iustin Pop

  will change the disk activation time from O(n) to O(1), which on big

45

fbd6f863

Iustin Pop

  clusters is a significant gain

46

fbd6f863

Iustin Pop

47

fbd6f863

Iustin Pop

- remove the hard dependency on LVM (currently all available storage types are

48

fbd6f863

Iustin Pop

  ultimately backed by LVM volumes) by introducing file-based storage

49

fbd6f863

Iustin Pop

50

fbd6f863

Iustin Pop

Additionally, a number of smaller enhancements are also planned:

51

fbd6f863

Iustin Pop

- support variable number of disks

52

fbd6f863

Iustin Pop

- support read-only disks

53

fbd6f863

Iustin Pop

54

fbd6f863

Iustin Pop

Future enhancements in the 2.x series, which do not require base design

55

fbd6f863

Iustin Pop

changes, might include:

56

fbd6f863

Iustin Pop

57

fbd6f863

Iustin Pop

- enhancement of the LVM allocation method in order to try to keep

58

fbd6f863

Iustin Pop

  all of an instance's virtual disks on the same physical

59

fbd6f863

Iustin Pop

  disks

60

fbd6f863

Iustin Pop

61

fbd6f863

Iustin Pop

- add support for DRBD8 authentication at handshake time in

62

fbd6f863

Iustin Pop

  order to ensure each device connects to the correct peer

63

fbd6f863

Iustin Pop

64

fbd6f863

Iustin Pop

- remove the restrictions on failover only to the secondary

65

fbd6f863

Iustin Pop

  which creates very strict rules on cluster allocation

66

fbd6f863

Iustin Pop

67

fbd6f863

Iustin Pop

Detailed Design

68

fbd6f863

Iustin Pop

---------------

69

fbd6f863

Iustin Pop

70

fbd6f863

Iustin Pop

DRBD minor allocation

71

fbd6f863

Iustin Pop

~~~~~~~~~~~~~~~~~~~~~

72

fbd6f863

Iustin Pop

73

fbd6f863

Iustin Pop

Currently, when trying to identify or activate a new DRBD (or MD)

74

fbd6f863

Iustin Pop

device, the code scans all in-use devices in order to see if we find

75

fbd6f863

Iustin Pop

one that looks similar to our parameters and is already in the desired

76

fbd6f863

Iustin Pop

state or not. Since this needs external commands to be run, it is very

77

fbd6f863

Iustin Pop

slow when more than a few devices are already present.

78

fbd6f863

Iustin Pop

79

fbd6f863

Iustin Pop

Therefore, we will change the discovery model from dynamic to

80

fbd6f863

Iustin Pop

static. When a new device is logically created (added to the

81

fbd6f863

Iustin Pop

configuration) a free minor number is computed from the list of

82

fbd6f863

Iustin Pop

devices that should exist on that node and assigned to that

83

fbd6f863

Iustin Pop

device.

84

fbd6f863

Iustin Pop

85

fbd6f863

Iustin Pop

At device activation, if the minor is already in use, we check if

86

fbd6f863

Iustin Pop

it has our parameters; if not so, we just destroy the device (if

87

fbd6f863

Iustin Pop

possible, otherwise we abort) and start it with our own

88

fbd6f863

Iustin Pop

parameters.

89

fbd6f863

Iustin Pop

90

fbd6f863

Iustin Pop

This means that we in effect take ownership of the minor space for

91

fbd6f863

Iustin Pop

that device type; if there's a user-created drbd minor, it will be

92

fbd6f863

Iustin Pop

automatically removed.

93

fbd6f863

Iustin Pop

94

fbd6f863

Iustin Pop

The change will have the effect of reducing the number of external

95

fbd6f863

Iustin Pop

commands run per device from a constant number times the index of the

96

fbd6f863

Iustin Pop

first free DRBD minor to just a constant number.

97

fbd6f863

Iustin Pop

98

fbd6f863

Iustin Pop

Removal of obsolete device types (md, drbd7)

99

fbd6f863

Iustin Pop

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

100

fbd6f863

Iustin Pop

101

fbd6f863

Iustin Pop

We need to remove these device types because of two issues. First,

102

fbd6f863

Iustin Pop

drbd7 has bad failure modes in case of dual failures (both network and

103

fbd6f863

Iustin Pop

disk - it cannot propagate the error up the device stack and instead

104

fbd6f863

Iustin Pop

just panics. Second, due to the assymetry between primary and

105

fbd6f863

Iustin Pop

secondary in md+drbd mode, we cannot do live failover (not even if we

106

fbd6f863

Iustin Pop

had md+drbd8).

107

fbd6f863

Iustin Pop

108

fbd6f863

Iustin Pop

File-based storage support

109

fbd6f863

Iustin Pop

~~~~~~~~~~~~~~~~~~~~~~~~~~

110

fbd6f863

Iustin Pop

111

fbd6f863

Iustin Pop

This is covered by a separate design doc (<em>Vinales</em>) and

112

fbd6f863

Iustin Pop

would allow us to get rid of the hard requirement for testing

113

fbd6f863

Iustin Pop

clusters; it would also allow people who have SAN storage to do live

114

fbd6f863

Iustin Pop

failover taking advantage of their storage solution.

115

fbd6f863

Iustin Pop

116

fbd6f863

Iustin Pop

Variable number of disks

117

fbd6f863

Iustin Pop

~~~~~~~~~~~~~~~~~~~~~~~~

118

fbd6f863

Iustin Pop

119

fbd6f863

Iustin Pop

In order to support high-security scenarios (for example read-only sda

120

fbd6f863

Iustin Pop

and read-write sdb), we need to make a fully flexibly disk

121

fbd6f863

Iustin Pop

definition. This has less impact that it might look at first sight:

122

fbd6f863

Iustin Pop

only the instance creation has hardcoded number of disks, not the disk

123

fbd6f863

Iustin Pop

handling code. The block device handling and most of the instance

124

fbd6f863

Iustin Pop

handling code is already working with "the instance's disks" as

125

fbd6f863

Iustin Pop

opposed to "the two disks of the instance", but some pieces are not

126

fbd6f863

Iustin Pop

(e.g. import/export) and the code needs a review to ensure safety.

127

fbd6f863

Iustin Pop

128

fbd6f863

Iustin Pop

The objective is to be able to specify the number of disks at

129

fbd6f863

Iustin Pop

instance creation, and to be able to toggle from read-only to

130

fbd6f863

Iustin Pop

read-write a disk afterwards.

131

fbd6f863

Iustin Pop

132

fbd6f863

Iustin Pop

Better LVM allocation

133

fbd6f863

Iustin Pop

~~~~~~~~~~~~~~~~~~~~~

134

fbd6f863

Iustin Pop

135

fbd6f863

Iustin Pop

Currently, the LV to PV allocation mechanism is a very simple one: at

136

fbd6f863

Iustin Pop

each new request for a logical volume, tell LVM to allocate the volume

137

fbd6f863

Iustin Pop

in order based on the amount of free space. This is good for

138

fbd6f863

Iustin Pop

simplicity and for keeping the usage equally spread over the available

139

fbd6f863

Iustin Pop

physical disks, however it introduces a problem that an instance could

140

fbd6f863

Iustin Pop

end up with its (currently) two drives on two physical disks, or

141

fbd6f863

Iustin Pop

(worse) that the data and metadata for a DRBD device end up on

142

fbd6f863

Iustin Pop

different drives.

143

fbd6f863

Iustin Pop

144

fbd6f863

Iustin Pop

This is bad because it causes unneeded ``replace-disks`` operations in

145

fbd6f863

Iustin Pop

case of a physical failure.

146

fbd6f863

Iustin Pop

147

fbd6f863

Iustin Pop

The solution is to batch allocations for an instance and make the LVM

148

fbd6f863

Iustin Pop

handling code try to allocate as close as possible all the storage of

149

fbd6f863

Iustin Pop

one instance. We will still allow the logical volumes to spill over to

150

fbd6f863

Iustin Pop

additional disks as needed.

151

fbd6f863

Iustin Pop

152

fbd6f863

Iustin Pop

Note that this clustered allocation can only be attempted at initial

153

fbd6f863

Iustin Pop

instance creation, or at change secondary node time. At add disk time,

154

fbd6f863

Iustin Pop

or at replacing individual disks, it's not easy enough to compute the

155

fbd6f863

Iustin Pop

current disk map so we'll not attempt the clustering.

156

fbd6f863

Iustin Pop

157

fbd6f863

Iustin Pop

DRBD8 peer authentication at handshake

158

fbd6f863

Iustin Pop

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

159

fbd6f863

Iustin Pop

160

fbd6f863

Iustin Pop

DRBD8 has a new feature that allow authentication of the peer at

161

fbd6f863

Iustin Pop

connect time. We can use this to prevent connecting to the wrong peer

162

fbd6f863

Iustin Pop

more that securing the connection. Even though we never had issues

163

fbd6f863

Iustin Pop

with wrong connections, it would be good to implement this.

164

fbd6f863

Iustin Pop

165

fbd6f863

Iustin Pop

166

fbd6f863

Iustin Pop

LVM self-repair (optional)

167

fbd6f863

Iustin Pop

~~~~~~~~~~~~~~~~~~~~~~~~~~

168

fbd6f863

Iustin Pop

169

fbd6f863

Iustin Pop

The complete failure of a physical disk is very tedious to

170

fbd6f863

Iustin Pop

troubleshoot, mainly because of the many failure modes and the many

171

fbd6f863

Iustin Pop

steps needed. We can safely automate some of the steps, more

172

fbd6f863

Iustin Pop

specifically the ``vgreduce --removemissing`` using the following

173

fbd6f863

Iustin Pop

method:

174

fbd6f863

Iustin Pop

175

fbd6f863

Iustin Pop

#. check if all nodes have consistent volume groups

176

fbd6f863

Iustin Pop

#. if yes, and previous status was yes, do nothing

177

fbd6f863

Iustin Pop

#. if yes, and previous status was no, save status and restart

178

fbd6f863

Iustin Pop

#. if no, and previous status was no, do nothing

179

fbd6f863

Iustin Pop

#. if no, and previous status was yes:

180

fbd6f863

Iustin Pop

    #. if more than one node is inconsistent, do nothing

181

fbd6f863

Iustin Pop

    #. if only one node is incosistent:

182

fbd6f863

Iustin Pop

        #. run ``vgreduce --removemissing``

183

fbd6f863

Iustin Pop

        #. log this occurence in the ganeti log in a form that

184

fbd6f863

Iustin Pop

           can be used for monitoring

185

fbd6f863

Iustin Pop

        #. [FUTURE] run ``replace-disks`` for all

186

fbd6f863

Iustin Pop

           instances affected

187

fbd6f863

Iustin Pop

188

fbd6f863

Iustin Pop

Failover to any node

189

fbd6f863

Iustin Pop

~~~~~~~~~~~~~~~~~~~~

190

fbd6f863

Iustin Pop

191

fbd6f863

Iustin Pop

With a modified disk activation sequence, we can implement the

192

fbd6f863

Iustin Pop

*failover to any* functionality, removing many of the layout

193

fbd6f863

Iustin Pop

restrictions of a cluster:

194

fbd6f863

Iustin Pop

195

fbd6f863

Iustin Pop

- the need to reserve memory on the current secondary: this gets reduced to

196

fbd6f863

Iustin Pop

  a must to reserve memory anywhere on the cluster

197

fbd6f863

Iustin Pop

198

fbd6f863

Iustin Pop

- the need to first failover and then replace secondary for an

199

fbd6f863

Iustin Pop

  instance: with failover-to-any, we can directly failover to

200

fbd6f863

Iustin Pop

  another node, which also does the replace disks at the same

201

fbd6f863

Iustin Pop

  step

202

fbd6f863

Iustin Pop

203

fbd6f863

Iustin Pop

In the following, we denote the current primary by P1, the current

204

fbd6f863

Iustin Pop

secondary by S1, and the new primary and secondaries by P2 and S2. P2

205

fbd6f863

Iustin Pop

is fixed to the node the user chooses, but the choice of S2 can be

206

fbd6f863

Iustin Pop

made between P1 and S1. This choice can be constrained, depending on

207

fbd6f863

Iustin Pop

which of P1 and S1 has failed.

208

fbd6f863

Iustin Pop

209

fbd6f863

Iustin Pop

- if P1 has failed, then S1 must become S2, and live migration is not possible

210

fbd6f863

Iustin Pop

- if S1 has failed, then P1 must become S2, and live migration could be

211

fbd6f863

Iustin Pop

  possible (in theory, but this is not a design goal for 2.0)

212

fbd6f863

Iustin Pop

213

fbd6f863

Iustin Pop

The algorithm for performing the failover is straightforward:

214

fbd6f863

Iustin Pop

215

fbd6f863

Iustin Pop

- verify that S2 (the node the user has chosen to keep as secondary) has

216

fbd6f863

Iustin Pop

  valid data (is consistent)

217

fbd6f863

Iustin Pop

218

fbd6f863

Iustin Pop

- tear down the current DRBD association and setup a drbd pairing between

219

fbd6f863

Iustin Pop

  P2 (P2 is indicated by the user) and S2; since P2 has no data, it will

220

fbd6f863

Iustin Pop

  start resyncing from S2

221

fbd6f863

Iustin Pop

222

fbd6f863

Iustin Pop

- as soon as P2 is in state SyncTarget (i.e. after the resync has started

223

fbd6f863

Iustin Pop

  but before it has finished), we can promote it to primary role (r/w)

224

fbd6f863

Iustin Pop

  and start the instance on P2

225

fbd6f863

Iustin Pop

226

fbd6f863

Iustin Pop

- as soon as the P2⇐S2 sync has finished, we can remove

227

fbd6f863

Iustin Pop

  the old data on the old node that has not been chosen for

228

fbd6f863

Iustin Pop

S2

229

fbd6f863

Iustin Pop

230

fbd6f863

Iustin Pop

Caveats: during the P2⇐S2 sync, a (non-transient) network error

231

fbd6f863

Iustin Pop

will cause I/O errors on the instance, so (if a longer instance

232

fbd6f863

Iustin Pop

downtime is acceptable) we can postpone the restart of the instance

233

fbd6f863

Iustin Pop

until the resync is done. However, disk I/O errors on S2 will cause

234

fbd6f863

Iustin Pop

dataloss, since we don't have a good copy of the data anymore, so in

235

fbd6f863

Iustin Pop

this case waiting for the sync to complete is not an option. As such,

236

fbd6f863

Iustin Pop

it is recommended that this feature is used only in conjunction with

237

fbd6f863

Iustin Pop

proper disk monitoring.

238

fbd6f863

Iustin Pop

239

fbd6f863

Iustin Pop

240

fbd6f863

Iustin Pop

Live migration note: While failover-to-any is possible for all choices

241

fbd6f863

Iustin Pop

of S2, migration-to-any is possible only if we keep P1 as S2.

242

fbd6f863

Iustin Pop

243

fbd6f863

Iustin Pop

Caveats

244

fbd6f863

Iustin Pop

-------

245

fbd6f863

Iustin Pop

246

fbd6f863

Iustin Pop

The dynamic device model, while more complex, has an advantage: it

247

fbd6f863

Iustin Pop

will not reuse by mistake another's instance DRBD device, since it

248

fbd6f863

Iustin Pop

always looks for either our own or a free one.

249

fbd6f863

Iustin Pop

250

fbd6f863

Iustin Pop

The static one, in contrast, will assume that given a minor number N,

251

fbd6f863

Iustin Pop

it's ours and we can take over. This needs careful implementation such

252

fbd6f863

Iustin Pop

that if the minor is in use, either we are able to cleanly shut it

253

fbd6f863

Iustin Pop

down, or we abort the startup. Otherwise, it could be that we start

254

fbd6f863

Iustin Pop

syncing between two instance's disks, causing dataloss.

255

fbd6f863

Iustin Pop

256

fbd6f863

Iustin Pop

Security Considerations

257

fbd6f863

Iustin Pop

-----------------------

258

fbd6f863

Iustin Pop

259

fbd6f863

Iustin Pop

The changes will not affect the security model of Ganeti.

Synnefo » snf-ganeti » ganeti-local

root / doc / design-2.0-disk-handling.rst @ cd55576a

1	fbd6f863	Iustin Pop	Ganeti 2.0 disk handling changes
2	fbd6f863	Iustin Pop	================================
3	fbd6f863	Iustin Pop
4	fbd6f863	Iustin Pop	Objective
5	fbd6f863	Iustin Pop	---------
6	fbd6f863	Iustin Pop
7	fbd6f863	Iustin Pop	Change the storage options available and the details of the
8	fbd6f863	Iustin Pop	implementation such that we overcome some design limitations present
9	fbd6f863	Iustin Pop	in Ganeti 1.x.
10	fbd6f863	Iustin Pop
11	fbd6f863	Iustin Pop	Background
12	fbd6f863	Iustin Pop	----------
13	fbd6f863	Iustin Pop
14	fbd6f863	Iustin Pop	The storage options available in Ganeti 1.x were introduced based on
15	fbd6f863	Iustin Pop	then-current software (DRBD 0.7 and later DRBD 8) and the estimated
16	fbd6f863	Iustin Pop	usage patters. However, experience has later shown that some
17	fbd6f863	Iustin Pop	assumptions made initially are not true and that more flexibility is
18	fbd6f863	Iustin Pop	needed.
19	fbd6f863	Iustin Pop
20	fbd6f863	Iustin Pop	One main assupmtion made was that disk failures should be treated as 'rare'
21	fbd6f863	Iustin Pop	events, and that each of them needs to be manually handled in order to ensure
22	fbd6f863	Iustin Pop	data safety; however, both these assumptions are false:
23	fbd6f863	Iustin Pop
24	fbd6f863	Iustin Pop	- disk failures can be a common occurence, based on usage patterns or cluster
25	fbd6f863	Iustin Pop	size
26	fbd6f863	Iustin Pop	- our disk setup is robust enough (referring to DRBD8 + LVM) that we could
27	fbd6f863	Iustin Pop	automate more of the recovery
28	fbd6f863	Iustin Pop
29	fbd6f863	Iustin Pop	Note that we still don't have fully-automated disk recovery as a goal, but our
30	fbd6f863	Iustin Pop	goal is to reduce the manual work needed.
31	fbd6f863	Iustin Pop
32	fbd6f863	Iustin Pop	Overview
33	fbd6f863	Iustin Pop	--------
34	fbd6f863	Iustin Pop
35	fbd6f863	Iustin Pop	We plan the following main changes:
36	fbd6f863	Iustin Pop
37	fbd6f863	Iustin Pop	- DRBD8 is much more flexible and stable than its previous version (0.7),
38	fbd6f863	Iustin Pop	such that removing the support for the ``remote_raid1`` template and
39	fbd6f863	Iustin Pop	focusing only on DRBD8 is easier
40	fbd6f863	Iustin Pop
41	fbd6f863	Iustin Pop	- dynamic discovery of DRBD devices is not actually needed in a cluster that
42	fbd6f863	Iustin Pop	where the DRBD namespace is controlled by Ganeti; switching to a static
43	fbd6f863	Iustin Pop	assignment (done at either instance creation time or change secondary time)
44	fbd6f863	Iustin Pop	will change the disk activation time from O(n) to O(1), which on big
45	fbd6f863	Iustin Pop	clusters is a significant gain
46	fbd6f863	Iustin Pop
47	fbd6f863	Iustin Pop	- remove the hard dependency on LVM (currently all available storage types are
48	fbd6f863	Iustin Pop	ultimately backed by LVM volumes) by introducing file-based storage
49	fbd6f863	Iustin Pop
50	fbd6f863	Iustin Pop	Additionally, a number of smaller enhancements are also planned:
51	fbd6f863	Iustin Pop	- support variable number of disks
52	fbd6f863	Iustin Pop	- support read-only disks
53	fbd6f863	Iustin Pop
54	fbd6f863	Iustin Pop	Future enhancements in the 2.x series, which do not require base design
55	fbd6f863	Iustin Pop	changes, might include:
56	fbd6f863	Iustin Pop
57	fbd6f863	Iustin Pop	- enhancement of the LVM allocation method in order to try to keep
58	fbd6f863	Iustin Pop	all of an instance's virtual disks on the same physical
59	fbd6f863	Iustin Pop	disks
60	fbd6f863	Iustin Pop
61	fbd6f863	Iustin Pop	- add support for DRBD8 authentication at handshake time in
62	fbd6f863	Iustin Pop	order to ensure each device connects to the correct peer
63	fbd6f863	Iustin Pop
64	fbd6f863	Iustin Pop	- remove the restrictions on failover only to the secondary
65	fbd6f863	Iustin Pop	which creates very strict rules on cluster allocation
66	fbd6f863	Iustin Pop
67	fbd6f863	Iustin Pop	Detailed Design
68	fbd6f863	Iustin Pop	---------------
69	fbd6f863	Iustin Pop
70	fbd6f863	Iustin Pop	DRBD minor allocation
71	fbd6f863	Iustin Pop	~~~~~~~~~~~~~~~~~~~~~
72	fbd6f863	Iustin Pop
73	fbd6f863	Iustin Pop	Currently, when trying to identify or activate a new DRBD (or MD)
74	fbd6f863	Iustin Pop	device, the code scans all in-use devices in order to see if we find
75	fbd6f863	Iustin Pop	one that looks similar to our parameters and is already in the desired
76	fbd6f863	Iustin Pop	state or not. Since this needs external commands to be run, it is very
77	fbd6f863	Iustin Pop	slow when more than a few devices are already present.
78	fbd6f863	Iustin Pop
79	fbd6f863	Iustin Pop	Therefore, we will change the discovery model from dynamic to
80	fbd6f863	Iustin Pop	static. When a new device is logically created (added to the
81	fbd6f863	Iustin Pop	configuration) a free minor number is computed from the list of
82	fbd6f863	Iustin Pop	devices that should exist on that node and assigned to that
83	fbd6f863	Iustin Pop	device.
84	fbd6f863	Iustin Pop
85	fbd6f863	Iustin Pop	At device activation, if the minor is already in use, we check if
86	fbd6f863	Iustin Pop	it has our parameters; if not so, we just destroy the device (if
87	fbd6f863	Iustin Pop	possible, otherwise we abort) and start it with our own
88	fbd6f863	Iustin Pop	parameters.
89	fbd6f863	Iustin Pop
90	fbd6f863	Iustin Pop	This means that we in effect take ownership of the minor space for
91	fbd6f863	Iustin Pop	that device type; if there's a user-created drbd minor, it will be
92	fbd6f863	Iustin Pop	automatically removed.
93	fbd6f863	Iustin Pop
94	fbd6f863	Iustin Pop	The change will have the effect of reducing the number of external
95	fbd6f863	Iustin Pop	commands run per device from a constant number times the index of the
96	fbd6f863	Iustin Pop	first free DRBD minor to just a constant number.
97	fbd6f863	Iustin Pop
98	fbd6f863	Iustin Pop	Removal of obsolete device types (md, drbd7)
99	fbd6f863	Iustin Pop	~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
100	fbd6f863	Iustin Pop
101	fbd6f863	Iustin Pop	We need to remove these device types because of two issues. First,
102	fbd6f863	Iustin Pop	drbd7 has bad failure modes in case of dual failures (both network and
103	fbd6f863	Iustin Pop	disk - it cannot propagate the error up the device stack and instead
104	fbd6f863	Iustin Pop	just panics. Second, due to the assymetry between primary and
105	fbd6f863	Iustin Pop	secondary in md+drbd mode, we cannot do live failover (not even if we
106	fbd6f863	Iustin Pop	had md+drbd8).
107	fbd6f863	Iustin Pop
108	fbd6f863	Iustin Pop	File-based storage support
109	fbd6f863	Iustin Pop	~~~~~~~~~~~~~~~~~~~~~~~~~~
110	fbd6f863	Iustin Pop
111	fbd6f863	Iustin Pop	This is covered by a separate design doc (<em>Vinales</em>) and
112	fbd6f863	Iustin Pop	would allow us to get rid of the hard requirement for testing
113	fbd6f863	Iustin Pop	clusters; it would also allow people who have SAN storage to do live
114	fbd6f863	Iustin Pop	failover taking advantage of their storage solution.
115	fbd6f863	Iustin Pop
116	fbd6f863	Iustin Pop	Variable number of disks
117	fbd6f863	Iustin Pop	~~~~~~~~~~~~~~~~~~~~~~~~
118	fbd6f863	Iustin Pop
119	fbd6f863	Iustin Pop	In order to support high-security scenarios (for example read-only sda
120	fbd6f863	Iustin Pop	and read-write sdb), we need to make a fully flexibly disk
121	fbd6f863	Iustin Pop	definition. This has less impact that it might look at first sight:
122	fbd6f863	Iustin Pop	only the instance creation has hardcoded number of disks, not the disk
123	fbd6f863	Iustin Pop	handling code. The block device handling and most of the instance
124	fbd6f863	Iustin Pop	handling code is already working with "the instance's disks" as
125	fbd6f863	Iustin Pop	opposed to "the two disks of the instance", but some pieces are not
126	fbd6f863	Iustin Pop	(e.g. import/export) and the code needs a review to ensure safety.
127	fbd6f863	Iustin Pop
128	fbd6f863	Iustin Pop	The objective is to be able to specify the number of disks at
129	fbd6f863	Iustin Pop	instance creation, and to be able to toggle from read-only to
130	fbd6f863	Iustin Pop	read-write a disk afterwards.
131	fbd6f863	Iustin Pop
132	fbd6f863	Iustin Pop	Better LVM allocation
133	fbd6f863	Iustin Pop	~~~~~~~~~~~~~~~~~~~~~
134	fbd6f863	Iustin Pop
135	fbd6f863	Iustin Pop	Currently, the LV to PV allocation mechanism is a very simple one: at
136	fbd6f863	Iustin Pop	each new request for a logical volume, tell LVM to allocate the volume
137	fbd6f863	Iustin Pop	in order based on the amount of free space. This is good for
138	fbd6f863	Iustin Pop	simplicity and for keeping the usage equally spread over the available
139	fbd6f863	Iustin Pop	physical disks, however it introduces a problem that an instance could
140	fbd6f863	Iustin Pop	end up with its (currently) two drives on two physical disks, or
141	fbd6f863	Iustin Pop	(worse) that the data and metadata for a DRBD device end up on
142	fbd6f863	Iustin Pop	different drives.
143	fbd6f863	Iustin Pop
144	fbd6f863	Iustin Pop	This is bad because it causes unneeded ``replace-disks`` operations in
145	fbd6f863	Iustin Pop	case of a physical failure.
146	fbd6f863	Iustin Pop
147	fbd6f863	Iustin Pop	The solution is to batch allocations for an instance and make the LVM
148	fbd6f863	Iustin Pop	handling code try to allocate as close as possible all the storage of
149	fbd6f863	Iustin Pop	one instance. We will still allow the logical volumes to spill over to
150	fbd6f863	Iustin Pop	additional disks as needed.
151	fbd6f863	Iustin Pop
152	fbd6f863	Iustin Pop	Note that this clustered allocation can only be attempted at initial
153	fbd6f863	Iustin Pop	instance creation, or at change secondary node time. At add disk time,
154	fbd6f863	Iustin Pop	or at replacing individual disks, it's not easy enough to compute the
155	fbd6f863	Iustin Pop	current disk map so we'll not attempt the clustering.
156	fbd6f863	Iustin Pop
157	fbd6f863	Iustin Pop	DRBD8 peer authentication at handshake
158	fbd6f863	Iustin Pop	~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
159	fbd6f863	Iustin Pop
160	fbd6f863	Iustin Pop	DRBD8 has a new feature that allow authentication of the peer at
161	fbd6f863	Iustin Pop	connect time. We can use this to prevent connecting to the wrong peer
162	fbd6f863	Iustin Pop	more that securing the connection. Even though we never had issues
163	fbd6f863	Iustin Pop	with wrong connections, it would be good to implement this.
164	fbd6f863	Iustin Pop
165	fbd6f863	Iustin Pop
166	fbd6f863	Iustin Pop	LVM self-repair (optional)
167	fbd6f863	Iustin Pop	~~~~~~~~~~~~~~~~~~~~~~~~~~
168	fbd6f863	Iustin Pop
169	fbd6f863	Iustin Pop	The complete failure of a physical disk is very tedious to
170	fbd6f863	Iustin Pop	troubleshoot, mainly because of the many failure modes and the many
171	fbd6f863	Iustin Pop	steps needed. We can safely automate some of the steps, more
172	fbd6f863	Iustin Pop	specifically the ``vgreduce --removemissing`` using the following
173	fbd6f863	Iustin Pop	method:
174	fbd6f863	Iustin Pop
175	fbd6f863	Iustin Pop	#. check if all nodes have consistent volume groups
176	fbd6f863	Iustin Pop	#. if yes, and previous status was yes, do nothing
177	fbd6f863	Iustin Pop	#. if yes, and previous status was no, save status and restart
178	fbd6f863	Iustin Pop	#. if no, and previous status was no, do nothing
179	fbd6f863	Iustin Pop	#. if no, and previous status was yes:
180	fbd6f863	Iustin Pop	#. if more than one node is inconsistent, do nothing
181	fbd6f863	Iustin Pop	#. if only one node is incosistent:
182	fbd6f863	Iustin Pop	#. run ``vgreduce --removemissing``
183	fbd6f863	Iustin Pop	#. log this occurence in the ganeti log in a form that
184	fbd6f863	Iustin Pop	can be used for monitoring
185	fbd6f863	Iustin Pop	#. [FUTURE] run ``replace-disks`` for all
186	fbd6f863	Iustin Pop	instances affected
187	fbd6f863	Iustin Pop
188	fbd6f863	Iustin Pop	Failover to any node
189	fbd6f863	Iustin Pop	~~~~~~~~~~~~~~~~~~~~
190	fbd6f863	Iustin Pop
191	fbd6f863	Iustin Pop	With a modified disk activation sequence, we can implement the
192	fbd6f863	Iustin Pop	failover to any functionality, removing many of the layout
193	fbd6f863	Iustin Pop	restrictions of a cluster:
194	fbd6f863	Iustin Pop
195	fbd6f863	Iustin Pop	- the need to reserve memory on the current secondary: this gets reduced to
196	fbd6f863	Iustin Pop	a must to reserve memory anywhere on the cluster
197	fbd6f863	Iustin Pop
198	fbd6f863	Iustin Pop	- the need to first failover and then replace secondary for an
199	fbd6f863	Iustin Pop	instance: with failover-to-any, we can directly failover to
200	fbd6f863	Iustin Pop	another node, which also does the replace disks at the same
201	fbd6f863	Iustin Pop	step
202	fbd6f863	Iustin Pop
203	fbd6f863	Iustin Pop	In the following, we denote the current primary by P1, the current
204	fbd6f863	Iustin Pop	secondary by S1, and the new primary and secondaries by P2 and S2. P2
205	fbd6f863	Iustin Pop	is fixed to the node the user chooses, but the choice of S2 can be
206	fbd6f863	Iustin Pop	made between P1 and S1. This choice can be constrained, depending on
207	fbd6f863	Iustin Pop	which of P1 and S1 has failed.
208	fbd6f863	Iustin Pop
209	fbd6f863	Iustin Pop	- if P1 has failed, then S1 must become S2, and live migration is not possible
210	fbd6f863	Iustin Pop	- if S1 has failed, then P1 must become S2, and live migration could be
211	fbd6f863	Iustin Pop	possible (in theory, but this is not a design goal for 2.0)
212	fbd6f863	Iustin Pop
213	fbd6f863	Iustin Pop	The algorithm for performing the failover is straightforward:
214	fbd6f863	Iustin Pop
215	fbd6f863	Iustin Pop	- verify that S2 (the node the user has chosen to keep as secondary) has
216	fbd6f863	Iustin Pop	valid data (is consistent)
217	fbd6f863	Iustin Pop
218	fbd6f863	Iustin Pop	- tear down the current DRBD association and setup a drbd pairing between
219	fbd6f863	Iustin Pop	P2 (P2 is indicated by the user) and S2; since P2 has no data, it will
220	fbd6f863	Iustin Pop	start resyncing from S2
221	fbd6f863	Iustin Pop
222	fbd6f863	Iustin Pop	- as soon as P2 is in state SyncTarget (i.e. after the resync has started
223	fbd6f863	Iustin Pop	but before it has finished), we can promote it to primary role (r/w)
224	fbd6f863	Iustin Pop	and start the instance on P2
225	fbd6f863	Iustin Pop
226	fbd6f863	Iustin Pop	- as soon as the P2⇐S2 sync has finished, we can remove
227	fbd6f863	Iustin Pop	the old data on the old node that has not been chosen for
228	fbd6f863	Iustin Pop	S2
229	fbd6f863	Iustin Pop
230	fbd6f863	Iustin Pop	Caveats: during the P2⇐S2 sync, a (non-transient) network error
231	fbd6f863	Iustin Pop	will cause I/O errors on the instance, so (if a longer instance
232	fbd6f863	Iustin Pop	downtime is acceptable) we can postpone the restart of the instance
233	fbd6f863	Iustin Pop	until the resync is done. However, disk I/O errors on S2 will cause
234	fbd6f863	Iustin Pop	dataloss, since we don't have a good copy of the data anymore, so in
235	fbd6f863	Iustin Pop	this case waiting for the sync to complete is not an option. As such,
236	fbd6f863	Iustin Pop	it is recommended that this feature is used only in conjunction with
237	fbd6f863	Iustin Pop	proper disk monitoring.
238	fbd6f863	Iustin Pop
239	fbd6f863	Iustin Pop
240	fbd6f863	Iustin Pop	Live migration note: While failover-to-any is possible for all choices
241	fbd6f863	Iustin Pop	of S2, migration-to-any is possible only if we keep P1 as S2.
242	fbd6f863	Iustin Pop
243	fbd6f863	Iustin Pop	Caveats
244	fbd6f863	Iustin Pop	-------
245	fbd6f863	Iustin Pop
246	fbd6f863	Iustin Pop	The dynamic device model, while more complex, has an advantage: it
247	fbd6f863	Iustin Pop	will not reuse by mistake another's instance DRBD device, since it
248	fbd6f863	Iustin Pop	always looks for either our own or a free one.
249	fbd6f863	Iustin Pop
250	fbd6f863	Iustin Pop	The static one, in contrast, will assume that given a minor number N,
251	fbd6f863	Iustin Pop	it's ours and we can take over. This needs careful implementation such
252	fbd6f863	Iustin Pop	that if the minor is in use, either we are able to cleanly shut it
253	fbd6f863	Iustin Pop	down, or we abort the startup. Otherwise, it could be that we start
254	fbd6f863	Iustin Pop	syncing between two instance's disks, causing dataloss.
255	fbd6f863	Iustin Pop
256	fbd6f863	Iustin Pop	Security Considerations
257	fbd6f863	Iustin Pop	-----------------------
258	fbd6f863	Iustin Pop
259	fbd6f863	Iustin Pop	The changes will not affect the security model of Ganeti.