Statistics
| Branch: | Tag: | Revision:

root / doc / design-2.0-disk-handling.rst @ 109509e4

History | View | Annotate | Download (10.2 kB)

1
Ganeti 2.0 disk handling changes
2
================================
3

    
4
Objective
5
---------
6

    
7
Change the storage options available and the details of the
8
implementation such that we overcome some design limitations present
9
in Ganeti 1.x.
10

    
11
Background
12
----------
13

    
14
The storage options available in Ganeti 1.x were introduced based on
15
then-current software (DRBD 0.7 and later DRBD 8) and the estimated
16
usage patters. However, experience has later shown that some
17
assumptions made initially are not true and that more flexibility is
18
needed.
19

    
20
One main assupmtion made was that disk failures should be treated as 'rare'
21
events, and that each of them needs to be manually handled in order to ensure
22
data safety; however, both these assumptions are false:
23

    
24
- disk failures can be a common occurence, based on usage patterns or cluster
25
  size
26
- our disk setup is robust enough (referring to DRBD8 + LVM) that we could
27
  automate more of the recovery
28

    
29
Note that we still don't have fully-automated disk recovery as a goal, but our
30
goal is to reduce the manual work needed.
31

    
32
Overview
33
--------
34

    
35
We plan the following main changes:
36

    
37
- DRBD8 is much more flexible and stable than its previous version (0.7),
38
  such that removing the support for the ``remote_raid1`` template and
39
  focusing only on DRBD8 is easier
40

    
41
- dynamic discovery of DRBD devices is not actually needed in a cluster that
42
  where the DRBD namespace is controlled by Ganeti; switching to a static
43
  assignment (done at either instance creation time or change secondary time)
44
  will change the disk activation time from O(n) to O(1), which on big
45
  clusters is a significant gain
46

    
47
- remove the hard dependency on LVM (currently all available storage types are
48
  ultimately backed by LVM volumes) by introducing file-based storage
49

    
50
Additionally, a number of smaller enhancements are also planned:
51
- support variable number of disks
52
- support read-only disks
53

    
54
Future enhancements in the 2.x series, which do not require base design
55
changes, might include:
56

    
57
- enhancement of the LVM allocation method in order to try to keep
58
  all of an instance's virtual disks on the same physical
59
  disks
60

    
61
- add support for DRBD8 authentication at handshake time in
62
  order to ensure each device connects to the correct peer
63

    
64
- remove the restrictions on failover only to the secondary
65
  which creates very strict rules on cluster allocation
66

    
67
Detailed Design
68
---------------
69

    
70
DRBD minor allocation
71
~~~~~~~~~~~~~~~~~~~~~
72

    
73
Currently, when trying to identify or activate a new DRBD (or MD)
74
device, the code scans all in-use devices in order to see if we find
75
one that looks similar to our parameters and is already in the desired
76
state or not. Since this needs external commands to be run, it is very
77
slow when more than a few devices are already present.
78

    
79
Therefore, we will change the discovery model from dynamic to
80
static. When a new device is logically created (added to the
81
configuration) a free minor number is computed from the list of
82
devices that should exist on that node and assigned to that
83
device.
84

    
85
At device activation, if the minor is already in use, we check if
86
it has our parameters; if not so, we just destroy the device (if
87
possible, otherwise we abort) and start it with our own
88
parameters.
89

    
90
This means that we in effect take ownership of the minor space for
91
that device type; if there's a user-created drbd minor, it will be
92
automatically removed.
93

    
94
The change will have the effect of reducing the number of external
95
commands run per device from a constant number times the index of the
96
first free DRBD minor to just a constant number.
97

    
98
Removal of obsolete device types (md, drbd7)
99
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
100

    
101
We need to remove these device types because of two issues. First,
102
drbd7 has bad failure modes in case of dual failures (both network and
103
disk - it cannot propagate the error up the device stack and instead
104
just panics. Second, due to the assymetry between primary and
105
secondary in md+drbd mode, we cannot do live failover (not even if we
106
had md+drbd8).
107

    
108
File-based storage support
109
~~~~~~~~~~~~~~~~~~~~~~~~~~
110

    
111
This is covered by a separate design doc (<em>Vinales</em>) and
112
would allow us to get rid of the hard requirement for testing
113
clusters; it would also allow people who have SAN storage to do live
114
failover taking advantage of their storage solution.
115

    
116
Variable number of disks
117
~~~~~~~~~~~~~~~~~~~~~~~~
118

    
119
In order to support high-security scenarios (for example read-only sda
120
and read-write sdb), we need to make a fully flexibly disk
121
definition. This has less impact that it might look at first sight:
122
only the instance creation has hardcoded number of disks, not the disk
123
handling code. The block device handling and most of the instance
124
handling code is already working with "the instance's disks" as
125
opposed to "the two disks of the instance", but some pieces are not
126
(e.g. import/export) and the code needs a review to ensure safety.
127

    
128
The objective is to be able to specify the number of disks at
129
instance creation, and to be able to toggle from read-only to
130
read-write a disk afterwards.
131

    
132
Better LVM allocation
133
~~~~~~~~~~~~~~~~~~~~~
134

    
135
Currently, the LV to PV allocation mechanism is a very simple one: at
136
each new request for a logical volume, tell LVM to allocate the volume
137
in order based on the amount of free space. This is good for
138
simplicity and for keeping the usage equally spread over the available
139
physical disks, however it introduces a problem that an instance could
140
end up with its (currently) two drives on two physical disks, or
141
(worse) that the data and metadata for a DRBD device end up on
142
different drives.
143

    
144
This is bad because it causes unneeded ``replace-disks`` operations in
145
case of a physical failure.
146

    
147
The solution is to batch allocations for an instance and make the LVM
148
handling code try to allocate as close as possible all the storage of
149
one instance. We will still allow the logical volumes to spill over to
150
additional disks as needed.
151

    
152
Note that this clustered allocation can only be attempted at initial
153
instance creation, or at change secondary node time. At add disk time,
154
or at replacing individual disks, it's not easy enough to compute the
155
current disk map so we'll not attempt the clustering.
156

    
157
DRBD8 peer authentication at handshake
158
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
159

    
160
DRBD8 has a new feature that allow authentication of the peer at
161
connect time. We can use this to prevent connecting to the wrong peer
162
more that securing the connection. Even though we never had issues
163
with wrong connections, it would be good to implement this.
164

    
165

    
166
LVM self-repair (optional)
167
~~~~~~~~~~~~~~~~~~~~~~~~~~
168

    
169
The complete failure of a physical disk is very tedious to
170
troubleshoot, mainly because of the many failure modes and the many
171
steps needed. We can safely automate some of the steps, more
172
specifically the ``vgreduce --removemissing`` using the following
173
method:
174

    
175
#. check if all nodes have consistent volume groups
176
#. if yes, and previous status was yes, do nothing
177
#. if yes, and previous status was no, save status and restart
178
#. if no, and previous status was no, do nothing
179
#. if no, and previous status was yes:
180
    #. if more than one node is inconsistent, do nothing
181
    #. if only one node is incosistent:
182
        #. run ``vgreduce --removemissing``
183
        #. log this occurence in the ganeti log in a form that
184
           can be used for monitoring
185
        #. [FUTURE] run ``replace-disks`` for all
186
           instances affected
187

    
188
Failover to any node
189
~~~~~~~~~~~~~~~~~~~~
190

    
191
With a modified disk activation sequence, we can implement the
192
*failover to any* functionality, removing many of the layout
193
restrictions of a cluster:
194

    
195
- the need to reserve memory on the current secondary: this gets reduced to
196
  a must to reserve memory anywhere on the cluster
197

    
198
- the need to first failover and then replace secondary for an
199
  instance: with failover-to-any, we can directly failover to
200
  another node, which also does the replace disks at the same
201
  step
202

    
203
In the following, we denote the current primary by P1, the current
204
secondary by S1, and the new primary and secondaries by P2 and S2. P2
205
is fixed to the node the user chooses, but the choice of S2 can be
206
made between P1 and S1. This choice can be constrained, depending on
207
which of P1 and S1 has failed.
208

    
209
- if P1 has failed, then S1 must become S2, and live migration is not possible
210
- if S1 has failed, then P1 must become S2, and live migration could be
211
  possible (in theory, but this is not a design goal for 2.0)
212

    
213
The algorithm for performing the failover is straightforward:
214

    
215
- verify that S2 (the node the user has chosen to keep as secondary) has
216
  valid data (is consistent)
217

    
218
- tear down the current DRBD association and setup a drbd pairing between
219
  P2 (P2 is indicated by the user) and S2; since P2 has no data, it will
220
  start resyncing from S2
221

    
222
- as soon as P2 is in state SyncTarget (i.e. after the resync has started
223
  but before it has finished), we can promote it to primary role (r/w)
224
  and start the instance on P2
225

    
226
- as soon as the P2⇐S2 sync has finished, we can remove
227
  the old data on the old node that has not been chosen for
228
  S2
229

    
230
Caveats: during the P2⇐S2 sync, a (non-transient) network error
231
will cause I/O errors on the instance, so (if a longer instance
232
downtime is acceptable) we can postpone the restart of the instance
233
until the resync is done. However, disk I/O errors on S2 will cause
234
dataloss, since we don't have a good copy of the data anymore, so in
235
this case waiting for the sync to complete is not an option. As such,
236
it is recommended that this feature is used only in conjunction with
237
proper disk monitoring.
238

    
239

    
240
Live migration note: While failover-to-any is possible for all choices
241
of S2, migration-to-any is possible only if we keep P1 as S2.
242

    
243
Caveats
244
-------
245

    
246
The dynamic device model, while more complex, has an advantage: it
247
will not reuse by mistake another's instance DRBD device, since it
248
always looks for either our own or a free one.
249

    
250
The static one, in contrast, will assume that given a minor number N,
251
it's ours and we can take over. This needs careful implementation such
252
that if the minor is in use, either we are able to cleanly shut it
253
down, or we abort the startup. Otherwise, it could be that we start
254
syncing between two instance's disks, causing dataloss.
255

    
256
Security Considerations
257
-----------------------
258

    
259
The changes will not affect the security model of Ganeti.