Statistics
| Branch: | Tag: | Revision:

root / doc / design-2.0-disk-handling.rst @ cd55576a

History | View | Annotate | Download (10.2 kB)

1 fbd6f863 Iustin Pop
Ganeti 2.0 disk handling changes
2 fbd6f863 Iustin Pop
================================
3 fbd6f863 Iustin Pop
4 fbd6f863 Iustin Pop
Objective
5 fbd6f863 Iustin Pop
---------
6 fbd6f863 Iustin Pop
7 fbd6f863 Iustin Pop
Change the storage options available and the details of the
8 fbd6f863 Iustin Pop
implementation such that we overcome some design limitations present
9 fbd6f863 Iustin Pop
in Ganeti 1.x.
10 fbd6f863 Iustin Pop
11 fbd6f863 Iustin Pop
Background
12 fbd6f863 Iustin Pop
----------
13 fbd6f863 Iustin Pop
14 fbd6f863 Iustin Pop
The storage options available in Ganeti 1.x were introduced based on
15 fbd6f863 Iustin Pop
then-current software (DRBD 0.7 and later DRBD 8) and the estimated
16 fbd6f863 Iustin Pop
usage patters. However, experience has later shown that some
17 fbd6f863 Iustin Pop
assumptions made initially are not true and that more flexibility is
18 fbd6f863 Iustin Pop
needed.
19 fbd6f863 Iustin Pop
20 fbd6f863 Iustin Pop
One main assupmtion made was that disk failures should be treated as 'rare'
21 fbd6f863 Iustin Pop
events, and that each of them needs to be manually handled in order to ensure
22 fbd6f863 Iustin Pop
data safety; however, both these assumptions are false:
23 fbd6f863 Iustin Pop
24 fbd6f863 Iustin Pop
- disk failures can be a common occurence, based on usage patterns or cluster
25 fbd6f863 Iustin Pop
  size
26 fbd6f863 Iustin Pop
- our disk setup is robust enough (referring to DRBD8 + LVM) that we could
27 fbd6f863 Iustin Pop
  automate more of the recovery
28 fbd6f863 Iustin Pop
29 fbd6f863 Iustin Pop
Note that we still don't have fully-automated disk recovery as a goal, but our
30 fbd6f863 Iustin Pop
goal is to reduce the manual work needed.
31 fbd6f863 Iustin Pop
32 fbd6f863 Iustin Pop
Overview
33 fbd6f863 Iustin Pop
--------
34 fbd6f863 Iustin Pop
35 fbd6f863 Iustin Pop
We plan the following main changes:
36 fbd6f863 Iustin Pop
37 fbd6f863 Iustin Pop
- DRBD8 is much more flexible and stable than its previous version (0.7),
38 fbd6f863 Iustin Pop
  such that removing the support for the ``remote_raid1`` template and
39 fbd6f863 Iustin Pop
  focusing only on DRBD8 is easier
40 fbd6f863 Iustin Pop
41 fbd6f863 Iustin Pop
- dynamic discovery of DRBD devices is not actually needed in a cluster that
42 fbd6f863 Iustin Pop
  where the DRBD namespace is controlled by Ganeti; switching to a static
43 fbd6f863 Iustin Pop
  assignment (done at either instance creation time or change secondary time)
44 fbd6f863 Iustin Pop
  will change the disk activation time from O(n) to O(1), which on big
45 fbd6f863 Iustin Pop
  clusters is a significant gain
46 fbd6f863 Iustin Pop
47 fbd6f863 Iustin Pop
- remove the hard dependency on LVM (currently all available storage types are
48 fbd6f863 Iustin Pop
  ultimately backed by LVM volumes) by introducing file-based storage
49 fbd6f863 Iustin Pop
50 fbd6f863 Iustin Pop
Additionally, a number of smaller enhancements are also planned:
51 fbd6f863 Iustin Pop
- support variable number of disks
52 fbd6f863 Iustin Pop
- support read-only disks
53 fbd6f863 Iustin Pop
54 fbd6f863 Iustin Pop
Future enhancements in the 2.x series, which do not require base design
55 fbd6f863 Iustin Pop
changes, might include:
56 fbd6f863 Iustin Pop
57 fbd6f863 Iustin Pop
- enhancement of the LVM allocation method in order to try to keep
58 fbd6f863 Iustin Pop
  all of an instance's virtual disks on the same physical
59 fbd6f863 Iustin Pop
  disks
60 fbd6f863 Iustin Pop
61 fbd6f863 Iustin Pop
- add support for DRBD8 authentication at handshake time in
62 fbd6f863 Iustin Pop
  order to ensure each device connects to the correct peer
63 fbd6f863 Iustin Pop
64 fbd6f863 Iustin Pop
- remove the restrictions on failover only to the secondary
65 fbd6f863 Iustin Pop
  which creates very strict rules on cluster allocation
66 fbd6f863 Iustin Pop
67 fbd6f863 Iustin Pop
Detailed Design
68 fbd6f863 Iustin Pop
---------------
69 fbd6f863 Iustin Pop
70 fbd6f863 Iustin Pop
DRBD minor allocation
71 fbd6f863 Iustin Pop
~~~~~~~~~~~~~~~~~~~~~
72 fbd6f863 Iustin Pop
73 fbd6f863 Iustin Pop
Currently, when trying to identify or activate a new DRBD (or MD)
74 fbd6f863 Iustin Pop
device, the code scans all in-use devices in order to see if we find
75 fbd6f863 Iustin Pop
one that looks similar to our parameters and is already in the desired
76 fbd6f863 Iustin Pop
state or not. Since this needs external commands to be run, it is very
77 fbd6f863 Iustin Pop
slow when more than a few devices are already present.
78 fbd6f863 Iustin Pop
79 fbd6f863 Iustin Pop
Therefore, we will change the discovery model from dynamic to
80 fbd6f863 Iustin Pop
static. When a new device is logically created (added to the
81 fbd6f863 Iustin Pop
configuration) a free minor number is computed from the list of
82 fbd6f863 Iustin Pop
devices that should exist on that node and assigned to that
83 fbd6f863 Iustin Pop
device.
84 fbd6f863 Iustin Pop
85 fbd6f863 Iustin Pop
At device activation, if the minor is already in use, we check if
86 fbd6f863 Iustin Pop
it has our parameters; if not so, we just destroy the device (if
87 fbd6f863 Iustin Pop
possible, otherwise we abort) and start it with our own
88 fbd6f863 Iustin Pop
parameters.
89 fbd6f863 Iustin Pop
90 fbd6f863 Iustin Pop
This means that we in effect take ownership of the minor space for
91 fbd6f863 Iustin Pop
that device type; if there's a user-created drbd minor, it will be
92 fbd6f863 Iustin Pop
automatically removed.
93 fbd6f863 Iustin Pop
94 fbd6f863 Iustin Pop
The change will have the effect of reducing the number of external
95 fbd6f863 Iustin Pop
commands run per device from a constant number times the index of the
96 fbd6f863 Iustin Pop
first free DRBD minor to just a constant number.
97 fbd6f863 Iustin Pop
98 fbd6f863 Iustin Pop
Removal of obsolete device types (md, drbd7)
99 fbd6f863 Iustin Pop
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
100 fbd6f863 Iustin Pop
101 fbd6f863 Iustin Pop
We need to remove these device types because of two issues. First,
102 fbd6f863 Iustin Pop
drbd7 has bad failure modes in case of dual failures (both network and
103 fbd6f863 Iustin Pop
disk - it cannot propagate the error up the device stack and instead
104 fbd6f863 Iustin Pop
just panics. Second, due to the assymetry between primary and
105 fbd6f863 Iustin Pop
secondary in md+drbd mode, we cannot do live failover (not even if we
106 fbd6f863 Iustin Pop
had md+drbd8).
107 fbd6f863 Iustin Pop
108 fbd6f863 Iustin Pop
File-based storage support
109 fbd6f863 Iustin Pop
~~~~~~~~~~~~~~~~~~~~~~~~~~
110 fbd6f863 Iustin Pop
111 fbd6f863 Iustin Pop
This is covered by a separate design doc (<em>Vinales</em>) and
112 fbd6f863 Iustin Pop
would allow us to get rid of the hard requirement for testing
113 fbd6f863 Iustin Pop
clusters; it would also allow people who have SAN storage to do live
114 fbd6f863 Iustin Pop
failover taking advantage of their storage solution.
115 fbd6f863 Iustin Pop
116 fbd6f863 Iustin Pop
Variable number of disks
117 fbd6f863 Iustin Pop
~~~~~~~~~~~~~~~~~~~~~~~~
118 fbd6f863 Iustin Pop
119 fbd6f863 Iustin Pop
In order to support high-security scenarios (for example read-only sda
120 fbd6f863 Iustin Pop
and read-write sdb), we need to make a fully flexibly disk
121 fbd6f863 Iustin Pop
definition. This has less impact that it might look at first sight:
122 fbd6f863 Iustin Pop
only the instance creation has hardcoded number of disks, not the disk
123 fbd6f863 Iustin Pop
handling code. The block device handling and most of the instance
124 fbd6f863 Iustin Pop
handling code is already working with "the instance's disks" as
125 fbd6f863 Iustin Pop
opposed to "the two disks of the instance", but some pieces are not
126 fbd6f863 Iustin Pop
(e.g. import/export) and the code needs a review to ensure safety.
127 fbd6f863 Iustin Pop
128 fbd6f863 Iustin Pop
The objective is to be able to specify the number of disks at
129 fbd6f863 Iustin Pop
instance creation, and to be able to toggle from read-only to
130 fbd6f863 Iustin Pop
read-write a disk afterwards.
131 fbd6f863 Iustin Pop
132 fbd6f863 Iustin Pop
Better LVM allocation
133 fbd6f863 Iustin Pop
~~~~~~~~~~~~~~~~~~~~~
134 fbd6f863 Iustin Pop
135 fbd6f863 Iustin Pop
Currently, the LV to PV allocation mechanism is a very simple one: at
136 fbd6f863 Iustin Pop
each new request for a logical volume, tell LVM to allocate the volume
137 fbd6f863 Iustin Pop
in order based on the amount of free space. This is good for
138 fbd6f863 Iustin Pop
simplicity and for keeping the usage equally spread over the available
139 fbd6f863 Iustin Pop
physical disks, however it introduces a problem that an instance could
140 fbd6f863 Iustin Pop
end up with its (currently) two drives on two physical disks, or
141 fbd6f863 Iustin Pop
(worse) that the data and metadata for a DRBD device end up on
142 fbd6f863 Iustin Pop
different drives.
143 fbd6f863 Iustin Pop
144 fbd6f863 Iustin Pop
This is bad because it causes unneeded ``replace-disks`` operations in
145 fbd6f863 Iustin Pop
case of a physical failure.
146 fbd6f863 Iustin Pop
147 fbd6f863 Iustin Pop
The solution is to batch allocations for an instance and make the LVM
148 fbd6f863 Iustin Pop
handling code try to allocate as close as possible all the storage of
149 fbd6f863 Iustin Pop
one instance. We will still allow the logical volumes to spill over to
150 fbd6f863 Iustin Pop
additional disks as needed.
151 fbd6f863 Iustin Pop
152 fbd6f863 Iustin Pop
Note that this clustered allocation can only be attempted at initial
153 fbd6f863 Iustin Pop
instance creation, or at change secondary node time. At add disk time,
154 fbd6f863 Iustin Pop
or at replacing individual disks, it's not easy enough to compute the
155 fbd6f863 Iustin Pop
current disk map so we'll not attempt the clustering.
156 fbd6f863 Iustin Pop
157 fbd6f863 Iustin Pop
DRBD8 peer authentication at handshake
158 fbd6f863 Iustin Pop
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
159 fbd6f863 Iustin Pop
160 fbd6f863 Iustin Pop
DRBD8 has a new feature that allow authentication of the peer at
161 fbd6f863 Iustin Pop
connect time. We can use this to prevent connecting to the wrong peer
162 fbd6f863 Iustin Pop
more that securing the connection. Even though we never had issues
163 fbd6f863 Iustin Pop
with wrong connections, it would be good to implement this.
164 fbd6f863 Iustin Pop
165 fbd6f863 Iustin Pop
166 fbd6f863 Iustin Pop
LVM self-repair (optional)
167 fbd6f863 Iustin Pop
~~~~~~~~~~~~~~~~~~~~~~~~~~
168 fbd6f863 Iustin Pop
169 fbd6f863 Iustin Pop
The complete failure of a physical disk is very tedious to
170 fbd6f863 Iustin Pop
troubleshoot, mainly because of the many failure modes and the many
171 fbd6f863 Iustin Pop
steps needed. We can safely automate some of the steps, more
172 fbd6f863 Iustin Pop
specifically the ``vgreduce --removemissing`` using the following
173 fbd6f863 Iustin Pop
method:
174 fbd6f863 Iustin Pop
175 fbd6f863 Iustin Pop
#. check if all nodes have consistent volume groups
176 fbd6f863 Iustin Pop
#. if yes, and previous status was yes, do nothing
177 fbd6f863 Iustin Pop
#. if yes, and previous status was no, save status and restart
178 fbd6f863 Iustin Pop
#. if no, and previous status was no, do nothing
179 fbd6f863 Iustin Pop
#. if no, and previous status was yes:
180 fbd6f863 Iustin Pop
    #. if more than one node is inconsistent, do nothing
181 fbd6f863 Iustin Pop
    #. if only one node is incosistent:
182 fbd6f863 Iustin Pop
        #. run ``vgreduce --removemissing``
183 fbd6f863 Iustin Pop
        #. log this occurence in the ganeti log in a form that
184 fbd6f863 Iustin Pop
           can be used for monitoring
185 fbd6f863 Iustin Pop
        #. [FUTURE] run ``replace-disks`` for all
186 fbd6f863 Iustin Pop
           instances affected
187 fbd6f863 Iustin Pop
188 fbd6f863 Iustin Pop
Failover to any node
189 fbd6f863 Iustin Pop
~~~~~~~~~~~~~~~~~~~~
190 fbd6f863 Iustin Pop
191 fbd6f863 Iustin Pop
With a modified disk activation sequence, we can implement the
192 fbd6f863 Iustin Pop
*failover to any* functionality, removing many of the layout
193 fbd6f863 Iustin Pop
restrictions of a cluster:
194 fbd6f863 Iustin Pop
195 fbd6f863 Iustin Pop
- the need to reserve memory on the current secondary: this gets reduced to
196 fbd6f863 Iustin Pop
  a must to reserve memory anywhere on the cluster
197 fbd6f863 Iustin Pop
198 fbd6f863 Iustin Pop
- the need to first failover and then replace secondary for an
199 fbd6f863 Iustin Pop
  instance: with failover-to-any, we can directly failover to
200 fbd6f863 Iustin Pop
  another node, which also does the replace disks at the same
201 fbd6f863 Iustin Pop
  step
202 fbd6f863 Iustin Pop
203 fbd6f863 Iustin Pop
In the following, we denote the current primary by P1, the current
204 fbd6f863 Iustin Pop
secondary by S1, and the new primary and secondaries by P2 and S2. P2
205 fbd6f863 Iustin Pop
is fixed to the node the user chooses, but the choice of S2 can be
206 fbd6f863 Iustin Pop
made between P1 and S1. This choice can be constrained, depending on
207 fbd6f863 Iustin Pop
which of P1 and S1 has failed.
208 fbd6f863 Iustin Pop
209 fbd6f863 Iustin Pop
- if P1 has failed, then S1 must become S2, and live migration is not possible
210 fbd6f863 Iustin Pop
- if S1 has failed, then P1 must become S2, and live migration could be
211 fbd6f863 Iustin Pop
  possible (in theory, but this is not a design goal for 2.0)
212 fbd6f863 Iustin Pop
213 fbd6f863 Iustin Pop
The algorithm for performing the failover is straightforward:
214 fbd6f863 Iustin Pop
215 fbd6f863 Iustin Pop
- verify that S2 (the node the user has chosen to keep as secondary) has
216 fbd6f863 Iustin Pop
  valid data (is consistent)
217 fbd6f863 Iustin Pop
218 fbd6f863 Iustin Pop
- tear down the current DRBD association and setup a drbd pairing between
219 fbd6f863 Iustin Pop
  P2 (P2 is indicated by the user) and S2; since P2 has no data, it will
220 fbd6f863 Iustin Pop
  start resyncing from S2
221 fbd6f863 Iustin Pop
222 fbd6f863 Iustin Pop
- as soon as P2 is in state SyncTarget (i.e. after the resync has started
223 fbd6f863 Iustin Pop
  but before it has finished), we can promote it to primary role (r/w)
224 fbd6f863 Iustin Pop
  and start the instance on P2
225 fbd6f863 Iustin Pop
226 fbd6f863 Iustin Pop
- as soon as the P2⇐S2 sync has finished, we can remove
227 fbd6f863 Iustin Pop
  the old data on the old node that has not been chosen for
228 fbd6f863 Iustin Pop
  S2
229 fbd6f863 Iustin Pop
230 fbd6f863 Iustin Pop
Caveats: during the P2⇐S2 sync, a (non-transient) network error
231 fbd6f863 Iustin Pop
will cause I/O errors on the instance, so (if a longer instance
232 fbd6f863 Iustin Pop
downtime is acceptable) we can postpone the restart of the instance
233 fbd6f863 Iustin Pop
until the resync is done. However, disk I/O errors on S2 will cause
234 fbd6f863 Iustin Pop
dataloss, since we don't have a good copy of the data anymore, so in
235 fbd6f863 Iustin Pop
this case waiting for the sync to complete is not an option. As such,
236 fbd6f863 Iustin Pop
it is recommended that this feature is used only in conjunction with
237 fbd6f863 Iustin Pop
proper disk monitoring.
238 fbd6f863 Iustin Pop
239 fbd6f863 Iustin Pop
240 fbd6f863 Iustin Pop
Live migration note: While failover-to-any is possible for all choices
241 fbd6f863 Iustin Pop
of S2, migration-to-any is possible only if we keep P1 as S2.
242 fbd6f863 Iustin Pop
243 fbd6f863 Iustin Pop
Caveats
244 fbd6f863 Iustin Pop
-------
245 fbd6f863 Iustin Pop
246 fbd6f863 Iustin Pop
The dynamic device model, while more complex, has an advantage: it
247 fbd6f863 Iustin Pop
will not reuse by mistake another's instance DRBD device, since it
248 fbd6f863 Iustin Pop
always looks for either our own or a free one.
249 fbd6f863 Iustin Pop
250 fbd6f863 Iustin Pop
The static one, in contrast, will assume that given a minor number N,
251 fbd6f863 Iustin Pop
it's ours and we can take over. This needs careful implementation such
252 fbd6f863 Iustin Pop
that if the minor is in use, either we are able to cleanly shut it
253 fbd6f863 Iustin Pop
down, or we abort the startup. Otherwise, it could be that we start
254 fbd6f863 Iustin Pop
syncing between two instance's disks, causing dataloss.
255 fbd6f863 Iustin Pop
256 fbd6f863 Iustin Pop
Security Considerations
257 fbd6f863 Iustin Pop
-----------------------
258 fbd6f863 Iustin Pop
259 fbd6f863 Iustin Pop
The changes will not affect the security model of Ganeti.