root / doc / design-2.0-disk-handling.rst @ cd55576a
History | View | Annotate | Download (10.2 kB)
1 |
Ganeti 2.0 disk handling changes |
---|---|
2 |
================================ |
3 |
|
4 |
Objective |
5 |
--------- |
6 |
|
7 |
Change the storage options available and the details of the |
8 |
implementation such that we overcome some design limitations present |
9 |
in Ganeti 1.x. |
10 |
|
11 |
Background |
12 |
---------- |
13 |
|
14 |
The storage options available in Ganeti 1.x were introduced based on |
15 |
then-current software (DRBD 0.7 and later DRBD 8) and the estimated |
16 |
usage patters. However, experience has later shown that some |
17 |
assumptions made initially are not true and that more flexibility is |
18 |
needed. |
19 |
|
20 |
One main assupmtion made was that disk failures should be treated as 'rare' |
21 |
events, and that each of them needs to be manually handled in order to ensure |
22 |
data safety; however, both these assumptions are false: |
23 |
|
24 |
- disk failures can be a common occurence, based on usage patterns or cluster |
25 |
size |
26 |
- our disk setup is robust enough (referring to DRBD8 + LVM) that we could |
27 |
automate more of the recovery |
28 |
|
29 |
Note that we still don't have fully-automated disk recovery as a goal, but our |
30 |
goal is to reduce the manual work needed. |
31 |
|
32 |
Overview |
33 |
-------- |
34 |
|
35 |
We plan the following main changes: |
36 |
|
37 |
- DRBD8 is much more flexible and stable than its previous version (0.7), |
38 |
such that removing the support for the ``remote_raid1`` template and |
39 |
focusing only on DRBD8 is easier |
40 |
|
41 |
- dynamic discovery of DRBD devices is not actually needed in a cluster that |
42 |
where the DRBD namespace is controlled by Ganeti; switching to a static |
43 |
assignment (done at either instance creation time or change secondary time) |
44 |
will change the disk activation time from O(n) to O(1), which on big |
45 |
clusters is a significant gain |
46 |
|
47 |
- remove the hard dependency on LVM (currently all available storage types are |
48 |
ultimately backed by LVM volumes) by introducing file-based storage |
49 |
|
50 |
Additionally, a number of smaller enhancements are also planned: |
51 |
- support variable number of disks |
52 |
- support read-only disks |
53 |
|
54 |
Future enhancements in the 2.x series, which do not require base design |
55 |
changes, might include: |
56 |
|
57 |
- enhancement of the LVM allocation method in order to try to keep |
58 |
all of an instance's virtual disks on the same physical |
59 |
disks |
60 |
|
61 |
- add support for DRBD8 authentication at handshake time in |
62 |
order to ensure each device connects to the correct peer |
63 |
|
64 |
- remove the restrictions on failover only to the secondary |
65 |
which creates very strict rules on cluster allocation |
66 |
|
67 |
Detailed Design |
68 |
--------------- |
69 |
|
70 |
DRBD minor allocation |
71 |
~~~~~~~~~~~~~~~~~~~~~ |
72 |
|
73 |
Currently, when trying to identify or activate a new DRBD (or MD) |
74 |
device, the code scans all in-use devices in order to see if we find |
75 |
one that looks similar to our parameters and is already in the desired |
76 |
state or not. Since this needs external commands to be run, it is very |
77 |
slow when more than a few devices are already present. |
78 |
|
79 |
Therefore, we will change the discovery model from dynamic to |
80 |
static. When a new device is logically created (added to the |
81 |
configuration) a free minor number is computed from the list of |
82 |
devices that should exist on that node and assigned to that |
83 |
device. |
84 |
|
85 |
At device activation, if the minor is already in use, we check if |
86 |
it has our parameters; if not so, we just destroy the device (if |
87 |
possible, otherwise we abort) and start it with our own |
88 |
parameters. |
89 |
|
90 |
This means that we in effect take ownership of the minor space for |
91 |
that device type; if there's a user-created drbd minor, it will be |
92 |
automatically removed. |
93 |
|
94 |
The change will have the effect of reducing the number of external |
95 |
commands run per device from a constant number times the index of the |
96 |
first free DRBD minor to just a constant number. |
97 |
|
98 |
Removal of obsolete device types (md, drbd7) |
99 |
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
100 |
|
101 |
We need to remove these device types because of two issues. First, |
102 |
drbd7 has bad failure modes in case of dual failures (both network and |
103 |
disk - it cannot propagate the error up the device stack and instead |
104 |
just panics. Second, due to the assymetry between primary and |
105 |
secondary in md+drbd mode, we cannot do live failover (not even if we |
106 |
had md+drbd8). |
107 |
|
108 |
File-based storage support |
109 |
~~~~~~~~~~~~~~~~~~~~~~~~~~ |
110 |
|
111 |
This is covered by a separate design doc (<em>Vinales</em>) and |
112 |
would allow us to get rid of the hard requirement for testing |
113 |
clusters; it would also allow people who have SAN storage to do live |
114 |
failover taking advantage of their storage solution. |
115 |
|
116 |
Variable number of disks |
117 |
~~~~~~~~~~~~~~~~~~~~~~~~ |
118 |
|
119 |
In order to support high-security scenarios (for example read-only sda |
120 |
and read-write sdb), we need to make a fully flexibly disk |
121 |
definition. This has less impact that it might look at first sight: |
122 |
only the instance creation has hardcoded number of disks, not the disk |
123 |
handling code. The block device handling and most of the instance |
124 |
handling code is already working with "the instance's disks" as |
125 |
opposed to "the two disks of the instance", but some pieces are not |
126 |
(e.g. import/export) and the code needs a review to ensure safety. |
127 |
|
128 |
The objective is to be able to specify the number of disks at |
129 |
instance creation, and to be able to toggle from read-only to |
130 |
read-write a disk afterwards. |
131 |
|
132 |
Better LVM allocation |
133 |
~~~~~~~~~~~~~~~~~~~~~ |
134 |
|
135 |
Currently, the LV to PV allocation mechanism is a very simple one: at |
136 |
each new request for a logical volume, tell LVM to allocate the volume |
137 |
in order based on the amount of free space. This is good for |
138 |
simplicity and for keeping the usage equally spread over the available |
139 |
physical disks, however it introduces a problem that an instance could |
140 |
end up with its (currently) two drives on two physical disks, or |
141 |
(worse) that the data and metadata for a DRBD device end up on |
142 |
different drives. |
143 |
|
144 |
This is bad because it causes unneeded ``replace-disks`` operations in |
145 |
case of a physical failure. |
146 |
|
147 |
The solution is to batch allocations for an instance and make the LVM |
148 |
handling code try to allocate as close as possible all the storage of |
149 |
one instance. We will still allow the logical volumes to spill over to |
150 |
additional disks as needed. |
151 |
|
152 |
Note that this clustered allocation can only be attempted at initial |
153 |
instance creation, or at change secondary node time. At add disk time, |
154 |
or at replacing individual disks, it's not easy enough to compute the |
155 |
current disk map so we'll not attempt the clustering. |
156 |
|
157 |
DRBD8 peer authentication at handshake |
158 |
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
159 |
|
160 |
DRBD8 has a new feature that allow authentication of the peer at |
161 |
connect time. We can use this to prevent connecting to the wrong peer |
162 |
more that securing the connection. Even though we never had issues |
163 |
with wrong connections, it would be good to implement this. |
164 |
|
165 |
|
166 |
LVM self-repair (optional) |
167 |
~~~~~~~~~~~~~~~~~~~~~~~~~~ |
168 |
|
169 |
The complete failure of a physical disk is very tedious to |
170 |
troubleshoot, mainly because of the many failure modes and the many |
171 |
steps needed. We can safely automate some of the steps, more |
172 |
specifically the ``vgreduce --removemissing`` using the following |
173 |
method: |
174 |
|
175 |
#. check if all nodes have consistent volume groups |
176 |
#. if yes, and previous status was yes, do nothing |
177 |
#. if yes, and previous status was no, save status and restart |
178 |
#. if no, and previous status was no, do nothing |
179 |
#. if no, and previous status was yes: |
180 |
#. if more than one node is inconsistent, do nothing |
181 |
#. if only one node is incosistent: |
182 |
#. run ``vgreduce --removemissing`` |
183 |
#. log this occurence in the ganeti log in a form that |
184 |
can be used for monitoring |
185 |
#. [FUTURE] run ``replace-disks`` for all |
186 |
instances affected |
187 |
|
188 |
Failover to any node |
189 |
~~~~~~~~~~~~~~~~~~~~ |
190 |
|
191 |
With a modified disk activation sequence, we can implement the |
192 |
*failover to any* functionality, removing many of the layout |
193 |
restrictions of a cluster: |
194 |
|
195 |
- the need to reserve memory on the current secondary: this gets reduced to |
196 |
a must to reserve memory anywhere on the cluster |
197 |
|
198 |
- the need to first failover and then replace secondary for an |
199 |
instance: with failover-to-any, we can directly failover to |
200 |
another node, which also does the replace disks at the same |
201 |
step |
202 |
|
203 |
In the following, we denote the current primary by P1, the current |
204 |
secondary by S1, and the new primary and secondaries by P2 and S2. P2 |
205 |
is fixed to the node the user chooses, but the choice of S2 can be |
206 |
made between P1 and S1. This choice can be constrained, depending on |
207 |
which of P1 and S1 has failed. |
208 |
|
209 |
- if P1 has failed, then S1 must become S2, and live migration is not possible |
210 |
- if S1 has failed, then P1 must become S2, and live migration could be |
211 |
possible (in theory, but this is not a design goal for 2.0) |
212 |
|
213 |
The algorithm for performing the failover is straightforward: |
214 |
|
215 |
- verify that S2 (the node the user has chosen to keep as secondary) has |
216 |
valid data (is consistent) |
217 |
|
218 |
- tear down the current DRBD association and setup a drbd pairing between |
219 |
P2 (P2 is indicated by the user) and S2; since P2 has no data, it will |
220 |
start resyncing from S2 |
221 |
|
222 |
- as soon as P2 is in state SyncTarget (i.e. after the resync has started |
223 |
but before it has finished), we can promote it to primary role (r/w) |
224 |
and start the instance on P2 |
225 |
|
226 |
- as soon as the P2⇐S2 sync has finished, we can remove |
227 |
the old data on the old node that has not been chosen for |
228 |
S2 |
229 |
|
230 |
Caveats: during the P2⇐S2 sync, a (non-transient) network error |
231 |
will cause I/O errors on the instance, so (if a longer instance |
232 |
downtime is acceptable) we can postpone the restart of the instance |
233 |
until the resync is done. However, disk I/O errors on S2 will cause |
234 |
dataloss, since we don't have a good copy of the data anymore, so in |
235 |
this case waiting for the sync to complete is not an option. As such, |
236 |
it is recommended that this feature is used only in conjunction with |
237 |
proper disk monitoring. |
238 |
|
239 |
|
240 |
Live migration note: While failover-to-any is possible for all choices |
241 |
of S2, migration-to-any is possible only if we keep P1 as S2. |
242 |
|
243 |
Caveats |
244 |
------- |
245 |
|
246 |
The dynamic device model, while more complex, has an advantage: it |
247 |
will not reuse by mistake another's instance DRBD device, since it |
248 |
always looks for either our own or a free one. |
249 |
|
250 |
The static one, in contrast, will assume that given a minor number N, |
251 |
it's ours and we can take over. This needs careful implementation such |
252 |
that if the minor is in use, either we are able to cleanly shut it |
253 |
down, or we abort the startup. Otherwise, it could be that we start |
254 |
syncing between two instance's disks, causing dataloss. |
255 |
|
256 |
Security Considerations |
257 |
----------------------- |
258 |
|
259 |
The changes will not affect the security model of Ganeti. |