root / doc / design-2.0-disk-handling.rst @ cd55576a
History | View | Annotate | Download (10.2 kB)
1 | fbd6f863 | Iustin Pop | Ganeti 2.0 disk handling changes |
---|---|---|---|
2 | fbd6f863 | Iustin Pop | ================================ |
3 | fbd6f863 | Iustin Pop | |
4 | fbd6f863 | Iustin Pop | Objective |
5 | fbd6f863 | Iustin Pop | --------- |
6 | fbd6f863 | Iustin Pop | |
7 | fbd6f863 | Iustin Pop | Change the storage options available and the details of the |
8 | fbd6f863 | Iustin Pop | implementation such that we overcome some design limitations present |
9 | fbd6f863 | Iustin Pop | in Ganeti 1.x. |
10 | fbd6f863 | Iustin Pop | |
11 | fbd6f863 | Iustin Pop | Background |
12 | fbd6f863 | Iustin Pop | ---------- |
13 | fbd6f863 | Iustin Pop | |
14 | fbd6f863 | Iustin Pop | The storage options available in Ganeti 1.x were introduced based on |
15 | fbd6f863 | Iustin Pop | then-current software (DRBD 0.7 and later DRBD 8) and the estimated |
16 | fbd6f863 | Iustin Pop | usage patters. However, experience has later shown that some |
17 | fbd6f863 | Iustin Pop | assumptions made initially are not true and that more flexibility is |
18 | fbd6f863 | Iustin Pop | needed. |
19 | fbd6f863 | Iustin Pop | |
20 | fbd6f863 | Iustin Pop | One main assupmtion made was that disk failures should be treated as 'rare' |
21 | fbd6f863 | Iustin Pop | events, and that each of them needs to be manually handled in order to ensure |
22 | fbd6f863 | Iustin Pop | data safety; however, both these assumptions are false: |
23 | fbd6f863 | Iustin Pop | |
24 | fbd6f863 | Iustin Pop | - disk failures can be a common occurence, based on usage patterns or cluster |
25 | fbd6f863 | Iustin Pop | size |
26 | fbd6f863 | Iustin Pop | - our disk setup is robust enough (referring to DRBD8 + LVM) that we could |
27 | fbd6f863 | Iustin Pop | automate more of the recovery |
28 | fbd6f863 | Iustin Pop | |
29 | fbd6f863 | Iustin Pop | Note that we still don't have fully-automated disk recovery as a goal, but our |
30 | fbd6f863 | Iustin Pop | goal is to reduce the manual work needed. |
31 | fbd6f863 | Iustin Pop | |
32 | fbd6f863 | Iustin Pop | Overview |
33 | fbd6f863 | Iustin Pop | -------- |
34 | fbd6f863 | Iustin Pop | |
35 | fbd6f863 | Iustin Pop | We plan the following main changes: |
36 | fbd6f863 | Iustin Pop | |
37 | fbd6f863 | Iustin Pop | - DRBD8 is much more flexible and stable than its previous version (0.7), |
38 | fbd6f863 | Iustin Pop | such that removing the support for the ``remote_raid1`` template and |
39 | fbd6f863 | Iustin Pop | focusing only on DRBD8 is easier |
40 | fbd6f863 | Iustin Pop | |
41 | fbd6f863 | Iustin Pop | - dynamic discovery of DRBD devices is not actually needed in a cluster that |
42 | fbd6f863 | Iustin Pop | where the DRBD namespace is controlled by Ganeti; switching to a static |
43 | fbd6f863 | Iustin Pop | assignment (done at either instance creation time or change secondary time) |
44 | fbd6f863 | Iustin Pop | will change the disk activation time from O(n) to O(1), which on big |
45 | fbd6f863 | Iustin Pop | clusters is a significant gain |
46 | fbd6f863 | Iustin Pop | |
47 | fbd6f863 | Iustin Pop | - remove the hard dependency on LVM (currently all available storage types are |
48 | fbd6f863 | Iustin Pop | ultimately backed by LVM volumes) by introducing file-based storage |
49 | fbd6f863 | Iustin Pop | |
50 | fbd6f863 | Iustin Pop | Additionally, a number of smaller enhancements are also planned: |
51 | fbd6f863 | Iustin Pop | - support variable number of disks |
52 | fbd6f863 | Iustin Pop | - support read-only disks |
53 | fbd6f863 | Iustin Pop | |
54 | fbd6f863 | Iustin Pop | Future enhancements in the 2.x series, which do not require base design |
55 | fbd6f863 | Iustin Pop | changes, might include: |
56 | fbd6f863 | Iustin Pop | |
57 | fbd6f863 | Iustin Pop | - enhancement of the LVM allocation method in order to try to keep |
58 | fbd6f863 | Iustin Pop | all of an instance's virtual disks on the same physical |
59 | fbd6f863 | Iustin Pop | disks |
60 | fbd6f863 | Iustin Pop | |
61 | fbd6f863 | Iustin Pop | - add support for DRBD8 authentication at handshake time in |
62 | fbd6f863 | Iustin Pop | order to ensure each device connects to the correct peer |
63 | fbd6f863 | Iustin Pop | |
64 | fbd6f863 | Iustin Pop | - remove the restrictions on failover only to the secondary |
65 | fbd6f863 | Iustin Pop | which creates very strict rules on cluster allocation |
66 | fbd6f863 | Iustin Pop | |
67 | fbd6f863 | Iustin Pop | Detailed Design |
68 | fbd6f863 | Iustin Pop | --------------- |
69 | fbd6f863 | Iustin Pop | |
70 | fbd6f863 | Iustin Pop | DRBD minor allocation |
71 | fbd6f863 | Iustin Pop | ~~~~~~~~~~~~~~~~~~~~~ |
72 | fbd6f863 | Iustin Pop | |
73 | fbd6f863 | Iustin Pop | Currently, when trying to identify or activate a new DRBD (or MD) |
74 | fbd6f863 | Iustin Pop | device, the code scans all in-use devices in order to see if we find |
75 | fbd6f863 | Iustin Pop | one that looks similar to our parameters and is already in the desired |
76 | fbd6f863 | Iustin Pop | state or not. Since this needs external commands to be run, it is very |
77 | fbd6f863 | Iustin Pop | slow when more than a few devices are already present. |
78 | fbd6f863 | Iustin Pop | |
79 | fbd6f863 | Iustin Pop | Therefore, we will change the discovery model from dynamic to |
80 | fbd6f863 | Iustin Pop | static. When a new device is logically created (added to the |
81 | fbd6f863 | Iustin Pop | configuration) a free minor number is computed from the list of |
82 | fbd6f863 | Iustin Pop | devices that should exist on that node and assigned to that |
83 | fbd6f863 | Iustin Pop | device. |
84 | fbd6f863 | Iustin Pop | |
85 | fbd6f863 | Iustin Pop | At device activation, if the minor is already in use, we check if |
86 | fbd6f863 | Iustin Pop | it has our parameters; if not so, we just destroy the device (if |
87 | fbd6f863 | Iustin Pop | possible, otherwise we abort) and start it with our own |
88 | fbd6f863 | Iustin Pop | parameters. |
89 | fbd6f863 | Iustin Pop | |
90 | fbd6f863 | Iustin Pop | This means that we in effect take ownership of the minor space for |
91 | fbd6f863 | Iustin Pop | that device type; if there's a user-created drbd minor, it will be |
92 | fbd6f863 | Iustin Pop | automatically removed. |
93 | fbd6f863 | Iustin Pop | |
94 | fbd6f863 | Iustin Pop | The change will have the effect of reducing the number of external |
95 | fbd6f863 | Iustin Pop | commands run per device from a constant number times the index of the |
96 | fbd6f863 | Iustin Pop | first free DRBD minor to just a constant number. |
97 | fbd6f863 | Iustin Pop | |
98 | fbd6f863 | Iustin Pop | Removal of obsolete device types (md, drbd7) |
99 | fbd6f863 | Iustin Pop | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
100 | fbd6f863 | Iustin Pop | |
101 | fbd6f863 | Iustin Pop | We need to remove these device types because of two issues. First, |
102 | fbd6f863 | Iustin Pop | drbd7 has bad failure modes in case of dual failures (both network and |
103 | fbd6f863 | Iustin Pop | disk - it cannot propagate the error up the device stack and instead |
104 | fbd6f863 | Iustin Pop | just panics. Second, due to the assymetry between primary and |
105 | fbd6f863 | Iustin Pop | secondary in md+drbd mode, we cannot do live failover (not even if we |
106 | fbd6f863 | Iustin Pop | had md+drbd8). |
107 | fbd6f863 | Iustin Pop | |
108 | fbd6f863 | Iustin Pop | File-based storage support |
109 | fbd6f863 | Iustin Pop | ~~~~~~~~~~~~~~~~~~~~~~~~~~ |
110 | fbd6f863 | Iustin Pop | |
111 | fbd6f863 | Iustin Pop | This is covered by a separate design doc (<em>Vinales</em>) and |
112 | fbd6f863 | Iustin Pop | would allow us to get rid of the hard requirement for testing |
113 | fbd6f863 | Iustin Pop | clusters; it would also allow people who have SAN storage to do live |
114 | fbd6f863 | Iustin Pop | failover taking advantage of their storage solution. |
115 | fbd6f863 | Iustin Pop | |
116 | fbd6f863 | Iustin Pop | Variable number of disks |
117 | fbd6f863 | Iustin Pop | ~~~~~~~~~~~~~~~~~~~~~~~~ |
118 | fbd6f863 | Iustin Pop | |
119 | fbd6f863 | Iustin Pop | In order to support high-security scenarios (for example read-only sda |
120 | fbd6f863 | Iustin Pop | and read-write sdb), we need to make a fully flexibly disk |
121 | fbd6f863 | Iustin Pop | definition. This has less impact that it might look at first sight: |
122 | fbd6f863 | Iustin Pop | only the instance creation has hardcoded number of disks, not the disk |
123 | fbd6f863 | Iustin Pop | handling code. The block device handling and most of the instance |
124 | fbd6f863 | Iustin Pop | handling code is already working with "the instance's disks" as |
125 | fbd6f863 | Iustin Pop | opposed to "the two disks of the instance", but some pieces are not |
126 | fbd6f863 | Iustin Pop | (e.g. import/export) and the code needs a review to ensure safety. |
127 | fbd6f863 | Iustin Pop | |
128 | fbd6f863 | Iustin Pop | The objective is to be able to specify the number of disks at |
129 | fbd6f863 | Iustin Pop | instance creation, and to be able to toggle from read-only to |
130 | fbd6f863 | Iustin Pop | read-write a disk afterwards. |
131 | fbd6f863 | Iustin Pop | |
132 | fbd6f863 | Iustin Pop | Better LVM allocation |
133 | fbd6f863 | Iustin Pop | ~~~~~~~~~~~~~~~~~~~~~ |
134 | fbd6f863 | Iustin Pop | |
135 | fbd6f863 | Iustin Pop | Currently, the LV to PV allocation mechanism is a very simple one: at |
136 | fbd6f863 | Iustin Pop | each new request for a logical volume, tell LVM to allocate the volume |
137 | fbd6f863 | Iustin Pop | in order based on the amount of free space. This is good for |
138 | fbd6f863 | Iustin Pop | simplicity and for keeping the usage equally spread over the available |
139 | fbd6f863 | Iustin Pop | physical disks, however it introduces a problem that an instance could |
140 | fbd6f863 | Iustin Pop | end up with its (currently) two drives on two physical disks, or |
141 | fbd6f863 | Iustin Pop | (worse) that the data and metadata for a DRBD device end up on |
142 | fbd6f863 | Iustin Pop | different drives. |
143 | fbd6f863 | Iustin Pop | |
144 | fbd6f863 | Iustin Pop | This is bad because it causes unneeded ``replace-disks`` operations in |
145 | fbd6f863 | Iustin Pop | case of a physical failure. |
146 | fbd6f863 | Iustin Pop | |
147 | fbd6f863 | Iustin Pop | The solution is to batch allocations for an instance and make the LVM |
148 | fbd6f863 | Iustin Pop | handling code try to allocate as close as possible all the storage of |
149 | fbd6f863 | Iustin Pop | one instance. We will still allow the logical volumes to spill over to |
150 | fbd6f863 | Iustin Pop | additional disks as needed. |
151 | fbd6f863 | Iustin Pop | |
152 | fbd6f863 | Iustin Pop | Note that this clustered allocation can only be attempted at initial |
153 | fbd6f863 | Iustin Pop | instance creation, or at change secondary node time. At add disk time, |
154 | fbd6f863 | Iustin Pop | or at replacing individual disks, it's not easy enough to compute the |
155 | fbd6f863 | Iustin Pop | current disk map so we'll not attempt the clustering. |
156 | fbd6f863 | Iustin Pop | |
157 | fbd6f863 | Iustin Pop | DRBD8 peer authentication at handshake |
158 | fbd6f863 | Iustin Pop | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
159 | fbd6f863 | Iustin Pop | |
160 | fbd6f863 | Iustin Pop | DRBD8 has a new feature that allow authentication of the peer at |
161 | fbd6f863 | Iustin Pop | connect time. We can use this to prevent connecting to the wrong peer |
162 | fbd6f863 | Iustin Pop | more that securing the connection. Even though we never had issues |
163 | fbd6f863 | Iustin Pop | with wrong connections, it would be good to implement this. |
164 | fbd6f863 | Iustin Pop | |
165 | fbd6f863 | Iustin Pop | |
166 | fbd6f863 | Iustin Pop | LVM self-repair (optional) |
167 | fbd6f863 | Iustin Pop | ~~~~~~~~~~~~~~~~~~~~~~~~~~ |
168 | fbd6f863 | Iustin Pop | |
169 | fbd6f863 | Iustin Pop | The complete failure of a physical disk is very tedious to |
170 | fbd6f863 | Iustin Pop | troubleshoot, mainly because of the many failure modes and the many |
171 | fbd6f863 | Iustin Pop | steps needed. We can safely automate some of the steps, more |
172 | fbd6f863 | Iustin Pop | specifically the ``vgreduce --removemissing`` using the following |
173 | fbd6f863 | Iustin Pop | method: |
174 | fbd6f863 | Iustin Pop | |
175 | fbd6f863 | Iustin Pop | #. check if all nodes have consistent volume groups |
176 | fbd6f863 | Iustin Pop | #. if yes, and previous status was yes, do nothing |
177 | fbd6f863 | Iustin Pop | #. if yes, and previous status was no, save status and restart |
178 | fbd6f863 | Iustin Pop | #. if no, and previous status was no, do nothing |
179 | fbd6f863 | Iustin Pop | #. if no, and previous status was yes: |
180 | fbd6f863 | Iustin Pop | #. if more than one node is inconsistent, do nothing |
181 | fbd6f863 | Iustin Pop | #. if only one node is incosistent: |
182 | fbd6f863 | Iustin Pop | #. run ``vgreduce --removemissing`` |
183 | fbd6f863 | Iustin Pop | #. log this occurence in the ganeti log in a form that |
184 | fbd6f863 | Iustin Pop | can be used for monitoring |
185 | fbd6f863 | Iustin Pop | #. [FUTURE] run ``replace-disks`` for all |
186 | fbd6f863 | Iustin Pop | instances affected |
187 | fbd6f863 | Iustin Pop | |
188 | fbd6f863 | Iustin Pop | Failover to any node |
189 | fbd6f863 | Iustin Pop | ~~~~~~~~~~~~~~~~~~~~ |
190 | fbd6f863 | Iustin Pop | |
191 | fbd6f863 | Iustin Pop | With a modified disk activation sequence, we can implement the |
192 | fbd6f863 | Iustin Pop | *failover to any* functionality, removing many of the layout |
193 | fbd6f863 | Iustin Pop | restrictions of a cluster: |
194 | fbd6f863 | Iustin Pop | |
195 | fbd6f863 | Iustin Pop | - the need to reserve memory on the current secondary: this gets reduced to |
196 | fbd6f863 | Iustin Pop | a must to reserve memory anywhere on the cluster |
197 | fbd6f863 | Iustin Pop | |
198 | fbd6f863 | Iustin Pop | - the need to first failover and then replace secondary for an |
199 | fbd6f863 | Iustin Pop | instance: with failover-to-any, we can directly failover to |
200 | fbd6f863 | Iustin Pop | another node, which also does the replace disks at the same |
201 | fbd6f863 | Iustin Pop | step |
202 | fbd6f863 | Iustin Pop | |
203 | fbd6f863 | Iustin Pop | In the following, we denote the current primary by P1, the current |
204 | fbd6f863 | Iustin Pop | secondary by S1, and the new primary and secondaries by P2 and S2. P2 |
205 | fbd6f863 | Iustin Pop | is fixed to the node the user chooses, but the choice of S2 can be |
206 | fbd6f863 | Iustin Pop | made between P1 and S1. This choice can be constrained, depending on |
207 | fbd6f863 | Iustin Pop | which of P1 and S1 has failed. |
208 | fbd6f863 | Iustin Pop | |
209 | fbd6f863 | Iustin Pop | - if P1 has failed, then S1 must become S2, and live migration is not possible |
210 | fbd6f863 | Iustin Pop | - if S1 has failed, then P1 must become S2, and live migration could be |
211 | fbd6f863 | Iustin Pop | possible (in theory, but this is not a design goal for 2.0) |
212 | fbd6f863 | Iustin Pop | |
213 | fbd6f863 | Iustin Pop | The algorithm for performing the failover is straightforward: |
214 | fbd6f863 | Iustin Pop | |
215 | fbd6f863 | Iustin Pop | - verify that S2 (the node the user has chosen to keep as secondary) has |
216 | fbd6f863 | Iustin Pop | valid data (is consistent) |
217 | fbd6f863 | Iustin Pop | |
218 | fbd6f863 | Iustin Pop | - tear down the current DRBD association and setup a drbd pairing between |
219 | fbd6f863 | Iustin Pop | P2 (P2 is indicated by the user) and S2; since P2 has no data, it will |
220 | fbd6f863 | Iustin Pop | start resyncing from S2 |
221 | fbd6f863 | Iustin Pop | |
222 | fbd6f863 | Iustin Pop | - as soon as P2 is in state SyncTarget (i.e. after the resync has started |
223 | fbd6f863 | Iustin Pop | but before it has finished), we can promote it to primary role (r/w) |
224 | fbd6f863 | Iustin Pop | and start the instance on P2 |
225 | fbd6f863 | Iustin Pop | |
226 | fbd6f863 | Iustin Pop | - as soon as the P2⇐S2 sync has finished, we can remove |
227 | fbd6f863 | Iustin Pop | the old data on the old node that has not been chosen for |
228 | fbd6f863 | Iustin Pop | S2 |
229 | fbd6f863 | Iustin Pop | |
230 | fbd6f863 | Iustin Pop | Caveats: during the P2⇐S2 sync, a (non-transient) network error |
231 | fbd6f863 | Iustin Pop | will cause I/O errors on the instance, so (if a longer instance |
232 | fbd6f863 | Iustin Pop | downtime is acceptable) we can postpone the restart of the instance |
233 | fbd6f863 | Iustin Pop | until the resync is done. However, disk I/O errors on S2 will cause |
234 | fbd6f863 | Iustin Pop | dataloss, since we don't have a good copy of the data anymore, so in |
235 | fbd6f863 | Iustin Pop | this case waiting for the sync to complete is not an option. As such, |
236 | fbd6f863 | Iustin Pop | it is recommended that this feature is used only in conjunction with |
237 | fbd6f863 | Iustin Pop | proper disk monitoring. |
238 | fbd6f863 | Iustin Pop | |
239 | fbd6f863 | Iustin Pop | |
240 | fbd6f863 | Iustin Pop | Live migration note: While failover-to-any is possible for all choices |
241 | fbd6f863 | Iustin Pop | of S2, migration-to-any is possible only if we keep P1 as S2. |
242 | fbd6f863 | Iustin Pop | |
243 | fbd6f863 | Iustin Pop | Caveats |
244 | fbd6f863 | Iustin Pop | ------- |
245 | fbd6f863 | Iustin Pop | |
246 | fbd6f863 | Iustin Pop | The dynamic device model, while more complex, has an advantage: it |
247 | fbd6f863 | Iustin Pop | will not reuse by mistake another's instance DRBD device, since it |
248 | fbd6f863 | Iustin Pop | always looks for either our own or a free one. |
249 | fbd6f863 | Iustin Pop | |
250 | fbd6f863 | Iustin Pop | The static one, in contrast, will assume that given a minor number N, |
251 | fbd6f863 | Iustin Pop | it's ours and we can take over. This needs careful implementation such |
252 | fbd6f863 | Iustin Pop | that if the minor is in use, either we are able to cleanly shut it |
253 | fbd6f863 | Iustin Pop | down, or we abort the startup. Otherwise, it could be that we start |
254 | fbd6f863 | Iustin Pop | syncing between two instance's disks, causing dataloss. |
255 | fbd6f863 | Iustin Pop | |
256 | fbd6f863 | Iustin Pop | Security Considerations |
257 | fbd6f863 | Iustin Pop | ----------------------- |
258 | fbd6f863 | Iustin Pop | |
259 | fbd6f863 | Iustin Pop | The changes will not affect the security model of Ganeti. |