root / doc / design-move-instance-improvements.rst @ ec9c1bf8
History | View | Annotate | Download (19.8 kB)
1 | 0cb22cf2 | Hrvoje Ribicic | ======================================== |
---|---|---|---|
2 | 0cb22cf2 | Hrvoje Ribicic | Instance move improvements |
3 | 0cb22cf2 | Hrvoje Ribicic | ======================================== |
4 | 0cb22cf2 | Hrvoje Ribicic | |
5 | 0cb22cf2 | Hrvoje Ribicic | .. contents:: :depth: 3 |
6 | 0cb22cf2 | Hrvoje Ribicic | |
7 | 0cb22cf2 | Hrvoje Ribicic | Ganeti provides tools for moving instances within and between clusters. Through |
8 | 0cb22cf2 | Hrvoje Ribicic | special export and import calls, a new instance is created with the disk data of |
9 | 0cb22cf2 | Hrvoje Ribicic | the existing one. |
10 | 0cb22cf2 | Hrvoje Ribicic | |
11 | 0cb22cf2 | Hrvoje Ribicic | The tools work correctly and reliably, but depending on bandwidth and priority, |
12 | 0cb22cf2 | Hrvoje Ribicic | an instance disk of considerable size requires a long time to transfer. The |
13 | 0cb22cf2 | Hrvoje Ribicic | length of the transfer is inconvenient at best, but the problem becomes only |
14 | 0cb22cf2 | Hrvoje Ribicic | worse if excessive locking causes a move operation to be delayed for a longer |
15 | 0cb22cf2 | Hrvoje Ribicic | period of time, or to block other operations. |
16 | 0cb22cf2 | Hrvoje Ribicic | |
17 | 0cb22cf2 | Hrvoje Ribicic | The performance of moves is a complex topic, with available bandwidth, |
18 | 0cb22cf2 | Hrvoje Ribicic | compression, and encryption all being candidates for choke points that bog down |
19 | 0cb22cf2 | Hrvoje Ribicic | a transfer. Depending on the environment a move is performed in, tuning these |
20 | 0cb22cf2 | Hrvoje Ribicic | can have significant performance benefits, but Ganeti does not expose many |
21 | 0cb22cf2 | Hrvoje Ribicic | options needed for such tuning. The details of what to expose and what tradeoffs |
22 | 0cb22cf2 | Hrvoje Ribicic | can be made will be presented in this document. |
23 | 0cb22cf2 | Hrvoje Ribicic | |
24 | 0cb22cf2 | Hrvoje Ribicic | Apart from existing functionality, some beneficial features can be introduced to |
25 | 0cb22cf2 | Hrvoje Ribicic | help with instance moves. Zeroing empty space on instance disks can be useful |
26 | 0cb22cf2 | Hrvoje Ribicic | for drastically improving the qualities of compression, effectively not needing |
27 | 0cb22cf2 | Hrvoje Ribicic | to transfer unused disk space during moves. Compression itself can be improved |
28 | 0cb22cf2 | Hrvoje Ribicic | by using different tools. The encryption used can be weakened or eliminated for |
29 | 0cb22cf2 | Hrvoje Ribicic | certain moves. Using opportunistic locking during instance moves results in |
30 | 0cb22cf2 | Hrvoje Ribicic | greater parallelization. As all of these approaches aim to tackle two different |
31 | 0cb22cf2 | Hrvoje Ribicic | aspects of the problem, they do not exclude each other and will be presented |
32 | 0cb22cf2 | Hrvoje Ribicic | independently. |
33 | 0cb22cf2 | Hrvoje Ribicic | |
34 | 0cb22cf2 | Hrvoje Ribicic | The performance of Ganeti moves |
35 | 0cb22cf2 | Hrvoje Ribicic | =============================== |
36 | 0cb22cf2 | Hrvoje Ribicic | |
37 | 0cb22cf2 | Hrvoje Ribicic | In the current implementation, there are three possible factors limiting the |
38 | 0cb22cf2 | Hrvoje Ribicic | speed of an instance move. The first is the network bandwidth, which Ganeti can |
39 | 0cb22cf2 | Hrvoje Ribicic | exploit better by using compression. The second is the encryption, which is |
40 | 0cb22cf2 | Hrvoje Ribicic | obligatory, and which can throttle an otherwise fast connection. The third is |
41 | 0cb22cf2 | Hrvoje Ribicic | surprisingly the compression, which can cause the connection to be |
42 | 0cb22cf2 | Hrvoje Ribicic | underutilized. |
43 | 0cb22cf2 | Hrvoje Ribicic | |
44 | 0cb22cf2 | Hrvoje Ribicic | Example 1: some numbers present during an intra-cluster instance move: |
45 | 0cb22cf2 | Hrvoje Ribicic | |
46 | 0cb22cf2 | Hrvoje Ribicic | * Network bandwidth: 105MB/s, courtesy of a gigabit switch |
47 | 0cb22cf2 | Hrvoje Ribicic | |
48 | 0cb22cf2 | Hrvoje Ribicic | * Encryption performance: 40MB/s, provided by OpenSSL |
49 | 0cb22cf2 | Hrvoje Ribicic | |
50 | 0cb22cf2 | Hrvoje Ribicic | * Compression performance: 22.3MB/s input, 7.1MB/s gzip compressed output |
51 | 0cb22cf2 | Hrvoje Ribicic | |
52 | 0cb22cf2 | Hrvoje Ribicic | As can be seen in this example, the obligatory encryption results in 62% of |
53 | 0cb22cf2 | Hrvoje Ribicic | available bandwidth being wasted, while using compression further lowers the |
54 | 0cb22cf2 | Hrvoje Ribicic | throughput to 55% of what the encryption would allow. The following sections |
55 | 0cb22cf2 | Hrvoje Ribicic | will talk about these numbers in more detail, and suggest improvements and best |
56 | 0cb22cf2 | Hrvoje Ribicic | practices. |
57 | 0cb22cf2 | Hrvoje Ribicic | |
58 | 0cb22cf2 | Hrvoje Ribicic | Encryption and Ganeti security |
59 | 0cb22cf2 | Hrvoje Ribicic | ++++++++++++++++++++++++++++++ |
60 | 0cb22cf2 | Hrvoje Ribicic | |
61 | 0cb22cf2 | Hrvoje Ribicic | Turning compression and encryption off would allow for an immediate improvement, |
62 | 0cb22cf2 | Hrvoje Ribicic | and while that is possible for compression, there are good reasons why |
63 | 0cb22cf2 | Hrvoje Ribicic | encryption is currently not a feature a user can disable. |
64 | 0cb22cf2 | Hrvoje Ribicic | |
65 | 0cb22cf2 | Hrvoje Ribicic | While it is impossible to secure instance data if an attacker gains SSH access |
66 | 0cb22cf2 | Hrvoje Ribicic | to a node, the RAPI was designed to never allow user data to be accessed through |
67 | 0cb22cf2 | Hrvoje Ribicic | it in case of being compromised. If moves could be performed unencrypted, this |
68 | 0cb22cf2 | Hrvoje Ribicic | property would be broken. Instance moves can take place in environments which |
69 | 0cb22cf2 | Hrvoje Ribicic | may be hostile, and where unencrypted traffic could be intercepted. As they can |
70 | 0cb22cf2 | Hrvoje Ribicic | be instigated through the RAPI, an attacker could access all data on all |
71 | 0cb22cf2 | Hrvoje Ribicic | instances in a cluster by moving them unencrypted and intercepting the data in |
72 | 0cb22cf2 | Hrvoje Ribicic | flight. This is one of the few situations where the current speed of instance |
73 | 0cb22cf2 | Hrvoje Ribicic | moves could be considered a perk. |
74 | 0cb22cf2 | Hrvoje Ribicic | |
75 | 0cb22cf2 | Hrvoje Ribicic | The performance of encryption can be increased by either using a less secure |
76 | 0cb22cf2 | Hrvoje Ribicic | form of encryption, including no encryption, or using a faster encryption |
77 | 0cb22cf2 | Hrvoje Ribicic | algorithm. The example listed above utilizes AES-256, one of the few ciphers |
78 | 0cb22cf2 | Hrvoje Ribicic | that Ganeti deems secure enough to use. AES-128, also allowed by Ganeti's |
79 | 0cb22cf2 | Hrvoje Ribicic | current settings, is weaker but 46% faster. A cipher that is not allowed due to |
80 | 0cb22cf2 | Hrvoje Ribicic | its flaws, such as RC4, could offer a 208% increase in speed. On the other hand, |
81 | 0cb22cf2 | Hrvoje Ribicic | using an OS capable of utilizing the AES_NI chip present on modern hardware |
82 | 0cb22cf2 | Hrvoje Ribicic | can double the performance of AES, making it the best tradeoff between security |
83 | 0cb22cf2 | Hrvoje Ribicic | and performance. |
84 | 0cb22cf2 | Hrvoje Ribicic | |
85 | 0cb22cf2 | Hrvoje Ribicic | Ganeti cannot and should not detect all the factors listed above, but should |
86 | 0cb22cf2 | Hrvoje Ribicic | rather give its users some leeway in what to choose. A precedent already exists, |
87 | 0cb22cf2 | Hrvoje Ribicic | as intra-cluster DRBD replication is already performed unencrypted, albeit on a |
88 | 0cb22cf2 | Hrvoje Ribicic | separate VLAN. For intra-cluster moves, Ganeti should allow its users to set |
89 | 0cb22cf2 | Hrvoje Ribicic | OpenSSL ciphers at will, while still enforcing high-security settings for moves |
90 | 0cb22cf2 | Hrvoje Ribicic | between clusters. |
91 | 0cb22cf2 | Hrvoje Ribicic | |
92 | 0cb22cf2 | Hrvoje Ribicic | Thus, two settings will be introduced: |
93 | 0cb22cf2 | Hrvoje Ribicic | |
94 | 0cb22cf2 | Hrvoje Ribicic | * a cluster-level setting called ``--allow-cipher-bypassing``, a boolean that |
95 | 0cb22cf2 | Hrvoje Ribicic | cannot be set over RAPI |
96 | 0cb22cf2 | Hrvoje Ribicic | |
97 | 0cb22cf2 | Hrvoje Ribicic | * a gnt-instance move setting called ``--ciphers-to-use``, bypassing the default |
98 | 0cb22cf2 | Hrvoje Ribicic | cipher list with given ciphers, filtered to ensure no other OpenSSL options |
99 | 0cb22cf2 | Hrvoje Ribicic | are passed in within |
100 | 0cb22cf2 | Hrvoje Ribicic | |
101 | 0cb22cf2 | Hrvoje Ribicic | This change will serve to address the issues with moving non-redundant instances |
102 | 0cb22cf2 | Hrvoje Ribicic | within the cluster, while keeping Ganeti security at its current level. |
103 | 0cb22cf2 | Hrvoje Ribicic | |
104 | 0cb22cf2 | Hrvoje Ribicic | Compression |
105 | 0cb22cf2 | Hrvoje Ribicic | +++++++++++ |
106 | 0cb22cf2 | Hrvoje Ribicic | |
107 | 0cb22cf2 | Hrvoje Ribicic | Support for disk compression during instance moves was partially present before, |
108 | 0cb22cf2 | Hrvoje Ribicic | but cleaned up and unified under the ``--compress`` option only as of Ganeti |
109 | 0cb22cf2 | Hrvoje Ribicic | 2.11. The only option offered by Ganeti is gzip with no options passed to it, |
110 | 0cb22cf2 | Hrvoje Ribicic | resulting in a good compression ratio, but bad compression speed. |
111 | 0cb22cf2 | Hrvoje Ribicic | |
112 | 0cb22cf2 | Hrvoje Ribicic | As compression can affect the speed of instance moves significantly, it is |
113 | 0cb22cf2 | Hrvoje Ribicic | worthwhile to explore alternatives. To test compression tool performance, an 8GB |
114 | 0cb22cf2 | Hrvoje Ribicic | drive filled with data matching the expected usage patterns (taken from a |
115 | 0cb22cf2 | Hrvoje Ribicic | workstation) was compressed by using various tools with various settings. The |
116 | 0cb22cf2 | Hrvoje Ribicic | two top performers were ``lzop`` and, surprisingly, ``gzip``. The improvement in |
117 | 0cb22cf2 | Hrvoje Ribicic | the performance of ``gzip`` was obtained by explicitly optimizing for speed |
118 | 0cb22cf2 | Hrvoje Ribicic | rather than compression. |
119 | 0cb22cf2 | Hrvoje Ribicic | |
120 | 0cb22cf2 | Hrvoje Ribicic | * ``gzip -6``: 22.3MB/s in, 7.1MB/s out |
121 | 0cb22cf2 | Hrvoje Ribicic | * ``gzip -1``: 44.1MB/s in, 15.1MB/s out |
122 | 0cb22cf2 | Hrvoje Ribicic | * ``lzop``: 71.9MB/s in, 28.1MB/s out |
123 | 0cb22cf2 | Hrvoje Ribicic | |
124 | 0cb22cf2 | Hrvoje Ribicic | If encryption is the limiting factor, and as in the example, limits the |
125 | 0cb22cf2 | Hrvoje Ribicic | bandwidth to 40MB/s, ``lzop`` allows for an effective 79% increase in transfer |
126 | 0cb22cf2 | Hrvoje Ribicic | speed. The fast ``gzip`` would also prove to be beneficial, but much less than |
127 | 0cb22cf2 | Hrvoje Ribicic | ``lzop``. It should also be noted that as a rule of thumb, tools with a lower |
128 | 0cb22cf2 | Hrvoje Ribicic | compression ratio had a lesser workload, with ``lzop`` straining the CPU much |
129 | 0cb22cf2 | Hrvoje Ribicic | less than any of the competitors. |
130 | 0cb22cf2 | Hrvoje Ribicic | |
131 | 0cb22cf2 | Hrvoje Ribicic | With the test results present here, it is clear that ``lzop`` would be a very |
132 | 0cb22cf2 | Hrvoje Ribicic | worthwhile addition to the compression options present in Ganeti, yet the |
133 | 0cb22cf2 | Hrvoje Ribicic | problem is that it is not available by default on all distributions, as the |
134 | 0cb22cf2 | Hrvoje Ribicic | option's presence might imply. In general, Ganeti may know how to use several |
135 | 0cb22cf2 | Hrvoje Ribicic | tools, and check for their presence, but should add some way of at least hinting |
136 | 0cb22cf2 | Hrvoje Ribicic | at which tools are available. |
137 | 0cb22cf2 | Hrvoje Ribicic | |
138 | 0cb22cf2 | Hrvoje Ribicic | Additionally, the user might want to use a tool that Ganeti did not account for. |
139 | 0cb22cf2 | Hrvoje Ribicic | Allowing the tool to be named is also helpful, both for cases when multiple |
140 | 0cb22cf2 | Hrvoje Ribicic | custom tools are to be used, and for distinguishing between various tools in |
141 | 0cb22cf2 | Hrvoje Ribicic | case of e.g. inter-cluster moves. |
142 | 0cb22cf2 | Hrvoje Ribicic | |
143 | 0cb22cf2 | Hrvoje Ribicic | To this end, the ``--compression-tools`` cluster parameter will be added to |
144 | 0cb22cf2 | Hrvoje Ribicic | Ganeti. It contains a list of names of compression tools that can be supplied as |
145 | 0cb22cf2 | Hrvoje Ribicic | the parameter of ``--compress``, and by default it contains all the tools |
146 | 0cb22cf2 | Hrvoje Ribicic | Ganeti knows how to use. The user can change the list as desired, removing |
147 | 0cb22cf2 | Hrvoje Ribicic | entries that are not or should not be available on the cluster, and adding |
148 | 0cb22cf2 | Hrvoje Ribicic | custom tools. |
149 | 0cb22cf2 | Hrvoje Ribicic | |
150 | 0cb22cf2 | Hrvoje Ribicic | Every custom tool is identified by its name, and Ganeti expects the name to |
151 | 0cb22cf2 | Hrvoje Ribicic | correspond to a script invoking the compression tool. Without arguments, the |
152 | 0cb22cf2 | Hrvoje Ribicic | script compresses input on stdin, outputting it on stdout. With the -d argument, |
153 | 0cb22cf2 | Hrvoje Ribicic | the script does the same, only while decompressing. The -h argument is used to |
154 | 0cb22cf2 | Hrvoje Ribicic | check for the presence of the script, and in this case, only the error code is |
155 | 0cb22cf2 | Hrvoje Ribicic | examined. This syntax matches the ``gzip`` syntax well, which should allow most |
156 | 0cb22cf2 | Hrvoje Ribicic | compression tools to be adapted to it easily. |
157 | 0cb22cf2 | Hrvoje Ribicic | |
158 | 0cb22cf2 | Hrvoje Ribicic | Ganeti will not allow arbitrary parameters to be passed to a compression tool, |
159 | 0cb22cf2 | Hrvoje Ribicic | and will restrict the names to contain only a small but assuredly safe subset of |
160 | 0cb22cf2 | Hrvoje Ribicic | characters - alphanumeric values and dashes and underscores. This minimizes the |
161 | 0cb22cf2 | Hrvoje Ribicic | risk of security issues that could arise from an attacker smuggling a malicious |
162 | 0cb22cf2 | Hrvoje Ribicic | command through RAPI. Common variations, like the speed/compression tradeoff of |
163 | 0cb22cf2 | Hrvoje Ribicic | ``gzip``, will be handled by aliases, e.g. ``gzip-fast`` or ``gzip-slow``. |
164 | 0cb22cf2 | Hrvoje Ribicic | |
165 | 0cb22cf2 | Hrvoje Ribicic | It should also be noted that for some purposes - e.g. the writing of OVF files, |
166 | 0cb22cf2 | Hrvoje Ribicic | ``gzip`` is the only allowed means of compression, and an appropriate error |
167 | 0cb22cf2 | Hrvoje Ribicic | message should be displayed if the user attempts to use one of the other |
168 | 0cb22cf2 | Hrvoje Ribicic | provided tools. |
169 | 0cb22cf2 | Hrvoje Ribicic | |
170 | 0cb22cf2 | Hrvoje Ribicic | Zeroing instance disks |
171 | 0cb22cf2 | Hrvoje Ribicic | ====================== |
172 | 0cb22cf2 | Hrvoje Ribicic | |
173 | 0cb22cf2 | Hrvoje Ribicic | While compression lowers the amount of data sent, further reductions can be |
174 | 0cb22cf2 | Hrvoje Ribicic | achieved by taking advantage of the structure of the disk - namely, sending only |
175 | 0cb22cf2 | Hrvoje Ribicic | used disk sectors. |
176 | 0cb22cf2 | Hrvoje Ribicic | |
177 | 0cb22cf2 | Hrvoje Ribicic | There is no direct way to achieve this, as it would require that the |
178 | 0cb22cf2 | Hrvoje Ribicic | move-instance tool is aware of the structure of the file system. Mounting the |
179 | 0cb22cf2 | Hrvoje Ribicic | filesystem is not an option, primarily due to security issues. A disk primed to |
180 | 0cb22cf2 | Hrvoje Ribicic | take advantage of a disk driver exploit could cause an attacker to breach |
181 | 0cb22cf2 | Hrvoje Ribicic | instance isolation and gain control of a Ganeti node. |
182 | 0cb22cf2 | Hrvoje Ribicic | |
183 | 0cb22cf2 | Hrvoje Ribicic | An indirect way for this performance gain to be achieved is the zeroing of any |
184 | 0cb22cf2 | Hrvoje Ribicic | hard disk space not in use. While this primarily means empty space, swap |
185 | 0cb22cf2 | Hrvoje Ribicic | partitions can be zeroed as well. |
186 | 0cb22cf2 | Hrvoje Ribicic | |
187 | 0cb22cf2 | Hrvoje Ribicic | Sequences of zeroes can be compressed and thus transferred very efficiently, all |
188 | 0cb22cf2 | Hrvoje Ribicic | without the host knowing that these are empty space. This approach can also be |
189 | 0cb22cf2 | Hrvoje Ribicic | dangerous if a sparse disk is zeroed in this way, causing ballooning. As Ganeti |
190 | 0cb22cf2 | Hrvoje Ribicic | does not seem to make special concessions for moving sparse disks, the only |
191 | 0cb22cf2 | Hrvoje Ribicic | difference should be the disk space utilization on the current node. |
192 | 0cb22cf2 | Hrvoje Ribicic | |
193 | 0cb22cf2 | Hrvoje Ribicic | Zeroing approaches |
194 | 0cb22cf2 | Hrvoje Ribicic | ++++++++++++++++++ |
195 | 0cb22cf2 | Hrvoje Ribicic | |
196 | 0cb22cf2 | Hrvoje Ribicic | Zeroing is a feasible approach, but the node cannot perform it as it cannot |
197 | 0cb22cf2 | Hrvoje Ribicic | mount the disk. Only virtualization-based options remain, and of those, using |
198 | 0cb22cf2 | Hrvoje Ribicic | Ganeti's own virtualization capabilities makes the most sense. There are two |
199 | 0cb22cf2 | Hrvoje Ribicic | ways of doing this - creating a new helper instance, temporary or persistent, or |
200 | 0cb22cf2 | Hrvoje Ribicic | reusing the target instance. |
201 | 0cb22cf2 | Hrvoje Ribicic | |
202 | 0cb22cf2 | Hrvoje Ribicic | Both approaches have their disadvantages. Creating a new helper instance |
203 | 0cb22cf2 | Hrvoje Ribicic | requires managing its lifecycle, taking special care to make sure no helper |
204 | 0cb22cf2 | Hrvoje Ribicic | instance remains left over due to a failed operation. Even if this were to be |
205 | 0cb22cf2 | Hrvoje Ribicic | taken care of, disks are not yet separate entities in Ganeti, making the |
206 | 0cb22cf2 | Hrvoje Ribicic | temporary transfer of disks between instances hard to implement and even harder |
207 | 0cb22cf2 | Hrvoje Ribicic | to make robust. The reuse can be done by modifying the OS running on the |
208 | 0cb22cf2 | Hrvoje Ribicic | instance to perform the zeroing itself when notified via the new instance |
209 | 0cb22cf2 | Hrvoje Ribicic | communication mechanism, but this approach is neither generic, nor particularly |
210 | 0cb22cf2 | Hrvoje Ribicic | safe. There is no guarantee that the zeroing operation will not interfere with |
211 | 0cb22cf2 | Hrvoje Ribicic | the normal operation of the instance, nor that it will be completed if a |
212 | 0cb22cf2 | Hrvoje Ribicic | user-initiated shutdown occurs. |
213 | 0cb22cf2 | Hrvoje Ribicic | |
214 | 0cb22cf2 | Hrvoje Ribicic | A better solution can be found by combining the two approaches - re-using the |
215 | 0cb22cf2 | Hrvoje Ribicic | virtualized environment, but with a specifically crafted OS image. With the |
216 | 0cb22cf2 | Hrvoje Ribicic | instance shut down as it should be in preparation for the move, it can be |
217 | 0cb22cf2 | Hrvoje Ribicic | extended with an additional disk with the OS image on it. By prepending the |
218 | 0cb22cf2 | Hrvoje Ribicic | disk and changing some instance parameters, the instance can boot from it. The |
219 | 0cb22cf2 | Hrvoje Ribicic | OS can be configured to perform the zeroing on startup, attempting to mount any |
220 | 0cb22cf2 | Hrvoje Ribicic | partitions with a filesystem present, and creating and deleting a zero-filled |
221 | 0cb22cf2 | Hrvoje Ribicic | file on them. After the zeroing is complete, the OS should shut down, and the |
222 | 0cb22cf2 | Hrvoje Ribicic | master should note the shutdown and restore the instance to its previous state. |
223 | 0cb22cf2 | Hrvoje Ribicic | |
224 | 0cb22cf2 | Hrvoje Ribicic | Note that the requirements above are very similar to the notion of a helper VM |
225 | 0cb22cf2 | Hrvoje Ribicic | suggested in the OS install document. Some potentially unsafe actions are |
226 | 0cb22cf2 | Hrvoje Ribicic | performed within a virtualized environment, acting on disks that belong or will |
227 | 0cb22cf2 | Hrvoje Ribicic | belong to the instance. The mechanisms used will thus be developed with both |
228 | 0cb22cf2 | Hrvoje Ribicic | approaches in mind. |
229 | 0cb22cf2 | Hrvoje Ribicic | |
230 | 0cb22cf2 | Hrvoje Ribicic | Implementation |
231 | 0cb22cf2 | Hrvoje Ribicic | ++++++++++++++ |
232 | 0cb22cf2 | Hrvoje Ribicic | |
233 | 0cb22cf2 | Hrvoje Ribicic | There are two components to this solution - the Ganeti changes needed to boot |
234 | 0cb22cf2 | Hrvoje Ribicic | the OS, and the OS image used for the zeroing. Due to the variety of filesystems |
235 | 0cb22cf2 | Hrvoje Ribicic | and architectures that instances can use, no single ready-to-run disk image can |
236 | 0cb22cf2 | Hrvoje Ribicic | satisfy the needs of all the Ganeti users. Instead, the instance-debootstrap |
237 | 0cb22cf2 | Hrvoje Ribicic | scripts can be used to generate a zeroing-capable OS image. This might not be |
238 | 0cb22cf2 | Hrvoje Ribicic | ideal, as there are lightweight distributions that take up less space and boot |
239 | 0cb22cf2 | Hrvoje Ribicic | up more quickly. Generating those with the right set of drivers for the |
240 | 0cb22cf2 | Hrvoje Ribicic | virtualization platform of choice is not easy. Thus we do not provide a script |
241 | 0cb22cf2 | Hrvoje Ribicic | for this purpose, but the user is free to provide any OS image which performs |
242 | 0cb22cf2 | Hrvoje Ribicic | the necessary steps: zero out all virtualization-provided devices on startup, |
243 | 0cb22cf2 | Hrvoje Ribicic | shutdown immediately. The cluster-wide parameter controlling the image to be |
244 | 0cb22cf2 | Hrvoje Ribicic | used would be called ``--zeroing-image``. |
245 | 0cb22cf2 | Hrvoje Ribicic | |
246 | 0cb22cf2 | Hrvoje Ribicic | The modifications to Ganeti code needed are minor. The zeroing functionality |
247 | 0cb22cf2 | Hrvoje Ribicic | should be implemented as an extension of the instance export, and exposed as the |
248 | 0cb22cf2 | Hrvoje Ribicic | ``--zero-free-space option``. Prior to beginning the export, the instance |
249 | 0cb22cf2 | Hrvoje Ribicic | configuration is temporarily extended with a new read-only disk of sufficient |
250 | 0cb22cf2 | Hrvoje Ribicic | size to host the zeroing image, and the changes necessary for the image to be |
251 | 0cb22cf2 | Hrvoje Ribicic | used as the boot drive. The temporary nature of the configuration changes |
252 | 0cb22cf2 | Hrvoje Ribicic | requires that they are not propagated to other nodes. While this would normally |
253 | 0cb22cf2 | Hrvoje Ribicic | not be feasible with an instance using a disk template offering multi-node |
254 | 0cb22cf2 | Hrvoje Ribicic | redundancy, experiments with the code have shown that the restriction on |
255 | 0cb22cf2 | Hrvoje Ribicic | diverse disk templates can be bypassed to temporarily allow a plain |
256 | 0cb22cf2 | Hrvoje Ribicic | disk-template disk to host the zeroing image. Given that one of the planned |
257 | 0cb22cf2 | Hrvoje Ribicic | changes in Ganeti is to have instance disks as separate entities, with no |
258 | 0cb22cf2 | Hrvoje Ribicic | restriction on templates, this assumption is useful rather than harmful by |
259 | 0cb22cf2 | Hrvoje Ribicic | asserting the desired behavior. The image is dumped to the disk, and the |
260 | 0cb22cf2 | Hrvoje Ribicic | instance is started up. |
261 | 0cb22cf2 | Hrvoje Ribicic | |
262 | 0cb22cf2 | Hrvoje Ribicic | Once the instance is started up, the zeroing will proceed until completion, when |
263 | 0cb22cf2 | Hrvoje Ribicic | a self-initiated shutdown will occur. The instance-shutdown detection |
264 | 0cb22cf2 | Hrvoje Ribicic | capabilities of 2.11 should prevent the watcher from restarting the instance |
265 | 0cb22cf2 | Hrvoje Ribicic | once this happens, allowing the host to take it as a sign the zeroing was |
266 | 0cb22cf2 | Hrvoje Ribicic | completed. Either way, the host waits until the instance is shut down, or a |
267 | 0cb22cf2 | Hrvoje Ribicic | timeout has been reached and the instance is forcibly shut down. As the time |
268 | 0cb22cf2 | Hrvoje Ribicic | needed to zero an instance is dependent on the size of the disk of the instance, |
269 | 0cb22cf2 | Hrvoje Ribicic | the user can provide a fixed and a per-size timeout, recommended to be set to |
270 | 0cb22cf2 | Hrvoje Ribicic | twice the maximum write speed of the device hosting the instance. |
271 | 0cb22cf2 | Hrvoje Ribicic | |
272 | 0cb22cf2 | Hrvoje Ribicic | Better progress monitoring can be implemented with the instance-host |
273 | 0cb22cf2 | Hrvoje Ribicic | communication channel proposed by the OS install design document. The first |
274 | 0cb22cf2 | Hrvoje Ribicic | version will most likely use only the shutdown detection, and will be improved |
275 | 0cb22cf2 | Hrvoje Ribicic | to account for the available communication channel at a later time. |
276 | 0cb22cf2 | Hrvoje Ribicic | |
277 | 0cb22cf2 | Hrvoje Ribicic | After the shutdown, the temporary disk is destroyed and the instance |
278 | 0cb22cf2 | Hrvoje Ribicic | configuration is reverted to its original state. The very same action is done if |
279 | 0cb22cf2 | Hrvoje Ribicic | any error is encountered during the zeroing process. In the case that the |
280 | 0cb22cf2 | Hrvoje Ribicic | zeroing is interrupted while the zero-filled file is being written, the file may |
281 | 0cb22cf2 | Hrvoje Ribicic | remain on the disk of the instance. The script that performs the zeroing will be |
282 | 0cb22cf2 | Hrvoje Ribicic | made to react to system signals by deleting the zero-filled file, but there is |
283 | 0cb22cf2 | Hrvoje Ribicic | little else that can be done to recover. |
284 | 0cb22cf2 | Hrvoje Ribicic | |
285 | 0cb22cf2 | Hrvoje Ribicic | When to use zeroing |
286 | 0cb22cf2 | Hrvoje Ribicic | +++++++++++++++++++ |
287 | 0cb22cf2 | Hrvoje Ribicic | |
288 | 0cb22cf2 | Hrvoje Ribicic | The question of when it is useful to use zeroing is hard to answer because the |
289 | 0cb22cf2 | Hrvoje Ribicic | effectiveness of the approach depends on many factors. All compression tools |
290 | 0cb22cf2 | Hrvoje Ribicic | compress zeroes to almost nothingness, but compressing them takes time. If the |
291 | 0cb22cf2 | Hrvoje Ribicic | time needed to compress zeroes were equal to zero, the approach would boil down |
292 | 0cb22cf2 | Hrvoje Ribicic | to whether it is faster to zero unused space out, performing writes to disk, or |
293 | 0cb22cf2 | Hrvoje Ribicic | to transfer it compressed. For the example used above, the average compression |
294 | 0cb22cf2 | Hrvoje Ribicic | ratio, and write speeds of current disk drives, the answer would almost |
295 | 0cb22cf2 | Hrvoje Ribicic | unanimously be yes. |
296 | 0cb22cf2 | Hrvoje Ribicic | |
297 | 0cb22cf2 | Hrvoje Ribicic | With a more realistic setup, where zeroes take time to compress, yet less time |
298 | 0cb22cf2 | Hrvoje Ribicic | than ordinary data, the gains depend on the previously mentioned tradeoff and |
299 | 0cb22cf2 | Hrvoje Ribicic | the free space available. Zeroing will definitely lessen the amount of bandwidth |
300 | 0cb22cf2 | Hrvoje Ribicic | used, but it can lead to the connection being underutilized due to the time |
301 | 0cb22cf2 | Hrvoje Ribicic | spent compressing data. It is up to the user to make these tradeoffs, but |
302 | 0cb22cf2 | Hrvoje Ribicic | zeroing should be seen primarily as a means of further reducing the amount of |
303 | 0cb22cf2 | Hrvoje Ribicic | data sent while increasing disk activity, with possible speed gains that should |
304 | 0cb22cf2 | Hrvoje Ribicic | not be relied upon. |
305 | 0cb22cf2 | Hrvoje Ribicic | |
306 | 0cb22cf2 | Hrvoje Ribicic | In the future, the VM created for zeroing could also undertake other tasks |
307 | 0cb22cf2 | Hrvoje Ribicic | related to the move, such as compression and encryption, and produce a stream |
308 | 0cb22cf2 | Hrvoje Ribicic | of data rather than just modifying the disk. This would lessen the strain on |
309 | 0cb22cf2 | Hrvoje Ribicic | the resources of the hypervisor, both disk I/O and CPU usage, and allow moves to |
310 | 0cb22cf2 | Hrvoje Ribicic | obey the resource constraints placed on the instance being moved. |
311 | 0cb22cf2 | Hrvoje Ribicic | |
312 | 0cb22cf2 | Hrvoje Ribicic | Lock reduction |
313 | 0cb22cf2 | Hrvoje Ribicic | ============== |
314 | 0cb22cf2 | Hrvoje Ribicic | |
315 | 0cb22cf2 | Hrvoje Ribicic | An instance move as executed by the move-instance tool consists of several |
316 | 0cb22cf2 | Hrvoje Ribicic | preparatory RAPI calls, leading up to two long-lasting opcodes: OpCreateInstance |
317 | 0cb22cf2 | Hrvoje Ribicic | and OpBackupExport. While OpBackupExport locks only the instance, the locks of |
318 | 0cb22cf2 | Hrvoje Ribicic | OpCreateInstance require more attention. |
319 | 0cb22cf2 | Hrvoje Ribicic | |
320 | 0cb22cf2 | Hrvoje Ribicic | When executed, this opcode attempts to lock all nodes on which the instance may |
321 | 0cb22cf2 | Hrvoje Ribicic | be created and obtain shared locks on the groups they belong to. In the case |
322 | 0cb22cf2 | Hrvoje Ribicic | that an IAllocator is used, this means all nodes must be locked. Any operation |
323 | 0cb22cf2 | Hrvoje Ribicic | that requires a node lock to be present can delay the move operation, and there |
324 | 0cb22cf2 | Hrvoje Ribicic | is no shortage of these. |
325 | 0cb22cf2 | Hrvoje Ribicic | |
326 | 0cb22cf2 | Hrvoje Ribicic | The concept of opportunistic locking has been introduced to remedy exactly this |
327 | 0cb22cf2 | Hrvoje Ribicic | situation, allowing the IAllocator to lock as many nodes as possible. Depending |
328 | 0cb22cf2 | Hrvoje Ribicic | whether the allocation can be made on these nodes, the operation either proceeds |
329 | 0cb22cf2 | Hrvoje Ribicic | as expected, or fails noting that it is temporarily infeasible. The failure case |
330 | 0cb22cf2 | Hrvoje Ribicic | would change the semantics of the move-instance tool, which is expected to fail |
331 | 0cb22cf2 | Hrvoje Ribicic | only if the move is impossible. To yield the benefits of opportunistic locking |
332 | 0cb22cf2 | Hrvoje Ribicic | yet satisfy this constraint, the move-instance tool can be extended with the |
333 | 0cb22cf2 | Hrvoje Ribicic | --opportunistic-tries and --opportunistic-try-delay options. A number of |
334 | 0cb22cf2 | Hrvoje Ribicic | opportunistic instance creations are attempted, with a delay between attempts. |
335 | 0cb22cf2 | Hrvoje Ribicic | The delay is slightly altered every time to avoid timing issues. Should all |
336 | 0cb22cf2 | Hrvoje Ribicic | attempts fail, a normal instance creation is requested, which blocks until all |
337 | 0cb22cf2 | Hrvoje Ribicic | the locks can be acquired. |
338 | 0cb22cf2 | Hrvoje Ribicic | |
339 | 0cb22cf2 | Hrvoje Ribicic | While it may seem excessive to grab so many node locks, the early release |
340 | 0cb22cf2 | Hrvoje Ribicic | mechanism is used to make the situation less dire, releasing all nodes that were |
341 | 0cb22cf2 | Hrvoje Ribicic | not chosen as candidates for allocation. This is taken to the extreme as all the |
342 | 0cb22cf2 | Hrvoje Ribicic | locks acquired are released prior to the start of the transfer, barring the |
343 | 0cb22cf2 | Hrvoje Ribicic | newly-acquired lock over the new instance. This works because all operations |
344 | 0cb22cf2 | Hrvoje Ribicic | that alter the node in a way which could affect the transfer: |
345 | 0cb22cf2 | Hrvoje Ribicic | |
346 | 0cb22cf2 | Hrvoje Ribicic | * are prevented by the instance lock or instance presence, e.g. gnt-node remove, |
347 | 0cb22cf2 | Hrvoje Ribicic | gnt-node evacuate, |
348 | 0cb22cf2 | Hrvoje Ribicic | |
349 | 0cb22cf2 | Hrvoje Ribicic | * do not interrupt the transfer, e.g. a PV on the node can be set as |
350 | 0cb22cf2 | Hrvoje Ribicic | unallocatable, and the transfer still proceeds as expected, |
351 | 0cb22cf2 | Hrvoje Ribicic | |
352 | 0cb22cf2 | Hrvoje Ribicic | * do not care, e.g. a gnt-node powercycle explicitly ignores all locks. |
353 | 0cb22cf2 | Hrvoje Ribicic | |
354 | 0cb22cf2 | Hrvoje Ribicic | This invariant should be kept in mind, and perhaps verified through tests. |
355 | 0cb22cf2 | Hrvoje Ribicic | |
356 | 0cb22cf2 | Hrvoje Ribicic | All in all, there is very little space to reduce the number of locks used, and |
357 | 0cb22cf2 | Hrvoje Ribicic | the only improvement that can be made is introducing opportunistic locking as an |
358 | 0cb22cf2 | Hrvoje Ribicic | option of move-instance. |
359 | 0cb22cf2 | Hrvoje Ribicic | |
360 | 0cb22cf2 | Hrvoje Ribicic | Introduction of changes |
361 | 0cb22cf2 | Hrvoje Ribicic | ======================= |
362 | 0cb22cf2 | Hrvoje Ribicic | |
363 | 0cb22cf2 | Hrvoje Ribicic | All the changes noted will be implemented in Ganeti 2.12, in the way described |
364 | 0cb22cf2 | Hrvoje Ribicic | in the previous chapters. They will be implemented as separate changes, first |
365 | 0cb22cf2 | Hrvoje Ribicic | the lock reduction, then the instance zeroing, then the compression |
366 | 0cb22cf2 | Hrvoje Ribicic | improvements, and finally the encryption changes. |