Statistics
| Branch: | Tag: | Revision:

root / doc / design-move-instance-improvements.rst @ ec9c1bf8

History | View | Annotate | Download (19.8 kB)

1 0cb22cf2 Hrvoje Ribicic
========================================
2 0cb22cf2 Hrvoje Ribicic
Instance move improvements
3 0cb22cf2 Hrvoje Ribicic
========================================
4 0cb22cf2 Hrvoje Ribicic
5 0cb22cf2 Hrvoje Ribicic
.. contents:: :depth: 3
6 0cb22cf2 Hrvoje Ribicic
7 0cb22cf2 Hrvoje Ribicic
Ganeti provides tools for moving instances within and between clusters. Through
8 0cb22cf2 Hrvoje Ribicic
special export and import calls, a new instance is created with the disk data of
9 0cb22cf2 Hrvoje Ribicic
the existing one.
10 0cb22cf2 Hrvoje Ribicic
11 0cb22cf2 Hrvoje Ribicic
The tools work correctly and reliably, but depending on bandwidth and priority,
12 0cb22cf2 Hrvoje Ribicic
an instance disk of considerable size requires a long time to transfer. The
13 0cb22cf2 Hrvoje Ribicic
length of the transfer is inconvenient at best, but the problem becomes only
14 0cb22cf2 Hrvoje Ribicic
worse if excessive locking causes a move operation to be delayed for a longer
15 0cb22cf2 Hrvoje Ribicic
period of time, or to block other operations.
16 0cb22cf2 Hrvoje Ribicic
17 0cb22cf2 Hrvoje Ribicic
The performance of moves is a complex topic, with available bandwidth,
18 0cb22cf2 Hrvoje Ribicic
compression, and encryption all being candidates for choke points that bog down
19 0cb22cf2 Hrvoje Ribicic
a transfer. Depending on the environment a move is performed in, tuning these
20 0cb22cf2 Hrvoje Ribicic
can have significant performance benefits, but Ganeti does not expose many
21 0cb22cf2 Hrvoje Ribicic
options needed for such tuning. The details of what to expose and what tradeoffs
22 0cb22cf2 Hrvoje Ribicic
can be made will be presented in this document.
23 0cb22cf2 Hrvoje Ribicic
24 0cb22cf2 Hrvoje Ribicic
Apart from existing functionality, some beneficial features can be introduced to
25 0cb22cf2 Hrvoje Ribicic
help with instance moves. Zeroing empty space on instance disks can be useful
26 0cb22cf2 Hrvoje Ribicic
for drastically improving the qualities of compression, effectively not needing
27 0cb22cf2 Hrvoje Ribicic
to transfer unused disk space during moves. Compression itself can be improved
28 0cb22cf2 Hrvoje Ribicic
by using different tools. The encryption used can be weakened or eliminated for
29 0cb22cf2 Hrvoje Ribicic
certain moves. Using opportunistic locking during instance moves results in
30 0cb22cf2 Hrvoje Ribicic
greater parallelization. As all of these approaches aim to tackle two different
31 0cb22cf2 Hrvoje Ribicic
aspects of the problem, they do not exclude each other and will be presented
32 0cb22cf2 Hrvoje Ribicic
independently.
33 0cb22cf2 Hrvoje Ribicic
34 0cb22cf2 Hrvoje Ribicic
The performance of Ganeti moves
35 0cb22cf2 Hrvoje Ribicic
===============================
36 0cb22cf2 Hrvoje Ribicic
37 0cb22cf2 Hrvoje Ribicic
In the current implementation, there are three possible factors limiting the
38 0cb22cf2 Hrvoje Ribicic
speed of an instance move. The first is the network bandwidth, which Ganeti can
39 0cb22cf2 Hrvoje Ribicic
exploit better by using compression. The second is the encryption, which is
40 0cb22cf2 Hrvoje Ribicic
obligatory, and which can throttle an otherwise fast connection. The third is
41 0cb22cf2 Hrvoje Ribicic
surprisingly the compression, which can cause the connection to be
42 0cb22cf2 Hrvoje Ribicic
underutilized.
43 0cb22cf2 Hrvoje Ribicic
44 0cb22cf2 Hrvoje Ribicic
Example 1: some numbers present during an intra-cluster instance move:
45 0cb22cf2 Hrvoje Ribicic
46 0cb22cf2 Hrvoje Ribicic
* Network bandwidth: 105MB/s, courtesy of a gigabit switch
47 0cb22cf2 Hrvoje Ribicic
48 0cb22cf2 Hrvoje Ribicic
* Encryption performance: 40MB/s, provided by OpenSSL
49 0cb22cf2 Hrvoje Ribicic
50 0cb22cf2 Hrvoje Ribicic
* Compression performance: 22.3MB/s input, 7.1MB/s gzip compressed output
51 0cb22cf2 Hrvoje Ribicic
52 0cb22cf2 Hrvoje Ribicic
As can be seen in this example, the obligatory encryption results in 62% of
53 0cb22cf2 Hrvoje Ribicic
available bandwidth being wasted, while using compression further lowers the
54 0cb22cf2 Hrvoje Ribicic
throughput to 55% of what the encryption would allow. The following sections
55 0cb22cf2 Hrvoje Ribicic
will talk about these numbers in more detail, and suggest improvements and best
56 0cb22cf2 Hrvoje Ribicic
practices.
57 0cb22cf2 Hrvoje Ribicic
58 0cb22cf2 Hrvoje Ribicic
Encryption and Ganeti security
59 0cb22cf2 Hrvoje Ribicic
++++++++++++++++++++++++++++++
60 0cb22cf2 Hrvoje Ribicic
61 0cb22cf2 Hrvoje Ribicic
Turning compression and encryption off would allow for an immediate improvement,
62 0cb22cf2 Hrvoje Ribicic
and while that is possible for compression, there are good reasons why
63 0cb22cf2 Hrvoje Ribicic
encryption is currently not a feature a user can disable.
64 0cb22cf2 Hrvoje Ribicic
65 0cb22cf2 Hrvoje Ribicic
While it is impossible to secure instance data if an attacker gains SSH access
66 0cb22cf2 Hrvoje Ribicic
to a node, the RAPI was designed to never allow user data to be accessed through
67 0cb22cf2 Hrvoje Ribicic
it in case of being compromised. If moves could be performed unencrypted, this
68 0cb22cf2 Hrvoje Ribicic
property would be broken. Instance moves can take place in environments which
69 0cb22cf2 Hrvoje Ribicic
may be hostile, and where unencrypted traffic could be intercepted. As they can
70 0cb22cf2 Hrvoje Ribicic
be instigated through the RAPI, an attacker could access all data on all
71 0cb22cf2 Hrvoje Ribicic
instances in a cluster by moving them unencrypted and intercepting the data in
72 0cb22cf2 Hrvoje Ribicic
flight. This is one of the few situations where the current speed of instance
73 0cb22cf2 Hrvoje Ribicic
moves could be considered a perk.
74 0cb22cf2 Hrvoje Ribicic
75 0cb22cf2 Hrvoje Ribicic
The performance of encryption can be increased by either using a less secure
76 0cb22cf2 Hrvoje Ribicic
form of encryption, including no encryption, or using a faster encryption
77 0cb22cf2 Hrvoje Ribicic
algorithm. The example listed above utilizes AES-256, one of the few ciphers
78 0cb22cf2 Hrvoje Ribicic
that Ganeti deems secure enough to use. AES-128, also allowed by Ganeti's
79 0cb22cf2 Hrvoje Ribicic
current settings, is weaker but 46% faster. A cipher that is not allowed due to
80 0cb22cf2 Hrvoje Ribicic
its flaws, such as RC4, could offer a 208% increase in speed. On the other hand,
81 0cb22cf2 Hrvoje Ribicic
using an OS capable of utilizing the AES_NI chip present on modern hardware
82 0cb22cf2 Hrvoje Ribicic
can double the performance of AES, making it the best tradeoff between security
83 0cb22cf2 Hrvoje Ribicic
and performance.
84 0cb22cf2 Hrvoje Ribicic
85 0cb22cf2 Hrvoje Ribicic
Ganeti cannot and should not detect all the factors listed above, but should
86 0cb22cf2 Hrvoje Ribicic
rather give its users some leeway in what to choose. A precedent already exists,
87 0cb22cf2 Hrvoje Ribicic
as intra-cluster DRBD replication is already performed unencrypted, albeit on a
88 0cb22cf2 Hrvoje Ribicic
separate VLAN. For intra-cluster moves, Ganeti should allow its users to set
89 0cb22cf2 Hrvoje Ribicic
OpenSSL ciphers at will, while still enforcing high-security settings for moves
90 0cb22cf2 Hrvoje Ribicic
between clusters.
91 0cb22cf2 Hrvoje Ribicic
92 0cb22cf2 Hrvoje Ribicic
Thus, two settings will be introduced:
93 0cb22cf2 Hrvoje Ribicic
94 0cb22cf2 Hrvoje Ribicic
* a cluster-level setting called ``--allow-cipher-bypassing``, a boolean that
95 0cb22cf2 Hrvoje Ribicic
  cannot be set over RAPI
96 0cb22cf2 Hrvoje Ribicic
97 0cb22cf2 Hrvoje Ribicic
* a gnt-instance move setting called ``--ciphers-to-use``, bypassing the default
98 0cb22cf2 Hrvoje Ribicic
  cipher list with given ciphers, filtered to ensure no other OpenSSL options
99 0cb22cf2 Hrvoje Ribicic
  are passed in within
100 0cb22cf2 Hrvoje Ribicic
101 0cb22cf2 Hrvoje Ribicic
This change will serve to address the issues with moving non-redundant instances
102 0cb22cf2 Hrvoje Ribicic
within the cluster, while keeping Ganeti security at its current level.
103 0cb22cf2 Hrvoje Ribicic
104 0cb22cf2 Hrvoje Ribicic
Compression
105 0cb22cf2 Hrvoje Ribicic
+++++++++++
106 0cb22cf2 Hrvoje Ribicic
107 0cb22cf2 Hrvoje Ribicic
Support for disk compression during instance moves was partially present before,
108 0cb22cf2 Hrvoje Ribicic
but cleaned up and unified under the ``--compress`` option only as of Ganeti
109 0cb22cf2 Hrvoje Ribicic
2.11. The only option offered by Ganeti is gzip with no options passed to it,
110 0cb22cf2 Hrvoje Ribicic
resulting in a good compression ratio, but bad compression speed.
111 0cb22cf2 Hrvoje Ribicic
112 0cb22cf2 Hrvoje Ribicic
As compression can affect the speed of instance moves significantly, it is
113 0cb22cf2 Hrvoje Ribicic
worthwhile to explore alternatives. To test compression tool performance, an 8GB
114 0cb22cf2 Hrvoje Ribicic
drive filled with data matching the expected usage patterns (taken from a
115 0cb22cf2 Hrvoje Ribicic
workstation) was compressed by using various tools with various settings. The
116 0cb22cf2 Hrvoje Ribicic
two top performers were ``lzop`` and, surprisingly, ``gzip``. The improvement in
117 0cb22cf2 Hrvoje Ribicic
the performance of ``gzip`` was obtained by explicitly optimizing for speed
118 0cb22cf2 Hrvoje Ribicic
rather than compression.
119 0cb22cf2 Hrvoje Ribicic
120 0cb22cf2 Hrvoje Ribicic
* ``gzip -6``: 22.3MB/s in, 7.1MB/s out
121 0cb22cf2 Hrvoje Ribicic
* ``gzip -1``: 44.1MB/s in, 15.1MB/s out
122 0cb22cf2 Hrvoje Ribicic
* ``lzop``: 71.9MB/s in, 28.1MB/s out
123 0cb22cf2 Hrvoje Ribicic
124 0cb22cf2 Hrvoje Ribicic
If encryption is the limiting factor, and as in the example, limits the
125 0cb22cf2 Hrvoje Ribicic
bandwidth to 40MB/s, ``lzop`` allows for an effective 79% increase in transfer
126 0cb22cf2 Hrvoje Ribicic
speed. The fast ``gzip`` would also prove to be beneficial, but much less than
127 0cb22cf2 Hrvoje Ribicic
``lzop``. It should also be noted that as a rule of thumb, tools with a lower
128 0cb22cf2 Hrvoje Ribicic
compression ratio had a lesser workload, with ``lzop`` straining the CPU much
129 0cb22cf2 Hrvoje Ribicic
less than any of the competitors.
130 0cb22cf2 Hrvoje Ribicic
131 0cb22cf2 Hrvoje Ribicic
With the test results present here, it is clear that ``lzop`` would be a very
132 0cb22cf2 Hrvoje Ribicic
worthwhile addition to the compression options present in Ganeti, yet the
133 0cb22cf2 Hrvoje Ribicic
problem is that it is not available by default on all distributions, as the
134 0cb22cf2 Hrvoje Ribicic
option's presence might imply. In general, Ganeti may know how to use several
135 0cb22cf2 Hrvoje Ribicic
tools, and check for their presence, but should add some way of at least hinting
136 0cb22cf2 Hrvoje Ribicic
at which tools are available.
137 0cb22cf2 Hrvoje Ribicic
138 0cb22cf2 Hrvoje Ribicic
Additionally, the user might want to use a tool that Ganeti did not account for.
139 0cb22cf2 Hrvoje Ribicic
Allowing the tool to be named is also helpful, both for cases when multiple
140 0cb22cf2 Hrvoje Ribicic
custom tools are to be used, and for distinguishing between various tools in
141 0cb22cf2 Hrvoje Ribicic
case of e.g. inter-cluster moves.
142 0cb22cf2 Hrvoje Ribicic
143 0cb22cf2 Hrvoje Ribicic
To this end, the ``--compression-tools`` cluster parameter will be added to
144 0cb22cf2 Hrvoje Ribicic
Ganeti. It contains a list of names of compression tools that can be supplied as
145 0cb22cf2 Hrvoje Ribicic
the parameter of ``--compress``, and by default it contains all the tools
146 0cb22cf2 Hrvoje Ribicic
Ganeti knows how to use. The user can change the list as desired, removing
147 0cb22cf2 Hrvoje Ribicic
entries that are not or should not be available on the cluster, and adding
148 0cb22cf2 Hrvoje Ribicic
custom tools.
149 0cb22cf2 Hrvoje Ribicic
150 0cb22cf2 Hrvoje Ribicic
Every custom tool is identified by its name, and Ganeti expects the name to
151 0cb22cf2 Hrvoje Ribicic
correspond to a script invoking the compression tool. Without arguments, the
152 0cb22cf2 Hrvoje Ribicic
script compresses input on stdin, outputting it on stdout. With the -d argument,
153 0cb22cf2 Hrvoje Ribicic
the script does the same, only while decompressing. The -h argument is used to
154 0cb22cf2 Hrvoje Ribicic
check for the presence of the script, and in this case, only the error code is
155 0cb22cf2 Hrvoje Ribicic
examined. This syntax matches the ``gzip`` syntax well, which should allow most
156 0cb22cf2 Hrvoje Ribicic
compression tools to be adapted to it easily.
157 0cb22cf2 Hrvoje Ribicic
158 0cb22cf2 Hrvoje Ribicic
Ganeti will not allow arbitrary parameters to be passed to a compression tool,
159 0cb22cf2 Hrvoje Ribicic
and will restrict the names to contain only a small but assuredly safe subset of
160 0cb22cf2 Hrvoje Ribicic
characters - alphanumeric values and dashes and underscores. This minimizes the
161 0cb22cf2 Hrvoje Ribicic
risk of security issues that could arise from an attacker smuggling a malicious
162 0cb22cf2 Hrvoje Ribicic
command through RAPI. Common variations, like the speed/compression tradeoff of
163 0cb22cf2 Hrvoje Ribicic
``gzip``, will be handled by aliases, e.g. ``gzip-fast`` or ``gzip-slow``.
164 0cb22cf2 Hrvoje Ribicic
165 0cb22cf2 Hrvoje Ribicic
It should also be noted that for some purposes - e.g. the writing of OVF files,
166 0cb22cf2 Hrvoje Ribicic
``gzip`` is the only allowed means of compression, and an appropriate error
167 0cb22cf2 Hrvoje Ribicic
message should be displayed if the user attempts to use one of the other
168 0cb22cf2 Hrvoje Ribicic
provided tools.
169 0cb22cf2 Hrvoje Ribicic
170 0cb22cf2 Hrvoje Ribicic
Zeroing instance disks
171 0cb22cf2 Hrvoje Ribicic
======================
172 0cb22cf2 Hrvoje Ribicic
173 0cb22cf2 Hrvoje Ribicic
While compression lowers the amount of data sent, further reductions can be
174 0cb22cf2 Hrvoje Ribicic
achieved by taking advantage of the structure of the disk - namely, sending only
175 0cb22cf2 Hrvoje Ribicic
used disk sectors.
176 0cb22cf2 Hrvoje Ribicic
177 0cb22cf2 Hrvoje Ribicic
There is no direct way to achieve this, as it would require that the
178 0cb22cf2 Hrvoje Ribicic
move-instance tool is aware of the structure of the file system. Mounting the
179 0cb22cf2 Hrvoje Ribicic
filesystem is not an option, primarily due to security issues. A disk primed to
180 0cb22cf2 Hrvoje Ribicic
take advantage of a disk driver exploit could cause an attacker to breach
181 0cb22cf2 Hrvoje Ribicic
instance isolation and gain control of a Ganeti node.
182 0cb22cf2 Hrvoje Ribicic
183 0cb22cf2 Hrvoje Ribicic
An indirect way for this performance gain to be achieved is the zeroing of any
184 0cb22cf2 Hrvoje Ribicic
hard disk space not in use. While this primarily means empty space, swap
185 0cb22cf2 Hrvoje Ribicic
partitions can be zeroed as well.
186 0cb22cf2 Hrvoje Ribicic
187 0cb22cf2 Hrvoje Ribicic
Sequences of zeroes can be compressed and thus transferred very efficiently, all
188 0cb22cf2 Hrvoje Ribicic
without the host knowing that these are empty space. This approach can also be
189 0cb22cf2 Hrvoje Ribicic
dangerous if a sparse disk is zeroed in this way, causing ballooning. As Ganeti
190 0cb22cf2 Hrvoje Ribicic
does not seem to make special concessions for moving sparse disks, the only
191 0cb22cf2 Hrvoje Ribicic
difference should be the disk space utilization on the current node.
192 0cb22cf2 Hrvoje Ribicic
193 0cb22cf2 Hrvoje Ribicic
Zeroing approaches
194 0cb22cf2 Hrvoje Ribicic
++++++++++++++++++
195 0cb22cf2 Hrvoje Ribicic
196 0cb22cf2 Hrvoje Ribicic
Zeroing is a feasible approach, but the node cannot perform it as it cannot
197 0cb22cf2 Hrvoje Ribicic
mount the disk. Only virtualization-based options remain, and of those, using
198 0cb22cf2 Hrvoje Ribicic
Ganeti's own virtualization capabilities makes the most sense. There are two
199 0cb22cf2 Hrvoje Ribicic
ways of doing this - creating a new helper instance, temporary or persistent, or
200 0cb22cf2 Hrvoje Ribicic
reusing the target instance.
201 0cb22cf2 Hrvoje Ribicic
202 0cb22cf2 Hrvoje Ribicic
Both approaches have their disadvantages. Creating a new helper instance
203 0cb22cf2 Hrvoje Ribicic
requires managing its lifecycle, taking special care to make sure no helper
204 0cb22cf2 Hrvoje Ribicic
instance remains left over due to a failed operation. Even if this were to be
205 0cb22cf2 Hrvoje Ribicic
taken care of, disks are not yet separate entities in Ganeti, making the
206 0cb22cf2 Hrvoje Ribicic
temporary transfer of disks between instances hard to implement and even harder
207 0cb22cf2 Hrvoje Ribicic
to make robust. The reuse can be done by modifying the OS running on the
208 0cb22cf2 Hrvoje Ribicic
instance to perform the zeroing itself when notified via the new instance
209 0cb22cf2 Hrvoje Ribicic
communication mechanism, but this approach is neither generic, nor particularly
210 0cb22cf2 Hrvoje Ribicic
safe. There is no guarantee that the zeroing operation will not interfere with
211 0cb22cf2 Hrvoje Ribicic
the normal operation of the instance, nor that it will be completed if a
212 0cb22cf2 Hrvoje Ribicic
user-initiated shutdown occurs.
213 0cb22cf2 Hrvoje Ribicic
214 0cb22cf2 Hrvoje Ribicic
A better solution can be found by combining the two approaches - re-using the
215 0cb22cf2 Hrvoje Ribicic
virtualized environment, but with a specifically crafted OS image. With the
216 0cb22cf2 Hrvoje Ribicic
instance shut down as it should be in preparation for the move, it can be
217 0cb22cf2 Hrvoje Ribicic
extended with an additional disk with the OS image on it. By prepending the
218 0cb22cf2 Hrvoje Ribicic
disk and changing some instance parameters, the instance can boot from it. The
219 0cb22cf2 Hrvoje Ribicic
OS can be configured to perform the zeroing on startup, attempting to mount any
220 0cb22cf2 Hrvoje Ribicic
partitions with a filesystem present, and creating and deleting a zero-filled
221 0cb22cf2 Hrvoje Ribicic
file on them. After the zeroing is complete, the OS should shut down, and the
222 0cb22cf2 Hrvoje Ribicic
master should note the shutdown and restore the instance to its previous state.
223 0cb22cf2 Hrvoje Ribicic
224 0cb22cf2 Hrvoje Ribicic
Note that the requirements above are very similar to the notion of a helper VM
225 0cb22cf2 Hrvoje Ribicic
suggested in the OS install document. Some potentially unsafe actions are
226 0cb22cf2 Hrvoje Ribicic
performed within a virtualized environment, acting on disks that belong or will
227 0cb22cf2 Hrvoje Ribicic
belong to the instance. The mechanisms used will thus be developed with both
228 0cb22cf2 Hrvoje Ribicic
approaches in mind.
229 0cb22cf2 Hrvoje Ribicic
230 0cb22cf2 Hrvoje Ribicic
Implementation
231 0cb22cf2 Hrvoje Ribicic
++++++++++++++
232 0cb22cf2 Hrvoje Ribicic
233 0cb22cf2 Hrvoje Ribicic
There are two components to this solution - the Ganeti changes needed to boot
234 0cb22cf2 Hrvoje Ribicic
the OS, and the OS image used for the zeroing. Due to the variety of filesystems
235 0cb22cf2 Hrvoje Ribicic
and architectures that instances can use, no single ready-to-run disk image can
236 0cb22cf2 Hrvoje Ribicic
satisfy the needs of all the Ganeti users. Instead, the instance-debootstrap
237 0cb22cf2 Hrvoje Ribicic
scripts can be used to generate a zeroing-capable OS image. This might not be
238 0cb22cf2 Hrvoje Ribicic
ideal, as there are lightweight distributions that take up less space and boot
239 0cb22cf2 Hrvoje Ribicic
up more quickly. Generating those with the right set of drivers for the
240 0cb22cf2 Hrvoje Ribicic
virtualization platform of choice is not easy. Thus we do not provide a script
241 0cb22cf2 Hrvoje Ribicic
for this purpose, but the user is free to provide any OS image which performs
242 0cb22cf2 Hrvoje Ribicic
the necessary steps: zero out all virtualization-provided devices on startup,
243 0cb22cf2 Hrvoje Ribicic
shutdown immediately. The cluster-wide parameter controlling the image to be
244 0cb22cf2 Hrvoje Ribicic
used would be called ``--zeroing-image``.
245 0cb22cf2 Hrvoje Ribicic
246 0cb22cf2 Hrvoje Ribicic
The modifications to Ganeti code needed are minor. The zeroing functionality
247 0cb22cf2 Hrvoje Ribicic
should be implemented as an extension of the instance export, and exposed as the
248 0cb22cf2 Hrvoje Ribicic
``--zero-free-space option``. Prior to beginning the export, the instance
249 0cb22cf2 Hrvoje Ribicic
configuration is temporarily extended with a new read-only disk of sufficient
250 0cb22cf2 Hrvoje Ribicic
size to host the zeroing image, and the changes necessary for the image to be
251 0cb22cf2 Hrvoje Ribicic
used as the boot drive. The temporary nature of the configuration changes
252 0cb22cf2 Hrvoje Ribicic
requires that they are not propagated to other nodes. While this would normally
253 0cb22cf2 Hrvoje Ribicic
not be feasible with an instance using a disk template offering multi-node
254 0cb22cf2 Hrvoje Ribicic
redundancy, experiments with the code have shown that the restriction on
255 0cb22cf2 Hrvoje Ribicic
diverse disk templates can be bypassed to temporarily allow a plain
256 0cb22cf2 Hrvoje Ribicic
disk-template disk to host the zeroing image. Given that one of the planned
257 0cb22cf2 Hrvoje Ribicic
changes in Ganeti is to have instance disks as separate entities, with no
258 0cb22cf2 Hrvoje Ribicic
restriction on templates, this assumption is useful rather than harmful by
259 0cb22cf2 Hrvoje Ribicic
asserting the desired behavior. The image is dumped to the disk, and the
260 0cb22cf2 Hrvoje Ribicic
instance is started up.
261 0cb22cf2 Hrvoje Ribicic
262 0cb22cf2 Hrvoje Ribicic
Once the instance is started up, the zeroing will proceed until completion, when
263 0cb22cf2 Hrvoje Ribicic
a self-initiated shutdown will occur. The instance-shutdown detection
264 0cb22cf2 Hrvoje Ribicic
capabilities of 2.11 should prevent the watcher from restarting the instance
265 0cb22cf2 Hrvoje Ribicic
once this happens, allowing the host to take it as a sign the zeroing was
266 0cb22cf2 Hrvoje Ribicic
completed. Either way, the host waits until the instance is shut down, or a
267 0cb22cf2 Hrvoje Ribicic
timeout has been reached and the instance is forcibly shut down. As the time
268 0cb22cf2 Hrvoje Ribicic
needed to zero an instance is dependent on the size of the disk of the instance,
269 0cb22cf2 Hrvoje Ribicic
the user can provide a fixed and a per-size timeout, recommended to be set to
270 0cb22cf2 Hrvoje Ribicic
twice the maximum write speed of the device hosting the instance.
271 0cb22cf2 Hrvoje Ribicic
272 0cb22cf2 Hrvoje Ribicic
Better progress monitoring can be implemented with the instance-host
273 0cb22cf2 Hrvoje Ribicic
communication channel proposed by the OS install design document. The first
274 0cb22cf2 Hrvoje Ribicic
version will most likely use only the shutdown detection, and will be improved
275 0cb22cf2 Hrvoje Ribicic
to account for the available communication channel at a later time.
276 0cb22cf2 Hrvoje Ribicic
277 0cb22cf2 Hrvoje Ribicic
After the shutdown, the temporary disk is destroyed and the instance
278 0cb22cf2 Hrvoje Ribicic
configuration is reverted to its original state. The very same action is done if
279 0cb22cf2 Hrvoje Ribicic
any error is encountered during the zeroing process. In the case that the
280 0cb22cf2 Hrvoje Ribicic
zeroing is interrupted while the zero-filled file is being written, the file may
281 0cb22cf2 Hrvoje Ribicic
remain on the disk of the instance. The script that performs the zeroing will be
282 0cb22cf2 Hrvoje Ribicic
made to react to system signals by deleting the zero-filled file, but there is
283 0cb22cf2 Hrvoje Ribicic
little else that can be done to recover.
284 0cb22cf2 Hrvoje Ribicic
285 0cb22cf2 Hrvoje Ribicic
When to use zeroing
286 0cb22cf2 Hrvoje Ribicic
+++++++++++++++++++
287 0cb22cf2 Hrvoje Ribicic
288 0cb22cf2 Hrvoje Ribicic
The question of when it is useful to use zeroing is hard to answer because the
289 0cb22cf2 Hrvoje Ribicic
effectiveness of the approach depends on many factors. All compression tools
290 0cb22cf2 Hrvoje Ribicic
compress zeroes to almost nothingness, but compressing them takes time. If the
291 0cb22cf2 Hrvoje Ribicic
time needed to compress zeroes were equal to zero, the approach would boil down
292 0cb22cf2 Hrvoje Ribicic
to whether it is faster to zero unused space out, performing writes to disk, or
293 0cb22cf2 Hrvoje Ribicic
to transfer it compressed. For the example used above, the average compression
294 0cb22cf2 Hrvoje Ribicic
ratio, and write speeds of current disk drives, the answer would almost
295 0cb22cf2 Hrvoje Ribicic
unanimously be yes.
296 0cb22cf2 Hrvoje Ribicic
297 0cb22cf2 Hrvoje Ribicic
With a more realistic setup, where zeroes take time to compress, yet less time
298 0cb22cf2 Hrvoje Ribicic
than ordinary data, the gains depend on the previously mentioned tradeoff and
299 0cb22cf2 Hrvoje Ribicic
the free space available. Zeroing will definitely lessen the amount of bandwidth
300 0cb22cf2 Hrvoje Ribicic
used, but it can lead to the connection being underutilized due to the time
301 0cb22cf2 Hrvoje Ribicic
spent compressing data. It is up to the user to make these tradeoffs, but
302 0cb22cf2 Hrvoje Ribicic
zeroing should be seen primarily as a means of further reducing the amount of
303 0cb22cf2 Hrvoje Ribicic
data sent while increasing disk activity, with possible speed gains that should
304 0cb22cf2 Hrvoje Ribicic
not be relied upon.
305 0cb22cf2 Hrvoje Ribicic
306 0cb22cf2 Hrvoje Ribicic
In the future, the VM created for zeroing could also undertake other tasks
307 0cb22cf2 Hrvoje Ribicic
related to the move, such as compression and encryption, and produce a stream
308 0cb22cf2 Hrvoje Ribicic
of data rather than just modifying the disk. This would lessen the strain on
309 0cb22cf2 Hrvoje Ribicic
the resources of the hypervisor, both disk I/O and CPU usage, and allow moves to
310 0cb22cf2 Hrvoje Ribicic
obey the resource constraints placed on the instance being moved.
311 0cb22cf2 Hrvoje Ribicic
312 0cb22cf2 Hrvoje Ribicic
Lock reduction
313 0cb22cf2 Hrvoje Ribicic
==============
314 0cb22cf2 Hrvoje Ribicic
315 0cb22cf2 Hrvoje Ribicic
An instance move as executed by the move-instance tool consists of several
316 0cb22cf2 Hrvoje Ribicic
preparatory RAPI calls, leading up to two long-lasting opcodes: OpCreateInstance
317 0cb22cf2 Hrvoje Ribicic
and OpBackupExport. While OpBackupExport locks only the instance, the locks of
318 0cb22cf2 Hrvoje Ribicic
OpCreateInstance require more attention.
319 0cb22cf2 Hrvoje Ribicic
320 0cb22cf2 Hrvoje Ribicic
When executed, this opcode attempts to lock all nodes on which the instance may
321 0cb22cf2 Hrvoje Ribicic
be created and obtain shared locks on the groups they belong to. In the case
322 0cb22cf2 Hrvoje Ribicic
that an IAllocator is used, this means all nodes must be locked. Any operation
323 0cb22cf2 Hrvoje Ribicic
that requires a node lock to be present can delay the move operation, and there
324 0cb22cf2 Hrvoje Ribicic
is no shortage of these.
325 0cb22cf2 Hrvoje Ribicic
326 0cb22cf2 Hrvoje Ribicic
The concept of opportunistic locking has been introduced to remedy exactly this
327 0cb22cf2 Hrvoje Ribicic
situation, allowing the IAllocator to lock as many nodes as possible. Depending
328 0cb22cf2 Hrvoje Ribicic
whether the allocation can be made on these nodes, the operation either proceeds
329 0cb22cf2 Hrvoje Ribicic
as expected, or fails noting that it is temporarily infeasible. The failure case
330 0cb22cf2 Hrvoje Ribicic
would change the semantics of the move-instance tool, which is expected to fail
331 0cb22cf2 Hrvoje Ribicic
only if the move is impossible. To yield the benefits of opportunistic locking
332 0cb22cf2 Hrvoje Ribicic
yet satisfy this constraint, the move-instance tool can be extended with the
333 0cb22cf2 Hrvoje Ribicic
--opportunistic-tries and --opportunistic-try-delay options. A number of
334 0cb22cf2 Hrvoje Ribicic
opportunistic instance creations are attempted, with a delay between attempts.
335 0cb22cf2 Hrvoje Ribicic
The delay is slightly altered every time to avoid timing issues. Should all
336 0cb22cf2 Hrvoje Ribicic
attempts fail, a normal instance creation is requested, which blocks until all
337 0cb22cf2 Hrvoje Ribicic
the locks can be acquired.
338 0cb22cf2 Hrvoje Ribicic
339 0cb22cf2 Hrvoje Ribicic
While it may seem excessive to grab so many node locks, the early release
340 0cb22cf2 Hrvoje Ribicic
mechanism is used to make the situation less dire, releasing all nodes that were
341 0cb22cf2 Hrvoje Ribicic
not chosen as candidates for allocation. This is taken to the extreme as all the
342 0cb22cf2 Hrvoje Ribicic
locks acquired are released prior to the start of the transfer, barring the
343 0cb22cf2 Hrvoje Ribicic
newly-acquired lock over the new instance. This works because all operations
344 0cb22cf2 Hrvoje Ribicic
that alter the node in a way which could affect the transfer:
345 0cb22cf2 Hrvoje Ribicic
346 0cb22cf2 Hrvoje Ribicic
* are prevented by the instance lock or instance presence, e.g. gnt-node remove,
347 0cb22cf2 Hrvoje Ribicic
  gnt-node evacuate,
348 0cb22cf2 Hrvoje Ribicic
349 0cb22cf2 Hrvoje Ribicic
* do not interrupt the transfer, e.g. a PV on the node can be set as
350 0cb22cf2 Hrvoje Ribicic
  unallocatable, and the transfer still proceeds as expected,
351 0cb22cf2 Hrvoje Ribicic
352 0cb22cf2 Hrvoje Ribicic
* do not care, e.g. a gnt-node powercycle explicitly ignores all locks.
353 0cb22cf2 Hrvoje Ribicic
354 0cb22cf2 Hrvoje Ribicic
This invariant should be kept in mind, and perhaps verified through tests.
355 0cb22cf2 Hrvoje Ribicic
356 0cb22cf2 Hrvoje Ribicic
All in all, there is very little space to reduce the number of locks used, and
357 0cb22cf2 Hrvoje Ribicic
the only improvement that can be made is introducing opportunistic locking as an
358 0cb22cf2 Hrvoje Ribicic
option of move-instance.
359 0cb22cf2 Hrvoje Ribicic
360 0cb22cf2 Hrvoje Ribicic
Introduction of changes
361 0cb22cf2 Hrvoje Ribicic
=======================
362 0cb22cf2 Hrvoje Ribicic
363 0cb22cf2 Hrvoje Ribicic
All the changes noted will be implemented in Ganeti 2.12, in the way described
364 0cb22cf2 Hrvoje Ribicic
in the previous chapters. They will be implemented as separate changes, first
365 0cb22cf2 Hrvoje Ribicic
the lock reduction, then the instance zeroing, then the compression
366 0cb22cf2 Hrvoje Ribicic
improvements, and finally the encryption changes.