root / docs / rdma.txt @ 52f35022
History | View | Annotate | Download (18.2 kB)
1 |
(RDMA: Remote Direct Memory Access) |
---|---|
2 |
RDMA Live Migration Specification, Version # 1 |
3 |
============================================== |
4 |
Wiki: http://wiki.qemu.org/Features/RDMALiveMigration |
5 |
Github: git@github.com:hinesmr/qemu.git, 'rdma' branch |
6 |
|
7 |
Copyright (C) 2013 Michael R. Hines <mrhines@us.ibm.com> |
8 |
|
9 |
An *exhaustive* paper (2010) shows additional performance details |
10 |
linked on the QEMU wiki above. |
11 |
|
12 |
Contents: |
13 |
========= |
14 |
* Introduction |
15 |
* Before running |
16 |
* Running |
17 |
* Performance |
18 |
* RDMA Migration Protocol Description |
19 |
* Versioning and Capabilities |
20 |
* QEMUFileRDMA Interface |
21 |
* Migration of pc.ram |
22 |
* Error handling |
23 |
* TODO |
24 |
|
25 |
Introduction: |
26 |
============= |
27 |
|
28 |
RDMA helps make your migration more deterministic under heavy load because |
29 |
of the significantly lower latency and higher throughput over TCP/IP. This is |
30 |
because the RDMA I/O architecture reduces the number of interrupts and |
31 |
data copies by bypassing the host networking stack. In particular, a TCP-based |
32 |
migration, under certain types of memory-bound workloads, may take a more |
33 |
unpredicatable amount of time to complete the migration if the amount of |
34 |
memory tracked during each live migration iteration round cannot keep pace |
35 |
with the rate of dirty memory produced by the workload. |
36 |
|
37 |
RDMA currently comes in two flavors: both Ethernet based (RoCE, or RDMA |
38 |
over Converged Ethernet) as well as Infiniband-based. This implementation of |
39 |
migration using RDMA is capable of using both technologies because of |
40 |
the use of the OpenFabrics OFED software stack that abstracts out the |
41 |
programming model irrespective of the underlying hardware. |
42 |
|
43 |
Refer to openfabrics.org or your respective RDMA hardware vendor for |
44 |
an understanding on how to verify that you have the OFED software stack |
45 |
installed in your environment. You should be able to successfully link |
46 |
against the "librdmacm" and "libibverbs" libraries and development headers |
47 |
for a working build of QEMU to run successfully using RDMA Migration. |
48 |
|
49 |
BEFORE RUNNING: |
50 |
=============== |
51 |
|
52 |
Use of RDMA during migration requires pinning and registering memory |
53 |
with the hardware. This means that memory must be physically resident |
54 |
before the hardware can transmit that memory to another machine. |
55 |
If this is not acceptable for your application or product, then the use |
56 |
of RDMA migration may in fact be harmful to co-located VMs or other |
57 |
software on the machine if there is not sufficient memory available to |
58 |
relocate the entire footprint of the virtual machine. If so, then the |
59 |
use of RDMA is discouraged and it is recommended to use standard TCP migration. |
60 |
|
61 |
Experimental: Next, decide if you want dynamic page registration. |
62 |
For example, if you have an 8GB RAM virtual machine, but only 1GB |
63 |
is in active use, then enabling this feature will cause all 8GB to |
64 |
be pinned and resident in memory. This feature mostly affects the |
65 |
bulk-phase round of the migration and can be enabled for extremely |
66 |
high-performance RDMA hardware using the following command: |
67 |
|
68 |
QEMU Monitor Command: |
69 |
$ migrate_set_capability x-rdma-pin-all on # disabled by default |
70 |
|
71 |
Performing this action will cause all 8GB to be pinned, so if that's |
72 |
not what you want, then please ignore this step altogether. |
73 |
|
74 |
On the other hand, this will also significantly speed up the bulk round |
75 |
of the migration, which can greatly reduce the "total" time of your migration. |
76 |
Example performance of this using an idle VM in the previous example |
77 |
can be found in the "Performance" section. |
78 |
|
79 |
Note: for very large virtual machines (hundreds of GBs), pinning all |
80 |
*all* of the memory of your virtual machine in the kernel is very expensive |
81 |
may extend the initial bulk iteration time by many seconds, |
82 |
and thus extending the total migration time. However, this will not |
83 |
affect the determinism or predictability of your migration you will |
84 |
still gain from the benefits of advanced pinning with RDMA. |
85 |
|
86 |
RUNNING: |
87 |
======== |
88 |
|
89 |
First, set the migration speed to match your hardware's capabilities: |
90 |
|
91 |
QEMU Monitor Command: |
92 |
$ migrate_set_speed 40g # or whatever is the MAX of your RDMA device |
93 |
|
94 |
Next, on the destination machine, add the following to the QEMU command line: |
95 |
|
96 |
qemu ..... -incoming x-rdma:host:port |
97 |
|
98 |
Finally, perform the actual migration on the source machine: |
99 |
|
100 |
QEMU Monitor Command: |
101 |
$ migrate -d x-rdma:host:port |
102 |
|
103 |
PERFORMANCE |
104 |
=========== |
105 |
|
106 |
Here is a brief summary of total migration time and downtime using RDMA: |
107 |
Using a 40gbps infiniband link performing a worst-case stress test, |
108 |
using an 8GB RAM virtual machine: |
109 |
|
110 |
Using the following command: |
111 |
$ apt-get install stress |
112 |
$ stress --vm-bytes 7500M --vm 1 --vm-keep |
113 |
|
114 |
1. Migration throughput: 26 gigabits/second. |
115 |
2. Downtime (stop time) varies between 15 and 100 milliseconds. |
116 |
|
117 |
EFFECTS of memory registration on bulk phase round: |
118 |
|
119 |
For example, in the same 8GB RAM example with all 8GB of memory in |
120 |
active use and the VM itself is completely idle using the same 40 gbps |
121 |
infiniband link: |
122 |
|
123 |
1. x-rdma-pin-all disabled total time: approximately 7.5 seconds @ 9.5 Gbps |
124 |
2. x-rdma-pin-all enabled total time: approximately 4 seconds @ 26 Gbps |
125 |
|
126 |
These numbers would of course scale up to whatever size virtual machine |
127 |
you have to migrate using RDMA. |
128 |
|
129 |
Enabling this feature does *not* have any measurable affect on |
130 |
migration *downtime*. This is because, without this feature, all of the |
131 |
memory will have already been registered already in advance during |
132 |
the bulk round and does not need to be re-registered during the successive |
133 |
iteration rounds. |
134 |
|
135 |
RDMA Protocol Description: |
136 |
========================== |
137 |
|
138 |
Migration with RDMA is separated into two parts: |
139 |
|
140 |
1. The transmission of the pages using RDMA |
141 |
2. Everything else (a control channel is introduced) |
142 |
|
143 |
"Everything else" is transmitted using a formal |
144 |
protocol now, consisting of infiniband SEND messages. |
145 |
|
146 |
An infiniband SEND message is the standard ibverbs |
147 |
message used by applications of infiniband hardware. |
148 |
The only difference between a SEND message and an RDMA |
149 |
message is that SEND messages cause notifications |
150 |
to be posted to the completion queue (CQ) on the |
151 |
infiniband receiver side, whereas RDMA messages (used |
152 |
for pc.ram) do not (to behave like an actual DMA). |
153 |
|
154 |
Messages in infiniband require two things: |
155 |
|
156 |
1. registration of the memory that will be transmitted |
157 |
2. (SEND only) work requests to be posted on both |
158 |
sides of the network before the actual transmission |
159 |
can occur. |
160 |
|
161 |
RDMA messages are much easier to deal with. Once the memory |
162 |
on the receiver side is registered and pinned, we're |
163 |
basically done. All that is required is for the sender |
164 |
side to start dumping bytes onto the link. |
165 |
|
166 |
(Memory is not released from pinning until the migration |
167 |
completes, given that RDMA migrations are very fast.) |
168 |
|
169 |
SEND messages require more coordination because the |
170 |
receiver must have reserved space (using a receive |
171 |
work request) on the receive queue (RQ) before QEMUFileRDMA |
172 |
can start using them to carry all the bytes as |
173 |
a control transport for migration of device state. |
174 |
|
175 |
To begin the migration, the initial connection setup is |
176 |
as follows (migration-rdma.c): |
177 |
|
178 |
1. Receiver and Sender are started (command line or libvirt): |
179 |
2. Both sides post two RQ work requests |
180 |
3. Receiver does listen() |
181 |
4. Sender does connect() |
182 |
5. Receiver accept() |
183 |
6. Check versioning and capabilities (described later) |
184 |
|
185 |
At this point, we define a control channel on top of SEND messages |
186 |
which is described by a formal protocol. Each SEND message has a |
187 |
header portion and a data portion (but together are transmitted |
188 |
as a single SEND message). |
189 |
|
190 |
Header: |
191 |
* Length (of the data portion, uint32, network byte order) |
192 |
* Type (what command to perform, uint32, network byte order) |
193 |
* Repeat (Number of commands in data portion, same type only) |
194 |
|
195 |
The 'Repeat' field is here to support future multiple page registrations |
196 |
in a single message without any need to change the protocol itself |
197 |
so that the protocol is compatible against multiple versions of QEMU. |
198 |
Version #1 requires that all server implementations of the protocol must |
199 |
check this field and register all requests found in the array of commands located |
200 |
in the data portion and return an equal number of results in the response. |
201 |
The maximum number of repeats is hard-coded to 4096. This is a conservative |
202 |
limit based on the maximum size of a SEND message along with empirical |
203 |
observations on the maximum future benefit of simultaneous page registrations. |
204 |
|
205 |
The 'type' field has 12 different command values: |
206 |
1. Unused |
207 |
2. Error (sent to the source during bad things) |
208 |
3. Ready (control-channel is available) |
209 |
4. QEMU File (for sending non-live device state) |
210 |
5. RAM Blocks request (used right after connection setup) |
211 |
6. RAM Blocks result (used right after connection setup) |
212 |
7. Compress page (zap zero page and skip registration) |
213 |
8. Register request (dynamic chunk registration) |
214 |
9. Register result ('rkey' to be used by sender) |
215 |
10. Register finished (registration for current iteration finished) |
216 |
11. Unregister request (unpin previously registered memory) |
217 |
12. Unregister finished (confirmation that unpin completed) |
218 |
|
219 |
A single control message, as hinted above, can contain within the data |
220 |
portion an array of many commands of the same type. If there is more than |
221 |
one command, then the 'repeat' field will be greater than 1. |
222 |
|
223 |
After connection setup, message 5 & 6 are used to exchange ram block |
224 |
information and optionally pin all the memory if requested by the user. |
225 |
|
226 |
After ram block exchange is completed, we have two protocol-level |
227 |
functions, responsible for communicating control-channel commands |
228 |
using the above list of values: |
229 |
|
230 |
Logically: |
231 |
|
232 |
qemu_rdma_exchange_recv(header, expected command type) |
233 |
|
234 |
1. We transmit a READY command to let the sender know that |
235 |
we are *ready* to receive some data bytes on the control channel. |
236 |
2. Before attempting to receive the expected command, we post another |
237 |
RQ work request to replace the one we just used up. |
238 |
3. Block on a CQ event channel and wait for the SEND to arrive. |
239 |
4. When the send arrives, librdmacm will unblock us. |
240 |
5. Verify that the command-type and version received matches the one we expected. |
241 |
|
242 |
qemu_rdma_exchange_send(header, data, optional response header & data): |
243 |
|
244 |
1. Block on the CQ event channel waiting for a READY command |
245 |
from the receiver to tell us that the receiver |
246 |
is *ready* for us to transmit some new bytes. |
247 |
2. Optionally: if we are expecting a response from the command |
248 |
(that we have not yet transmitted), let's post an RQ |
249 |
work request to receive that data a few moments later. |
250 |
3. When the READY arrives, librdmacm will |
251 |
unblock us and we immediately post a RQ work request |
252 |
to replace the one we just used up. |
253 |
4. Now, we can actually post the work request to SEND |
254 |
the requested command type of the header we were asked for. |
255 |
5. Optionally, if we are expecting a response (as before), |
256 |
we block again and wait for that response using the additional |
257 |
work request we previously posted. (This is used to carry |
258 |
'Register result' commands #6 back to the sender which |
259 |
hold the rkey need to perform RDMA. Note that the virtual address |
260 |
corresponding to this rkey was already exchanged at the beginning |
261 |
of the connection (described below). |
262 |
|
263 |
All of the remaining command types (not including 'ready') |
264 |
described above all use the aformentioned two functions to do the hard work: |
265 |
|
266 |
1. After connection setup, RAMBlock information is exchanged using |
267 |
this protocol before the actual migration begins. This information includes |
268 |
a description of each RAMBlock on the server side as well as the virtual addresses |
269 |
and lengths of each RAMBlock. This is used by the client to determine the |
270 |
start and stop locations of chunks and how to register them dynamically |
271 |
before performing the RDMA operations. |
272 |
2. During runtime, once a 'chunk' becomes full of pages ready to |
273 |
be sent with RDMA, the registration commands are used to ask the |
274 |
other side to register the memory for this chunk and respond |
275 |
with the result (rkey) of the registration. |
276 |
3. Also, the QEMUFile interfaces also call these functions (described below) |
277 |
when transmitting non-live state, such as devices or to send |
278 |
its own protocol information during the migration process. |
279 |
4. Finally, zero pages are only checked if a page has not yet been registered |
280 |
using chunk registration (or not checked at all and unconditionally |
281 |
written if chunk registration is disabled. This is accomplished using |
282 |
the "Compress" command listed above. If the page *has* been registered |
283 |
then we check the entire chunk for zero. Only if the entire chunk is |
284 |
zero, then we send a compress command to zap the page on the other side. |
285 |
|
286 |
Versioning and Capabilities |
287 |
=========================== |
288 |
Current version of the protocol is version #1. |
289 |
|
290 |
The same version applies to both for protocol traffic and capabilities |
291 |
negotiation. (i.e. There is only one version number that is referred to |
292 |
by all communication). |
293 |
|
294 |
librdmacm provides the user with a 'private data' area to be exchanged |
295 |
at connection-setup time before any infiniband traffic is generated. |
296 |
|
297 |
Header: |
298 |
* Version (protocol version validated before send/recv occurs), |
299 |
uint32, network byte order |
300 |
* Flags (bitwise OR of each capability), |
301 |
uint32, network byte order |
302 |
|
303 |
There is no data portion of this header right now, so there is |
304 |
no length field. The maximum size of the 'private data' section |
305 |
is only 192 bytes per the Infiniband specification, so it's not |
306 |
very useful for data anyway. This structure needs to remain small. |
307 |
|
308 |
This private data area is a convenient place to check for protocol |
309 |
versioning because the user does not need to register memory to |
310 |
transmit a few bytes of version information. |
311 |
|
312 |
This is also a convenient place to negotiate capabilities |
313 |
(like dynamic page registration). |
314 |
|
315 |
If the version is invalid, we throw an error. |
316 |
|
317 |
If the version is new, we only negotiate the capabilities that the |
318 |
requested version is able to perform and ignore the rest. |
319 |
|
320 |
Currently there is only one capability in Version #1: dynamic page registration |
321 |
|
322 |
Finally: Negotiation happens with the Flags field: If the primary-VM |
323 |
sets a flag, but the destination does not support this capability, it |
324 |
will return a zero-bit for that flag and the primary-VM will understand |
325 |
that as not being an available capability and will thus disable that |
326 |
capability on the primary-VM side. |
327 |
|
328 |
QEMUFileRDMA Interface: |
329 |
======================= |
330 |
|
331 |
QEMUFileRDMA introduces a couple of new functions: |
332 |
|
333 |
1. qemu_rdma_get_buffer() (QEMUFileOps rdma_read_ops) |
334 |
2. qemu_rdma_put_buffer() (QEMUFileOps rdma_write_ops) |
335 |
|
336 |
These two functions are very short and simply use the protocol |
337 |
describe above to deliver bytes without changing the upper-level |
338 |
users of QEMUFile that depend on a bytestream abstraction. |
339 |
|
340 |
Finally, how do we handoff the actual bytes to get_buffer()? |
341 |
|
342 |
Again, because we're trying to "fake" a bytestream abstraction |
343 |
using an analogy not unlike individual UDP frames, we have |
344 |
to hold on to the bytes received from control-channel's SEND |
345 |
messages in memory. |
346 |
|
347 |
Each time we receive a complete "QEMU File" control-channel |
348 |
message, the bytes from SEND are copied into a small local holding area. |
349 |
|
350 |
Then, we return the number of bytes requested by get_buffer() |
351 |
and leave the remaining bytes in the holding area until get_buffer() |
352 |
comes around for another pass. |
353 |
|
354 |
If the buffer is empty, then we follow the same steps |
355 |
listed above and issue another "QEMU File" protocol command, |
356 |
asking for a new SEND message to re-fill the buffer. |
357 |
|
358 |
Migration of pc.ram: |
359 |
==================== |
360 |
|
361 |
At the beginning of the migration, (migration-rdma.c), |
362 |
the sender and the receiver populate the list of RAMBlocks |
363 |
to be registered with each other into a structure. |
364 |
Then, using the aforementioned protocol, they exchange a |
365 |
description of these blocks with each other, to be used later |
366 |
during the iteration of main memory. This description includes |
367 |
a list of all the RAMBlocks, their offsets and lengths, virtual |
368 |
addresses and possibly includes pre-registered RDMA keys in case dynamic |
369 |
page registration was disabled on the server-side, otherwise not. |
370 |
|
371 |
Main memory is not migrated with the aforementioned protocol, |
372 |
but is instead migrated with normal RDMA Write operations. |
373 |
|
374 |
Pages are migrated in "chunks" (hard-coded to 1 Megabyte right now). |
375 |
Chunk size is not dynamic, but it could be in a future implementation. |
376 |
There's nothing to indicate that this is useful right now. |
377 |
|
378 |
When a chunk is full (or a flush() occurs), the memory backed by |
379 |
the chunk is registered with librdmacm is pinned in memory on |
380 |
both sides using the aforementioned protocol. |
381 |
After pinning, an RDMA Write is generated and transmitted |
382 |
for the entire chunk. |
383 |
|
384 |
Chunks are also transmitted in batches: This means that we |
385 |
do not request that the hardware signal the completion queue |
386 |
for the completion of *every* chunk. The current batch size |
387 |
is about 64 chunks (corresponding to 64 MB of memory). |
388 |
Only the last chunk in a batch must be signaled. |
389 |
This helps keep everything as asynchronous as possible |
390 |
and helps keep the hardware busy performing RDMA operations. |
391 |
|
392 |
Error-handling: |
393 |
=============== |
394 |
|
395 |
Infiniband has what is called a "Reliable, Connected" |
396 |
link (one of 4 choices). This is the mode in which |
397 |
we use for RDMA migration. |
398 |
|
399 |
If a *single* message fails, |
400 |
the decision is to abort the migration entirely and |
401 |
cleanup all the RDMA descriptors and unregister all |
402 |
the memory. |
403 |
|
404 |
After cleanup, the Virtual Machine is returned to normal |
405 |
operation the same way that would happen if the TCP |
406 |
socket is broken during a non-RDMA based migration. |
407 |
|
408 |
TODO: |
409 |
===== |
410 |
1. 'migrate x-rdma:host:port' and '-incoming x-rdma' options will be |
411 |
renamed to 'rdma' after the experimental phase of this work has |
412 |
completed upstream. |
413 |
2. Currently, 'ulimit -l' mlock() limits as well as cgroups swap limits |
414 |
are not compatible with infinband memory pinning and will result in |
415 |
an aborted migration (but with the source VM left unaffected). |
416 |
3. Use of the recent /proc/<pid>/pagemap would likely speed up |
417 |
the use of KSM and ballooning while using RDMA. |
418 |
4. Also, some form of balloon-device usage tracking would also |
419 |
help alleviate some issues. |
420 |
5. Move UNREGISTER requests to a separate thread. |
421 |
6. Use LRU to provide more fine-grained direction of UNREGISTER |
422 |
requests for unpinning memory in an overcommitted environment. |
423 |
7. Expose UNREGISTER support to the user by way of workload-specific |
424 |
hints about application behavior. |