Statistics
| Branch: | Revision:

root / docs / rdma.txt @ 52f35022

History | View | Annotate | Download (18.2 kB)

1
(RDMA: Remote Direct Memory Access)
2
RDMA Live Migration Specification, Version # 1
3
==============================================
4
Wiki: http://wiki.qemu.org/Features/RDMALiveMigration
5
Github: git@github.com:hinesmr/qemu.git, 'rdma' branch
6

    
7
Copyright (C) 2013 Michael R. Hines <mrhines@us.ibm.com>
8

    
9
An *exhaustive* paper (2010) shows additional performance details
10
linked on the QEMU wiki above.
11

    
12
Contents:
13
=========
14
* Introduction
15
* Before running
16
* Running
17
* Performance
18
* RDMA Migration Protocol Description
19
* Versioning and Capabilities
20
* QEMUFileRDMA Interface
21
* Migration of pc.ram
22
* Error handling
23
* TODO
24

    
25
Introduction:
26
=============
27

    
28
RDMA helps make your migration more deterministic under heavy load because
29
of the significantly lower latency and higher throughput over TCP/IP. This is
30
because the RDMA I/O architecture reduces the number of interrupts and
31
data copies by bypassing the host networking stack. In particular, a TCP-based
32
migration, under certain types of memory-bound workloads, may take a more
33
unpredicatable amount of time to complete the migration if the amount of
34
memory tracked during each live migration iteration round cannot keep pace
35
with the rate of dirty memory produced by the workload.
36

    
37
RDMA currently comes in two flavors: both Ethernet based (RoCE, or RDMA
38
over Converged Ethernet) as well as Infiniband-based. This implementation of
39
migration using RDMA is capable of using both technologies because of
40
the use of the OpenFabrics OFED software stack that abstracts out the
41
programming model irrespective of the underlying hardware.
42

    
43
Refer to openfabrics.org or your respective RDMA hardware vendor for
44
an understanding on how to verify that you have the OFED software stack
45
installed in your environment. You should be able to successfully link
46
against the "librdmacm" and "libibverbs" libraries and development headers
47
for a working build of QEMU to run successfully using RDMA Migration.
48

    
49
BEFORE RUNNING:
50
===============
51

    
52
Use of RDMA during migration requires pinning and registering memory
53
with the hardware. This means that memory must be physically resident
54
before the hardware can transmit that memory to another machine.
55
If this is not acceptable for your application or product, then the use
56
of RDMA migration may in fact be harmful to co-located VMs or other
57
software on the machine if there is not sufficient memory available to
58
relocate the entire footprint of the virtual machine. If so, then the
59
use of RDMA is discouraged and it is recommended to use standard TCP migration.
60

    
61
Experimental: Next, decide if you want dynamic page registration.
62
For example, if you have an 8GB RAM virtual machine, but only 1GB
63
is in active use, then enabling this feature will cause all 8GB to
64
be pinned and resident in memory. This feature mostly affects the
65
bulk-phase round of the migration and can be enabled for extremely
66
high-performance RDMA hardware using the following command:
67

    
68
QEMU Monitor Command:
69
$ migrate_set_capability x-rdma-pin-all on # disabled by default
70

    
71
Performing this action will cause all 8GB to be pinned, so if that's
72
not what you want, then please ignore this step altogether.
73

    
74
On the other hand, this will also significantly speed up the bulk round
75
of the migration, which can greatly reduce the "total" time of your migration.
76
Example performance of this using an idle VM in the previous example
77
can be found in the "Performance" section.
78

    
79
Note: for very large virtual machines (hundreds of GBs), pinning all
80
*all* of the memory of your virtual machine in the kernel is very expensive
81
may extend the initial bulk iteration time by many seconds,
82
and thus extending the total migration time. However, this will not
83
affect the determinism or predictability of your migration you will
84
still gain from the benefits of advanced pinning with RDMA.
85

    
86
RUNNING:
87
========
88

    
89
First, set the migration speed to match your hardware's capabilities:
90

    
91
QEMU Monitor Command:
92
$ migrate_set_speed 40g # or whatever is the MAX of your RDMA device
93

    
94
Next, on the destination machine, add the following to the QEMU command line:
95

    
96
qemu ..... -incoming x-rdma:host:port
97

    
98
Finally, perform the actual migration on the source machine:
99

    
100
QEMU Monitor Command:
101
$ migrate -d x-rdma:host:port
102

    
103
PERFORMANCE
104
===========
105

    
106
Here is a brief summary of total migration time and downtime using RDMA:
107
Using a 40gbps infiniband link performing a worst-case stress test,
108
using an 8GB RAM virtual machine:
109

    
110
Using the following command:
111
$ apt-get install stress
112
$ stress --vm-bytes 7500M --vm 1 --vm-keep
113

    
114
1. Migration throughput: 26 gigabits/second.
115
2. Downtime (stop time) varies between 15 and 100 milliseconds.
116

    
117
EFFECTS of memory registration on bulk phase round:
118

    
119
For example, in the same 8GB RAM example with all 8GB of memory in
120
active use and the VM itself is completely idle using the same 40 gbps
121
infiniband link:
122

    
123
1. x-rdma-pin-all disabled total time: approximately 7.5 seconds @ 9.5 Gbps
124
2. x-rdma-pin-all enabled total time: approximately 4 seconds @ 26 Gbps
125

    
126
These numbers would of course scale up to whatever size virtual machine
127
you have to migrate using RDMA.
128

    
129
Enabling this feature does *not* have any measurable affect on
130
migration *downtime*. This is because, without this feature, all of the
131
memory will have already been registered already in advance during
132
the bulk round and does not need to be re-registered during the successive
133
iteration rounds.
134

    
135
RDMA Protocol Description:
136
==========================
137

    
138
Migration with RDMA is separated into two parts:
139

    
140
1. The transmission of the pages using RDMA
141
2. Everything else (a control channel is introduced)
142

    
143
"Everything else" is transmitted using a formal
144
protocol now, consisting of infiniband SEND messages.
145

    
146
An infiniband SEND message is the standard ibverbs
147
message used by applications of infiniband hardware.
148
The only difference between a SEND message and an RDMA
149
message is that SEND messages cause notifications
150
to be posted to the completion queue (CQ) on the
151
infiniband receiver side, whereas RDMA messages (used
152
for pc.ram) do not (to behave like an actual DMA).
153

    
154
Messages in infiniband require two things:
155

    
156
1. registration of the memory that will be transmitted
157
2. (SEND only) work requests to be posted on both
158
   sides of the network before the actual transmission
159
   can occur.
160

    
161
RDMA messages are much easier to deal with. Once the memory
162
on the receiver side is registered and pinned, we're
163
basically done. All that is required is for the sender
164
side to start dumping bytes onto the link.
165

    
166
(Memory is not released from pinning until the migration
167
completes, given that RDMA migrations are very fast.)
168

    
169
SEND messages require more coordination because the
170
receiver must have reserved space (using a receive
171
work request) on the receive queue (RQ) before QEMUFileRDMA
172
can start using them to carry all the bytes as
173
a control transport for migration of device state.
174

    
175
To begin the migration, the initial connection setup is
176
as follows (migration-rdma.c):
177

    
178
1. Receiver and Sender are started (command line or libvirt):
179
2. Both sides post two RQ work requests
180
3. Receiver does listen()
181
4. Sender does connect()
182
5. Receiver accept()
183
6. Check versioning and capabilities (described later)
184

    
185
At this point, we define a control channel on top of SEND messages
186
which is described by a formal protocol. Each SEND message has a
187
header portion and a data portion (but together are transmitted
188
as a single SEND message).
189

    
190
Header:
191
    * Length               (of the data portion, uint32, network byte order)
192
    * Type                 (what command to perform, uint32, network byte order)
193
    * Repeat               (Number of commands in data portion, same type only)
194

    
195
The 'Repeat' field is here to support future multiple page registrations
196
in a single message without any need to change the protocol itself
197
so that the protocol is compatible against multiple versions of QEMU.
198
Version #1 requires that all server implementations of the protocol must
199
check this field and register all requests found in the array of commands located
200
in the data portion and return an equal number of results in the response.
201
The maximum number of repeats is hard-coded to 4096. This is a conservative
202
limit based on the maximum size of a SEND message along with empirical
203
observations on the maximum future benefit of simultaneous page registrations.
204

    
205
The 'type' field has 12 different command values:
206
     1. Unused
207
     2. Error                      (sent to the source during bad things)
208
     3. Ready                      (control-channel is available)
209
     4. QEMU File                  (for sending non-live device state)
210
     5. RAM Blocks request         (used right after connection setup)
211
     6. RAM Blocks result          (used right after connection setup)
212
     7. Compress page              (zap zero page and skip registration)
213
     8. Register request           (dynamic chunk registration)
214
     9. Register result            ('rkey' to be used by sender)
215
    10. Register finished          (registration for current iteration finished)
216
    11. Unregister request         (unpin previously registered memory)
217
    12. Unregister finished        (confirmation that unpin completed)
218

    
219
A single control message, as hinted above, can contain within the data
220
portion an array of many commands of the same type. If there is more than
221
one command, then the 'repeat' field will be greater than 1.
222

    
223
After connection setup, message 5 & 6 are used to exchange ram block
224
information and optionally pin all the memory if requested by the user.
225

    
226
After ram block exchange is completed, we have two protocol-level
227
functions, responsible for communicating control-channel commands
228
using the above list of values:
229

    
230
Logically:
231

    
232
qemu_rdma_exchange_recv(header, expected command type)
233

    
234
1. We transmit a READY command to let the sender know that
235
   we are *ready* to receive some data bytes on the control channel.
236
2. Before attempting to receive the expected command, we post another
237
   RQ work request to replace the one we just used up.
238
3. Block on a CQ event channel and wait for the SEND to arrive.
239
4. When the send arrives, librdmacm will unblock us.
240
5. Verify that the command-type and version received matches the one we expected.
241

    
242
qemu_rdma_exchange_send(header, data, optional response header & data):
243

    
244
1. Block on the CQ event channel waiting for a READY command
245
   from the receiver to tell us that the receiver
246
   is *ready* for us to transmit some new bytes.
247
2. Optionally: if we are expecting a response from the command
248
   (that we have not yet transmitted), let's post an RQ
249
   work request to receive that data a few moments later.
250
3. When the READY arrives, librdmacm will
251
   unblock us and we immediately post a RQ work request
252
   to replace the one we just used up.
253
4. Now, we can actually post the work request to SEND
254
   the requested command type of the header we were asked for.
255
5. Optionally, if we are expecting a response (as before),
256
   we block again and wait for that response using the additional
257
   work request we previously posted. (This is used to carry
258
   'Register result' commands #6 back to the sender which
259
   hold the rkey need to perform RDMA. Note that the virtual address
260
   corresponding to this rkey was already exchanged at the beginning
261
   of the connection (described below).
262

    
263
All of the remaining command types (not including 'ready')
264
described above all use the aformentioned two functions to do the hard work:
265

    
266
1. After connection setup, RAMBlock information is exchanged using
267
   this protocol before the actual migration begins. This information includes
268
   a description of each RAMBlock on the server side as well as the virtual addresses
269
   and lengths of each RAMBlock. This is used by the client to determine the
270
   start and stop locations of chunks and how to register them dynamically
271
   before performing the RDMA operations.
272
2. During runtime, once a 'chunk' becomes full of pages ready to
273
   be sent with RDMA, the registration commands are used to ask the
274
   other side to register the memory for this chunk and respond
275
   with the result (rkey) of the registration.
276
3. Also, the QEMUFile interfaces also call these functions (described below)
277
   when transmitting non-live state, such as devices or to send
278
   its own protocol information during the migration process.
279
4. Finally, zero pages are only checked if a page has not yet been registered
280
   using chunk registration (or not checked at all and unconditionally
281
   written if chunk registration is disabled. This is accomplished using
282
   the "Compress" command listed above. If the page *has* been registered
283
   then we check the entire chunk for zero. Only if the entire chunk is
284
   zero, then we send a compress command to zap the page on the other side.
285

    
286
Versioning and Capabilities
287
===========================
288
Current version of the protocol is version #1.
289

    
290
The same version applies to both for protocol traffic and capabilities
291
negotiation. (i.e. There is only one version number that is referred to
292
by all communication).
293

    
294
librdmacm provides the user with a 'private data' area to be exchanged
295
at connection-setup time before any infiniband traffic is generated.
296

    
297
Header:
298
    * Version (protocol version validated before send/recv occurs),
299
                                               uint32, network byte order
300
    * Flags   (bitwise OR of each capability),
301
                                               uint32, network byte order
302

    
303
There is no data portion of this header right now, so there is
304
no length field. The maximum size of the 'private data' section
305
is only 192 bytes per the Infiniband specification, so it's not
306
very useful for data anyway. This structure needs to remain small.
307

    
308
This private data area is a convenient place to check for protocol
309
versioning because the user does not need to register memory to
310
transmit a few bytes of version information.
311

    
312
This is also a convenient place to negotiate capabilities
313
(like dynamic page registration).
314

    
315
If the version is invalid, we throw an error.
316

    
317
If the version is new, we only negotiate the capabilities that the
318
requested version is able to perform and ignore the rest.
319

    
320
Currently there is only one capability in Version #1: dynamic page registration
321

    
322
Finally: Negotiation happens with the Flags field: If the primary-VM
323
sets a flag, but the destination does not support this capability, it
324
will return a zero-bit for that flag and the primary-VM will understand
325
that as not being an available capability and will thus disable that
326
capability on the primary-VM side.
327

    
328
QEMUFileRDMA Interface:
329
=======================
330

    
331
QEMUFileRDMA introduces a couple of new functions:
332

    
333
1. qemu_rdma_get_buffer()               (QEMUFileOps rdma_read_ops)
334
2. qemu_rdma_put_buffer()               (QEMUFileOps rdma_write_ops)
335

    
336
These two functions are very short and simply use the protocol
337
describe above to deliver bytes without changing the upper-level
338
users of QEMUFile that depend on a bytestream abstraction.
339

    
340
Finally, how do we handoff the actual bytes to get_buffer()?
341

    
342
Again, because we're trying to "fake" a bytestream abstraction
343
using an analogy not unlike individual UDP frames, we have
344
to hold on to the bytes received from control-channel's SEND
345
messages in memory.
346

    
347
Each time we receive a complete "QEMU File" control-channel
348
message, the bytes from SEND are copied into a small local holding area.
349

    
350
Then, we return the number of bytes requested by get_buffer()
351
and leave the remaining bytes in the holding area until get_buffer()
352
comes around for another pass.
353

    
354
If the buffer is empty, then we follow the same steps
355
listed above and issue another "QEMU File" protocol command,
356
asking for a new SEND message to re-fill the buffer.
357

    
358
Migration of pc.ram:
359
====================
360

    
361
At the beginning of the migration, (migration-rdma.c),
362
the sender and the receiver populate the list of RAMBlocks
363
to be registered with each other into a structure.
364
Then, using the aforementioned protocol, they exchange a
365
description of these blocks with each other, to be used later
366
during the iteration of main memory. This description includes
367
a list of all the RAMBlocks, their offsets and lengths, virtual
368
addresses and possibly includes pre-registered RDMA keys in case dynamic
369
page registration was disabled on the server-side, otherwise not.
370

    
371
Main memory is not migrated with the aforementioned protocol,
372
but is instead migrated with normal RDMA Write operations.
373

    
374
Pages are migrated in "chunks" (hard-coded to 1 Megabyte right now).
375
Chunk size is not dynamic, but it could be in a future implementation.
376
There's nothing to indicate that this is useful right now.
377

    
378
When a chunk is full (or a flush() occurs), the memory backed by
379
the chunk is registered with librdmacm is pinned in memory on
380
both sides using the aforementioned protocol.
381
After pinning, an RDMA Write is generated and transmitted
382
for the entire chunk.
383

    
384
Chunks are also transmitted in batches: This means that we
385
do not request that the hardware signal the completion queue
386
for the completion of *every* chunk. The current batch size
387
is about 64 chunks (corresponding to 64 MB of memory).
388
Only the last chunk in a batch must be signaled.
389
This helps keep everything as asynchronous as possible
390
and helps keep the hardware busy performing RDMA operations.
391

    
392
Error-handling:
393
===============
394

    
395
Infiniband has what is called a "Reliable, Connected"
396
link (one of 4 choices). This is the mode in which
397
we use for RDMA migration.
398

    
399
If a *single* message fails,
400
the decision is to abort the migration entirely and
401
cleanup all the RDMA descriptors and unregister all
402
the memory.
403

    
404
After cleanup, the Virtual Machine is returned to normal
405
operation the same way that would happen if the TCP
406
socket is broken during a non-RDMA based migration.
407

    
408
TODO:
409
=====
410
1. 'migrate x-rdma:host:port' and '-incoming x-rdma' options will be
411
   renamed to 'rdma' after the experimental phase of this work has
412
   completed upstream.
413
2. Currently, 'ulimit -l' mlock() limits as well as cgroups swap limits
414
   are not compatible with infinband memory pinning and will result in
415
   an aborted migration (but with the source VM left unaffected).
416
3. Use of the recent /proc/<pid>/pagemap would likely speed up
417
   the use of KSM and ballooning while using RDMA.
418
4. Also, some form of balloon-device usage tracking would also
419
   help alleviate some issues.
420
5. Move UNREGISTER requests to a separate thread.
421
6. Use LRU to provide more fine-grained direction of UNREGISTER
422
   requests for unpinning memory in an overcommitted environment.
423
7. Expose UNREGISTER support to the user by way of workload-specific
424
   hints about application behavior.