root / docs / migration.txt @ 81a97d9d
History | View | Annotate | Download (11.6 kB)
1 |
= Migration = |
---|---|
2 |
|
3 |
QEMU has code to load/save the state of the guest that it is running. |
4 |
This are two complementary operations. Saving the state just does |
5 |
that, saves the state for each device that the guest is running. |
6 |
Restoring a guest is just the opposite operation: we need to load the |
7 |
state of each device. |
8 |
|
9 |
For this to work, QEMU has to be launch with the same arguments the |
10 |
two times. I.e. it can only restore the state in one guest that has |
11 |
the same devices that the one it was saved (this last requirement can |
12 |
be relaxed a bit, but for now we can consider that configuration have |
13 |
to be exactly the same). |
14 |
|
15 |
Once that we are able to save/restore a guest, a new functionality is |
16 |
requested: migration. This means that QEMU is able to start in one |
17 |
machine and being "migrated" to other machine. I.e. being moved to |
18 |
other machine. |
19 |
|
20 |
Next was the "live migration" functionality. This is important |
21 |
because some guests run with a lot of state (specially RAM), and it |
22 |
can take a while to move all state from one machine to another. Live |
23 |
migration allows the guest to continue running while the state is |
24 |
transferred. Only while the last part of the state is transferred has |
25 |
the guest to be stopped. Typically the time that the guest is |
26 |
unresponsive during live migration is the low hundred of milliseconds |
27 |
(notice that this depends on lot of things). |
28 |
|
29 |
=== Types of migration === |
30 |
|
31 |
Now that we have talked about live migration, there are several ways |
32 |
to do migration: |
33 |
|
34 |
- tcp migration: do the migration using tcp sockets |
35 |
- unix migration: do the migration using unix sockets |
36 |
- exec migration: do the migration using the stdin/stdout through a process. |
37 |
- fd migration: do the migration using an file descriptor that is |
38 |
passed to QEMU. QEMU don't cares how this file descriptor is opened. |
39 |
|
40 |
All this four migration protocols use the same infrastructure to |
41 |
save/restore state devices. This infrastructure is shared with the |
42 |
savevm/loadvm functionality. |
43 |
|
44 |
=== State Live Migration == |
45 |
|
46 |
This is used for RAM and block devices. It is not yet ported to vmstate. |
47 |
<Fill more information here> |
48 |
|
49 |
=== What is the common infrastructure === |
50 |
|
51 |
QEMU uses a QEMUFile abstraction to be able to do migration. Any type |
52 |
of migration that what to use QEMU infrastructure has to create a |
53 |
QEMUFile with: |
54 |
|
55 |
QEMUFile *qemu_fopen_ops(void *opaque, |
56 |
QEMUFilePutBufferFunc *put_buffer, |
57 |
QEMUFileGetBufferFunc *get_buffer, |
58 |
QEMUFileCloseFunc *close, |
59 |
QEMUFileRateLimit *rate_limit, |
60 |
QEMUFileSetRateLimit *set_rate_limit, |
61 |
QEMUFileGetRateLimit *get_rate_limit); |
62 |
|
63 |
The functions have the following functionality: |
64 |
|
65 |
This function writes a chunk of data to a file at the given position. |
66 |
The pos argument can be ignored if the file is only being used for |
67 |
streaming. The handler should try to write all of the data it can. |
68 |
|
69 |
typedef int (QEMUFilePutBufferFunc)(void *opaque, const uint8_t *buf, |
70 |
int64_t pos, int size); |
71 |
|
72 |
Read a chunk of data from a file at the given position. The pos argument |
73 |
can be ignored if the file is only be used for streaming. The number of |
74 |
bytes actually read should be returned. |
75 |
|
76 |
typedef int (QEMUFileGetBufferFunc)(void *opaque, uint8_t *buf, |
77 |
int64_t pos, int size); |
78 |
|
79 |
Close a file and return an error code |
80 |
|
81 |
typedef int (QEMUFileCloseFunc)(void *opaque); |
82 |
|
83 |
Called to determine if the file has exceeded it's bandwidth allocation. The |
84 |
bandwidth capping is a soft limit, not a hard limit. |
85 |
|
86 |
typedef int (QEMUFileRateLimit)(void *opaque); |
87 |
|
88 |
Called to change the current bandwidth allocation. This function must return |
89 |
the new actual bandwidth. It should be new_rate if everything goes OK, and |
90 |
the old rate otherwise |
91 |
|
92 |
typedef size_t (QEMUFileSetRateLimit)(void *opaque, size_t new_rate); |
93 |
typedef size_t (QEMUFileGetRateLimit)(void *opaque); |
94 |
|
95 |
You can use any internal state that you need using the opaque void * |
96 |
pointer that is passed to all functions. |
97 |
|
98 |
The rate limiting functions are used to limit the bandwidth used by |
99 |
QEMU migration. |
100 |
|
101 |
The important functions for us are put_buffer()/get_buffer() that |
102 |
allow to write/read a buffer into the QEMUFile. |
103 |
|
104 |
=== How to save the state of one device == |
105 |
|
106 |
The state of a device is saved using intermediate buffers. There are |
107 |
some helper functions to assist this saving. |
108 |
|
109 |
There is a new concept that we have to explain here: device state |
110 |
version. When we migrate a device, we save/load the state as a series |
111 |
of fields. Some times, due to bugs or new functionality, we need to |
112 |
change the state to store more/different information. We use the |
113 |
version to identify each time that we do a change. Each version is |
114 |
associated with a series of fields saved. The save_state always save |
115 |
the state as the newer version. But load_state some times is able to |
116 |
load state from an older version. |
117 |
|
118 |
=== Legacy way === |
119 |
|
120 |
This way is going to disappear as soon as all current users are ported to VMSTATE. |
121 |
|
122 |
Each device has to register two functions, one to save the state and |
123 |
another to load the state back. |
124 |
|
125 |
int register_savevm(DeviceState *dev, |
126 |
const char *idstr, |
127 |
int instance_id, |
128 |
int version_id, |
129 |
SaveStateHandler *save_state, |
130 |
LoadStateHandler *load_state, |
131 |
void *opaque); |
132 |
|
133 |
typedef void SaveStateHandler(QEMUFile *f, void *opaque); |
134 |
typedef int LoadStateHandler(QEMUFile *f, void *opaque, int version_id); |
135 |
|
136 |
The important functions for the device state format are the save_state |
137 |
and load_state. Notice that load_state receives a version_id |
138 |
parameter to know what state format is receiving. save_state don't |
139 |
have a version_id parameter because it uses always the latest version. |
140 |
|
141 |
=== VMState === |
142 |
|
143 |
The legacy way of saving/loading state of the device had the problem |
144 |
that we have to maintain in sync two functions. If we did one change |
145 |
in one of them and not on the other, we got a failed migration. |
146 |
|
147 |
VMState changed the way that state is saved/loaded. Instead of using |
148 |
a function to save the state and another to load it, it was changed to |
149 |
a declarative way of what the state consisted of. Now VMState is able |
150 |
to interpret that definition to be able to load/save the state. As |
151 |
the state is declared only once, it can't go out of sync in the |
152 |
save/load functions. |
153 |
|
154 |
An example (from hw/pckbd.c) |
155 |
|
156 |
static const VMStateDescription vmstate_kbd = { |
157 |
.name = "pckbd", |
158 |
.version_id = 3, |
159 |
.minimum_version_id = 3, |
160 |
.minimum_version_id_old = 3, |
161 |
.fields = (VMStateField []) { |
162 |
VMSTATE_UINT8(write_cmd, KBDState), |
163 |
VMSTATE_UINT8(status, KBDState), |
164 |
VMSTATE_UINT8(mode, KBDState), |
165 |
VMSTATE_UINT8(pending, KBDState), |
166 |
VMSTATE_END_OF_LIST() |
167 |
} |
168 |
}; |
169 |
|
170 |
We are declaring the state with name "pckbd". |
171 |
The version_id is 3, and the fields are 4 uint8_t in a KBDState structure. |
172 |
We registered this with: |
173 |
|
174 |
vmstate_register(NULL, 0, &vmstate_kbd, s); |
175 |
|
176 |
Note: talk about how vmstate <-> qdev interact, and what the instance id's mean. |
177 |
|
178 |
You can search for VMSTATE_* macros for lots of types used in QEMU in |
179 |
hw/hw.h. |
180 |
|
181 |
=== More about versions == |
182 |
|
183 |
You can see that there are several version fields: |
184 |
|
185 |
- version_id: the maximum version_id supported by VMState for that device |
186 |
- minimum_version_id: the minimum version_id that VMState is able to understand |
187 |
for that device. |
188 |
- minimum_version_id_old: For devices that were not able to port to vmstate, we can |
189 |
assign a function that knows how to read this old state. |
190 |
|
191 |
So, VMState is able to read versions from minimum_version_id to |
192 |
version_id. And the function load_state_old() is able to load state |
193 |
from minimum_version_id_old to minimum_version_id. This function is |
194 |
deprecated and will be removed when no more users are left. |
195 |
|
196 |
=== Massaging functions === |
197 |
|
198 |
Some times, it is not enough to be able to save the state directly |
199 |
from one structure, we need to fill the correct values there. One |
200 |
example is when we are using kvm. Before saving the cpu state, we |
201 |
need to ask kvm to copy to QEMU the state that it is using. And the |
202 |
opposite when we are loading the state, we need a way to tell kvm to |
203 |
load the state for the cpu that we have just loaded from the QEMUFile. |
204 |
|
205 |
The functions to do that are inside a vmstate definition, and are called: |
206 |
|
207 |
- int (*pre_load)(void *opaque); |
208 |
|
209 |
This function is called before we load the state of one device. |
210 |
|
211 |
- int (*post_load)(void *opaque, int version_id); |
212 |
|
213 |
This function is called after we load the state of one device. |
214 |
|
215 |
- void (*pre_save)(void *opaque); |
216 |
|
217 |
This function is called before we save the state of one device. |
218 |
|
219 |
Example: You can look at hpet.c, that uses the three function to |
220 |
massage the state that is transferred. |
221 |
|
222 |
=== Subsections === |
223 |
|
224 |
The use of version_id allows to be able to migrate from older versions |
225 |
to newer versions of a device. But not the other way around. This |
226 |
makes very complicated to fix bugs in stable branches. If we need to |
227 |
add anything to the state to fix a bug, we have to disable migration |
228 |
to older versions that don't have that bug-fix (i.e. a new field). |
229 |
|
230 |
But some time, that bug-fix is only needed sometimes, not always. For |
231 |
instance, if the device is in the middle of a DMA operation, it is |
232 |
using a specific functionality, .... |
233 |
|
234 |
It is impossible to create a way to make migration from any version to |
235 |
any other version to work. But we can do better that only allowing |
236 |
migration from older versions no newer ones. For that fields that are |
237 |
only needed sometimes, we add the idea of subsections. a subsection |
238 |
is "like" a device vmstate, but with a particularity, it has a Boolean |
239 |
function that tells if that values are needed to be sent or not. If |
240 |
this functions returns false, the subsection is not sent. |
241 |
|
242 |
On the receiving side, if we found a subsection for a device that we |
243 |
don't understand, we just fail the migration. If we understand all |
244 |
the subsections, then we load the state with success. |
245 |
|
246 |
One important note is that the post_load() function is called "after" |
247 |
loading all subsections, because a newer subsection could change same |
248 |
value that it uses. |
249 |
|
250 |
Example: |
251 |
|
252 |
static bool ide_drive_pio_state_needed(void *opaque) |
253 |
{ |
254 |
IDEState *s = opaque; |
255 |
|
256 |
return (s->status & DRQ_STAT) != 0; |
257 |
} |
258 |
|
259 |
const VMStateDescription vmstate_ide_drive_pio_state = { |
260 |
.name = "ide_drive/pio_state", |
261 |
.version_id = 1, |
262 |
.minimum_version_id = 1, |
263 |
.minimum_version_id_old = 1, |
264 |
.pre_save = ide_drive_pio_pre_save, |
265 |
.post_load = ide_drive_pio_post_load, |
266 |
.fields = (VMStateField []) { |
267 |
VMSTATE_INT32(req_nb_sectors, IDEState), |
268 |
VMSTATE_VARRAY_INT32(io_buffer, IDEState, io_buffer_total_len, 1, |
269 |
vmstate_info_uint8, uint8_t), |
270 |
VMSTATE_INT32(cur_io_buffer_offset, IDEState), |
271 |
VMSTATE_INT32(cur_io_buffer_len, IDEState), |
272 |
VMSTATE_UINT8(end_transfer_fn_idx, IDEState), |
273 |
VMSTATE_INT32(elementary_transfer_size, IDEState), |
274 |
VMSTATE_INT32(packet_transfer_size, IDEState), |
275 |
VMSTATE_END_OF_LIST() |
276 |
} |
277 |
}; |
278 |
|
279 |
const VMStateDescription vmstate_ide_drive = { |
280 |
.name = "ide_drive", |
281 |
.version_id = 3, |
282 |
.minimum_version_id = 0, |
283 |
.minimum_version_id_old = 0, |
284 |
.post_load = ide_drive_post_load, |
285 |
.fields = (VMStateField []) { |
286 |
.... several fields .... |
287 |
VMSTATE_END_OF_LIST() |
288 |
}, |
289 |
.subsections = (VMStateSubsection []) { |
290 |
{ |
291 |
.vmsd = &vmstate_ide_drive_pio_state, |
292 |
.needed = ide_drive_pio_state_needed, |
293 |
}, { |
294 |
/* empty */ |
295 |
} |
296 |
} |
297 |
}; |
298 |
|
299 |
Here we have a subsection for the pio state. We only need to |
300 |
save/send this state when we are in the middle of a pio operation |
301 |
(that is what ide_drive_pio_state_needed() checks). If DRQ_STAT is |
302 |
not enabled, the values on that fields are garbage and don't need to |
303 |
be sent. |