Revision f58ae59c
b/docs/migration.txt | ||
---|---|---|
1 |
= Migration = |
|
2 |
|
|
3 |
QEMU has code to load/save the state of the guest that it is running. |
|
4 |
This are two complementary operations. Saving the state just does |
|
5 |
that, saves the state for each device that the guest is running. |
|
6 |
Restoring a guest is just the opposite operation: we need to load the |
|
7 |
state of each device. |
|
8 |
|
|
9 |
For this to work, QEMU has to be launch with the same arguments the |
|
10 |
two times. I.e. it can only restore the state in one guest that has |
|
11 |
the same devices that the one it was saved (this last requirement can |
|
12 |
be relaxed a bit, but for now we can consider that configuration have |
|
13 |
to be exactly the same). |
|
14 |
|
|
15 |
Once that we are able to save/restore a guest, a new functionality is |
|
16 |
requested: migration. This means that QEMU is able to start in one |
|
17 |
machine and being "migrated" to other machine. I.e. being moved to |
|
18 |
other machine. |
|
19 |
|
|
20 |
Next was the "live migration" functionality. This is important |
|
21 |
because some guests run with a lot of state (specially RAM), and it |
|
22 |
can take a while to move all state from one machine to another. Live |
|
23 |
migration allows the guest to continue running while the state is |
|
24 |
transferred. Only while the last part of the state is transferred has |
|
25 |
the guest to be stopped. Typically the time that the guest is |
|
26 |
unresponsive during live migration is the low hundred of milliseconds |
|
27 |
(notice that this depends on lot of things). |
|
28 |
|
|
29 |
=== Types of migration === |
|
30 |
|
|
31 |
Now that we have talked about live migration, there are several ways |
|
32 |
to do migration: |
|
33 |
|
|
34 |
- tcp migration: do the migration using tcp sockets |
|
35 |
- unix migration: do the migration using unix sockets |
|
36 |
- exec migration: do the migration using the stdin/stdout through a process. |
|
37 |
- fd migration: do the migration using an file descriptor that is |
|
38 |
passed to QEMU. QEMU don't cares how this file descriptor is opened. |
|
39 |
|
|
40 |
All this four migration protocols use the same infrastructure to |
|
41 |
save/restore state devices. This infrastructure is shared with the |
|
42 |
savevm/loadvm functionality. |
|
43 |
|
|
44 |
=== State Live Migration == |
|
45 |
|
|
46 |
This is used for RAM and block devices. It is not yet ported to vmstate. |
|
47 |
<Fill more information here> |
|
48 |
|
|
49 |
=== What is the common infrastructure === |
|
50 |
|
|
51 |
QEMU uses a QEMUFile abstraction to be able to do migration. Any type |
|
52 |
of migration that what to use QEMU infrastructure has to create a |
|
53 |
QEMUFile with: |
|
54 |
|
|
55 |
QEMUFile *qemu_fopen_ops(void *opaque, |
|
56 |
QEMUFilePutBufferFunc *put_buffer, |
|
57 |
QEMUFileGetBufferFunc *get_buffer, |
|
58 |
QEMUFileCloseFunc *close, |
|
59 |
QEMUFileRateLimit *rate_limit, |
|
60 |
QEMUFileSetRateLimit *set_rate_limit, |
|
61 |
QEMUFileGetRateLimit *get_rate_limit); |
|
62 |
|
|
63 |
The functions have the following functionality: |
|
64 |
|
|
65 |
This function writes a chunk of data to a file at the given position. |
|
66 |
The pos argument can be ignored if the file is only being used for |
|
67 |
streaming. The handler should try to write all of the data it can. |
|
68 |
|
|
69 |
typedef int (QEMUFilePutBufferFunc)(void *opaque, const uint8_t *buf, |
|
70 |
int64_t pos, int size); |
|
71 |
|
|
72 |
Read a chunk of data from a file at the given position. The pos argument |
|
73 |
can be ignored if the file is only be used for streaming. The number of |
|
74 |
bytes actually read should be returned. |
|
75 |
|
|
76 |
typedef int (QEMUFileGetBufferFunc)(void *opaque, uint8_t *buf, |
|
77 |
int64_t pos, int size); |
|
78 |
|
|
79 |
Close a file and return an error code |
|
80 |
|
|
81 |
typedef int (QEMUFileCloseFunc)(void *opaque); |
|
82 |
|
|
83 |
Called to determine if the file has exceeded it's bandwidth allocation. The |
|
84 |
bandwidth capping is a soft limit, not a hard limit. |
|
85 |
|
|
86 |
typedef int (QEMUFileRateLimit)(void *opaque); |
|
87 |
|
|
88 |
Called to change the current bandwidth allocation. This function must return |
|
89 |
the new actual bandwidth. It should be new_rate if everything goes OK, and |
|
90 |
the old rate otherwise |
|
91 |
|
|
92 |
typedef size_t (QEMUFileSetRateLimit)(void *opaque, size_t new_rate); |
|
93 |
typedef size_t (QEMUFileGetRateLimit)(void *opaque); |
|
94 |
|
|
95 |
You can use any internal state that you need using the opaque void * |
|
96 |
pointer that is passed to all functions. |
|
97 |
|
|
98 |
The rate limiting functions are used to limit the bandwidth used by |
|
99 |
QEMU migration. |
|
100 |
|
|
101 |
The important functions for us are put_buffer()/get_buffer() that |
|
102 |
allow to write/read a buffer into the QEMUFile. |
|
103 |
|
|
104 |
=== How to save the state of one device == |
|
105 |
|
|
106 |
The state of a device is saved using intermediate buffers. There are |
|
107 |
some helper functions to assist this saving. |
|
108 |
|
|
109 |
There is a new concept that we have to explain here: device state |
|
110 |
version. When we migrate a device, we save/load the state as a series |
|
111 |
of fields. Some times, due to bugs or new functionality, we need to |
|
112 |
change the state to store more/different information. We use the |
|
113 |
version to identify each time that we do a change. Each version is |
|
114 |
associated with a series of fields saved. The save_state always save |
|
115 |
the state as the newer version. But load_state some times is able to |
|
116 |
load state from an older version. |
|
117 |
|
|
118 |
=== Legacy way === |
|
119 |
|
|
120 |
This way is going to disappear as soon as all current users are ported to VMSTATE. |
|
121 |
|
|
122 |
Each device has to register two functions, one to save the state and |
|
123 |
another to load the state back. |
|
124 |
|
|
125 |
int register_savevm(DeviceState *dev, |
|
126 |
const char *idstr, |
|
127 |
int instance_id, |
|
128 |
int version_id, |
|
129 |
SaveStateHandler *save_state, |
|
130 |
LoadStateHandler *load_state, |
|
131 |
void *opaque); |
|
132 |
|
|
133 |
typedef void SaveStateHandler(QEMUFile *f, void *opaque); |
|
134 |
typedef int LoadStateHandler(QEMUFile *f, void *opaque, int version_id); |
|
135 |
|
|
136 |
The important functions for the device state format are the save_state |
|
137 |
and load_state. Notice that load_state receives a version_id |
|
138 |
parameter to know what state format is receiving. save_state don't |
|
139 |
have a version_id parameter because it uses always the latest version. |
|
140 |
|
|
141 |
=== VMState === |
|
142 |
|
|
143 |
The legacy way of saving/loading state of the device had the problem |
|
144 |
that we have to maintain in sync two functions. If we did one change |
|
145 |
in one of them and not on the other, we got a failed migration. |
|
146 |
|
|
147 |
VMState changed the way that state is saved/loaded. Instead of using |
|
148 |
a function to save the state and another to load it, it was changed to |
|
149 |
a declarative way of what the state consisted of. Now VMState is able |
|
150 |
to interpret that definition to be able to load/save the state. As |
|
151 |
the state is declared only once, it can't go out of sync in the |
|
152 |
save/load functions. |
|
153 |
|
|
154 |
An example (from hw/pckbd.c) |
|
155 |
|
|
156 |
static const VMStateDescription vmstate_kbd = { |
|
157 |
.name = "pckbd", |
|
158 |
.version_id = 3, |
|
159 |
.minimum_version_id = 3, |
|
160 |
.minimum_version_id_old = 3, |
|
161 |
.fields = (VMStateField []) { |
|
162 |
VMSTATE_UINT8(write_cmd, KBDState), |
|
163 |
VMSTATE_UINT8(status, KBDState), |
|
164 |
VMSTATE_UINT8(mode, KBDState), |
|
165 |
VMSTATE_UINT8(pending, KBDState), |
|
166 |
VMSTATE_END_OF_LIST() |
|
167 |
} |
|
168 |
}; |
|
169 |
|
|
170 |
We are declaring the state with name "pckbd". |
|
171 |
The version_id is 3, and the fields are 4 uint8_t in a KBDState structure. |
|
172 |
We registered this with: |
|
173 |
|
|
174 |
vmstate_register(NULL, 0, &vmstate_kbd, s); |
|
175 |
|
|
176 |
Note: talk about how vmstate <-> qdev interact, and what the instance id's mean. |
|
177 |
|
|
178 |
You can search for VMSTATE_* macros for lots of types used in QEMU in |
|
179 |
hw/hw.h. |
|
180 |
|
|
181 |
=== More about versions == |
|
182 |
|
|
183 |
You can see that there are several version fields: |
|
184 |
|
|
185 |
- version_id: the maximum version_id supported by VMState for that device |
|
186 |
- minimum_version_id: the minimum version_id that VMState is able to understand |
|
187 |
for that device. |
|
188 |
- minimum_version_id_old: For devices that were not able to port to vmstate, we can |
|
189 |
assign a function that knows how to read this old state. |
|
190 |
|
|
191 |
So, VMState is able to read versions from minimum_version_id to |
|
192 |
version_id. And the function load_state_old() is able to load state |
|
193 |
from minimum_version_id_old to minimum_version_id. This function is |
|
194 |
deprecated and will be removed when no more users are left. |
|
195 |
|
|
196 |
=== Massaging functions === |
|
197 |
|
|
198 |
Some times, it is not enough to be able to save the state directly |
|
199 |
from one structure, we need to fill the correct values there. One |
|
200 |
example is when we are using kvm. Before saving the cpu state, we |
|
201 |
need to ask kvm to copy to QEMU the state that it is using. And the |
|
202 |
opposite when we are loading the state, we need a way to tell kvm to |
|
203 |
load the state for the cpu that we have just loaded from the QEMUFile. |
|
204 |
|
|
205 |
The functions to do that are inside a vmstate definition, and are called: |
|
206 |
|
|
207 |
- int (*pre_load)(void *opaque); |
|
208 |
|
|
209 |
This function is called before we load the state of one device. |
|
210 |
|
|
211 |
- int (*post_load)(void *opaque, int version_id); |
|
212 |
|
|
213 |
This function is called after we load the state of one device. |
|
214 |
|
|
215 |
- void (*pre_save)(void *opaque); |
|
216 |
|
|
217 |
This function is called before we save the state of one device. |
|
218 |
|
|
219 |
Example: You can look at hpet.c, that uses the three function to |
|
220 |
massage the state that is transferred. |
|
221 |
|
|
222 |
=== Subsections === |
|
223 |
|
|
224 |
The use of version_id allows to be able to migrate from older versions |
|
225 |
to newer versions of a device. But not the other way around. This |
|
226 |
makes very complicated to fix bugs in stable branches. If we need to |
|
227 |
add anything to the state to fix a bug, we have to disable migration |
|
228 |
to older versions that don't have that bug-fix (i.e. a new field). |
|
229 |
|
|
230 |
But some time, that bug-fix is only needed sometimes, not always. For |
|
231 |
instance, if the device is in the middle of a DMA operation, it is |
|
232 |
using a specific functionality, .... |
|
233 |
|
|
234 |
It is impossible to create a way to make migration from any version to |
|
235 |
any other version to work. But we can do better that only allowing |
|
236 |
migration from older versions no newer ones. For that fields that are |
|
237 |
only needed sometimes, we add the idea of subsections. a subsection |
|
238 |
is "like" a device vmstate, but with a particularity, it has a Boolean |
|
239 |
function that tells if that values are needed to be sent or not. If |
|
240 |
this functions returns false, the subsection is not sent. |
|
241 |
|
|
242 |
On the receiving side, if we found a subsection for a device that we |
|
243 |
don't understand, we just fail the migration. If we understand all |
|
244 |
the subsections, then we load the state with success. |
|
245 |
|
|
246 |
One important note is that the post_load() function is called "after" |
|
247 |
loading all subsections, because a newer subsection could change same |
|
248 |
value that it uses. |
|
249 |
|
|
250 |
Example: |
|
251 |
|
|
252 |
static bool ide_drive_pio_state_needed(void *opaque) |
|
253 |
{ |
|
254 |
IDEState *s = opaque; |
|
255 |
|
|
256 |
return (s->status & DRQ_STAT) != 0; |
|
257 |
} |
|
258 |
|
|
259 |
const VMStateDescription vmstate_ide_drive_pio_state = { |
|
260 |
.name = "ide_drive/pio_state", |
|
261 |
.version_id = 1, |
|
262 |
.minimum_version_id = 1, |
|
263 |
.minimum_version_id_old = 1, |
|
264 |
.pre_save = ide_drive_pio_pre_save, |
|
265 |
.post_load = ide_drive_pio_post_load, |
|
266 |
.fields = (VMStateField []) { |
|
267 |
VMSTATE_INT32(req_nb_sectors, IDEState), |
|
268 |
VMSTATE_VARRAY_INT32(io_buffer, IDEState, io_buffer_total_len, 1, |
|
269 |
vmstate_info_uint8, uint8_t), |
|
270 |
VMSTATE_INT32(cur_io_buffer_offset, IDEState), |
|
271 |
VMSTATE_INT32(cur_io_buffer_len, IDEState), |
|
272 |
VMSTATE_UINT8(end_transfer_fn_idx, IDEState), |
|
273 |
VMSTATE_INT32(elementary_transfer_size, IDEState), |
|
274 |
VMSTATE_INT32(packet_transfer_size, IDEState), |
|
275 |
VMSTATE_END_OF_LIST() |
|
276 |
} |
|
277 |
}; |
|
278 |
|
|
279 |
const VMStateDescription vmstate_ide_drive = { |
|
280 |
.name = "ide_drive", |
|
281 |
.version_id = 3, |
|
282 |
.minimum_version_id = 0, |
|
283 |
.minimum_version_id_old = 0, |
|
284 |
.post_load = ide_drive_post_load, |
|
285 |
.fields = (VMStateField []) { |
|
286 |
.... several fields .... |
|
287 |
VMSTATE_END_OF_LIST() |
|
288 |
}, |
|
289 |
.subsections = (VMStateSubsection []) { |
|
290 |
{ |
|
291 |
.vmsd = &vmstate_ide_drive_pio_state, |
|
292 |
.needed = ide_drive_pio_state_needed, |
|
293 |
}, { |
|
294 |
/* empty */ |
|
295 |
} |
|
296 |
} |
|
297 |
}; |
|
298 |
|
|
299 |
Here we have a subsection for the pio state. We only need to |
|
300 |
save/send this state when we are in the middle of a pio operation |
|
301 |
(that is what ide_drive_pio_state_needed() checks). If DRQ_STAT is |
|
302 |
not enabled, the values on that fields are garbage and don't need to |
|
303 |
be sent. |
Also available in: Unified diff