root / doc / design-hotplug.rst @ 0565f862
History | View | Annotate | Download (10.7 kB)
1 | 3798b89a | Dimitris Aragiorgis | ======= |
---|---|---|---|
2 | 3798b89a | Dimitris Aragiorgis | Hotplug |
3 | 3798b89a | Dimitris Aragiorgis | ======= |
4 | 3798b89a | Dimitris Aragiorgis | |
5 | 3798b89a | Dimitris Aragiorgis | .. contents:: :depth: 4 |
6 | 3798b89a | Dimitris Aragiorgis | |
7 | 3798b89a | Dimitris Aragiorgis | This is a design document detailing the implementation of device |
8 | 3798b89a | Dimitris Aragiorgis | hotplugging in Ganeti. The logic used is hypervisor agnostic but still |
9 | 3798b89a | Dimitris Aragiorgis | the initial implementation will target the KVM hypervisor. The |
10 | 3798b89a | Dimitris Aragiorgis | implementation adds ``python-fdsend`` as a new dependency. In case |
11 | 3798b89a | Dimitris Aragiorgis | it is not installed hotplug will not be possible and the user will |
12 | 3798b89a | Dimitris Aragiorgis | be notified with a warning. |
13 | 3798b89a | Dimitris Aragiorgis | |
14 | 3798b89a | Dimitris Aragiorgis | |
15 | 3798b89a | Dimitris Aragiorgis | Current state and shortcomings |
16 | 3798b89a | Dimitris Aragiorgis | ============================== |
17 | 3798b89a | Dimitris Aragiorgis | |
18 | 3798b89a | Dimitris Aragiorgis | Currently, Ganeti supports addition/removal/modification of devices |
19 | 3798b89a | Dimitris Aragiorgis | (NICs, Disks) but the actual modification takes place only after |
20 | 3798b89a | Dimitris Aragiorgis | rebooting the instance. To this end an instance cannot change network, |
21 | 3798b89a | Dimitris Aragiorgis | get a new disk etc. without a hard reboot. |
22 | 3798b89a | Dimitris Aragiorgis | |
23 | 3798b89a | Dimitris Aragiorgis | Until now, in case of KVM hypervisor, code does not name devices nor |
24 | 3798b89a | Dimitris Aragiorgis | places them in specific PCI slots. Devices are appended in the KVM |
25 | 3798b89a | Dimitris Aragiorgis | command and Ganeti lets KVM decide where to place them. This means that |
26 | 3798b89a | Dimitris Aragiorgis | there is a possibility a device that resides in PCI slot 5, after a |
27 | 3798b89a | Dimitris Aragiorgis | reboot (due to another device removal) to be moved to another PCI slot |
28 | 3798b89a | Dimitris Aragiorgis | and probably get renamed too (due to udev rules, etc.). |
29 | 3798b89a | Dimitris Aragiorgis | |
30 | 3798b89a | Dimitris Aragiorgis | In order for a migration to succeed, the process on the target node |
31 | 3798b89a | Dimitris Aragiorgis | should be started with exactly the same machine version, CPU |
32 | 3798b89a | Dimitris Aragiorgis | architecture and PCI configuration with the running process. During |
33 | 3798b89a | Dimitris Aragiorgis | instance creation/startup ganeti creates a KVM runtime file with all the |
34 | 3798b89a | Dimitris Aragiorgis | necessary information to generate the KVM command. This runtime file is |
35 | 3798b89a | Dimitris Aragiorgis | used during instance migration to start a new identical KVM process. The |
36 | 3798b89a | Dimitris Aragiorgis | current format includes the fixed part of the final KVM command, a list |
37 | 3798b89a | Dimitris Aragiorgis | of NICs', and hvparams dict. It does not favor easy manipulations |
38 | 3798b89a | Dimitris Aragiorgis | concerning disks, because they are encapsulated in the fixed KVM |
39 | 3798b89a | Dimitris Aragiorgis | command. |
40 | 3798b89a | Dimitris Aragiorgis | |
41 | 3798b89a | Dimitris Aragiorgis | |
42 | 3798b89a | Dimitris Aragiorgis | Proposed changes |
43 | 3798b89a | Dimitris Aragiorgis | ================ |
44 | 3798b89a | Dimitris Aragiorgis | |
45 | 3798b89a | Dimitris Aragiorgis | For the case of the KVM hypervisor, QEMU exposes 32 PCI slots to the |
46 | 3798b89a | Dimitris Aragiorgis | instance. Disks and NICs occupy some of these slots. Recent versions of |
47 | 3798b89a | Dimitris Aragiorgis | QEMU have introduced monitor commands that allow addition/removal of PCI |
48 | 3798b89a | Dimitris Aragiorgis | devices. Devices are referenced based on their name or position on the |
49 | 3798b89a | Dimitris Aragiorgis | virtual PCI bus. To be able to use these commands, we need to be able to |
50 | 3798b89a | Dimitris Aragiorgis | assign each device a unique name. |
51 | 3798b89a | Dimitris Aragiorgis | |
52 | 3798b89a | Dimitris Aragiorgis | To keep track where each device is plugged into, we add the |
53 | 3798b89a | Dimitris Aragiorgis | ``pci`` slot to Disk and NIC objects, but we save it only in runtime |
54 | 3798b89a | Dimitris Aragiorgis | files, since it is hypervisor specific info. This is added for easy |
55 | 3798b89a | Dimitris Aragiorgis | object manipulation and is ensured not to be written back to the config. |
56 | 3798b89a | Dimitris Aragiorgis | |
57 | 3798b89a | Dimitris Aragiorgis | We propose to make use of QEMU 1.0 monitor commands so that |
58 | 3798b89a | Dimitris Aragiorgis | modifications to devices take effect instantly without the need for hard |
59 | 3798b89a | Dimitris Aragiorgis | reboot. The only change exposed to the end-user will be the addition of |
60 | 3798b89a | Dimitris Aragiorgis | a ``--hotplug`` option to the ``gnt-instance modify`` command. |
61 | 3798b89a | Dimitris Aragiorgis | |
62 | 3798b89a | Dimitris Aragiorgis | Upon hotplugging the PCI configuration of an instance is changed. |
63 | 3798b89a | Dimitris Aragiorgis | Runtime files should be updated correspondingly. Currently this is |
64 | 3798b89a | Dimitris Aragiorgis | impossible in case of disk hotplug because disks are included in command |
65 | 3798b89a | Dimitris Aragiorgis | line entry of the runtime file, contrary to NICs that are correctly |
66 | 3798b89a | Dimitris Aragiorgis | treated separately. We change the format of runtime files, we remove |
67 | 3798b89a | Dimitris Aragiorgis | disks from the fixed KVM command and create new entry containing them |
68 | 3798b89a | Dimitris Aragiorgis | only. KVM options concerning disk are generated during |
69 | 3798b89a | Dimitris Aragiorgis | ``_ExecuteKVMCommand()``, just like NICs. |
70 | 3798b89a | Dimitris Aragiorgis | |
71 | 3798b89a | Dimitris Aragiorgis | Design decisions |
72 | 3798b89a | Dimitris Aragiorgis | ================ |
73 | 3798b89a | Dimitris Aragiorgis | |
74 | 3798b89a | Dimitris Aragiorgis | Which should be each device ID? Currently KVM does not support arbitrary |
75 | 3798b89a | Dimitris Aragiorgis | IDs for devices; supported are only names starting with a letter, max 32 |
76 | 3798b89a | Dimitris Aragiorgis | chars length, and only including '.' '_' '-' special chars. |
77 | 3798b89a | Dimitris Aragiorgis | For debugging purposes and in order to be more informative, device will be |
78 | 3798b89a | Dimitris Aragiorgis | named after: <device type>-<part of uuid>-pci-<slot>. |
79 | 3798b89a | Dimitris Aragiorgis | |
80 | 3798b89a | Dimitris Aragiorgis | Who decides where to hotplug each device? As long as this is a |
81 | 3798b89a | Dimitris Aragiorgis | hypervisor specific matter, there is no point for the master node to |
82 | 3798b89a | Dimitris Aragiorgis | decide such a thing. Master node just has to request noded to hotplug a |
83 | 3798b89a | Dimitris Aragiorgis | device. To this end, hypervisor specific code should parse the current |
84 | 3798b89a | Dimitris Aragiorgis | PCI configuration (i.e. ``info pci`` QEMU monitor command), find the first |
85 | 3798b89a | Dimitris Aragiorgis | available slot and hotplug the device. Having noded to decide where to |
86 | 3798b89a | Dimitris Aragiorgis | hotplug a device we ensure that no error will occur due to duplicate |
87 | 3798b89a | Dimitris Aragiorgis | slot assignment (if masterd keeps track of PCI reservations and noded |
88 | 3798b89a | Dimitris Aragiorgis | fails to return the PCI slot that the device was plugged into then next |
89 | 3798b89a | Dimitris Aragiorgis | hotplug will fail). |
90 | 3798b89a | Dimitris Aragiorgis | |
91 | 3798b89a | Dimitris Aragiorgis | Where should we keep track of devices' PCI slots? As already mentioned, |
92 | 3798b89a | Dimitris Aragiorgis | we must keep track of devices PCI slots to successfully migrate |
93 | 3798b89a | Dimitris Aragiorgis | instances. First option is to save this info to config data, which would |
94 | 3798b89a | Dimitris Aragiorgis | allow us to place each device at the same PCI slot after reboot. This |
95 | 3798b89a | Dimitris Aragiorgis | would require to make the hypervisor return the PCI slot chosen for each |
96 | 3798b89a | Dimitris Aragiorgis | device, and storing this information to config data. Additionally the |
97 | 3798b89a | Dimitris Aragiorgis | whole instance configuration should be returned with PCI slots filled |
98 | 3798b89a | Dimitris Aragiorgis | after instance start and each instance should keep track of current PCI |
99 | 3798b89a | Dimitris Aragiorgis | reservations. We decide not to go towards this direction in order to |
100 | 3798b89a | Dimitris Aragiorgis | keep it simple and do not add hypervisor specific info to configuration |
101 | 3798b89a | Dimitris Aragiorgis | data (``pci_reservations`` at instance level and ``pci`` at device |
102 | 3798b89a | Dimitris Aragiorgis | level). For the aforementioned reason, we decide to store this info only |
103 | 3798b89a | Dimitris Aragiorgis | in KVM runtime files. |
104 | 3798b89a | Dimitris Aragiorgis | |
105 | 3798b89a | Dimitris Aragiorgis | Where to place the devices upon instance startup? QEMU has by default 4 |
106 | 3798b89a | Dimitris Aragiorgis | pre-occupied PCI slots. So, hypervisor can use the remaining ones for |
107 | 3798b89a | Dimitris Aragiorgis | disks and NICs. Currently, PCI configuration is not preserved after |
108 | 3798b89a | Dimitris Aragiorgis | reboot. Each time an instance starts, KVM assigns PCI slots to devices |
109 | 3798b89a | Dimitris Aragiorgis | based on their ordering in Ganeti configuration, i.e. the second disk |
110 | 3798b89a | Dimitris Aragiorgis | will be placed after the first, the third NIC after the second, etc. |
111 | 3798b89a | Dimitris Aragiorgis | Since we decided that there is no need to keep track of devices PCI |
112 | 3798b89a | Dimitris Aragiorgis | slots, there is no need to change current functionality. |
113 | 3798b89a | Dimitris Aragiorgis | |
114 | 3798b89a | Dimitris Aragiorgis | How to deal with existing instances? Hotplug depends on runtime file |
115 | 3798b89a | Dimitris Aragiorgis | manipulation. It stores there pci info and every device the kvm process is |
116 | 3798b89a | Dimitris Aragiorgis | currently using. Existing files have no pci info in devices and have block |
117 | 3798b89a | Dimitris Aragiorgis | devices encapsulated inside kvm_cmd entry. Thus hotplugging of existing devices |
118 | 3798b89a | Dimitris Aragiorgis | will not be possible. Still migration and hotplugging of new devices will |
119 | 3798b89a | Dimitris Aragiorgis | succeed. The workaround will happen upon loading kvm runtime: if we detect old |
120 | 3798b89a | Dimitris Aragiorgis | style format we will add an empty list for block devices and upon saving kvm |
121 | 3798b89a | Dimitris Aragiorgis | runtime we will include this empty list as well. Switching entirely to new |
122 | 3798b89a | Dimitris Aragiorgis | format will happen upon instance reboot. |
123 | 3798b89a | Dimitris Aragiorgis | |
124 | 3798b89a | Dimitris Aragiorgis | |
125 | 3798b89a | Dimitris Aragiorgis | Configuration changes |
126 | 3798b89a | Dimitris Aragiorgis | --------------------- |
127 | 3798b89a | Dimitris Aragiorgis | |
128 | 3798b89a | Dimitris Aragiorgis | The ``NIC`` and ``Disk`` objects get one extra slot: ``pci``. It refers to |
129 | 3798b89a | Dimitris Aragiorgis | PCI slot that the device gets plugged into. |
130 | 3798b89a | Dimitris Aragiorgis | |
131 | 3798b89a | Dimitris Aragiorgis | In order to be able to live migrate successfully, runtime files should |
132 | 3798b89a | Dimitris Aragiorgis | be updated every time a live modification (hotplug) takes place. To this |
133 | 3798b89a | Dimitris Aragiorgis | end we change the format of runtime files. The KVM options referring to |
134 | 3798b89a | Dimitris Aragiorgis | instance's disks are no longer recorded as part of the KVM command line. |
135 | 3798b89a | Dimitris Aragiorgis | Disks are treated separately, just as we treat NICs right now. We insert |
136 | 3798b89a | Dimitris Aragiorgis | and remove entries to reflect the current PCI configuration. |
137 | 3798b89a | Dimitris Aragiorgis | |
138 | 3798b89a | Dimitris Aragiorgis | |
139 | 3798b89a | Dimitris Aragiorgis | Backend changes |
140 | 3798b89a | Dimitris Aragiorgis | --------------- |
141 | 3798b89a | Dimitris Aragiorgis | |
142 | 3798b89a | Dimitris Aragiorgis | Introduce one new RPC call: |
143 | 3798b89a | Dimitris Aragiorgis | |
144 | 3798b89a | Dimitris Aragiorgis | - hotplug_device(DEVICE_TYPE, ACTION, device, ...) |
145 | 3798b89a | Dimitris Aragiorgis | |
146 | 3798b89a | Dimitris Aragiorgis | where DEVICE_TYPE can be either NIC or Disk, and ACTION either REMOVE or ADD. |
147 | 3798b89a | Dimitris Aragiorgis | |
148 | 3798b89a | Dimitris Aragiorgis | Hypervisor changes |
149 | 3798b89a | Dimitris Aragiorgis | ------------------ |
150 | 3798b89a | Dimitris Aragiorgis | |
151 | 3798b89a | Dimitris Aragiorgis | We implement hotplug on top of the KVM hypervisor. We take advantage of |
152 | 3798b89a | Dimitris Aragiorgis | QEMU 1.0 monitor commands (``device_add``, ``device_del``, |
153 | 3798b89a | Dimitris Aragiorgis | ``drive_add``, ``drive_del``, ``netdev_add``,`` netdev_del``). QEMU |
154 | 3798b89a | Dimitris Aragiorgis | refers to devices based on their id. We use ``uuid`` to name them |
155 | 3798b89a | Dimitris Aragiorgis | properly. If a device is about to be hotplugged we parse the output of |
156 | 3798b89a | Dimitris Aragiorgis | ``info pci`` and find the occupied PCI slots. We choose the first |
157 | 3798b89a | Dimitris Aragiorgis | available and the whole device object is appended to the corresponding |
158 | 3798b89a | Dimitris Aragiorgis | entry in the runtime file. |
159 | 3798b89a | Dimitris Aragiorgis | |
160 | 3798b89a | Dimitris Aragiorgis | Concerning NIC handling, we build on the top of the existing logic |
161 | 3798b89a | Dimitris Aragiorgis | (first create a tap with _OpenTap() and then pass its file descriptor to |
162 | 3798b89a | Dimitris Aragiorgis | the KVM process). To this end we need to pass access rights to the |
163 | 3798b89a | Dimitris Aragiorgis | corresponding file descriptor over the monitor socket (UNIX domain |
164 | 3798b89a | Dimitris Aragiorgis | socket). The open file is passed as a socket-level control message |
165 | 3798b89a | Dimitris Aragiorgis | (SCM), using the ``fdsend`` python library. |
166 | 3798b89a | Dimitris Aragiorgis | |
167 | 3798b89a | Dimitris Aragiorgis | |
168 | 3798b89a | Dimitris Aragiorgis | User interface |
169 | 3798b89a | Dimitris Aragiorgis | -------------- |
170 | 3798b89a | Dimitris Aragiorgis | |
171 | 3798b89a | Dimitris Aragiorgis | The new ``--hotplug`` option to gnt-instance modify is introduced, which |
172 | 3798b89a | Dimitris Aragiorgis | forces live modifications. |
173 | 3798b89a | Dimitris Aragiorgis | |
174 | 3798b89a | Dimitris Aragiorgis | |
175 | 3798b89a | Dimitris Aragiorgis | Enabling hotplug |
176 | 3798b89a | Dimitris Aragiorgis | ++++++++++++++++ |
177 | 3798b89a | Dimitris Aragiorgis | |
178 | 3798b89a | Dimitris Aragiorgis | Hotplug will be optional during gnt-instance modify. For existing |
179 | 3798b89a | Dimitris Aragiorgis | instance, after installing a version that supports hotplugging we |
180 | 3798b89a | Dimitris Aragiorgis | have the restriction that hotplug will not be supported for existing |
181 | 3798b89a | Dimitris Aragiorgis | devices. The reason is that old runtime files lack of: |
182 | 3798b89a | Dimitris Aragiorgis | |
183 | 3798b89a | Dimitris Aragiorgis | 1. Device pci configuration info. |
184 | 3798b89a | Dimitris Aragiorgis | |
185 | 3798b89a | Dimitris Aragiorgis | 2. Separate block device entry. |
186 | 3798b89a | Dimitris Aragiorgis | |
187 | 3798b89a | Dimitris Aragiorgis | Hotplug will be supported only for KVM in the first implementation. For |
188 | 3798b89a | Dimitris Aragiorgis | all other hypervisors, backend will raise an Exception case hotplug is |
189 | 3798b89a | Dimitris Aragiorgis | requested. |
190 | 3798b89a | Dimitris Aragiorgis | |
191 | 3798b89a | Dimitris Aragiorgis | |
192 | 3798b89a | Dimitris Aragiorgis | NIC Hotplug |
193 | 3798b89a | Dimitris Aragiorgis | +++++++++++ |
194 | 3798b89a | Dimitris Aragiorgis | |
195 | 3798b89a | Dimitris Aragiorgis | The user can add/modify/remove NICs either with hotplugging or not. If a |
196 | 3798b89a | Dimitris Aragiorgis | NIC is to be added a tap is created first and configured properly with |
197 | 3798b89a | Dimitris Aragiorgis | kvm-vif-bridge script. Then the instance gets a new network interface. |
198 | 3798b89a | Dimitris Aragiorgis | Since there is no QEMU monitor command to modify a NIC, we modify a NIC |
199 | 3798b89a | Dimitris Aragiorgis | by temporary removing the existing one and adding a new with the new |
200 | 3798b89a | Dimitris Aragiorgis | configuration. When removing a NIC the corresponding tap gets removed as |
201 | 3798b89a | Dimitris Aragiorgis | well. |
202 | 3798b89a | Dimitris Aragiorgis | |
203 | 3798b89a | Dimitris Aragiorgis | :: |
204 | 3798b89a | Dimitris Aragiorgis | |
205 | 3798b89a | Dimitris Aragiorgis | gnt-instance modify --net add --hotplug test |
206 | 3798b89a | Dimitris Aragiorgis | gnt-instance modify --net 1:mac=aa:00:00:55:44:33 --hotplug test |
207 | 3798b89a | Dimitris Aragiorgis | gnt-instance modify --net 1:remove --hotplug test |
208 | 3798b89a | Dimitris Aragiorgis | |
209 | 3798b89a | Dimitris Aragiorgis | |
210 | 3798b89a | Dimitris Aragiorgis | Disk Hotplug |
211 | 3798b89a | Dimitris Aragiorgis | ++++++++++++ |
212 | 3798b89a | Dimitris Aragiorgis | |
213 | 3798b89a | Dimitris Aragiorgis | The user can add and remove disks with hotplugging or not. QEMU monitor |
214 | 3798b89a | Dimitris Aragiorgis | supports resizing of disks, however the initial implementation will |
215 | 3798b89a | Dimitris Aragiorgis | support only disk addition/deletion. |
216 | 3798b89a | Dimitris Aragiorgis | |
217 | 3798b89a | Dimitris Aragiorgis | :: |
218 | 3798b89a | Dimitris Aragiorgis | |
219 | 3798b89a | Dimitris Aragiorgis | gnt-instance modify --disk add:size=1G --hotplug test |
220 | 3798b89a | Dimitris Aragiorgis | gnt-instance modify --net 1:remove --hotplug test |
221 | 3798b89a | Dimitris Aragiorgis | |
222 | 3798b89a | Dimitris Aragiorgis | |
223 | 3798b89a | Dimitris Aragiorgis | Dealing with chroot and uid pool |
224 | 3798b89a | Dimitris Aragiorgis | -------------------------------- |
225 | 3798b89a | Dimitris Aragiorgis | |
226 | 3798b89a | Dimitris Aragiorgis | The design so far covers all issues that arise without addressing the |
227 | 3798b89a | Dimitris Aragiorgis | case where the kvm process will not run with root privileges. |
228 | 3798b89a | Dimitris Aragiorgis | Specifically: |
229 | 3798b89a | Dimitris Aragiorgis | |
230 | 3798b89a | Dimitris Aragiorgis | - in case of chroot, the kvm process cannot see the newly created device |
231 | 3798b89a | Dimitris Aragiorgis | |
232 | 3798b89a | Dimitris Aragiorgis | - in case of uid pool security model, the kvm process is not allowed |
233 | 3798b89a | Dimitris Aragiorgis | to access the device |
234 | 3798b89a | Dimitris Aragiorgis | |
235 | 3798b89a | Dimitris Aragiorgis | For NIC hotplug we address this problem by using the ``getfd`` monitor |
236 | 3798b89a | Dimitris Aragiorgis | command and passing the file descriptor to the kvm process over the |
237 | 3798b89a | Dimitris Aragiorgis | monitor socket using SCM_RIGHTS. For disk hotplug and in case of uid |
238 | 3798b89a | Dimitris Aragiorgis | pool we can let the hypervisor code temporarily ``chown()`` the device |
239 | 3798b89a | Dimitris Aragiorgis | before the actual hotplug. Still this is insufficient in case of chroot. |
240 | 3798b89a | Dimitris Aragiorgis | In this case, we need to ``mknod()`` the device inside the chroot. Both |
241 | 3798b89a | Dimitris Aragiorgis | workarounds can be avoided, if we make use of the ``add-fd`` qemu |
242 | 3798b89a | Dimitris Aragiorgis | monitor command, that was introduced in version 1.3. This command is the |
243 | 3798b89a | Dimitris Aragiorgis | equivalent of NICs' `get-fd`` for disks and will allow disk hotplug in |
244 | 3798b89a | Dimitris Aragiorgis | every case. So, if the qemu monitor does not support the ``add-fd`` |
245 | 3798b89a | Dimitris Aragiorgis | command, we will not allow disk hotplug for chroot and uid security |
246 | 3798b89a | Dimitris Aragiorgis | model and notify the user with the corresponding warning. |
247 | 3798b89a | Dimitris Aragiorgis | |
248 | 3798b89a | Dimitris Aragiorgis | .. vim: set textwidth=72 : |
249 | 3798b89a | Dimitris Aragiorgis | .. Local Variables: |
250 | 3798b89a | Dimitris Aragiorgis | .. mode: rst |
251 | 3798b89a | Dimitris Aragiorgis | .. fill-column: 72 |
252 | 3798b89a | Dimitris Aragiorgis | .. End: |