Statistics
| Branch: | Tag: | Revision:

root / doc / design-hotplug.rst @ 9110fb4a

History | View | Annotate | Download (10.7 kB)

1 3798b89a Dimitris Aragiorgis
=======
2 3798b89a Dimitris Aragiorgis
Hotplug
3 3798b89a Dimitris Aragiorgis
=======
4 3798b89a Dimitris Aragiorgis
5 3798b89a Dimitris Aragiorgis
.. contents:: :depth: 4
6 3798b89a Dimitris Aragiorgis
7 3798b89a Dimitris Aragiorgis
This is a design document detailing the implementation of device
8 3798b89a Dimitris Aragiorgis
hotplugging in Ganeti. The logic used is hypervisor agnostic but still
9 3798b89a Dimitris Aragiorgis
the initial implementation will target the KVM hypervisor. The
10 3798b89a Dimitris Aragiorgis
implementation adds ``python-fdsend`` as a new dependency. In case
11 3798b89a Dimitris Aragiorgis
it is not installed hotplug will not be possible and the user will
12 3798b89a Dimitris Aragiorgis
be notified with a warning.
13 3798b89a Dimitris Aragiorgis
14 3798b89a Dimitris Aragiorgis
15 3798b89a Dimitris Aragiorgis
Current state and shortcomings
16 3798b89a Dimitris Aragiorgis
==============================
17 3798b89a Dimitris Aragiorgis
18 3798b89a Dimitris Aragiorgis
Currently, Ganeti supports addition/removal/modification of devices
19 3798b89a Dimitris Aragiorgis
(NICs, Disks) but the actual modification takes place only after
20 3798b89a Dimitris Aragiorgis
rebooting the instance. To this end an instance cannot change network,
21 3798b89a Dimitris Aragiorgis
get a new disk etc. without a hard reboot.
22 3798b89a Dimitris Aragiorgis
23 3798b89a Dimitris Aragiorgis
Until now, in case of KVM hypervisor, code does not name devices nor
24 3798b89a Dimitris Aragiorgis
places them in specific PCI slots. Devices are appended in the KVM
25 3798b89a Dimitris Aragiorgis
command and Ganeti lets KVM decide where to place them. This means that
26 3798b89a Dimitris Aragiorgis
there is a possibility a device that resides in PCI slot 5, after a
27 3798b89a Dimitris Aragiorgis
reboot (due to another device removal) to be moved to another PCI slot
28 3798b89a Dimitris Aragiorgis
and probably get renamed too (due to udev rules, etc.).
29 3798b89a Dimitris Aragiorgis
30 3798b89a Dimitris Aragiorgis
In order for a migration to succeed, the process on the target node
31 3798b89a Dimitris Aragiorgis
should be started with exactly the same machine version, CPU
32 3798b89a Dimitris Aragiorgis
architecture and PCI configuration with the running process. During
33 3798b89a Dimitris Aragiorgis
instance creation/startup ganeti creates a KVM runtime file with all the
34 3798b89a Dimitris Aragiorgis
necessary information to generate the KVM command. This runtime file is
35 3798b89a Dimitris Aragiorgis
used during instance migration to start a new identical KVM process. The
36 3798b89a Dimitris Aragiorgis
current format includes the fixed part of the final KVM command, a list
37 3798b89a Dimitris Aragiorgis
of NICs', and hvparams dict. It does not favor easy manipulations
38 3798b89a Dimitris Aragiorgis
concerning disks, because they are encapsulated in the fixed KVM
39 3798b89a Dimitris Aragiorgis
command.
40 3798b89a Dimitris Aragiorgis
41 3798b89a Dimitris Aragiorgis
42 3798b89a Dimitris Aragiorgis
Proposed changes
43 3798b89a Dimitris Aragiorgis
================
44 3798b89a Dimitris Aragiorgis
45 3798b89a Dimitris Aragiorgis
For the case of the KVM hypervisor, QEMU exposes 32 PCI slots to the
46 3798b89a Dimitris Aragiorgis
instance. Disks and NICs occupy some of these slots. Recent versions of
47 3798b89a Dimitris Aragiorgis
QEMU have introduced monitor commands that allow addition/removal of PCI
48 3798b89a Dimitris Aragiorgis
devices. Devices are referenced based on their name or position on the
49 3798b89a Dimitris Aragiorgis
virtual PCI bus. To be able to use these commands, we need to be able to
50 3798b89a Dimitris Aragiorgis
assign each device a unique name.
51 3798b89a Dimitris Aragiorgis
52 3798b89a Dimitris Aragiorgis
To keep track where each device is plugged into, we add the
53 3798b89a Dimitris Aragiorgis
``pci`` slot to Disk and NIC objects, but we save it only in runtime
54 3798b89a Dimitris Aragiorgis
files, since it is hypervisor specific info. This is added for easy
55 3798b89a Dimitris Aragiorgis
object manipulation and is ensured not to be written back to the config.
56 3798b89a Dimitris Aragiorgis
57 3798b89a Dimitris Aragiorgis
We propose to make use of QEMU 1.0 monitor commands so that
58 3798b89a Dimitris Aragiorgis
modifications to devices take effect instantly without the need for hard
59 3798b89a Dimitris Aragiorgis
reboot. The only change exposed to the end-user will be the addition of
60 3798b89a Dimitris Aragiorgis
a ``--hotplug`` option to the ``gnt-instance modify`` command.
61 3798b89a Dimitris Aragiorgis
62 3798b89a Dimitris Aragiorgis
Upon hotplugging the PCI configuration of an instance is changed.
63 3798b89a Dimitris Aragiorgis
Runtime files should be updated correspondingly. Currently this is
64 3798b89a Dimitris Aragiorgis
impossible in case of disk hotplug because disks are included in command
65 3798b89a Dimitris Aragiorgis
line entry of the runtime file, contrary to NICs that are correctly
66 3798b89a Dimitris Aragiorgis
treated separately. We change the format of runtime files, we remove
67 3798b89a Dimitris Aragiorgis
disks from the fixed KVM command and create new entry containing them
68 3798b89a Dimitris Aragiorgis
only. KVM options concerning disk are generated during
69 3798b89a Dimitris Aragiorgis
``_ExecuteKVMCommand()``, just like NICs.
70 3798b89a Dimitris Aragiorgis
71 3798b89a Dimitris Aragiorgis
Design decisions
72 3798b89a Dimitris Aragiorgis
================
73 3798b89a Dimitris Aragiorgis
74 3798b89a Dimitris Aragiorgis
Which should be each device ID? Currently KVM does not support arbitrary
75 3798b89a Dimitris Aragiorgis
IDs for devices; supported are only names starting with a letter, max 32
76 3798b89a Dimitris Aragiorgis
chars length, and only including '.' '_' '-' special chars.
77 3798b89a Dimitris Aragiorgis
For debugging purposes and in order to be more informative, device will be
78 3798b89a Dimitris Aragiorgis
named after: <device type>-<part of uuid>-pci-<slot>.
79 3798b89a Dimitris Aragiorgis
80 3798b89a Dimitris Aragiorgis
Who decides where to hotplug each device? As long as this is a
81 3798b89a Dimitris Aragiorgis
hypervisor specific matter, there is no point for the master node to
82 3798b89a Dimitris Aragiorgis
decide such a thing. Master node just has to request noded to hotplug a
83 3798b89a Dimitris Aragiorgis
device. To this end, hypervisor specific code should parse the current
84 3798b89a Dimitris Aragiorgis
PCI configuration (i.e. ``info pci`` QEMU monitor command), find the first
85 3798b89a Dimitris Aragiorgis
available slot and hotplug the device. Having noded to decide where to
86 3798b89a Dimitris Aragiorgis
hotplug a device we ensure that no error will occur due to duplicate
87 3798b89a Dimitris Aragiorgis
slot assignment (if masterd keeps track of PCI reservations and noded
88 3798b89a Dimitris Aragiorgis
fails to return the PCI slot that the device was plugged into then next
89 3798b89a Dimitris Aragiorgis
hotplug will fail).
90 3798b89a Dimitris Aragiorgis
91 3798b89a Dimitris Aragiorgis
Where should we keep track of devices' PCI slots? As already mentioned,
92 3798b89a Dimitris Aragiorgis
we must keep track of devices PCI slots to successfully migrate
93 3798b89a Dimitris Aragiorgis
instances. First option is to save this info to config data, which would
94 3798b89a Dimitris Aragiorgis
allow us to place each device at the same PCI slot after reboot. This
95 3798b89a Dimitris Aragiorgis
would require to make the hypervisor return the PCI slot chosen for each
96 3798b89a Dimitris Aragiorgis
device, and storing this information to config data. Additionally the
97 3798b89a Dimitris Aragiorgis
whole instance configuration should be returned with PCI slots filled
98 3798b89a Dimitris Aragiorgis
after instance start and each instance should keep track of current PCI
99 3798b89a Dimitris Aragiorgis
reservations. We decide not to go towards this direction in order to
100 3798b89a Dimitris Aragiorgis
keep it simple and do not add hypervisor specific info to configuration
101 3798b89a Dimitris Aragiorgis
data (``pci_reservations`` at instance level and ``pci`` at device
102 3798b89a Dimitris Aragiorgis
level). For the aforementioned reason, we decide to store this info only
103 3798b89a Dimitris Aragiorgis
in KVM runtime files.
104 3798b89a Dimitris Aragiorgis
105 3798b89a Dimitris Aragiorgis
Where to place the devices upon instance startup? QEMU has by default 4
106 3798b89a Dimitris Aragiorgis
pre-occupied PCI slots. So, hypervisor can use the remaining ones for
107 3798b89a Dimitris Aragiorgis
disks and NICs. Currently, PCI configuration is not preserved after
108 3798b89a Dimitris Aragiorgis
reboot.  Each time an instance starts, KVM assigns PCI slots to devices
109 3798b89a Dimitris Aragiorgis
based on their ordering in Ganeti configuration, i.e. the second disk
110 3798b89a Dimitris Aragiorgis
will be placed after the first, the third NIC after the second, etc.
111 3798b89a Dimitris Aragiorgis
Since we decided that there is no need to keep track of devices PCI
112 3798b89a Dimitris Aragiorgis
slots, there is no need to change current functionality.
113 3798b89a Dimitris Aragiorgis
114 3798b89a Dimitris Aragiorgis
How to deal with existing instances? Hotplug depends on runtime file
115 3798b89a Dimitris Aragiorgis
manipulation. It stores there pci info and every device the kvm process is
116 3798b89a Dimitris Aragiorgis
currently using. Existing files have no pci info in devices and have block
117 3798b89a Dimitris Aragiorgis
devices encapsulated inside kvm_cmd entry. Thus hotplugging of existing devices
118 3798b89a Dimitris Aragiorgis
will not be possible. Still migration and hotplugging of new devices will
119 3798b89a Dimitris Aragiorgis
succeed. The workaround will happen upon loading kvm runtime: if we detect old
120 3798b89a Dimitris Aragiorgis
style format we will add an empty list for block devices and upon saving kvm
121 3798b89a Dimitris Aragiorgis
runtime we will include this empty list as well. Switching entirely to new
122 3798b89a Dimitris Aragiorgis
format will happen upon instance reboot.
123 3798b89a Dimitris Aragiorgis
124 3798b89a Dimitris Aragiorgis
125 3798b89a Dimitris Aragiorgis
Configuration changes
126 3798b89a Dimitris Aragiorgis
---------------------
127 3798b89a Dimitris Aragiorgis
128 3798b89a Dimitris Aragiorgis
The ``NIC`` and ``Disk`` objects get one extra slot: ``pci``. It refers to
129 3798b89a Dimitris Aragiorgis
PCI slot that the device gets plugged into.
130 3798b89a Dimitris Aragiorgis
131 3798b89a Dimitris Aragiorgis
In order to be able to live migrate successfully, runtime files should
132 3798b89a Dimitris Aragiorgis
be updated every time a live modification (hotplug) takes place. To this
133 3798b89a Dimitris Aragiorgis
end we change the format of runtime files. The KVM options referring to
134 3798b89a Dimitris Aragiorgis
instance's disks are no longer recorded as part of the KVM command line.
135 3798b89a Dimitris Aragiorgis
Disks are treated separately, just as we treat NICs right now. We insert
136 3798b89a Dimitris Aragiorgis
and remove entries to reflect the current PCI configuration.
137 3798b89a Dimitris Aragiorgis
138 3798b89a Dimitris Aragiorgis
139 3798b89a Dimitris Aragiorgis
Backend changes
140 3798b89a Dimitris Aragiorgis
---------------
141 3798b89a Dimitris Aragiorgis
142 3798b89a Dimitris Aragiorgis
Introduce one new RPC call:
143 3798b89a Dimitris Aragiorgis
144 3798b89a Dimitris Aragiorgis
- hotplug_device(DEVICE_TYPE, ACTION, device, ...)
145 3798b89a Dimitris Aragiorgis
146 3798b89a Dimitris Aragiorgis
where DEVICE_TYPE can be either NIC or Disk, and ACTION either REMOVE or ADD.
147 3798b89a Dimitris Aragiorgis
148 3798b89a Dimitris Aragiorgis
Hypervisor changes
149 3798b89a Dimitris Aragiorgis
------------------
150 3798b89a Dimitris Aragiorgis
151 3798b89a Dimitris Aragiorgis
We implement hotplug on top of the KVM hypervisor. We take advantage of
152 3798b89a Dimitris Aragiorgis
QEMU 1.0 monitor commands (``device_add``, ``device_del``,
153 3798b89a Dimitris Aragiorgis
``drive_add``, ``drive_del``, ``netdev_add``,`` netdev_del``). QEMU
154 3798b89a Dimitris Aragiorgis
refers to devices based on their id. We use ``uuid`` to name them
155 3798b89a Dimitris Aragiorgis
properly. If a device is about to be hotplugged we parse the output of
156 3798b89a Dimitris Aragiorgis
``info pci`` and find the occupied PCI slots. We choose the first
157 3798b89a Dimitris Aragiorgis
available and the whole device object is appended to the corresponding
158 3798b89a Dimitris Aragiorgis
entry in the runtime file.
159 3798b89a Dimitris Aragiorgis
160 3798b89a Dimitris Aragiorgis
Concerning NIC handling, we build on the top of the existing logic
161 3798b89a Dimitris Aragiorgis
(first create a tap with _OpenTap() and then pass its file descriptor to
162 3798b89a Dimitris Aragiorgis
the KVM process). To this end we need to pass access rights to the
163 3798b89a Dimitris Aragiorgis
corresponding file descriptor over the monitor socket (UNIX domain
164 3798b89a Dimitris Aragiorgis
socket). The open file is passed as a socket-level control message
165 3798b89a Dimitris Aragiorgis
(SCM), using the ``fdsend`` python library.
166 3798b89a Dimitris Aragiorgis
167 3798b89a Dimitris Aragiorgis
168 3798b89a Dimitris Aragiorgis
User interface
169 3798b89a Dimitris Aragiorgis
--------------
170 3798b89a Dimitris Aragiorgis
171 3798b89a Dimitris Aragiorgis
The new ``--hotplug`` option to gnt-instance modify is introduced, which
172 3798b89a Dimitris Aragiorgis
forces live modifications.
173 3798b89a Dimitris Aragiorgis
174 3798b89a Dimitris Aragiorgis
175 3798b89a Dimitris Aragiorgis
Enabling hotplug
176 3798b89a Dimitris Aragiorgis
++++++++++++++++
177 3798b89a Dimitris Aragiorgis
178 3798b89a Dimitris Aragiorgis
Hotplug will be optional during gnt-instance modify.  For existing
179 3798b89a Dimitris Aragiorgis
instance, after installing a version that supports hotplugging we
180 3798b89a Dimitris Aragiorgis
have the restriction that hotplug will not be supported for existing
181 3798b89a Dimitris Aragiorgis
devices. The reason is that old runtime files lack of:
182 3798b89a Dimitris Aragiorgis
183 3798b89a Dimitris Aragiorgis
1. Device pci configuration info.
184 3798b89a Dimitris Aragiorgis
185 3798b89a Dimitris Aragiorgis
2. Separate block device entry.
186 3798b89a Dimitris Aragiorgis
187 3798b89a Dimitris Aragiorgis
Hotplug will be supported only for KVM in the first implementation. For
188 3798b89a Dimitris Aragiorgis
all other hypervisors, backend will raise an Exception case hotplug is
189 3798b89a Dimitris Aragiorgis
requested.
190 3798b89a Dimitris Aragiorgis
191 3798b89a Dimitris Aragiorgis
192 3798b89a Dimitris Aragiorgis
NIC Hotplug
193 3798b89a Dimitris Aragiorgis
+++++++++++
194 3798b89a Dimitris Aragiorgis
195 3798b89a Dimitris Aragiorgis
The user can add/modify/remove NICs either with hotplugging or not. If a
196 3798b89a Dimitris Aragiorgis
NIC is to be added a tap is created first and configured properly with
197 3798b89a Dimitris Aragiorgis
kvm-vif-bridge script. Then the instance gets a new network interface.
198 3798b89a Dimitris Aragiorgis
Since there is no QEMU monitor command to modify a NIC, we modify a NIC
199 3798b89a Dimitris Aragiorgis
by temporary removing the existing one and adding a new with the new
200 3798b89a Dimitris Aragiorgis
configuration. When removing a NIC the corresponding tap gets removed as
201 3798b89a Dimitris Aragiorgis
well.
202 3798b89a Dimitris Aragiorgis
203 3798b89a Dimitris Aragiorgis
::
204 3798b89a Dimitris Aragiorgis
205 3798b89a Dimitris Aragiorgis
 gnt-instance modify --net add --hotplug test
206 3798b89a Dimitris Aragiorgis
 gnt-instance modify --net 1:mac=aa:00:00:55:44:33 --hotplug test
207 3798b89a Dimitris Aragiorgis
 gnt-instance modify --net 1:remove --hotplug test
208 3798b89a Dimitris Aragiorgis
209 3798b89a Dimitris Aragiorgis
210 3798b89a Dimitris Aragiorgis
Disk Hotplug
211 3798b89a Dimitris Aragiorgis
++++++++++++
212 3798b89a Dimitris Aragiorgis
213 3798b89a Dimitris Aragiorgis
The user can add and remove disks with hotplugging or not. QEMU monitor
214 3798b89a Dimitris Aragiorgis
supports resizing of disks, however the initial implementation will
215 3798b89a Dimitris Aragiorgis
support only disk addition/deletion.
216 3798b89a Dimitris Aragiorgis
217 3798b89a Dimitris Aragiorgis
::
218 3798b89a Dimitris Aragiorgis
219 3798b89a Dimitris Aragiorgis
 gnt-instance modify --disk add:size=1G --hotplug test
220 3798b89a Dimitris Aragiorgis
 gnt-instance modify --net 1:remove --hotplug test
221 3798b89a Dimitris Aragiorgis
222 3798b89a Dimitris Aragiorgis
223 3798b89a Dimitris Aragiorgis
Dealing with chroot and uid pool
224 3798b89a Dimitris Aragiorgis
--------------------------------
225 3798b89a Dimitris Aragiorgis
226 3798b89a Dimitris Aragiorgis
The design so far covers all issues that arise without addressing the
227 3798b89a Dimitris Aragiorgis
case where the kvm process will not run with root privileges.
228 3798b89a Dimitris Aragiorgis
Specifically:
229 3798b89a Dimitris Aragiorgis
230 3798b89a Dimitris Aragiorgis
- in case of chroot, the kvm process cannot see the newly created device
231 3798b89a Dimitris Aragiorgis
232 3798b89a Dimitris Aragiorgis
- in case of uid pool security model, the kvm process is not allowed
233 3798b89a Dimitris Aragiorgis
  to access the device
234 3798b89a Dimitris Aragiorgis
235 3798b89a Dimitris Aragiorgis
For NIC hotplug we address this problem by using the ``getfd`` monitor
236 3798b89a Dimitris Aragiorgis
command and passing the file descriptor to the kvm process over the
237 3798b89a Dimitris Aragiorgis
monitor socket using SCM_RIGHTS. For disk hotplug and in case of uid
238 3798b89a Dimitris Aragiorgis
pool we can let the hypervisor code temporarily ``chown()`` the  device
239 3798b89a Dimitris Aragiorgis
before the actual hotplug. Still this is insufficient in case of chroot.
240 3798b89a Dimitris Aragiorgis
In this case, we need to ``mknod()`` the device inside the chroot. Both
241 3798b89a Dimitris Aragiorgis
workarounds can be avoided, if we make use of the ``add-fd`` qemu
242 3798b89a Dimitris Aragiorgis
monitor command, that was introduced in version 1.3. This command is the
243 3798b89a Dimitris Aragiorgis
equivalent of NICs' `get-fd`` for disks and will allow disk hotplug in
244 3798b89a Dimitris Aragiorgis
every case. So, if the qemu monitor does not support the ``add-fd``
245 3798b89a Dimitris Aragiorgis
command, we will not allow disk hotplug for chroot and uid security
246 3798b89a Dimitris Aragiorgis
model and notify the user with the corresponding warning.
247 3798b89a Dimitris Aragiorgis
248 3798b89a Dimitris Aragiorgis
.. vim: set textwidth=72 :
249 3798b89a Dimitris Aragiorgis
.. Local Variables:
250 3798b89a Dimitris Aragiorgis
.. mode: rst
251 3798b89a Dimitris Aragiorgis
.. fill-column: 72
252 3798b89a Dimitris Aragiorgis
.. End: