code.grnet.gr Git - ganeti-local/blob - doc/design-hotplug.rst

   1 =======
   2 Hotplug
   3 =======
   4
   5 .. contents:: :depth: 4
   6
   7 This is a design document detailing the implementation of device
   8 hotplugging in Ganeti. The logic used is hypervisor agnostic but still
   9 the initial implementation will target the KVM hypervisor. The
  10 implementation adds ``python-fdsend`` as a new dependency. In case
  11 it is not installed hotplug will not be possible and the user will
  12 be notified with a warning.
  13
  14
  15 Current state and shortcomings
  16 ==============================
  17
  18 Currently, Ganeti supports addition/removal/modification of devices
  19 (NICs, Disks) but the actual modification takes place only after
  20 rebooting the instance. To this end an instance cannot change network,
  21 get a new disk etc. without a hard reboot.
  22
  23 Until now, in case of KVM hypervisor, code does not name devices nor
  24 places them in specific PCI slots. Devices are appended in the KVM
  25 command and Ganeti lets KVM decide where to place them. This means that
  26 there is a possibility a device that resides in PCI slot 5, after a
  27 reboot (due to another device removal) to be moved to another PCI slot
  28 and probably get renamed too (due to udev rules, etc.).
  29
  30 In order for a migration to succeed, the process on the target node
  31 should be started with exactly the same machine version, CPU
  32 architecture and PCI configuration with the running process. During
  33 instance creation/startup ganeti creates a KVM runtime file with all the
  34 necessary information to generate the KVM command. This runtime file is
  35 used during instance migration to start a new identical KVM process. The
  36 current format includes the fixed part of the final KVM command, a list
  37 of NICs', and hvparams dict. It does not favor easy manipulations
  38 concerning disks, because they are encapsulated in the fixed KVM
  39 command.
  40
  41
  42 Proposed changes
  43 ================
  44
  45 For the case of the KVM hypervisor, QEMU exposes 32 PCI slots to the
  46 instance. Disks and NICs occupy some of these slots. Recent versions of
  47 QEMU have introduced monitor commands that allow addition/removal of PCI
  48 devices. Devices are referenced based on their name or position on the
  49 virtual PCI bus. To be able to use these commands, we need to be able to
  50 assign each device a unique name.
  51
  52 To keep track where each device is plugged into, we add the
  53 ``pci`` slot to Disk and NIC objects, but we save it only in runtime
  54 files, since it is hypervisor specific info. This is added for easy
  55 object manipulation and is ensured not to be written back to the config.
  56
  57 We propose to make use of QEMU 1.0 monitor commands so that
  58 modifications to devices take effect instantly without the need for hard
  59 reboot. The only change exposed to the end-user will be the addition of
  60 a ``--hotplug`` option to the ``gnt-instance modify`` command.
  61
  62 Upon hotplugging the PCI configuration of an instance is changed.
  63 Runtime files should be updated correspondingly. Currently this is
  64 impossible in case of disk hotplug because disks are included in command
  65 line entry of the runtime file, contrary to NICs that are correctly
  66 treated separately. We change the format of runtime files, we remove
  67 disks from the fixed KVM command and create new entry containing them
  68 only. KVM options concerning disk are generated during
  69 ``_ExecuteKVMCommand()``, just like NICs.
  70
  71 Design decisions
  72 ================
  73
  74 Which should be each device ID? Currently KVM does not support arbitrary
  75 IDs for devices; supported are only names starting with a letter, max 32
  76 chars length, and only including '.' '_' '-' special chars.
  77 For debugging purposes and in order to be more informative, device will be
  78 named after: <device type>-<part of uuid>-pci-<slot>.
  79
  80 Who decides where to hotplug each device? As long as this is a
  81 hypervisor specific matter, there is no point for the master node to
  82 decide such a thing. Master node just has to request noded to hotplug a
  83 device. To this end, hypervisor specific code should parse the current
  84 PCI configuration (i.e. ``info pci`` QEMU monitor command), find the first
  85 available slot and hotplug the device. Having noded to decide where to
  86 hotplug a device we ensure that no error will occur due to duplicate
  87 slot assignment (if masterd keeps track of PCI reservations and noded
  88 fails to return the PCI slot that the device was plugged into then next
  89 hotplug will fail).
  90
  91 Where should we keep track of devices' PCI slots? As already mentioned,
  92 we must keep track of devices PCI slots to successfully migrate
  93 instances. First option is to save this info to config data, which would
  94 allow us to place each device at the same PCI slot after reboot. This
  95 would require to make the hypervisor return the PCI slot chosen for each
  96 device, and storing this information to config data. Additionally the
  97 whole instance configuration should be returned with PCI slots filled
  98 after instance start and each instance should keep track of current PCI
  99 reservations. We decide not to go towards this direction in order to
 100 keep it simple and do not add hypervisor specific info to configuration
 101 data (``pci_reservations`` at instance level and ``pci`` at device
 102 level). For the aforementioned reason, we decide to store this info only
 103 in KVM runtime files.
 104
 105 Where to place the devices upon instance startup? QEMU has by default 4
 106 pre-occupied PCI slots. So, hypervisor can use the remaining ones for
 107 disks and NICs. Currently, PCI configuration is not preserved after
 108 reboot.  Each time an instance starts, KVM assigns PCI slots to devices
 109 based on their ordering in Ganeti configuration, i.e. the second disk
 110 will be placed after the first, the third NIC after the second, etc.
 111 Since we decided that there is no need to keep track of devices PCI
 112 slots, there is no need to change current functionality.
 113
 114 How to deal with existing instances? Hotplug depends on runtime file
 115 manipulation. It stores there pci info and every device the kvm process is
 116 currently using. Existing files have no pci info in devices and have block
 117 devices encapsulated inside kvm_cmd entry. Thus hotplugging of existing devices
 118 will not be possible. Still migration and hotplugging of new devices will
 119 succeed. The workaround will happen upon loading kvm runtime: if we detect old
 120 style format we will add an empty list for block devices and upon saving kvm
 121 runtime we will include this empty list as well. Switching entirely to new
 122 format will happen upon instance reboot.
 123
 124
 125 Configuration changes
 126 ---------------------
 127
 128 The ``NIC`` and ``Disk`` objects get one extra slot: ``pci``. It refers to
 129 PCI slot that the device gets plugged into.
 130
 131 In order to be able to live migrate successfully, runtime files should
 132 be updated every time a live modification (hotplug) takes place. To this
 133 end we change the format of runtime files. The KVM options referring to
 134 instance's disks are no longer recorded as part of the KVM command line.
 135 Disks are treated separately, just as we treat NICs right now. We insert
 136 and remove entries to reflect the current PCI configuration.
 137
 138
 139 Backend changes
 140 ---------------
 141
 142 Introduce one new RPC call:
 143
 144 - hotplug_device(DEVICE_TYPE, ACTION, device, ...)
 145
 146 where DEVICE_TYPE can be either NIC or Disk, and ACTION either REMOVE or ADD.
 147
 148 Hypervisor changes
 149 ------------------
 150
 151 We implement hotplug on top of the KVM hypervisor. We take advantage of
 152 QEMU 1.0 monitor commands (``device_add``, ``device_del``,
 153 ``drive_add``, ``drive_del``, ``netdev_add``,`` netdev_del``). QEMU
 154 refers to devices based on their id. We use ``uuid`` to name them
 155 properly. If a device is about to be hotplugged we parse the output of
 156 ``info pci`` and find the occupied PCI slots. We choose the first
 157 available and the whole device object is appended to the corresponding
 158 entry in the runtime file.
 159
 160 Concerning NIC handling, we build on the top of the existing logic
 161 (first create a tap with _OpenTap() and then pass its file descriptor to
 162 the KVM process). To this end we need to pass access rights to the
 163 corresponding file descriptor over the monitor socket (UNIX domain
 164 socket). The open file is passed as a socket-level control message
 165 (SCM), using the ``fdsend`` python library.
 166
 167
 168 User interface
 169 --------------
 170
 171 The new ``--hotplug`` option to gnt-instance modify is introduced, which
 172 forces live modifications.
 173
 174
 175 Enabling hotplug
 176 ++++++++++++++++
 177
 178 Hotplug will be optional during gnt-instance modify.  For existing
 179 instance, after installing a version that supports hotplugging we
 180 have the restriction that hotplug will not be supported for existing
 181 devices. The reason is that old runtime files lack of:
 182
 183 1. Device pci configuration info.
 184
 185 2. Separate block device entry.
 186
 187 Hotplug will be supported only for KVM in the first implementation. For
 188 all other hypervisors, backend will raise an Exception case hotplug is
 189 requested.
 190
 191
 192 NIC Hotplug
 193 +++++++++++
 194
 195 The user can add/modify/remove NICs either with hotplugging or not. If a
 196 NIC is to be added a tap is created first and configured properly with
 197 kvm-vif-bridge script. Then the instance gets a new network interface.
 198 Since there is no QEMU monitor command to modify a NIC, we modify a NIC
 199 by temporary removing the existing one and adding a new with the new
 200 configuration. When removing a NIC the corresponding tap gets removed as
 201 well.
 202
 203 ::
 204
 205  gnt-instance modify --net add --hotplug test
 206  gnt-instance modify --net 1:mac=aa:00:00:55:44:33 --hotplug test
 207  gnt-instance modify --net 1:remove --hotplug test
 208
 209
 210 Disk Hotplug
 211 ++++++++++++
 212
 213 The user can add and remove disks with hotplugging or not. QEMU monitor
 214 supports resizing of disks, however the initial implementation will
 215 support only disk addition/deletion.
 216
 217 ::
 218
 219  gnt-instance modify --disk add:size=1G --hotplug test
 220  gnt-instance modify --net 1:remove --hotplug test
 221
 222
 223 Dealing with chroot and uid pool
 224 --------------------------------
 225
 226 The design so far covers all issues that arise without addressing the
 227 case where the kvm process will not run with root privileges.
 228 Specifically:
 229
 230 - in case of chroot, the kvm process cannot see the newly created device
 231
 232 - in case of uid pool security model, the kvm process is not allowed
 233   to access the device
 234
 235 For NIC hotplug we address this problem by using the ``getfd`` monitor
 236 command and passing the file descriptor to the kvm process over the
 237 monitor socket using SCM_RIGHTS. For disk hotplug and in case of uid
 238 pool we can let the hypervisor code temporarily ``chown()`` the  device
 239 before the actual hotplug. Still this is insufficient in case of chroot.
 240 In this case, we need to ``mknod()`` the device inside the chroot. Both
 241 workarounds can be avoided, if we make use of the ``add-fd`` qemu
 242 monitor command, that was introduced in version 1.3. This command is the
 243 equivalent of NICs' `get-fd`` for disks and will allow disk hotplug in
 244 every case. So, if the qemu monitor does not support the ``add-fd``
 245 command, we will not allow disk hotplug for chroot and uid security
 246 model and notify the user with the corresponding warning.
 247
 248 .. vim: set textwidth=72 :
 249 .. Local Variables:
 250 .. mode: rst
 251 .. fill-column: 72
 252 .. End: