code.grnet.gr Git - ganeti-local/blob - doc/design-partitioned.rst

   1 ==================
   2 Partitioned Ganeti
   3 ==================
   4
   5 .. contents:: :depth: 4
   6
   7 Current state and shortcomings
   8 ==============================
   9
  10 Currently Ganeti can be used to easily share a node between multiple
  11 virtual instances. While it's easy to do a completely "best effort"
  12 sharing it's quite harder to completely reserve resources for the use of
  13 a particular instance. In particular this has to be done manually for
  14 CPUs and disk, is implemented for RAM under Xen, but not under KVM, and
  15 there's no provision for network level QoS.
  16
  17 Proposed changes
  18 ================
  19
  20 We want to make it easy to partition a node between machines with
  21 exclusive use of hardware resources. While some sharing will anyway need
  22 to happen (e.g. for operations that use the host domain, or use
  23 resources, like buses, which are unique or very scarce on host systems)
  24 we'll strive to maintain contention at a minimum, but won't try to avoid
  25 all possible sources of it.
  26
  27 Exclusive use of disks
  28 ----------------------
  29
  30 ``exclusive_storage`` is a configuration flag at node-group and cluster
  31 level. When it's enabled, Ganeti will allocate entire disks to
  32 instances. Though it's possible to think of ways of doing something
  33 similar for other storage back-ends, this design targets only ``plain``
  34 and ``drbd``. The name is generic enough in case the feature will be
  35 extended to other back-ends.
  36
  37 Ganeti will consider each physical volume in the destination volume
  38 group as a host disk (for proper isolation, an administrator should
  39 make sure that there aren't multiple PVs on the same physical
  40 disk). When ``exclusive_storage`` is enabled in a node group, all PVs
  41 in the node group must have the same size (within a certain margin, say
  42 1%, defined through a new parameter). Ganeti will check this condition
  43 when the ``exclusive_storage`` flag is set, whenever a new node is added
  44 and as part of ``cluster-verify``.
  45
  46 When creating a new disk for an instance, Ganeti will allocate the
  47 minimum number of PVs to hold the disk, and those PVs will be excluded
  48 from the pool of available PVs by marking them as unallocatable; in this
  49 way, PVs won't be shared between instance disks, and any remaining space
  50 won't be used by mistake for anything else. The underlying LV will be
  51 striped, when striping is allowed by the current configuration. Ganeti
  52 will continue to track only the LVs, and query the LVM layer to figure
  53 out which PVs are available and how much space is free.
  54
  55 For compatibility with the DRBD template and to take into account disk
  56 variability, Ganeti will always subtract 2% (this will be a parameter)
  57 from the PV space when calculating how many PVs are needed to allocate
  58 an instance and when nodes report free space.
  59
  60 The obvious target for this option is plain disk template, which doesn't
  61 provide redundancy. An administrator can still provide resilience
  62 against disk failures by setting up RAID under PVs, but this is
  63 transparent to Ganeti.
  64
  65 Spindles as a resource
  66 ~~~~~~~~~~~~~~~~~~~~~~
  67
  68 When resources are dedicated and there are more spindles than instances
  69 on a node, it is natural to assign more spindles to instances than what
  70 is strictly needed. For this reason, we introduce a new resource:
  71 spindles. A spindle is a PV in LVM. The number of spindles required for
  72 a disk of an instance is specified together with the size. Specifying
  73 the number of spindles is possible only when ``exclusive_storage`` is
  74 enabled. It is an error to specify a number of spindles insufficient to
  75 contain the requested disk size.
  76
  77 When ``exclusive_storage`` is not enabled, spindles are not used in free
  78 space calculation, in allocation algorithms, and policies. When it's
  79 enabled, ``hspace``, ``hbal``, and allocators will use spindles instead
  80 of disk size for their computation. For each node, the number of all the
  81 spindles in every LVM group is recorded, and different LVM groups are
  82 accounted separately in allocation and balancing.
  83
  84 There is already a concept of spindles in Ganeti. It's not related to
  85 any actual spindle or volume count, but it's used in ``spindle_use`` to
  86 measure the pressure of an instance on the storage system and in
  87 ``spindle_ratio`` to balance the I/O load on the nodes. These two
  88 parameters will be renamed to ``storage_io_use`` and
  89 ``storage_io_ratio`` to reflect better their meaning. When
  90 ``exclusive_storage`` is enabled, such parameters are ignored, as
  91 balancing the use of storage I/O is already addressed by the exclusive
  92 assignment of PVs.
  93
  94 Dedicated CPUs
  95 --------------
  96
  97 ``vpcu_ratio`` can be used to tie the number of VCPUs to the number of
  98 CPUs provided by the hardware. We need to take into account the CPU
  99 usage of the hypervisor. For Xen, this means counting the number of
 100 VCPUs assigned to ``Domain-0``.
 101
 102 For KVM, it's more difficult to limit the number of CPUs used by the
 103 node OS. ``cgroups`` could be a solution to restrict the node OS to use
 104 some of the CPUs, leaving the other ones to instances and KVM processes.
 105 For KVM, the number of CPUs for the host system should also be a
 106 hypervisor parameter (set at the node group level).
 107
 108 Dedicated RAM
 109 -------------
 110
 111 Instances should not compete for RAM. This is easily done on Xen, but it
 112 is tricky on KVM.
 113
 114 Xen
 115 ~~~
 116
 117 Memory is already fully segregated under Xen, if sharing mechanisms
 118 (transcendent memory, auto ballooning, etc) are not in use.
 119
 120 KVM
 121 ~~~
 122 Under KVM or LXC memory is fully shared between the host system and all
 123 the guests, and instances can even be swapped out by the host OS.
 124
 125 It's not clear if the problem can be solved by limiting the size of the
 126 instances, so that there is plenty of room for the host OS.
 127
 128 We could implement segregation using cgroups to limit the memory used by
 129 the host OS. This requires finishing the implementation of the memory
 130 hypervisor status (set at the node group level) that changes how free
 131 memory is computed under KVM systems. Then we have to add a way to
 132 enforce this limit on the host system itself, rather than leaving it as
 133 a calculation tool only.
 134
 135 Another problem for KVM is that we need to decide about the size of the
 136 cgroup versus the size of the VM: some overhead will in particular
 137 exist, due to the fact that an instance and its encapsulating KVM
 138 process share the same space. For KVM systems the physical memory
 139 allocatable to instances should be computed by subtracting an overhead
 140 for the KVM processes, whose value can be either statically configured
 141 or set in a hypervisor status parameter.
 142
 143 NUMA
 144 ~~~~
 145
 146 If instances are pinned to CPUs, and the amount of memory used for every
 147 instance is proportionate to the number of VCPUs, NUMA shouldn't be a
 148 problem, as the hypervisors allocate memory in the appropriate NUMA
 149 node. Work is in progress in Xen and the Linux kernel to always allocate
 150 memory correctly even without pinning. Therefore, we don't need to
 151 address this problem specifically; it will be solved by future versions
 152 of the hypervisors or by implementing CPU pinning.
 153
 154 Constrained instance sizes
 155 --------------------------
 156
 157 In order to simplify allocation and resource provisioning we want to
 158 limit the possible sizes of instances to a finite set of specifications,
 159 defined at node-group level.
 160
 161 Currently it's possible to define an instance policy that limits the
 162 minimum and maximum value for CPU, memory, and disk usage (and spindles
 163 and any other resource, when implemented), independently from each other. We
 164 extend the policy by allowing it to specify more specifications, where
 165 each specification contains the limits (minimum, maximum, and standard)
 166 for all the resources. Each specification has a unique priority (an
 167 integer) associated to it, which is used by ``hspace`` (see below).
 168
 169 For example, a policy could be set up to allow instances with this
 170 constraints:
 171 - between 1 and 2 CPUs, 2 GB of RAM, and between 10 GB and 400 GB of
 172 disk space;
 173 - 4 CPUs, 4 GB of RAM, and between 10 GB and 800 GB of disk space.
 174
 175 Then, an instance using 1 CPU, 2 GB of RAM and 50 GB of disk would be
 176 legal, as an instance using 4 CPUs, 4 GB of RAM, and 20 GB of disk,
 177 while an instance using 2 CPUs, 4 GB of RAM and 40 GB of disk would be
 178 illegal.
 179
 180 Ganeti will refuse to create (or modify) instances that violate instance
 181 policy constraints, unless the flag ``--ignore-ipolicy`` is passed.
 182
 183 While the changes needed to check constraint violations are
 184 straightforward, ``hspace`` behavior needs some adjustments. For both
 185 standard and tiered allocation, ``hspace`` will start to allocate
 186 instances using the specification with the highest priority, then it
 187 will fall back to second highest priority, and so on. For tiered
 188 allocation, it will try to lower the most constrained resources (without
 189 breaking the policy) before going to the next specification.
 190
 191 For consistent results in capacity calculation, the specifications
 192 inside a policy should be ordered so that the biggest specifications
 193 have the highest priorities. Also, specifications should not overlap.
 194 Ganeti won't check nor enforce such constraints, though.
 195
 196 Implementation order
 197 ====================
 198
 199 We will implement this design in the following order:
 200
 201 - Exclusive use of disks (without spindles as a resource)
 202 - Constrained instance sizes
 203 - Spindles as a resource
 204 - Dedicated CPU and memory
 205
 206 In this way have always new features that are immediately useful.
 207 Spindles as a resource are not needed for correct capacity calculation,
 208 as long as allowed disk sizes are multiples of spindle size, so it's
 209 been moved after constrained instance sizes. If it turns out that it's
 210 easier to implement dedicated disks with spindles as a resource, then we
 211 will do that.
 212
 213 Possible future enhancements
 214 ============================
 215
 216 This section briefly describes some enhancements to the current design.
 217 They may require their own design document, and must be re-evaluated
 218 when considered for implementation, as Ganeti and the hypervisors may
 219 change substantially in the meantime.
 220
 221 Network bandwidth
 222 -----------------
 223
 224 A new resource is introduced: network bandwidth. An administrator must
 225 be able to assign some network bandwidth to the virtual interfaces of an
 226 instance, and set limits in instance policies. Also, a list of the
 227 physical network interfaces available for Ganeti use and their maximum
 228 bandwidth must be kept at node-group or node level. This information
 229 will be taken into account for allocation, balancing, and free-space
 230 calculation.
 231
 232 An additional enhancement is Ganeti enforcing the values set in the
 233 bandwidth resource. This can be done by configuring limits for example
 234 via openvswitch or normal QoS for bridging or routing. The bandwidth
 235 resource represents the average bandwidth usage, so a few new back-end
 236 parameters are needed to configure how to deal with bursts (they depend
 237 on the actual way used to enforce the limit).
 238
 239 CPU pinning
 240 -----------
 241
 242 In order to avoid unwarranted migrations between CPUs and to deal with
 243 NUMA effectively we may need CPU pinning. CPU scheduling is a complex
 244 topic and still under active development in Xen and the Linux kernel, so
 245 we wont' try to outsmart their developers. If we need pinning it's more
 246 to have predictable performance than to get the maximum performance
 247 (which is best done by the hypervisor), so we'll implement a very simple
 248 algorithm that allocates CPUs when an instance is assigned to a node
 249 (either when it's created or when it's moved) and takes into account
 250 NUMA and maybe CPU multithreading. A more refined version might run also
 251 when an instance is deleted, but that would involve reassigning CPUs,
 252 which could be bad with NUMA.
 253
 254 Overcommit for RAM and disks
 255 ----------------------------
 256
 257 Right now it is possible to assign more VCPUs to the instances running
 258 on a node than there are CPU available. This works as normally CPU usage
 259 on average is way below 100%. There are ways to share memory pages
 260 (e.g. KSM, transcendent memory) and disk blocks, so we could add new
 261 parameters to overcommit memory and disks, similar to ``vcpu_ratio``.
 262
 263 .. vim: set textwidth=72 :
 264 .. Local Variables:
 265 .. mode: rst
 266 .. fill-column: 72
 267 .. End: