code.grnet.gr Git - ganeti-local/blob - doc/design-partitioned.rst

   1 ==================
   2 Partitioned Ganeti
   3 ==================
   4
   5 .. contents:: :depth: 4
   6
   7 Current state and shortcomings
   8 ==============================
   9
  10 Currently Ganeti can be used to easily share a node between multiple
  11 virtual instances. While it's easy to do a completely "best effort"
  12 sharing it's quite harder to completely reserve resources for the use of
  13 a particular instance. In particular this has to be done manually for
  14 CPUs and disk, is implemented for RAM under Xen, but not under KVM, and
  15 there's no provision for network level QoS.
  16
  17 Proposed changes
  18 ================
  19
  20 We want to make it easy to partition a node between machines with
  21 exclusive use of hardware resources. While some sharing will anyway need
  22 to happen (e.g. for operations that use the host domain, or use
  23 resources, like buses, which are unique or very scarce on host systems)
  24 we'll strive to maintain contention at a minimum, but won't try to avoid
  25 all possible sources of it.
  26
  27 Exclusive use of disks
  28 ----------------------
  29
  30 ``exclusive_storage`` is a new node parameter. When it's enabled, Ganeti
  31 will allocate entire disks to instances. Though it's possible to think
  32 of ways of doing something similar for other storage back-ends, this
  33 design targets only ``plain`` and ``drbd``. The name is generic enough
  34 in case the feature will be extended to other back-ends. The flag value
  35 should be homogeneous within a node-group; ``cluster-verify`` will report
  36 any violation of this condition.
  37
  38 Ganeti will consider each physical volume in the destination volume
  39 group as a host disk (for proper isolation, an administrator should
  40 make sure that there aren't multiple PVs on the same physical
  41 disk). When ``exclusive_storage`` is enabled in a node group, all PVs
  42 in the node group must have the same size (within a certain margin, say
  43 1%, defined through a new parameter). Ganeti will check this condition
  44 when the ``exclusive_storage`` flag is set, whenever a new node is added
  45 and as part of ``cluster-verify``.
  46
  47 When creating a new disk for an instance, Ganeti will allocate the
  48 minimum number of PVs to hold the disk, and those PVs will be excluded
  49 from the pool of available PVs for further disk creations. The
  50 underlying LV will be striped, when striping is allowed by the current
  51 configuration. Ganeti will continue to track only the LVs, and query the
  52 LVM layer to figure out which PVs are available and how much space is
  53 free. Yet, creation, disk growing, and free-space reporting will ignore
  54 any partially allocated PVs, so that PVs won't be shared between
  55 instance disks.
  56
  57 For compatibility with the DRBD template and to take into account disk
  58 variability, Ganeti will always subtract 2% (this will be a parameter)
  59 from the PV space when calculating how many PVs are needed to allocate
  60 an instance and when nodes report free space.
  61
  62 The obvious target for this option is plain disk template, which doesn't
  63 provide redundancy. An administrator can still provide resilience
  64 against disk failures by setting up RAID under PVs, but this is
  65 transparent to Ganeti.
  66
  67 Spindles as a resource
  68 ~~~~~~~~~~~~~~~~~~~~~~
  69
  70 When resources are dedicated and there are more spindles than instances
  71 on a node, it is natural to assign more spindles to instances than what
  72 is strictly needed. For this reason, we introduce a new resource:
  73 spindles. A spindle is a PV in LVM. The number of spindles required for
  74 a disk of an instance is specified together with the size. Specifying
  75 the number of spindles is possible only when ``exclusive_storage`` is
  76 enabled. It is an error to specify a number of spindles insufficient to
  77 contain the requested disk size.
  78
  79 When ``exclusive_storage`` is not enabled, spindles are not used in free
  80 space calculation, in allocation algorithms, and policies. When it's
  81 enabled, ``hspace``, ``hbal``, and allocators will use spindles instead
  82 of disk size for their computation. For each node, the number of all the
  83 spindles in every LVM group is recorded, and different LVM groups are
  84 accounted separately in allocation and balancing.
  85
  86 There is already a concept of spindles in Ganeti. It's not related to
  87 any actual spindle or volume count, but it's used in ``spindle_use`` to
  88 measure the pressure of an instance on the storage system and in
  89 ``spindle_ratio`` to balance the I/O load on the nodes. When
  90 ``exclusive_storage`` is enabled, these parameters as currently defined
  91 won't make any sense, so their meaning will be changed in this way:
  92
  93 - ``spindle_use`` refers to the resource, hence to the actual spindles
  94   (PVs in LVM), used by an instance. The values specified in the instance
  95   policy specifications are compared to the run-time numbers of spindle
  96   used by an instance. The ``spindle_use`` back-end parameter will be
  97   ignored.
  98 - ``spindle_ratio`` in instance policies and ``spindle_count`` in node
  99   parameters are ignored, as the exclusive assignment of PVs already
 100   implies a value of 1.0 for the first, and the second is replaced by
 101   the actual number of spindles.
 102
 103 When ``exclusive_storage`` is disabled, the existing spindle parameters
 104 behave as before.
 105
 106 Dedicated CPUs
 107 --------------
 108
 109 ``vpcu_ratio`` can be used to tie the number of VCPUs to the number of
 110 CPUs provided by the hardware. We need to take into account the CPU
 111 usage of the hypervisor. For Xen, this means counting the number of
 112 VCPUs assigned to ``Domain-0``.
 113
 114 For KVM, it's more difficult to limit the number of CPUs used by the
 115 node OS. ``cgroups`` could be a solution to restrict the node OS to use
 116 some of the CPUs, leaving the other ones to instances and KVM processes.
 117 For KVM, the number of CPUs for the host system should also be a
 118 hypervisor parameter (set at the node group level).
 119
 120 Dedicated RAM
 121 -------------
 122
 123 Instances should not compete for RAM. This is easily done on Xen, but it
 124 is tricky on KVM.
 125
 126 Xen
 127 ~~~
 128
 129 Memory is already fully segregated under Xen, if sharing mechanisms
 130 (transcendent memory, auto ballooning, etc) are not in use.
 131
 132 KVM
 133 ~~~
 134 Under KVM or LXC memory is fully shared between the host system and all
 135 the guests, and instances can even be swapped out by the host OS.
 136
 137 It's not clear if the problem can be solved by limiting the size of the
 138 instances, so that there is plenty of room for the host OS.
 139
 140 We could implement segregation using cgroups to limit the memory used by
 141 the host OS. This requires finishing the implementation of the memory
 142 hypervisor status (set at the node group level) that changes how free
 143 memory is computed under KVM systems. Then we have to add a way to
 144 enforce this limit on the host system itself, rather than leaving it as
 145 a calculation tool only.
 146
 147 Another problem for KVM is that we need to decide about the size of the
 148 cgroup versus the size of the VM: some overhead will in particular
 149 exist, due to the fact that an instance and its encapsulating KVM
 150 process share the same space. For KVM systems the physical memory
 151 allocatable to instances should be computed by subtracting an overhead
 152 for the KVM processes, whose value can be either statically configured
 153 or set in a hypervisor status parameter.
 154
 155 NUMA
 156 ~~~~
 157
 158 If instances are pinned to CPUs, and the amount of memory used for every
 159 instance is proportionate to the number of VCPUs, NUMA shouldn't be a
 160 problem, as the hypervisors allocate memory in the appropriate NUMA
 161 node. Work is in progress in Xen and the Linux kernel to always allocate
 162 memory correctly even without pinning. Therefore, we don't need to
 163 address this problem specifically; it will be solved by future versions
 164 of the hypervisors or by implementing CPU pinning.
 165
 166 Constrained instance sizes
 167 --------------------------
 168
 169 In order to simplify allocation and resource provisioning we want to
 170 limit the possible sizes of instances to a finite set of specifications,
 171 defined at node-group level.
 172
 173 Currently it's possible to define an instance policy that limits the
 174 minimum and maximum value for CPU, memory, and disk usage (and spindles
 175 and any other resource, when implemented), independently from each other. We
 176 extend the policy by allowing it to contain more occurrences of the
 177 specifications for both the limits for the instance resources. Each
 178 specification pair (minimum and maximum) has a unique priority
 179 associated to it (or in other words, specifications are ordered), which
 180 is used by ``hspace`` (see below). The standard specification doesn't
 181 change: there is one for the whole cluster.
 182
 183 For example, a policy could be set up to allow instances with this
 184 constraints:
 185
 186 - between 1 and 2 CPUs, 2 GB of RAM, and between 10 GB and 400 GB of
 187   disk space;
 188 - 4 CPUs, 4 GB of RAM, and between 10 GB and 800 GB of disk space.
 189
 190 Then, an instance using 1 CPU, 2 GB of RAM and 50 GB of disk would be
 191 legal, as an instance using 4 CPUs, 4 GB of RAM, and 20 GB of disk,
 192 while an instance using 2 CPUs, 4 GB of RAM and 40 GB of disk would be
 193 illegal.
 194
 195 Ganeti will refuse to create (or modify) instances that violate instance
 196 policy constraints, unless the flag ``--ignore-ipolicy`` is passed.
 197
 198 While the changes needed to check constraint violations are
 199 straightforward, ``hspace`` behavior needs some adjustments for tiered
 200 allocation. ``hspace`` will start to allocate instances using the
 201 maximum specification with the highest priority, then it will try to
 202 lower the most constrained resources (without breaking the policy)
 203 before moving to the second highest priority, and so on.
 204
 205 For consistent results in capacity calculation, the specifications
 206 inside a policy should be ordered so that the biggest specifications
 207 have the highest priorities. Also, specifications should not overlap.
 208 Ganeti won't check nor enforce such constraints, though.
 209
 210 Implementation order
 211 ====================
 212
 213 We will implement this design in the following order:
 214
 215 - Exclusive use of disks (without spindles as a resource)
 216 - Constrained instance sizes
 217 - Spindles as a resource
 218 - Dedicated CPU and memory
 219
 220 In this way have always new features that are immediately useful.
 221 Spindles as a resource are not needed for correct capacity calculation,
 222 as long as allowed disk sizes are multiples of spindle size, so it's
 223 been moved after constrained instance sizes. If it turns out that it's
 224 easier to implement dedicated disks with spindles as a resource, then we
 225 will do that.
 226
 227 Possible future enhancements
 228 ============================
 229
 230 This section briefly describes some enhancements to the current design.
 231 They may require their own design document, and must be re-evaluated
 232 when considered for implementation, as Ganeti and the hypervisors may
 233 change substantially in the meantime.
 234
 235 Network bandwidth
 236 -----------------
 237
 238 A new resource is introduced: network bandwidth. An administrator must
 239 be able to assign some network bandwidth to the virtual interfaces of an
 240 instance, and set limits in instance policies. Also, a list of the
 241 physical network interfaces available for Ganeti use and their maximum
 242 bandwidth must be kept at node-group or node level. This information
 243 will be taken into account for allocation, balancing, and free-space
 244 calculation.
 245
 246 An additional enhancement is Ganeti enforcing the values set in the
 247 bandwidth resource. This can be done by configuring limits for example
 248 via openvswitch or normal QoS for bridging or routing. The bandwidth
 249 resource represents the average bandwidth usage, so a few new back-end
 250 parameters are needed to configure how to deal with bursts (they depend
 251 on the actual way used to enforce the limit).
 252
 253 CPU pinning
 254 -----------
 255
 256 In order to avoid unwarranted migrations between CPUs and to deal with
 257 NUMA effectively we may need CPU pinning. CPU scheduling is a complex
 258 topic and still under active development in Xen and the Linux kernel, so
 259 we wont' try to outsmart their developers. If we need pinning it's more
 260 to have predictable performance than to get the maximum performance
 261 (which is best done by the hypervisor), so we'll implement a very simple
 262 algorithm that allocates CPUs when an instance is assigned to a node
 263 (either when it's created or when it's moved) and takes into account
 264 NUMA and maybe CPU multithreading. A more refined version might run also
 265 when an instance is deleted, but that would involve reassigning CPUs,
 266 which could be bad with NUMA.
 267
 268 Overcommit for RAM and disks
 269 ----------------------------
 270
 271 Right now it is possible to assign more VCPUs to the instances running
 272 on a node than there are CPU available. This works as normally CPU usage
 273 on average is way below 100%. There are ways to share memory pages
 274 (e.g. KSM, transcendent memory) and disk blocks, so we could add new
 275 parameters to overcommit memory and disks, similar to ``vcpu_ratio``.
 276
 277 .. vim: set textwidth=72 :
 278 .. Local Variables:
 279 .. mode: rst
 280 .. fill-column: 72
 281 .. End: