code.grnet.gr Git - ganeti-local/blob - doc/design-htools-2.3.rst

   1 ====================================
   2  Synchronising htools to Ganeti 2.3
   3 ====================================
   4
   5 Ganeti 2.3 introduces a number of new features that change the cluster
   6 internals significantly enough that the htools suite needs to be
   7 updated accordingly in order to function correctly.
   8
   9 Shared storage support
  10 ======================
  11
  12 Currently, the htools algorithms presume a model where all of an
  13 instance's resources is served from within the cluster, more
  14 specifically from the nodes comprising the cluster. While is this
  15 usual for memory and CPU, deployments which use shared storage will
  16 invalidate this assumption for storage.
  17
  18 To account for this, we need to move some assumptions from being
  19 implicit (and hardcoded) to being explicitly exported from Ganeti.
  20
  21
  22 New instance parameters
  23 -----------------------
  24
  25 It is presumed that Ganeti will export for all instances a new
  26 ``storage_type`` parameter, that will denote either internal storage
  27 (e.g. *plain* or *drbd*), or external storage.
  28
  29 Furthermore, a new ``storage_pool`` parameter will classify, for both
  30 internal and external storage, the pool out of which the storage is
  31 allocated. For internal storage, this will be either ``lvm`` (the pool
  32 that provides space to both ``plain`` and ``drbd`` instances) or
  33 ``file`` (for file-storage-based instances). For external storage,
  34 this will be the respective NAS/SAN/cloud storage that backs up the
  35 instance. Note that for htools, external storage pools are opaque; we
  36 only care that they have an identifier, so that we can distinguish
  37 between two different pools.
  38
  39 If these two parameters are not present, the instances will be
  40 presumed to be ``internal/lvm``.
  41
  42 New node parameters
  43 -------------------
  44
  45 For each node, it is expected that Ganeti will export what storage
  46 types it supports and pools it has access to. So a classic 2.2 cluster
  47 will have all nodes supporting ``internal/lvm`` and/or
  48 ``internal/file``, whereas a new shared storage only 2.3 cluster could
  49 have ``external/my-nas`` storage.
  50
  51 Whatever the mechanism that Ganeti will use internally to configure
  52 the associations between nodes and storage pools, we consider that
  53 we'll have available two node attributes inside htools: the list of internal
  54 and external storage pools.
  55
  56 External storage and instances
  57 ------------------------------
  58
  59 Currently, for an instance we allow one cheap move type: failover to
  60 the current secondary, if it is a healthy node, and four other
  61 “expensive” (as in, including data copies) moves that involve changing
  62 either the secondary or the primary node or both.
  63
  64 In presence of an external storage type, the following things will
  65 change:
  66
  67 - the disk-based moves will be disallowed; this is already a feature
  68   in the algorithm, controlled by a boolean switch, so adapting
  69   external storage here will be trivial
  70 - instead of the current one secondary node, the secondaries will
  71   become a list of potential secondaries, based on access to the
  72   instance's storage pool
  73
  74 Except for this, the basic move algorithm remains unchanged.
  75
  76 External storage and nodes
  77 --------------------------
  78
  79 Two separate areas will have to change for nodes and external storage.
  80
  81 First, then allocating instances (either as part of a move or a new
  82 allocation), if the instance is using external storage, then the
  83 internal disk metrics should be ignored (for both the primary and
  84 secondary cases).
  85
  86 Second, the per-node metrics used in the cluster scoring must take
  87 into account that nodes might not have internal storage at all, and
  88 handle this as a well-balanced case (score 0).
  89
  90 N+1 status
  91 ----------
  92
  93 Currently, computing the N+1 status of a node is simple:
  94
  95 - group the current secondary instances by their primary node, and
  96   compute the sum of each instance group memory
  97 - choose the maximum sum, and check if it's smaller than the current
  98   available memory on this node
  99
 100 In effect, computing the N+1 status is a per-node matter. However,
 101 with shared storage, we don't have secondary nodes, just potential
 102 secondaries. Thus computing the N+1 status will be a cluster-level
 103 matter, and much more expensive.
 104
 105 A simple version of the N+1 checks would be that for each instance
 106 having said node as primary, we have enough memory in the cluster for
 107 relocation. This means we would actually need to run allocation
 108 checks, and update the cluster status from within allocation on one
 109 node, while being careful that we don't recursively check N+1 status
 110 during this relocation, which is too expensive.
 111
 112 However, the shared storage model has some properties that changes the
 113 rules of the computation. Speaking broadly (and ignoring hard
 114 restrictions like tag based exclusion and CPU limits), the exact
 115 location of an instance in the cluster doesn't matter as long as
 116 memory is available. This results in two changes:
 117
 118 - simply tracking the amount of free memory buckets is enough,
 119   cluster-wide
 120 - moving an instance from one node to another would not change the N+1
 121   status of any node, and only allocation needs to deal with N+1
 122   checks
 123
 124 Unfortunately, this very cheap solution fails in case of any other
 125 exclusion or prevention factors.
 126
 127 TODO: find a solution for N+1 checks.
 128
 129
 130 Node groups support
 131 ===================
 132
 133 The addition of node groups has a small impact on the actual
 134 algorithms, which will simply operate at node group level instead of
 135 cluster level, but it requires the addition of new algorithms for
 136 inter-node group operations.
 137
 138 The following two definitions will be used in the following
 139 paragraphs:
 140
 141 local group
 142   The local group refers to a node's own node group, or when speaking
 143   about an instance, the node group of its primary node
 144
 145 regular cluster
 146   A cluster composed of a single node group, or pre-2.3 cluster
 147
 148 super cluster
 149   This term refers to a cluster which comprises multiple node groups,
 150   as opposed to a 2.2 and earlier cluster with a single node group
 151
 152 In all the below operations, it's assumed that Ganeti can gather the
 153 entire super cluster state cheaply.
 154
 155
 156 Balancing changes
 157 -----------------
 158
 159 Balancing will move from cluster-level balancing to group
 160 balancing. In order to achieve a reasonable improvement in a super
 161 cluster, without needing to keep state of what groups have been
 162 already balanced previously, the balancing algorithm will run as
 163 follows:
 164
 165 #. the cluster data is gathered
 166 #. if this is a regular cluster, as opposed to a super cluster,
 167    balancing will proceed normally as previously
 168 #. otherwise, compute the cluster scores for all groups
 169 #. choose the group with the worst score and see if we can improve it;
 170    if not choose the next-worst group, so on
 171 #. once a group has been identified, run the balancing for it
 172
 173 Of course, explicit selection of a group will be allowed.
 174
 175 Super cluster operations
 176 ++++++++++++++++++++++++
 177
 178 Beside the regular group balancing, in a super cluster we have more
 179 operations.
 180
 181
 182 Redistribution
 183 ^^^^^^^^^^^^^^
 184
 185 In a regular cluster, once we run out of resources (offline nodes
 186 which can't be fully evacuated, N+1 failures, etc.) there is nothing
 187 we can do unless nodes are added or instances are removed.
 188
 189 In a super cluster however, there might be resources available in
 190 another group, so there is the possibility of relocating instances
 191 between groups to re-establish N+1 success within each group.
 192
 193 One difficulty in the presence of both super clusters and shared
 194 storage is that the move paths of instances are quite complicated;
 195 basically an instance can move inside its local group, and to any
 196 other groups which have access to the same storage type and storage
 197 pool pair. In effect, the super cluster is composed of multiple
 198 ‘partitions’, each containing one or more groups, but a node is
 199 simultaneously present in multiple partitions, one for each storage
 200 type and storage pool it supports. As such, the interactions between
 201 the individual partitions are too complex for non-trivial clusters to
 202 assume we can compute a perfect solution: we might need to move some
 203 instances using shared storage pool ‘A’ in order to clear some more
 204 memory to accept an instance using local storage, which will further
 205 clear more VCPUs in a third partition, etc. As such, we'll limit
 206 ourselves at simple relocation steps within a single partition.
 207
 208 Algorithm:
 209
 210 #. read super cluster data, and exit if cluster doesn't allow
 211    inter-group moves
 212 #. filter out any groups that are “alone” in their partition
 213    (i.e. no other group sharing at least one storage method)
 214 #. determine list of healthy versus unhealthy groups:
 215
 216     #. a group which contains offline nodes still hosting instances is
 217        definitely not healthy
 218     #. a group which has nodes failing N+1 is ‘weakly’ unhealthy
 219
 220 #. if either list is empty, exit (no work to do, or no way to fix problems)
 221 #. for each unhealthy group:
 222
 223     #. compute the instances that are causing the problems: all
 224        instances living on offline nodes, all instances living as
 225        secondary on N+1 failing nodes, all instances living as primaries
 226        on N+1 failing nodes (in this order)
 227     #. remove instances, one by one, until the source group is healthy
 228        again
 229     #. try to run a standard allocation procedure for each instance on
 230        all potential groups in its partition
 231     #. if all instances were relocated successfully, it means we have a
 232        solution for repairing the original group
 233
 234 Compression
 235 ^^^^^^^^^^^
 236
 237 In a super cluster which has had many instance reclamations, it is
 238 possible that while none of the groups is empty, overall there is
 239 enough empty capacity that an entire group could be removed.
 240
 241 The algorithm for “compressing” the super cluster is as follows:
 242
 243 #. read super cluster data
 244 #. compute total *(memory, disk, cpu)*, and free *(memory, disk, cpu)*
 245    for the super-cluster
 246 #. computer per-group used and free *(memory, disk, cpu)*
 247 #. select candidate groups for evacuation:
 248
 249     #. they must be connected to other groups via a common storage type
 250        and pool
 251     #. they must have fewer used resources than the global free
 252        resources (minus their own free resources)
 253
 254 #. for each of these groups, try to relocate all its instances to
 255    connected peer groups
 256 #. report the list of groups that could be evacuated, or if instructed
 257    so, perform the evacuation of the group with the largest free
 258    resources (i.e. in order to reclaim the most capacity)
 259
 260 Load balancing
 261 ^^^^^^^^^^^^^^
 262
 263 Assuming a super cluster using shared storage, where instance failover
 264 is cheap, it should be possible to do a load-based balancing across
 265 groups.
 266
 267 As opposed to the normal balancing, where we want to balance on all
 268 node attributes, here we should look only at the load attributes; in
 269 other words, compare the available (total) node capacity with the
 270 (total) load generated by instances in a given group, and computing
 271 such scores for all groups, trying to see if we have any outliers.
 272
 273 Once a reliable load-weighting method for groups exists, we can apply
 274 a modified version of the cluster scoring method to score not
 275 imbalances across nodes, but imbalances across groups which result in
 276 a super cluster load-related score.
 277
 278 Allocation changes
 279 ------------------
 280
 281 It is important to keep the allocation method across groups internal
 282 (in the Ganeti/Iallocator combination), instead of delegating it to an
 283 external party (e.g. a RAPI client). For this, the IAllocator protocol
 284 should be extended to provide proper group support.
 285
 286 For htools, the new algorithm will work as follows:
 287
 288 #. read/receive cluster data from Ganeti
 289 #. filter out any groups that do not supports the requested storage
 290    method
 291 #. for remaining groups, try allocation and compute scores after
 292    allocation
 293 #. sort valid allocation solutions accordingly and return the entire
 294    list to Ganeti
 295
 296 The rationale for returning the entire group list, and not only the
 297 best choice, is that we anyway have the list, and Ganeti might have
 298 other criteria (e.g. the best group might be busy/locked down, etc.)
 299 so even if from the point of view of resources it is the best choice,
 300 it might not be the overall best one.
 301
 302 Node evacuation changes
 303 -----------------------
 304
 305 While the basic concept in the ``multi-evac`` iallocator
 306 mode remains unchanged (it's a simple local group issue), when failing
 307 to evacuate and running in a super cluster, we could have resources
 308 available elsewhere in the cluster for evacuation.
 309
 310 The algorithm for computing this will be the same as the one for super
 311 cluster compression and redistribution, except that the list of
 312 instances is fixed to the ones living on the nodes to-be-evacuated.
 313
 314 If the inter-group relocation is successful, the result to Ganeti will
 315 not be a local group evacuation target, but instead (for each
 316 instance) a pair *(remote group, nodes)*. Ganeti itself will have to
 317 decide (based on user input) whether to continue with inter-group
 318 evacuation or not.
 319
 320 In case that Ganeti doesn't provide complete cluster data, just the
 321 local group, the inter-group relocation won't be attempted.
 322
 323 .. vim: set textwidth=72 :
 324 .. Local Variables:
 325 .. mode: rst
 326 .. fill-column: 72
 327 .. End: