code.grnet.gr Git - ganeti-local/blob - doc/design-hsqueeze.rst

   1 =============
   2 HSqueeze tool
   3 =============
   4
   5 .. contents:: :depth: 4
   6
   7 This is a design document detailing the node-freeing scheduler, HSqueeze.
   8
   9
  10 Current state and shortcomings
  11 ==============================
  12
  13 Externally-mirrored instances can be moved between nodes at low
  14 cost. Therefore, it is attractive to free up nodes and power them down
  15 at times of low usage, even for small periods of time, like nights or
  16 weekends.
  17
  18 Currently, the best way to find out a suitable set of nodes to shut down
  19 is to use the property of our balancedness metric to move instances
  20 away from drained nodes. So, one would manually drain more and more
  21 nodes and see, if `hbal` could find a solution freeing up all those
  22 drained nodes.
  23
  24
  25 Proposed changes
  26 ================
  27
  28 We propose the addition of a new htool command-line tool, called
  29 `hsqueeze`, that aims at keeping resource usage at a constant high
  30 level by evacuating and powering down nodes, or powering up nodes and
  31 rebalancing, as appropriate. By default, only externally-mirrored
  32 instances are moved, but options are provided to additionally take
  33 DRBD instances (which can be moved without downtimes), or even all
  34 instances into consideration.
  35
  36 Tagging of standy nodes
  37 -----------------------
  38
  39 Powering down nodes that are technically healthy effectively creates a
  40 new node state: nodes on standby. To avoid further state
  41 proliferation, and as this information is only used by `hsqueeze`,
  42 this information is recorded in node tags. `hsqueeze` will assume
  43 that offline nodes having a tag with prefix `htools:standby:` can
  44 easily be powered on at any time.
  45
  46 Minimum available resources
  47 ---------------------------
  48
  49 To keep the squeezed cluster functional, a minimal amount of resources
  50 will be left available on every node. While the precise amount will
  51 be specifiable via command-line options, a sensible default is chosen,
  52 like enough resource to start an additional instance at standard
  53 allocation on each node. If the available resources fall below this
  54 limit, `hsqueeze` will, in fact, try to power on more nodes, till
  55 enough resources are available, or all standy nodes are online.
  56
  57 To avoid flapping behavior, a second, higher, amount of reserve
  58 resources can be specified, and `hsqueeze` will only power down nodes,
  59 if after the power down this higher amount of reserve resources is
  60 still available.
  61
  62 Computation of the set to free up
  63 ---------------------------------
  64
  65 To determine which nodes can be powered down, `hsqueeze` basically
  66 follows the same algorithm as the manual process. It greedily goes
  67 through all non-master nodes and tries if the algorithm used by `hbal`
  68 would find a solution (with the appropriate move restriction) that
  69 frees up the extended set of nodes to be drained, while keeping enough
  70 resources free. Being based on the algorithm used by `hbal`, all
  71 restrictions respected by `hbal`, in particular memory reservation
  72 for N+1 redundancy, are also respected by `hsqueeze`.
  73 The order in which the nodes are tried is choosen by a
  74 suitable heuristics, like trying the nodes in order of increasing
  75 number of instances; the hope is that this reduces the number of
  76 instances that actually have to be moved.
  77
  78 If the amount of free resources has fallen below the lower limit,
  79 `hsqueeze` will determine the set of nodes to power up in a similar
  80 way; it will hypothetically add more and more of the standby
  81 nodes (in some suitable order) till the algorithm used by `hbal` will
  82 finally balance the cluster in a way that enough resources are available,
  83 or all standy nodes are online.
  84
  85
  86 Instance moves and execution
  87 ----------------------------
  88
  89 Once the final set of nodes to power down is determined, the instance
  90 moves are determined by the algorithm used by `hbal`. If
  91 requested by the `-X` option, the nodes freed up are drained, and the
  92 instance moves are executed in the same way as `hbal` does. Finally,
  93 those of the freed-up nodes that do not already have a
  94 `htools:standby:` tag are tagged as `htools:standby:auto`, all free-up
  95 nodes are marked as offline and powered down via the
  96 :doc:`design-oob`.
  97
  98 Similarly, if it is determined that nodes need to be added, then first
  99 the nodes are powered up via the :doc:`design-oob`, then they're marked
 100 as online and finally,
 101 the cluster is balanced in the same way, as `hbal` would do. For the
 102 newly powered up nodes, the `htools:standby:auto` tag, if present, is
 103 removed, but no other tags are removed (including other
 104 `htools:standby:` tags).
 105
 106
 107 Design choices
 108 ==============
 109
 110 The proposed algorithm builds on top of the already present balancing
 111 algorithm, instead of greedily packing nodes as full as possible. The
 112 reason is, that in the end, a balanced cluster is needed anyway;
 113 therefore, basing on the balancing algorithm reduces the number of
 114 instance moves. Additionally, the final configuration will also
 115 benefit from all improvements to the balancing algorithm, like taking
 116 dynamic CPU data into account.
 117
 118 We decided to have a separate program instead of adding an option to
 119 `hbal` to keep the interfaces, especially that of `hbal`, cleaner. It is
 120 not unlikely that, over time, additional `hsqueeze`-specific options
 121 might be added, specifying, e.g., which nodes to prefer for
 122 shutdown. With the approach of the `htools` of having a single binary
 123 showing different behaviors, having an additional program also does not
 124 introduce significant additional cost.
 125
 126 We decided to have a whole prefix instead of a single tag reserved
 127 for marking standby nodes (we consider all tags starting with
 128 `htools:standby:` as serving only this purpose). This is not only in
 129 accordance with the tag
 130 reservations for other tools, but it also allows for further extension
 131 (like specifying priorities on which nodes to power up first) without
 132 changing name spaces.