/doc/design-hsqueeze.rst - snf-ganeti - Greek Research and Technology Network's projects

| Branch: | Tag: | Revision:

root / doc / design-hsqueeze.rst @ 333bd799

History | View | Annotate | Download (5.6 kB)

       =============
       HSqueeze tool
       =============
       .. contents:: :depth: 4
       This is a design document detailing the node-freeing scheduler, HSqueeze.
       Current state and shortcomings
       ==============================
       Externally-mirrored instances can be moved between nodes at low
       cost. Therefore, it is attractive to free up nodes and power them down
       at times of low usage, even for small periods of time, like nights or
       weekends.
       Currently, the best way to find out a suitable set of nodes to shut down
       is to use the property of our balancedness metric to move instances
       away from drained nodes. So, one would manually drain more and more
       nodes and see, if `hbal` could find a solution freeing up all those
       drained nodes.
       Proposed changes
       ================
       We propose the addition of a new htool command-line tool, called
       `hsqueeze`, that aims at keeping resource usage at a constant high
       level by evacuating and powering down nodes, or powering up nodes and
       rebalancing, as appropriate. By default, only externally-mirrored
       instances are moved, but options are provided to additionally take
       DRBD instances (which can be moved without downtimes), or even all
       instances into consideration.
       Tagging of standy nodes
       -----------------------
       Powering down nodes that are technically healthy effectively creates a
       new node state: nodes on standby. To avoid further state
       proliferation, and as this information is only used by `hsqueeze`,
       this information is recorded in node tags. `hsqueeze` will assume
       that offline nodes having a tag with prefix `htools:standby:` can
       easily be powered on at any time.
       Minimum available resources
       ---------------------------
       To keep the squeezed cluster functional, a minimal amount of resources
       will be left available on every node. While the precise amount will
       be specifiable via command-line options, a sensible default is chosen,
       like enough resource to start an additional instance at standard
       allocation on each node. If the available resources fall below this
       limit, `hsqueeze` will, in fact, try to power on more nodes, till
       enough resources are available, or all standy nodes are online.
       To avoid flapping behavior, a second, higher, amount of reserve
       resources can be specified, and `hsqueeze` will only power down nodes,
       if after the power down this higher amount of reserve resources is
       still available.
       Computation of the set to free up
       ---------------------------------
       To determine which nodes can be powered down, `hsqueeze` basically
       follows the same algorithm as the manual process. It greedily goes
       through all non-master nodes and tries if the algorithm used by `hbal`
       would find a solution (with the appropriate move restriction) that
       frees up the extended set of nodes to be drained, while keeping enough
       resources free. Being based on the algorithm used by `hbal`, all
       restrictions respected by `hbal`, in particular memory reservation
       for N+1 redundancy, are also respected by `hsqueeze`.
       The order in which the nodes are tried is choosen by a
       suitable heuristics, like trying the nodes in order of increasing
       number of instances; the hope is that this reduces the number of
       instances that actually have to be moved.
       If the amount of free resources has fallen below the lower limit,
       `hsqueeze` will determine the set of nodes to power up in a similar
       way; it will hypothetically add more and more of the standby
       nodes (in some suitable order) till the algorithm used by `hbal` will
       finally balance the cluster in a way that enough resources are available,
       or all standy nodes are online.
       Instance moves and execution
       ----------------------------
       Once the final set of nodes to power down is determined, the instance
       moves are determined by the algorithm used by `hbal`. If
       requested by the `-X` option, the nodes freed up are drained, and the
       instance moves are executed in the same way as `hbal` does. Finally,
       those of the freed-up nodes that do not already have a
       `htools:standby:` tag are tagged as `htools:standby:auto`, all free-up
       nodes are marked as offline and powered down via the
       :doc:`design-oob`.
       Similarly, if it is determined that nodes need to be added, then first
       the nodes are powered up via the :doc:`design-oob`, then they're marked
       as online and finally,
       the cluster is balanced in the same way, as `hbal` would do. For the
       newly powered up nodes, the `htools:standby:auto` tag, if present, is
       removed, but no other tags are removed (including other
       `htools:standby:` tags).
       Design choices
       ==============
       The proposed algorithm builds on top of the already present balancing
       algorithm, instead of greedily packing nodes as full as possible. The
       reason is, that in the end, a balanced cluster is needed anyway;
       therefore, basing on the balancing algorithm reduces the number of
       instance moves. Additionally, the final configuration will also
       benefit from all improvements to the balancing algorithm, like taking
       dynamic CPU data into account.
       We decided to have a separate program instead of adding an option to
       `hbal` to keep the interfaces, especially that of `hbal`, cleaner. It is
       not unlikely that, over time, additional `hsqueeze`-specific options
       might be added, specifying, e.g., which nodes to prefer for
       shutdown. With the approach of the `htools` of having a single binary
       showing different behaviors, having an additional program also does not
       introduce significant additional cost.
       We decided to have a whole prefix instead of a single tag reserved
       for marking standby nodes (we consider all tags starting with
       `htools:standby:` as serving only this purpose). This is not only in
       accordance with the tag
       reservations for other tools, but it also allows for further extension
       (like specifying priorities on which nodes to power up first) without
       changing name spaces.

Synnefo » snf-ganeti

root / doc / design-hsqueeze.rst @ 333bd799