Statistics
| Branch: | Tag: | Revision:

root / doc / design-hsqueeze.rst @ 9110fb4a

History | View | Annotate | Download (5.6 kB)

1 30a31713 Klaus Aehlig
=============
2 30a31713 Klaus Aehlig
HSqueeze tool
3 30a31713 Klaus Aehlig
=============
4 30a31713 Klaus Aehlig
5 30a31713 Klaus Aehlig
.. contents:: :depth: 4
6 30a31713 Klaus Aehlig
7 30a31713 Klaus Aehlig
This is a design document detailing the node-freeing scheduler, HSqueeze.
8 30a31713 Klaus Aehlig
9 30a31713 Klaus Aehlig
10 30a31713 Klaus Aehlig
Current state and shortcomings
11 30a31713 Klaus Aehlig
==============================
12 30a31713 Klaus Aehlig
13 30a31713 Klaus Aehlig
Externally-mirrored instances can be moved between nodes at low
14 30a31713 Klaus Aehlig
cost. Therefore, it is attractive to free up nodes and power them down
15 30a31713 Klaus Aehlig
at times of low usage, even for small periods of time, like nights or
16 30a31713 Klaus Aehlig
weekends.
17 30a31713 Klaus Aehlig
18 30a31713 Klaus Aehlig
Currently, the best way to find out a suitable set of nodes to shut down
19 30a31713 Klaus Aehlig
is to use the property of our balancedness metric to move instances
20 30a31713 Klaus Aehlig
away from drained nodes. So, one would manually drain more and more
21 30a31713 Klaus Aehlig
nodes and see, if `hbal` could find a solution freeing up all those
22 30a31713 Klaus Aehlig
drained nodes.
23 30a31713 Klaus Aehlig
24 30a31713 Klaus Aehlig
25 30a31713 Klaus Aehlig
Proposed changes
26 30a31713 Klaus Aehlig
================
27 30a31713 Klaus Aehlig
28 30a31713 Klaus Aehlig
We propose the addition of a new htool command-line tool, called
29 30a31713 Klaus Aehlig
`hsqueeze`, that aims at keeping resource usage at a constant high
30 30a31713 Klaus Aehlig
level by evacuating and powering down nodes, or powering up nodes and
31 30a31713 Klaus Aehlig
rebalancing, as appropriate. By default, only externally-mirrored
32 30a31713 Klaus Aehlig
instances are moved, but options are provided to additionally take
33 30a31713 Klaus Aehlig
DRBD instances (which can be moved without downtimes), or even all
34 30a31713 Klaus Aehlig
instances into consideration.
35 30a31713 Klaus Aehlig
36 30a31713 Klaus Aehlig
Tagging of standy nodes
37 30a31713 Klaus Aehlig
-----------------------
38 30a31713 Klaus Aehlig
39 30a31713 Klaus Aehlig
Powering down nodes that are technically healthy effectively creates a
40 30a31713 Klaus Aehlig
new node state: nodes on standby. To avoid further state
41 30a31713 Klaus Aehlig
proliferation, and as this information is only used by `hsqueeze`,
42 30a31713 Klaus Aehlig
this information is recorded in node tags. `hsqueeze` will assume
43 30a31713 Klaus Aehlig
that offline nodes having a tag with prefix `htools:standby:` can
44 30a31713 Klaus Aehlig
easily be powered on at any time.
45 30a31713 Klaus Aehlig
46 30a31713 Klaus Aehlig
Minimum available resources
47 30a31713 Klaus Aehlig
---------------------------
48 30a31713 Klaus Aehlig
49 30a31713 Klaus Aehlig
To keep the squeezed cluster functional, a minimal amount of resources
50 30a31713 Klaus Aehlig
will be left available on every node. While the precise amount will
51 30a31713 Klaus Aehlig
be specifiable via command-line options, a sensible default is chosen,
52 30a31713 Klaus Aehlig
like enough resource to start an additional instance at standard
53 30a31713 Klaus Aehlig
allocation on each node. If the available resources fall below this
54 30a31713 Klaus Aehlig
limit, `hsqueeze` will, in fact, try to power on more nodes, till
55 30a31713 Klaus Aehlig
enough resources are available, or all standy nodes are online.
56 30a31713 Klaus Aehlig
57 30a31713 Klaus Aehlig
To avoid flapping behavior, a second, higher, amount of reserve
58 30a31713 Klaus Aehlig
resources can be specified, and `hsqueeze` will only power down nodes,
59 30a31713 Klaus Aehlig
if after the power down this higher amount of reserve resources is
60 30a31713 Klaus Aehlig
still available.
61 30a31713 Klaus Aehlig
62 30a31713 Klaus Aehlig
Computation of the set to free up
63 30a31713 Klaus Aehlig
---------------------------------
64 30a31713 Klaus Aehlig
65 30a31713 Klaus Aehlig
To determine which nodes can be powered down, `hsqueeze` basically
66 30a31713 Klaus Aehlig
follows the same algorithm as the manual process. It greedily goes
67 30a31713 Klaus Aehlig
through all non-master nodes and tries if the algorithm used by `hbal`
68 30a31713 Klaus Aehlig
would find a solution (with the appropriate move restriction) that
69 30a31713 Klaus Aehlig
frees up the extended set of nodes to be drained, while keeping enough
70 30a31713 Klaus Aehlig
resources free. Being based on the algorithm used by `hbal`, all
71 30a31713 Klaus Aehlig
restrictions respected by `hbal`, in particular memory reservation
72 30a31713 Klaus Aehlig
for N+1 redundancy, are also respected by `hsqueeze`.
73 30a31713 Klaus Aehlig
The order in which the nodes are tried is choosen by a
74 30a31713 Klaus Aehlig
suitable heuristics, like trying the nodes in order of increasing
75 30a31713 Klaus Aehlig
number of instances; the hope is that this reduces the number of
76 30a31713 Klaus Aehlig
instances that actually have to be moved.
77 30a31713 Klaus Aehlig
78 30a31713 Klaus Aehlig
If the amount of free resources has fallen below the lower limit,
79 30a31713 Klaus Aehlig
`hsqueeze` will determine the set of nodes to power up in a similar
80 30a31713 Klaus Aehlig
way; it will hypothetically add more and more of the standby
81 30a31713 Klaus Aehlig
nodes (in some suitable order) till the algorithm used by `hbal` will
82 30a31713 Klaus Aehlig
finally balance the cluster in a way that enough resources are available,
83 30a31713 Klaus Aehlig
or all standy nodes are online.
84 30a31713 Klaus Aehlig
85 30a31713 Klaus Aehlig
86 30a31713 Klaus Aehlig
Instance moves and execution
87 30a31713 Klaus Aehlig
----------------------------
88 30a31713 Klaus Aehlig
89 30a31713 Klaus Aehlig
Once the final set of nodes to power down is determined, the instance
90 30a31713 Klaus Aehlig
moves are determined by the algorithm used by `hbal`. If
91 30a31713 Klaus Aehlig
requested by the `-X` option, the nodes freed up are drained, and the
92 30a31713 Klaus Aehlig
instance moves are executed in the same way as `hbal` does. Finally,
93 30a31713 Klaus Aehlig
those of the freed-up nodes that do not already have a
94 30a31713 Klaus Aehlig
`htools:standby:` tag are tagged as `htools:standby:auto`, all free-up
95 30a31713 Klaus Aehlig
nodes are marked as offline and powered down via the
96 30a31713 Klaus Aehlig
:doc:`design-oob`.
97 30a31713 Klaus Aehlig
98 30a31713 Klaus Aehlig
Similarly, if it is determined that nodes need to be added, then first
99 30a31713 Klaus Aehlig
the nodes are powered up via the :doc:`design-oob`, then they're marked
100 30a31713 Klaus Aehlig
as online and finally,
101 30a31713 Klaus Aehlig
the cluster is balanced in the same way, as `hbal` would do. For the
102 30a31713 Klaus Aehlig
newly powered up nodes, the `htools:standby:auto` tag, if present, is
103 30a31713 Klaus Aehlig
removed, but no other tags are removed (including other
104 30a31713 Klaus Aehlig
`htools:standby:` tags).
105 30a31713 Klaus Aehlig
106 30a31713 Klaus Aehlig
107 30a31713 Klaus Aehlig
Design choices
108 30a31713 Klaus Aehlig
==============
109 30a31713 Klaus Aehlig
110 30a31713 Klaus Aehlig
The proposed algorithm builds on top of the already present balancing
111 30a31713 Klaus Aehlig
algorithm, instead of greedily packing nodes as full as possible. The
112 30a31713 Klaus Aehlig
reason is, that in the end, a balanced cluster is needed anyway;
113 30a31713 Klaus Aehlig
therefore, basing on the balancing algorithm reduces the number of
114 30a31713 Klaus Aehlig
instance moves. Additionally, the final configuration will also
115 30a31713 Klaus Aehlig
benefit from all improvements to the balancing algorithm, like taking
116 30a31713 Klaus Aehlig
dynamic CPU data into account.
117 30a31713 Klaus Aehlig
118 30a31713 Klaus Aehlig
We decided to have a separate program instead of adding an option to
119 30a31713 Klaus Aehlig
`hbal` to keep the interfaces, especially that of `hbal`, cleaner. It is
120 30a31713 Klaus Aehlig
not unlikely that, over time, additional `hsqueeze`-specific options
121 30a31713 Klaus Aehlig
might be added, specifying, e.g., which nodes to prefer for
122 30a31713 Klaus Aehlig
shutdown. With the approach of the `htools` of having a single binary
123 30a31713 Klaus Aehlig
showing different behaviors, having an additional program also does not
124 30a31713 Klaus Aehlig
introduce significant additional cost.
125 30a31713 Klaus Aehlig
126 30a31713 Klaus Aehlig
We decided to have a whole prefix instead of a single tag reserved
127 30a31713 Klaus Aehlig
for marking standby nodes (we consider all tags starting with
128 30a31713 Klaus Aehlig
`htools:standby:` as serving only this purpose). This is not only in
129 30a31713 Klaus Aehlig
accordance with the tag
130 30a31713 Klaus Aehlig
reservations for other tools, but it also allows for further extension
131 30a31713 Klaus Aehlig
(like specifying priorities on which nodes to power up first) without
132 30a31713 Klaus Aehlig
changing name spaces.