root / doc / design-hsqueeze.rst @ 333bd799
History | View | Annotate | Download (5.6 kB)
1 |
============= |
---|---|
2 |
HSqueeze tool |
3 |
============= |
4 |
|
5 |
.. contents:: :depth: 4 |
6 |
|
7 |
This is a design document detailing the node-freeing scheduler, HSqueeze. |
8 |
|
9 |
|
10 |
Current state and shortcomings |
11 |
============================== |
12 |
|
13 |
Externally-mirrored instances can be moved between nodes at low |
14 |
cost. Therefore, it is attractive to free up nodes and power them down |
15 |
at times of low usage, even for small periods of time, like nights or |
16 |
weekends. |
17 |
|
18 |
Currently, the best way to find out a suitable set of nodes to shut down |
19 |
is to use the property of our balancedness metric to move instances |
20 |
away from drained nodes. So, one would manually drain more and more |
21 |
nodes and see, if `hbal` could find a solution freeing up all those |
22 |
drained nodes. |
23 |
|
24 |
|
25 |
Proposed changes |
26 |
================ |
27 |
|
28 |
We propose the addition of a new htool command-line tool, called |
29 |
`hsqueeze`, that aims at keeping resource usage at a constant high |
30 |
level by evacuating and powering down nodes, or powering up nodes and |
31 |
rebalancing, as appropriate. By default, only externally-mirrored |
32 |
instances are moved, but options are provided to additionally take |
33 |
DRBD instances (which can be moved without downtimes), or even all |
34 |
instances into consideration. |
35 |
|
36 |
Tagging of standy nodes |
37 |
----------------------- |
38 |
|
39 |
Powering down nodes that are technically healthy effectively creates a |
40 |
new node state: nodes on standby. To avoid further state |
41 |
proliferation, and as this information is only used by `hsqueeze`, |
42 |
this information is recorded in node tags. `hsqueeze` will assume |
43 |
that offline nodes having a tag with prefix `htools:standby:` can |
44 |
easily be powered on at any time. |
45 |
|
46 |
Minimum available resources |
47 |
--------------------------- |
48 |
|
49 |
To keep the squeezed cluster functional, a minimal amount of resources |
50 |
will be left available on every node. While the precise amount will |
51 |
be specifiable via command-line options, a sensible default is chosen, |
52 |
like enough resource to start an additional instance at standard |
53 |
allocation on each node. If the available resources fall below this |
54 |
limit, `hsqueeze` will, in fact, try to power on more nodes, till |
55 |
enough resources are available, or all standy nodes are online. |
56 |
|
57 |
To avoid flapping behavior, a second, higher, amount of reserve |
58 |
resources can be specified, and `hsqueeze` will only power down nodes, |
59 |
if after the power down this higher amount of reserve resources is |
60 |
still available. |
61 |
|
62 |
Computation of the set to free up |
63 |
--------------------------------- |
64 |
|
65 |
To determine which nodes can be powered down, `hsqueeze` basically |
66 |
follows the same algorithm as the manual process. It greedily goes |
67 |
through all non-master nodes and tries if the algorithm used by `hbal` |
68 |
would find a solution (with the appropriate move restriction) that |
69 |
frees up the extended set of nodes to be drained, while keeping enough |
70 |
resources free. Being based on the algorithm used by `hbal`, all |
71 |
restrictions respected by `hbal`, in particular memory reservation |
72 |
for N+1 redundancy, are also respected by `hsqueeze`. |
73 |
The order in which the nodes are tried is choosen by a |
74 |
suitable heuristics, like trying the nodes in order of increasing |
75 |
number of instances; the hope is that this reduces the number of |
76 |
instances that actually have to be moved. |
77 |
|
78 |
If the amount of free resources has fallen below the lower limit, |
79 |
`hsqueeze` will determine the set of nodes to power up in a similar |
80 |
way; it will hypothetically add more and more of the standby |
81 |
nodes (in some suitable order) till the algorithm used by `hbal` will |
82 |
finally balance the cluster in a way that enough resources are available, |
83 |
or all standy nodes are online. |
84 |
|
85 |
|
86 |
Instance moves and execution |
87 |
---------------------------- |
88 |
|
89 |
Once the final set of nodes to power down is determined, the instance |
90 |
moves are determined by the algorithm used by `hbal`. If |
91 |
requested by the `-X` option, the nodes freed up are drained, and the |
92 |
instance moves are executed in the same way as `hbal` does. Finally, |
93 |
those of the freed-up nodes that do not already have a |
94 |
`htools:standby:` tag are tagged as `htools:standby:auto`, all free-up |
95 |
nodes are marked as offline and powered down via the |
96 |
:doc:`design-oob`. |
97 |
|
98 |
Similarly, if it is determined that nodes need to be added, then first |
99 |
the nodes are powered up via the :doc:`design-oob`, then they're marked |
100 |
as online and finally, |
101 |
the cluster is balanced in the same way, as `hbal` would do. For the |
102 |
newly powered up nodes, the `htools:standby:auto` tag, if present, is |
103 |
removed, but no other tags are removed (including other |
104 |
`htools:standby:` tags). |
105 |
|
106 |
|
107 |
Design choices |
108 |
============== |
109 |
|
110 |
The proposed algorithm builds on top of the already present balancing |
111 |
algorithm, instead of greedily packing nodes as full as possible. The |
112 |
reason is, that in the end, a balanced cluster is needed anyway; |
113 |
therefore, basing on the balancing algorithm reduces the number of |
114 |
instance moves. Additionally, the final configuration will also |
115 |
benefit from all improvements to the balancing algorithm, like taking |
116 |
dynamic CPU data into account. |
117 |
|
118 |
We decided to have a separate program instead of adding an option to |
119 |
`hbal` to keep the interfaces, especially that of `hbal`, cleaner. It is |
120 |
not unlikely that, over time, additional `hsqueeze`-specific options |
121 |
might be added, specifying, e.g., which nodes to prefer for |
122 |
shutdown. With the approach of the `htools` of having a single binary |
123 |
showing different behaviors, having an additional program also does not |
124 |
introduce significant additional cost. |
125 |
|
126 |
We decided to have a whole prefix instead of a single tag reserved |
127 |
for marking standby nodes (we consider all tags starting with |
128 |
`htools:standby:` as serving only this purpose). This is not only in |
129 |
accordance with the tag |
130 |
reservations for other tools, but it also allows for further extension |
131 |
(like specifying priorities on which nodes to power up first) without |
132 |
changing name spaces. |