root / doc / design-hsqueeze.rst @ 0565f862
History | View | Annotate | Download (5.6 kB)
1 | 30a31713 | Klaus Aehlig | ============= |
---|---|---|---|
2 | 30a31713 | Klaus Aehlig | HSqueeze tool |
3 | 30a31713 | Klaus Aehlig | ============= |
4 | 30a31713 | Klaus Aehlig | |
5 | 30a31713 | Klaus Aehlig | .. contents:: :depth: 4 |
6 | 30a31713 | Klaus Aehlig | |
7 | 30a31713 | Klaus Aehlig | This is a design document detailing the node-freeing scheduler, HSqueeze. |
8 | 30a31713 | Klaus Aehlig | |
9 | 30a31713 | Klaus Aehlig | |
10 | 30a31713 | Klaus Aehlig | Current state and shortcomings |
11 | 30a31713 | Klaus Aehlig | ============================== |
12 | 30a31713 | Klaus Aehlig | |
13 | 30a31713 | Klaus Aehlig | Externally-mirrored instances can be moved between nodes at low |
14 | 30a31713 | Klaus Aehlig | cost. Therefore, it is attractive to free up nodes and power them down |
15 | 30a31713 | Klaus Aehlig | at times of low usage, even for small periods of time, like nights or |
16 | 30a31713 | Klaus Aehlig | weekends. |
17 | 30a31713 | Klaus Aehlig | |
18 | 30a31713 | Klaus Aehlig | Currently, the best way to find out a suitable set of nodes to shut down |
19 | 30a31713 | Klaus Aehlig | is to use the property of our balancedness metric to move instances |
20 | 30a31713 | Klaus Aehlig | away from drained nodes. So, one would manually drain more and more |
21 | 30a31713 | Klaus Aehlig | nodes and see, if `hbal` could find a solution freeing up all those |
22 | 30a31713 | Klaus Aehlig | drained nodes. |
23 | 30a31713 | Klaus Aehlig | |
24 | 30a31713 | Klaus Aehlig | |
25 | 30a31713 | Klaus Aehlig | Proposed changes |
26 | 30a31713 | Klaus Aehlig | ================ |
27 | 30a31713 | Klaus Aehlig | |
28 | 30a31713 | Klaus Aehlig | We propose the addition of a new htool command-line tool, called |
29 | 30a31713 | Klaus Aehlig | `hsqueeze`, that aims at keeping resource usage at a constant high |
30 | 30a31713 | Klaus Aehlig | level by evacuating and powering down nodes, or powering up nodes and |
31 | 30a31713 | Klaus Aehlig | rebalancing, as appropriate. By default, only externally-mirrored |
32 | 30a31713 | Klaus Aehlig | instances are moved, but options are provided to additionally take |
33 | 30a31713 | Klaus Aehlig | DRBD instances (which can be moved without downtimes), or even all |
34 | 30a31713 | Klaus Aehlig | instances into consideration. |
35 | 30a31713 | Klaus Aehlig | |
36 | 30a31713 | Klaus Aehlig | Tagging of standy nodes |
37 | 30a31713 | Klaus Aehlig | ----------------------- |
38 | 30a31713 | Klaus Aehlig | |
39 | 30a31713 | Klaus Aehlig | Powering down nodes that are technically healthy effectively creates a |
40 | 30a31713 | Klaus Aehlig | new node state: nodes on standby. To avoid further state |
41 | 30a31713 | Klaus Aehlig | proliferation, and as this information is only used by `hsqueeze`, |
42 | 30a31713 | Klaus Aehlig | this information is recorded in node tags. `hsqueeze` will assume |
43 | 30a31713 | Klaus Aehlig | that offline nodes having a tag with prefix `htools:standby:` can |
44 | 30a31713 | Klaus Aehlig | easily be powered on at any time. |
45 | 30a31713 | Klaus Aehlig | |
46 | 30a31713 | Klaus Aehlig | Minimum available resources |
47 | 30a31713 | Klaus Aehlig | --------------------------- |
48 | 30a31713 | Klaus Aehlig | |
49 | 30a31713 | Klaus Aehlig | To keep the squeezed cluster functional, a minimal amount of resources |
50 | 30a31713 | Klaus Aehlig | will be left available on every node. While the precise amount will |
51 | 30a31713 | Klaus Aehlig | be specifiable via command-line options, a sensible default is chosen, |
52 | 30a31713 | Klaus Aehlig | like enough resource to start an additional instance at standard |
53 | 30a31713 | Klaus Aehlig | allocation on each node. If the available resources fall below this |
54 | 30a31713 | Klaus Aehlig | limit, `hsqueeze` will, in fact, try to power on more nodes, till |
55 | 30a31713 | Klaus Aehlig | enough resources are available, or all standy nodes are online. |
56 | 30a31713 | Klaus Aehlig | |
57 | 30a31713 | Klaus Aehlig | To avoid flapping behavior, a second, higher, amount of reserve |
58 | 30a31713 | Klaus Aehlig | resources can be specified, and `hsqueeze` will only power down nodes, |
59 | 30a31713 | Klaus Aehlig | if after the power down this higher amount of reserve resources is |
60 | 30a31713 | Klaus Aehlig | still available. |
61 | 30a31713 | Klaus Aehlig | |
62 | 30a31713 | Klaus Aehlig | Computation of the set to free up |
63 | 30a31713 | Klaus Aehlig | --------------------------------- |
64 | 30a31713 | Klaus Aehlig | |
65 | 30a31713 | Klaus Aehlig | To determine which nodes can be powered down, `hsqueeze` basically |
66 | 30a31713 | Klaus Aehlig | follows the same algorithm as the manual process. It greedily goes |
67 | 30a31713 | Klaus Aehlig | through all non-master nodes and tries if the algorithm used by `hbal` |
68 | 30a31713 | Klaus Aehlig | would find a solution (with the appropriate move restriction) that |
69 | 30a31713 | Klaus Aehlig | frees up the extended set of nodes to be drained, while keeping enough |
70 | 30a31713 | Klaus Aehlig | resources free. Being based on the algorithm used by `hbal`, all |
71 | 30a31713 | Klaus Aehlig | restrictions respected by `hbal`, in particular memory reservation |
72 | 30a31713 | Klaus Aehlig | for N+1 redundancy, are also respected by `hsqueeze`. |
73 | 30a31713 | Klaus Aehlig | The order in which the nodes are tried is choosen by a |
74 | 30a31713 | Klaus Aehlig | suitable heuristics, like trying the nodes in order of increasing |
75 | 30a31713 | Klaus Aehlig | number of instances; the hope is that this reduces the number of |
76 | 30a31713 | Klaus Aehlig | instances that actually have to be moved. |
77 | 30a31713 | Klaus Aehlig | |
78 | 30a31713 | Klaus Aehlig | If the amount of free resources has fallen below the lower limit, |
79 | 30a31713 | Klaus Aehlig | `hsqueeze` will determine the set of nodes to power up in a similar |
80 | 30a31713 | Klaus Aehlig | way; it will hypothetically add more and more of the standby |
81 | 30a31713 | Klaus Aehlig | nodes (in some suitable order) till the algorithm used by `hbal` will |
82 | 30a31713 | Klaus Aehlig | finally balance the cluster in a way that enough resources are available, |
83 | 30a31713 | Klaus Aehlig | or all standy nodes are online. |
84 | 30a31713 | Klaus Aehlig | |
85 | 30a31713 | Klaus Aehlig | |
86 | 30a31713 | Klaus Aehlig | Instance moves and execution |
87 | 30a31713 | Klaus Aehlig | ---------------------------- |
88 | 30a31713 | Klaus Aehlig | |
89 | 30a31713 | Klaus Aehlig | Once the final set of nodes to power down is determined, the instance |
90 | 30a31713 | Klaus Aehlig | moves are determined by the algorithm used by `hbal`. If |
91 | 30a31713 | Klaus Aehlig | requested by the `-X` option, the nodes freed up are drained, and the |
92 | 30a31713 | Klaus Aehlig | instance moves are executed in the same way as `hbal` does. Finally, |
93 | 30a31713 | Klaus Aehlig | those of the freed-up nodes that do not already have a |
94 | 30a31713 | Klaus Aehlig | `htools:standby:` tag are tagged as `htools:standby:auto`, all free-up |
95 | 30a31713 | Klaus Aehlig | nodes are marked as offline and powered down via the |
96 | 30a31713 | Klaus Aehlig | :doc:`design-oob`. |
97 | 30a31713 | Klaus Aehlig | |
98 | 30a31713 | Klaus Aehlig | Similarly, if it is determined that nodes need to be added, then first |
99 | 30a31713 | Klaus Aehlig | the nodes are powered up via the :doc:`design-oob`, then they're marked |
100 | 30a31713 | Klaus Aehlig | as online and finally, |
101 | 30a31713 | Klaus Aehlig | the cluster is balanced in the same way, as `hbal` would do. For the |
102 | 30a31713 | Klaus Aehlig | newly powered up nodes, the `htools:standby:auto` tag, if present, is |
103 | 30a31713 | Klaus Aehlig | removed, but no other tags are removed (including other |
104 | 30a31713 | Klaus Aehlig | `htools:standby:` tags). |
105 | 30a31713 | Klaus Aehlig | |
106 | 30a31713 | Klaus Aehlig | |
107 | 30a31713 | Klaus Aehlig | Design choices |
108 | 30a31713 | Klaus Aehlig | ============== |
109 | 30a31713 | Klaus Aehlig | |
110 | 30a31713 | Klaus Aehlig | The proposed algorithm builds on top of the already present balancing |
111 | 30a31713 | Klaus Aehlig | algorithm, instead of greedily packing nodes as full as possible. The |
112 | 30a31713 | Klaus Aehlig | reason is, that in the end, a balanced cluster is needed anyway; |
113 | 30a31713 | Klaus Aehlig | therefore, basing on the balancing algorithm reduces the number of |
114 | 30a31713 | Klaus Aehlig | instance moves. Additionally, the final configuration will also |
115 | 30a31713 | Klaus Aehlig | benefit from all improvements to the balancing algorithm, like taking |
116 | 30a31713 | Klaus Aehlig | dynamic CPU data into account. |
117 | 30a31713 | Klaus Aehlig | |
118 | 30a31713 | Klaus Aehlig | We decided to have a separate program instead of adding an option to |
119 | 30a31713 | Klaus Aehlig | `hbal` to keep the interfaces, especially that of `hbal`, cleaner. It is |
120 | 30a31713 | Klaus Aehlig | not unlikely that, over time, additional `hsqueeze`-specific options |
121 | 30a31713 | Klaus Aehlig | might be added, specifying, e.g., which nodes to prefer for |
122 | 30a31713 | Klaus Aehlig | shutdown. With the approach of the `htools` of having a single binary |
123 | 30a31713 | Klaus Aehlig | showing different behaviors, having an additional program also does not |
124 | 30a31713 | Klaus Aehlig | introduce significant additional cost. |
125 | 30a31713 | Klaus Aehlig | |
126 | 30a31713 | Klaus Aehlig | We decided to have a whole prefix instead of a single tag reserved |
127 | 30a31713 | Klaus Aehlig | for marking standby nodes (we consider all tags starting with |
128 | 30a31713 | Klaus Aehlig | `htools:standby:` as serving only this purpose). This is not only in |
129 | 30a31713 | Klaus Aehlig | accordance with the tag |
130 | 30a31713 | Klaus Aehlig | reservations for other tools, but it also allows for further extension |
131 | 30a31713 | Klaus Aehlig | (like specifying priorities on which nodes to power up first) without |
132 | 30a31713 | Klaus Aehlig | changing name spaces. |