Statistics
| Branch: | Tag: | Revision:

root / doc / design-hsqueeze.rst @ 333bd799

History | View | Annotate | Download (5.6 kB)

1
=============
2
HSqueeze tool
3
=============
4

    
5
.. contents:: :depth: 4
6

    
7
This is a design document detailing the node-freeing scheduler, HSqueeze.
8

    
9

    
10
Current state and shortcomings
11
==============================
12

    
13
Externally-mirrored instances can be moved between nodes at low
14
cost. Therefore, it is attractive to free up nodes and power them down
15
at times of low usage, even for small periods of time, like nights or
16
weekends.
17

    
18
Currently, the best way to find out a suitable set of nodes to shut down
19
is to use the property of our balancedness metric to move instances
20
away from drained nodes. So, one would manually drain more and more
21
nodes and see, if `hbal` could find a solution freeing up all those
22
drained nodes.
23

    
24

    
25
Proposed changes
26
================
27

    
28
We propose the addition of a new htool command-line tool, called
29
`hsqueeze`, that aims at keeping resource usage at a constant high
30
level by evacuating and powering down nodes, or powering up nodes and
31
rebalancing, as appropriate. By default, only externally-mirrored
32
instances are moved, but options are provided to additionally take
33
DRBD instances (which can be moved without downtimes), or even all
34
instances into consideration.
35

    
36
Tagging of standy nodes
37
-----------------------
38

    
39
Powering down nodes that are technically healthy effectively creates a
40
new node state: nodes on standby. To avoid further state
41
proliferation, and as this information is only used by `hsqueeze`,
42
this information is recorded in node tags. `hsqueeze` will assume
43
that offline nodes having a tag with prefix `htools:standby:` can
44
easily be powered on at any time.
45

    
46
Minimum available resources
47
---------------------------
48

    
49
To keep the squeezed cluster functional, a minimal amount of resources
50
will be left available on every node. While the precise amount will
51
be specifiable via command-line options, a sensible default is chosen,
52
like enough resource to start an additional instance at standard
53
allocation on each node. If the available resources fall below this
54
limit, `hsqueeze` will, in fact, try to power on more nodes, till
55
enough resources are available, or all standy nodes are online.
56

    
57
To avoid flapping behavior, a second, higher, amount of reserve
58
resources can be specified, and `hsqueeze` will only power down nodes,
59
if after the power down this higher amount of reserve resources is
60
still available.
61

    
62
Computation of the set to free up
63
---------------------------------
64

    
65
To determine which nodes can be powered down, `hsqueeze` basically
66
follows the same algorithm as the manual process. It greedily goes
67
through all non-master nodes and tries if the algorithm used by `hbal`
68
would find a solution (with the appropriate move restriction) that
69
frees up the extended set of nodes to be drained, while keeping enough
70
resources free. Being based on the algorithm used by `hbal`, all
71
restrictions respected by `hbal`, in particular memory reservation
72
for N+1 redundancy, are also respected by `hsqueeze`.
73
The order in which the nodes are tried is choosen by a
74
suitable heuristics, like trying the nodes in order of increasing
75
number of instances; the hope is that this reduces the number of
76
instances that actually have to be moved.
77

    
78
If the amount of free resources has fallen below the lower limit,
79
`hsqueeze` will determine the set of nodes to power up in a similar
80
way; it will hypothetically add more and more of the standby
81
nodes (in some suitable order) till the algorithm used by `hbal` will
82
finally balance the cluster in a way that enough resources are available,
83
or all standy nodes are online.
84

    
85

    
86
Instance moves and execution
87
----------------------------
88

    
89
Once the final set of nodes to power down is determined, the instance
90
moves are determined by the algorithm used by `hbal`. If
91
requested by the `-X` option, the nodes freed up are drained, and the
92
instance moves are executed in the same way as `hbal` does. Finally,
93
those of the freed-up nodes that do not already have a
94
`htools:standby:` tag are tagged as `htools:standby:auto`, all free-up
95
nodes are marked as offline and powered down via the
96
:doc:`design-oob`.
97

    
98
Similarly, if it is determined that nodes need to be added, then first
99
the nodes are powered up via the :doc:`design-oob`, then they're marked
100
as online and finally,
101
the cluster is balanced in the same way, as `hbal` would do. For the
102
newly powered up nodes, the `htools:standby:auto` tag, if present, is
103
removed, but no other tags are removed (including other
104
`htools:standby:` tags).
105

    
106

    
107
Design choices
108
==============
109

    
110
The proposed algorithm builds on top of the already present balancing
111
algorithm, instead of greedily packing nodes as full as possible. The
112
reason is, that in the end, a balanced cluster is needed anyway;
113
therefore, basing on the balancing algorithm reduces the number of
114
instance moves. Additionally, the final configuration will also
115
benefit from all improvements to the balancing algorithm, like taking
116
dynamic CPU data into account.
117

    
118
We decided to have a separate program instead of adding an option to
119
`hbal` to keep the interfaces, especially that of `hbal`, cleaner. It is
120
not unlikely that, over time, additional `hsqueeze`-specific options
121
might be added, specifying, e.g., which nodes to prefer for
122
shutdown. With the approach of the `htools` of having a single binary
123
showing different behaviors, having an additional program also does not
124
introduce significant additional cost.
125

    
126
We decided to have a whole prefix instead of a single tag reserved
127
for marking standby nodes (we consider all tags starting with
128
`htools:standby:` as serving only this purpose). This is not only in
129
accordance with the tag
130
reservations for other tools, but it also allows for further extension
131
(like specifying priorities on which nodes to power up first) without
132
changing name spaces.