1 |
|
Cluster tools (h-aneti?)
|
2 |
|
========================
|
|
1 |
Ganeti Cluster tools (htools)
|
|
2 |
=============================
|
3 |
3 |
|
4 |
|
These are some simple cluster tools for fixing common problems. Right now N+1
|
5 |
|
and rebalancing are included.
|
|
4 |
These are some simple cluster tools for fixing common problems. Right
|
|
5 |
now N+1 and rebalancing are included.
|
6 |
6 |
|
7 |
|
.. contents::
|
8 |
7 |
|
9 |
8 |
Cluster N+1 solver
|
10 |
9 |
------------------
|
... | ... | |
14 |
13 |
needed to fix the cluster. Note this means we won't get a balanced cluster,
|
15 |
14 |
just one that passes N+1 checks.
|
16 |
15 |
|
17 |
|
Also note that the set of all instance placements on a 20/80 cluster is
|
18 |
|
(20*19)^80, that is ~10^200, so...
|
19 |
|
|
20 |
|
Algorithm
|
21 |
|
+++++++++
|
22 |
|
|
23 |
|
The algorithm is a simple two-phase process.
|
24 |
|
|
25 |
|
In phase 1 we determine the removal set, that is the set of instances that when
|
26 |
|
removed completely from the cluster, make it healthy again. The instances that
|
27 |
|
can go into the set are all the primary and secondary instances of the failing
|
28 |
|
nodes. The result from this phase is actually a list - we compute all sets of
|
29 |
|
the same minimum length.
|
30 |
|
|
31 |
|
So basically we aim to determine here: what is the minimum number of instances
|
32 |
|
that need to be removed (this is called the removal depth) and which are the
|
33 |
|
actual combinations that fit (called the list of removal sets).
|
34 |
|
|
35 |
|
In phase 2, for each removal set computed in the previous phase, we take the
|
36 |
|
removed instances and try to determine where we can put them so that the
|
37 |
|
cluster is still passing N+1 checks. From this list of possible solutions
|
38 |
|
(called the list of solutions), we compute the one that has the smallest delta
|
39 |
|
from the original state (the delta is the number of replace disks that needs to
|
40 |
|
be run) and chose this as the final solution.
|
41 |
|
|
42 |
|
Implementation
|
43 |
|
++++++++++++++
|
44 |
|
|
45 |
|
Of course, a naive implementation based on the above description will run for
|
46 |
|
long periods of time, so the implementation has to be smart in order to prune
|
47 |
|
the solution space as eagerly as possible.
|
48 |
|
|
49 |
|
In the following, we use as example a set of test data (a cluster with 20
|
50 |
|
nodes, 80 instances that has 5 nodes failing N+1 checks for a total of 12
|
51 |
|
warnings).
|
52 |
|
|
53 |
|
On this set, the minimum depth is 4 (anything below fails), and for this depth
|
54 |
|
the current version of the algorithm generates 5 removal sets; a previous
|
55 |
|
version of the first phase generated a slightly different set of instances, with
|
56 |
|
two removal sets. For the original version of the algorithm:
|
57 |
|
|
58 |
|
- the first, non-optimized implementation computed a solution of delta=4 in 30
|
59 |
|
minutes on server-class CPUs and was still running when aborted 10 minutes
|
60 |
|
later
|
61 |
|
- the intermediate optimized version computed the whole solution space and
|
62 |
|
found a delta=3 solution in around 10 seconds on a laptop-class CPU (total
|
63 |
|
number of solutions ~600k)
|
64 |
|
- latest version on server CPUs (which actually computes more removal sets)
|
65 |
|
computes depth=4 in less than a second and depth=5 in around 2 seconds, and
|
66 |
|
depth=6 in less than 20 seconds; depth=8 takes under five minutes (this is
|
67 |
|
10^10 bigger solution space)
|
68 |
|
|
69 |
|
Note that when (artificially) increasing the depth to 5 the number of removal
|
70 |
|
sets grows fast (~3000) and a (again artificial) depth 6 generates 61k removal
|
71 |
|
sets. Therefore, it is possible to restrict the number of solution sets
|
72 |
|
examined via a command-line option.
|
73 |
|
|
74 |
|
The factors that influence the run time are:
|
75 |
|
|
76 |
|
- the removal depth; for each increase with one of the depth, we grow the
|
77 |
|
solution space by the number of nodes squared (since a new instance can live
|
78 |
|
any two nodes as primary/secondary, therefore (almost) N times N); i.e.,
|
79 |
|
depth=1 will create a N^2 solution space, depth two will make this N^4,
|
80 |
|
depth three will be N^6, etc.
|
81 |
|
- the removal depth again; for each increase in the depth, there will be more
|
82 |
|
valid removal sets, and the space of solutions increases linearly with the
|
83 |
|
number of removal sets
|
84 |
|
|
85 |
|
Therefore, the smaller the depth the faster the algorithm will be; it doesn't
|
86 |
|
seem like this algorithm will work for clusters of 100 nodes and many many
|
87 |
|
small instances (e.g. 256MB instances on 16GB nodes).
|
88 |
|
|
89 |
|
Currently applied optimizations:
|
90 |
|
|
91 |
|
- when choosing where to place an instance in phase two, there are N*(N-1)
|
92 |
|
possible primary/secondary options; however, if instead of iterating over all
|
93 |
|
p * s pairs, we first determine the set of primary nodes that can hold this
|
94 |
|
instance (without failing N+1), we can cut (N-1) secondary placements for
|
95 |
|
each primary node removed; and since this applies at every iteration of phase
|
96 |
|
2 it linearly decreases the solution space, and on full clusters, this can
|
97 |
|
mean a four-five times reductions of solution space
|
98 |
|
- since the number of solutions is very high even for smaller depths (on the
|
99 |
|
test data, depth=4 results in 1.8M solutions) we can't compare them at the
|
100 |
|
end, so at each iteration in phase 2 we only promote the best solution out of
|
101 |
|
our own set of solutions
|
102 |
|
- since the placement of instances can only increase the delta of the solution
|
103 |
|
(placing a new instance will add zero or more replace-disks steps), it means
|
104 |
|
the delta will only increase while recursing during phase 2; therefore, if we
|
105 |
|
know at one point that we have a current delta that is equal or higher to the
|
106 |
|
delta of the best solution so far, we can abort the recursion; this cuts a
|
107 |
|
tremendous number of branches; further promotion of the best solution from
|
108 |
|
one removal set to another can cut entire removal sets after a few recursions
|
109 |
|
|
110 |
|
Command line usage
|
111 |
|
++++++++++++++++++
|
112 |
|
|
113 |
|
Synopsis::
|
114 |
|
|
115 |
|
hn1 { [-n NODES_FILE] [-i INSTANCES_FILE] | [-m CLUSTER] } \
|
116 |
|
[-d START_DEPTH] \
|
117 |
|
[-r MAX_REMOVALS] [-l MIN_DELTA] [-L MAX_DELTA] \
|
118 |
|
[-p] [-C]
|
119 |
|
|
120 |
|
The -n and -i options change the names of the input files.
|
121 |
|
Alternatively, the -m option specifies collection of data via RAPI.
|
122 |
|
|
123 |
|
The -d option
|
124 |
|
changes the start depth, as a higher depth can give (with a longer computation
|
125 |
|
time) a solution with better delta. The -r option restricts at each depth the
|
126 |
|
number of solutions considered - with r=1000 for example even depth=10 finishes
|
127 |
|
in less than a second.
|
128 |
|
|
129 |
|
The -p option will show the cluster state after the solution is implemented,
|
130 |
|
while the -C option will show the needed gnt-instance commands to implement
|
131 |
|
it.
|
132 |
|
|
133 |
|
The -l (--min-delta) and -L (--max-delta) options restrict the solution in the
|
134 |
|
following ways:
|
135 |
|
|
136 |
|
- min-delta will cause the search to abort early once we find a solution with
|
137 |
|
delta less than or equal to this parameter; this can cause extremely fast
|
138 |
|
results in case a desired solution is found quickly; the default value for
|
139 |
|
this parameter is zero, so once we find a "perfect" solution we finish early
|
140 |
|
- max-delta causes rejection of valid solution but which have delta higher
|
141 |
|
than the value of this parameter; this can reduce the depth of the search
|
142 |
|
tree, with sometimes significant speedups; by default, this optimization is
|
143 |
|
not used
|
144 |
|
|
145 |
|
Individually or combined, these two parameters can (if there are any) very
|
146 |
|
fast result; on our test data, depth=34 (max depth!) is solved in 2 seconds
|
147 |
|
with min-delta=0/max-delta=1 (since there is such a solution), and the
|
148 |
|
extremely low max-delta causes extreme pruning.
|
|
16 |
For algorithm details and usage, see the man page hn1(1).
|
149 |
17 |
|
150 |
18 |
Cluster rebalancer
|
151 |
19 |
------------------
|
... | ... | |
154 |
22 |
repeatedly try to move each instance one step, so that the cluster score
|
155 |
23 |
becomes better. We stop when no further move can improve the score.
|
156 |
24 |
|
157 |
|
The algorithm is divided into rounds (all identical):
|
158 |
|
|
159 |
|
#. Repeat for each instance:
|
160 |
|
|
161 |
|
#. Compute score after the potential failover of the instance
|
162 |
|
|
163 |
|
#. For each node that is different from the current primary/secondary
|
164 |
|
|
165 |
|
#. Compute score after replacing the primary with this new node
|
166 |
|
|
167 |
|
#. Compute score after replacing the secondary with this new node
|
168 |
|
|
169 |
|
|
170 |
|
#. Out of this N*2+1 possible new scores (and their associated move) for
|
171 |
|
this instance, we choose the one that is the best in terms of cluster
|
172 |
|
score, and then proceed to the next instance
|
173 |
|
|
174 |
|
Since we don't compute all combinations of moves for instances (e.g. the first
|
175 |
|
instance's all moves Cartesian product with second instance's all moves, etc.)
|
176 |
|
but we proceed serially instance A, then B, then C, the total computations we
|
177 |
|
make in one steps is simply N(number of nodes)*2+1 times I(number of instances),
|
178 |
|
instead of (N*2+1)^I. So therefore the runtime for a round is trivial.
|
179 |
|
|
180 |
|
Further rounds are done, since the relocation of instances might offer better
|
181 |
|
places for instances which we didn't move, or simply didn't move to the best
|
182 |
|
place. It is possible to limit the rounds, but usually the algorithm finishes
|
183 |
|
after a few rounds by itself.
|
184 |
|
|
185 |
|
Note that the cluster *must* be N+1 compliant before this algorithm is run, and
|
186 |
|
will stay at each move N+1 compliant. Therefore, the final cluster will be N+1
|
187 |
|
compliant.
|
188 |
|
|
189 |
|
Single-round solutions
|
190 |
|
++++++++++++++++++++++
|
191 |
|
|
192 |
|
Single-round solutions have the very nice property that they are
|
193 |
|
incrementally-valid. In other words, if you have a 10-step solution, at each
|
194 |
|
step the cluster is both N+1 compliant and better than the previous step.
|
195 |
|
|
196 |
|
This means that you can stop at any point and you will have a better cluster.
|
197 |
|
For this reason, single-round solutions are recommended in the common case of
|
198 |
|
let's make this better. Multi-round solutions will be better though when adding
|
199 |
|
a couple of new, empty nodes to the cluster due to the many relocations needed.
|
200 |
|
|
201 |
|
|
202 |
|
Multi-round solutions
|
203 |
|
+++++++++++++++++++++
|
204 |
|
|
205 |
|
A multi-round solution (not for a single round), due to de-duplication of moves
|
206 |
|
(i.e. just put the instance directly in its final place, and not move it five
|
207 |
|
times around) loses both these properties. It might be that it's not possible to
|
208 |
|
directly put the instance on the final nodes. So it can be possible that yes,
|
209 |
|
the cluster is happy in the final solution and nice, but you cannot do the steps
|
210 |
|
in the shown order. Solving this (via additional instance move(s)) is left to
|
211 |
|
the user.
|
212 |
|
|
213 |
|
Command line usage
|
214 |
|
++++++++++++++++++
|
215 |
|
|
216 |
|
Synopsis::
|
217 |
|
|
218 |
|
hbal { [-n NODES_FILE] [-i INSTANCES_FILE] | [-m CLUSTER] } \
|
219 |
|
[-r MAX_ROUNDS] \
|
220 |
|
[-p] [-C] [-o]
|
221 |
|
|
222 |
|
The -n and -i options change the names of the input files.
|
223 |
|
Alternatively, the -m option specifies collection of data via RAPI.
|
224 |
|
|
225 |
|
The -r option restricts the maximum number of rounds (and is more of
|
226 |
|
safety measure).
|
227 |
|
|
228 |
|
The -p option will show the cluster state after the solution is implemented,
|
229 |
|
while the -C option will show the needed gnt-instance commands to implement
|
230 |
|
it. The -o option specifies that instead the default, quite verbose
|
231 |
|
output, a single line of output should be shown, in the format::
|
232 |
|
|
233 |
|
initial_score number_of_moves final_score improvement
|
234 |
|
|
|
25 |
For algorithm details and usage, see the man page hbal(1).
|
235 |
26 |
|
236 |
27 |
Integration with Ganeti
|
237 |
28 |
-----------------------
|
238 |
29 |
|
239 |
|
The programs can either get their input from text files, or directly
|
240 |
|
from a cluster via RAPI. For text files, the following two commands
|
241 |
|
should be run::
|
|
30 |
The programs can either get their input from text files, or online
|
|
31 |
from a cluster via RAPI. For online collection via RAPI, the "-m"
|
|
32 |
argument to both hn1 and hbal should specify the cluster or master
|
|
33 |
node name.
|
|
34 |
|
|
35 |
For text files, a separate tool (hscan) is provided to automate their
|
|
36 |
gathering if RAPI is available, which is better since it can extract
|
|
37 |
more precise information. In case RAPI is not usable for whatever
|
|
38 |
reason, the following two commands should be run::
|
242 |
39 |
|
243 |
40 |
gnt-node list -oname,mtotal,mnode,mfree,dtotal,dfree \
|
244 |
41 |
--separator '|' --no-headers > nodes
|
245 |
42 |
gnt-instance list -oname,admin_ram,sda_size,status,pnode,snodes \
|
246 |
43 |
--separator '|' --no-head > instances
|
247 |
44 |
|
248 |
|
These two files should be saved under the names of 'nodes' and 'instances'.
|
|
45 |
These two files should be saved under the names of *nodes* and *instances*.
|
|
46 |
|
|
47 |
Installation
|
|
48 |
------------
|
|
49 |
|
|
50 |
If installing from source, you need a working ghc compiler (6.8 at
|
|
51 |
least) and some extra Haskell libraries which usually need to be
|
|
52 |
installed manually:
|
249 |
53 |
|
250 |
|
For RAPI, the "-m" argument to both hn1 and hbal should specify the
|
251 |
|
cluster or master node name.
|
|
54 |
- json
|
|
55 |
- curl
|
252 |
56 |
|
253 |
|
When run, the programs will show some informational messages and output the
|
254 |
|
chosen solution, in the form of a list of instance name and chosen
|
255 |
|
primary/secondary nodes. The user then needs to run the necessary commands to
|
256 |
|
get the instances to live on those nodes.
|
|
57 |
One these are available, just typing *make* in the top-level directory
|
|
58 |
should be enough.
|
257 |
59 |
|
258 |
|
Note that sda_size is less than the total disk size of an instance by 4352
|
259 |
|
MiB, so if disk space is at a premium the calculation could be wrong; in this
|
260 |
|
case, please adjust the values manually.
|
|
60 |
Internal (implementation) documentation is available in the ``apidoc``
|
|
61 |
directory.
|