root / hbal.1 @ d2ac5526
History | View | Annotate | Download (15.7 kB)
1 |
.TH HBAL 1 2009-03-22 htools "Ganeti H-tools" |
---|---|
2 |
.SH NAME |
3 |
hbal \- Cluster balancer for Ganeti |
4 |
|
5 |
.SH SYNOPSIS |
6 |
.B hbal |
7 |
.B "[-C]" |
8 |
.B "[-p]" |
9 |
.B "[-o]" |
10 |
.BI "[-l" limit "]" |
11 |
.BI "[-O" name... "]" |
12 |
.BI "[-m " cluster "]" |
13 |
.BI "[-n " nodes-file " ]" |
14 |
.BI "[-i " instances-file "]" |
15 |
|
16 |
.B hbal |
17 |
.B --version |
18 |
|
19 |
.SH DESCRIPTION |
20 |
hbal is a cluster balancer that looks at the current state of the |
21 |
cluster (nodes with their total and free disk, memory, etc.) and |
22 |
instance placement and computes a series of steps designed to bring |
23 |
the cluster into a better state. |
24 |
|
25 |
The algorithm to do so is designed to be stable (i.e. it will give you |
26 |
the same results when restarting it from the middle of the solution) |
27 |
and reasonably fast. It is not, however, designed to be a perfect |
28 |
algorithm - it is possible to make it go into a corner from which it |
29 |
can find no improvement, because it only look one "step" ahead. |
30 |
|
31 |
By default, the program will show the solution incrementally as it is |
32 |
computed, in a somewhat cryptic format; for getting the actual Ganeti |
33 |
command list, use the \fB-C\fR option. |
34 |
|
35 |
.SS ALGORITHM |
36 |
|
37 |
The program works in independent steps; at each step, we compute the |
38 |
best instance move that lowers the cluster score. |
39 |
|
40 |
The possible move type for an instance are combinations of |
41 |
failover/migrate and replace-disks such that we change one of the |
42 |
instance nodes, and the other one remains (but possibly with changed |
43 |
role, e.g. from primary it becomes secondary). The list is: |
44 |
- failover (f) |
45 |
- replace secondary (r) |
46 |
- replace primary, a composite move (f, r, f) |
47 |
- failover and replace secondary, also composite (f, r) |
48 |
- replace secondary and failover, also composite (r, f) |
49 |
|
50 |
We don't do the only remaining possibility of replacing both nodes |
51 |
(r,f,r,f or the equivalent f,r,f,r) since these move needs an |
52 |
exhaustive search over both candidate primary and secondary nodes, and |
53 |
is O(n*n) in the number of nodes. Furthermore, it doesn't seems to |
54 |
give better scores but will result in more disk replacements. |
55 |
|
56 |
.SS CLUSTER SCORING |
57 |
|
58 |
As said before, the algorithm tries to minimise the cluster score at |
59 |
each step. Currently this score is computed as a sum of the following |
60 |
components: |
61 |
- coefficient of variance of the percent of free memory |
62 |
- coefficient of variance of the percent of reserved memory |
63 |
- coefficient of variance of the percent of free disk |
64 |
- percentage of nodes failing N+1 check |
65 |
- percentage of instances living (either as primary or secondary) on |
66 |
offline nodes |
67 |
|
68 |
The free memory and free disk values help ensure that all nodes are |
69 |
somewhat balanced in their resource usage. The reserved memory helps |
70 |
to ensure that nodes are somewhat balanced in holding secondary |
71 |
instances, and that no node keeps too much memory reserved for |
72 |
N+1. And finally, the N+1 percentage helps guide the algorithm towards |
73 |
eliminating N+1 failures, if possible. |
74 |
|
75 |
Except for the N+1 failures and offline instances percentage, we use |
76 |
the coefficient of variance since this brings the values into the same |
77 |
unit so to speak, and with a restrict domain of values (between zero |
78 |
and one). The percentage of N+1 failures, while also in this numeric |
79 |
range, doesn't actually has the same meaning, but it has shown to work |
80 |
well. |
81 |
|
82 |
The other alternative, using for N+1 checks the coefficient of |
83 |
variance of (N+1 fail=1, N+1 pass=0) across nodes could hint the |
84 |
algorithm to make more N+1 failures if most nodes are N+1 fail |
85 |
already. Since this (making N+1 failures) is not allowed by other |
86 |
rules of the algorithm, so the N+1 checks would simply not work |
87 |
anymore in this case. |
88 |
|
89 |
The offline instances percentage (meaning the percentage of instances |
90 |
living on offline nodes) will cause the algorithm to actively move |
91 |
instances away from offline nodes. This, coupled with the restriction |
92 |
on placement given by offline nodes, will cause evacuation of such |
93 |
nodes. |
94 |
|
95 |
On a perfectly balanced cluster (all nodes the same size, all |
96 |
instances the same size and spread across the nodes equally), all |
97 |
values would be zero. This doesn't happen too often in practice :) |
98 |
|
99 |
.SS OTHER POSSIBLE METRICS |
100 |
|
101 |
It would be desirable to add more metrics to the algorithm, especially |
102 |
dynamically-computed metrics, such as: |
103 |
- CPU usage of instances, combined with VCPU versus PCPU count |
104 |
- Disk IO usage |
105 |
- Network IO |
106 |
|
107 |
.SH OPTIONS |
108 |
The options that can be passed to the program are as follows: |
109 |
.TP |
110 |
.B -C, --print-commands |
111 |
Print the command list at the end of the run. Without this, the |
112 |
program will only show a shorter, but cryptic output. |
113 |
.TP |
114 |
.B -p, --print-nodes |
115 |
Prints the before and after node status, in a format designed to allow |
116 |
the user to understand the node's most important parameters. |
117 |
|
118 |
The node list will contain these informations: |
119 |
.RS |
120 |
.TP |
121 |
.B F |
122 |
a character denoting the status of the node, with '-' meaning an |
123 |
offline node, '*' meaning N+1 failure and blank meaning a good node |
124 |
.TP |
125 |
.B Name |
126 |
the node name |
127 |
.TP |
128 |
.B t_mem |
129 |
the total node memory |
130 |
.TP |
131 |
.B n_mem |
132 |
the memory used by the node itself |
133 |
.TP |
134 |
.B i_mem |
135 |
the memory used by instances |
136 |
.TP |
137 |
.B x_mem |
138 |
amount memory which seems to be in use but cannot be determined why or |
139 |
by which instance; usually this means that the hypervisor has some |
140 |
overhead or that there are other reporting errors |
141 |
.TP |
142 |
.B f_mem |
143 |
the free node memory |
144 |
.TP |
145 |
.B r_mem |
146 |
the reserved node memory, which is the amount of free memory needed |
147 |
for N+1 compliance |
148 |
.TP |
149 |
.B t_dsk |
150 |
total disk |
151 |
.TP |
152 |
.B f_dsk |
153 |
free disk |
154 |
.TP |
155 |
.B pri |
156 |
number of primary instances |
157 |
.TP |
158 |
.B sec |
159 |
number of secondary instances |
160 |
.TP |
161 |
.B p_fmem |
162 |
percent of free memory |
163 |
.TP |
164 |
.B p_fdsk |
165 |
percent of free disk |
166 |
.RE |
167 |
|
168 |
.TP |
169 |
.B -o, --oneline |
170 |
Only shows a one-line output from the program, designed for the case |
171 |
when one wants to look at multiple clusters at once and check their |
172 |
status. |
173 |
|
174 |
The line will contain four fields: |
175 |
- initial cluster score |
176 |
- number of steps in the solution |
177 |
- final cluster score |
178 |
- improvement in the cluster score |
179 |
|
180 |
.TP |
181 |
.BI "-O " name |
182 |
This option (which can be given multiple times) will mark nodes as |
183 |
being \fIoffline\fR. This means a couple of things: |
184 |
.RS |
185 |
.TP |
186 |
- |
187 |
instances won't be placed on these nodes, not even temporarily; |
188 |
e.g. the \fIreplace primary\fR move is not available if the secondary |
189 |
node is offline, since this move requires a failover. |
190 |
.TP |
191 |
- |
192 |
these nodes will not be included in the score calculation (except for |
193 |
the percentage of instances on offline nodes) |
194 |
.RE |
195 |
|
196 |
.TP |
197 |
.BI "-n" nodefile ", --nodes=" nodefile |
198 |
The name of the file holding node information (if not collecting via |
199 |
RAPI), instead of the default |
200 |
.I nodes |
201 |
file. |
202 |
|
203 |
.TP |
204 |
.BI "-i" instancefile ", --instances=" instancefile |
205 |
The name of the file holding instance information (if not collecting |
206 |
via RAPI), instead of the default |
207 |
.I instances |
208 |
file. |
209 |
|
210 |
.TP |
211 |
.BI "-m" cluster |
212 |
Collect data not from files but directly from the |
213 |
.I cluster |
214 |
given as an argument via RAPI. This work for both Ganeti 1.2 and |
215 |
Ganeti 2.0. |
216 |
|
217 |
.TP |
218 |
.BI "-l" N ", --max-length=" N |
219 |
Restrict the solution to this length. This can be used for example to |
220 |
automate the execution of the balancing. |
221 |
|
222 |
.TP |
223 |
.B -v, --verbose |
224 |
Increase the output verbosity. Each usage of this option will increase |
225 |
the verbosity (currently more than 2 doesn't make sense) from the |
226 |
default of zero. |
227 |
|
228 |
.TP |
229 |
.B -V, --version |
230 |
Just show the program version and exit. |
231 |
|
232 |
.SH EXIT STATUS |
233 |
|
234 |
The exist status of the command will be zero, unless for some reason |
235 |
the algorithm fatally failed (e.g. wrong node or instance data). |
236 |
|
237 |
.SH BUGS |
238 |
|
239 |
The program does not check its input data for consistency, and aborts |
240 |
with cryptic errors messages in this case. |
241 |
|
242 |
The algorithm is not perfect. |
243 |
|
244 |
The output format is not easily scriptable, and the program should |
245 |
feed moves directly into Ganeti (either via RAPI or via a gnt-debug |
246 |
input file). |
247 |
|
248 |
.SH EXAMPLE |
249 |
|
250 |
Note that this example are not for the latest version (they don't have |
251 |
full node data). |
252 |
|
253 |
.SS Default output |
254 |
|
255 |
With the default options, the program shows each individual step and |
256 |
the improvements it brings in cluster score: |
257 |
|
258 |
.in +4n |
259 |
.nf |
260 |
.RB "$" " hbal" |
261 |
Loaded 20 nodes, 80 instances |
262 |
Cluster is not N+1 happy, continuing but no guarantee that the cluster will end N+1 happy. |
263 |
Initial score: 0.52329131 |
264 |
Trying to minimize the CV... |
265 |
1. instance14 node1:node10 => node16:node10 0.42109120 a=f r:node16 f |
266 |
2. instance54 node4:node15 => node16:node15 0.31904594 a=f r:node16 f |
267 |
3. instance4 node5:node2 => node2:node16 0.26611015 a=f r:node16 |
268 |
4. instance48 node18:node20 => node2:node18 0.21361717 a=r:node2 f |
269 |
5. instance93 node19:node18 => node16:node19 0.16166425 a=r:node16 f |
270 |
6. instance89 node3:node20 => node2:node3 0.11005629 a=r:node2 f |
271 |
7. instance5 node6:node2 => node16:node6 0.05841589 a=r:node16 f |
272 |
8. instance94 node7:node20 => node20:node16 0.00658759 a=f r:node16 |
273 |
9. instance44 node20:node2 => node2:node15 0.00438740 a=f r:node15 |
274 |
10. instance62 node14:node18 => node14:node16 0.00390087 a=r:node16 |
275 |
11. instance13 node11:node14 => node11:node16 0.00361787 a=r:node16 |
276 |
12. instance19 node10:node11 => node10:node7 0.00336636 a=r:node7 |
277 |
13. instance43 node12:node13 => node12:node1 0.00305681 a=r:node1 |
278 |
14. instance1 node1:node2 => node1:node4 0.00263124 a=r:node4 |
279 |
15. instance58 node19:node20 => node19:node17 0.00252594 a=r:node17 |
280 |
Cluster score improved from 0.52329131 to 0.00252594 |
281 |
.fi |
282 |
.in |
283 |
|
284 |
In the above output, we can see: |
285 |
- the input data (here from files) shows a cluster with 20 nodes and |
286 |
80 instances |
287 |
- the cluster is not initially N+1 compliant |
288 |
- the initial score is 0.52329131 |
289 |
|
290 |
The step list follows, showing the instance, its initial |
291 |
primary/secondary nodes, the new primary secondary, the cluster list, |
292 |
and the actions taken in this step (with 'f' denoting failover/migrate |
293 |
and 'r' denoting replace secondary). |
294 |
|
295 |
Finally, the program shows the improvement in cluster score. |
296 |
|
297 |
A more detailed output is obtained via the \fB-C\fR and \fB-p\fR options: |
298 |
|
299 |
.in +4n |
300 |
.nf |
301 |
.RB "$" " hbal" |
302 |
Loaded 20 nodes, 80 instances |
303 |
Cluster is not N+1 happy, continuing but no guarantee that the cluster will end N+1 happy. |
304 |
Initial cluster status: |
305 |
N1 Name t_mem f_mem r_mem t_dsk f_dsk pri sec p_fmem p_fdsk |
306 |
* node1 32762 1280 6000 1861 1026 5 3 0.03907 0.55179 |
307 |
node2 32762 31280 12000 1861 1026 0 8 0.95476 0.55179 |
308 |
* node3 32762 1280 6000 1861 1026 5 3 0.03907 0.55179 |
309 |
* node4 32762 1280 6000 1861 1026 5 3 0.03907 0.55179 |
310 |
* node5 32762 1280 6000 1861 978 5 5 0.03907 0.52573 |
311 |
* node6 32762 1280 6000 1861 1026 5 3 0.03907 0.55179 |
312 |
* node7 32762 1280 6000 1861 1026 5 3 0.03907 0.55179 |
313 |
node8 32762 7280 6000 1861 1026 4 4 0.22221 0.55179 |
314 |
node9 32762 7280 6000 1861 1026 4 4 0.22221 0.55179 |
315 |
* node10 32762 7280 12000 1861 1026 4 4 0.22221 0.55179 |
316 |
node11 32762 7280 6000 1861 922 4 5 0.22221 0.49577 |
317 |
node12 32762 7280 6000 1861 1026 4 4 0.22221 0.55179 |
318 |
node13 32762 7280 6000 1861 922 4 5 0.22221 0.49577 |
319 |
node14 32762 7280 6000 1861 922 4 5 0.22221 0.49577 |
320 |
* node15 32762 7280 12000 1861 1131 4 3 0.22221 0.60782 |
321 |
node16 32762 31280 0 1861 1860 0 0 0.95476 1.00000 |
322 |
node17 32762 7280 6000 1861 1106 5 3 0.22221 0.59479 |
323 |
* node18 32762 1280 6000 1396 561 5 3 0.03907 0.40239 |
324 |
* node19 32762 1280 6000 1861 1026 5 3 0.03907 0.55179 |
325 |
node20 32762 13280 12000 1861 689 3 9 0.40535 0.37068 |
326 |
|
327 |
Initial score: 0.52329131 |
328 |
Trying to minimize the CV... |
329 |
1. instance14 node1:node10 => node16:node10 0.42109120 a=f r:node16 f |
330 |
2. instance54 node4:node15 => node16:node15 0.31904594 a=f r:node16 f |
331 |
3. instance4 node5:node2 => node2:node16 0.26611015 a=f r:node16 |
332 |
4. instance48 node18:node20 => node2:node18 0.21361717 a=r:node2 f |
333 |
5. instance93 node19:node18 => node16:node19 0.16166425 a=r:node16 f |
334 |
6. instance89 node3:node20 => node2:node3 0.11005629 a=r:node2 f |
335 |
7. instance5 node6:node2 => node16:node6 0.05841589 a=r:node16 f |
336 |
8. instance94 node7:node20 => node20:node16 0.00658759 a=f r:node16 |
337 |
9. instance44 node20:node2 => node2:node15 0.00438740 a=f r:node15 |
338 |
10. instance62 node14:node18 => node14:node16 0.00390087 a=r:node16 |
339 |
11. instance13 node11:node14 => node11:node16 0.00361787 a=r:node16 |
340 |
12. instance19 node10:node11 => node10:node7 0.00336636 a=r:node7 |
341 |
13. instance43 node12:node13 => node12:node1 0.00305681 a=r:node1 |
342 |
14. instance1 node1:node2 => node1:node4 0.00263124 a=r:node4 |
343 |
15. instance58 node19:node20 => node19:node17 0.00252594 a=r:node17 |
344 |
Cluster score improved from 0.52329131 to 0.00252594 |
345 |
|
346 |
Commands to run to reach the above solution: |
347 |
echo step 1 |
348 |
echo gnt-instance migrate instance14 |
349 |
echo gnt-instance replace-disks -n node16 instance14 |
350 |
echo gnt-instance migrate instance14 |
351 |
echo step 2 |
352 |
echo gnt-instance migrate instance54 |
353 |
echo gnt-instance replace-disks -n node16 instance54 |
354 |
echo gnt-instance migrate instance54 |
355 |
echo step 3 |
356 |
echo gnt-instance migrate instance4 |
357 |
echo gnt-instance replace-disks -n node16 instance4 |
358 |
echo step 4 |
359 |
echo gnt-instance replace-disks -n node2 instance48 |
360 |
echo gnt-instance migrate instance48 |
361 |
echo step 5 |
362 |
echo gnt-instance replace-disks -n node16 instance93 |
363 |
echo gnt-instance migrate instance93 |
364 |
echo step 6 |
365 |
echo gnt-instance replace-disks -n node2 instance89 |
366 |
echo gnt-instance migrate instance89 |
367 |
echo step 7 |
368 |
echo gnt-instance replace-disks -n node16 instance5 |
369 |
echo gnt-instance migrate instance5 |
370 |
echo step 8 |
371 |
echo gnt-instance migrate instance94 |
372 |
echo gnt-instance replace-disks -n node16 instance94 |
373 |
echo step 9 |
374 |
echo gnt-instance migrate instance44 |
375 |
echo gnt-instance replace-disks -n node15 instance44 |
376 |
echo step 10 |
377 |
echo gnt-instance replace-disks -n node16 instance62 |
378 |
echo step 11 |
379 |
echo gnt-instance replace-disks -n node16 instance13 |
380 |
echo step 12 |
381 |
echo gnt-instance replace-disks -n node7 instance19 |
382 |
echo step 13 |
383 |
echo gnt-instance replace-disks -n node1 instance43 |
384 |
echo step 14 |
385 |
echo gnt-instance replace-disks -n node4 instance1 |
386 |
echo step 15 |
387 |
echo gnt-instance replace-disks -n node17 instance58 |
388 |
|
389 |
Final cluster status: |
390 |
N1 Name t_mem f_mem r_mem t_dsk f_dsk pri sec p_fmem p_fdsk |
391 |
node1 32762 7280 6000 1861 1026 4 4 0.22221 0.55179 |
392 |
node2 32762 7280 6000 1861 1026 4 4 0.22221 0.55179 |
393 |
node3 32762 7280 6000 1861 1026 4 4 0.22221 0.55179 |
394 |
node4 32762 7280 6000 1861 1026 4 4 0.22221 0.55179 |
395 |
node5 32762 7280 6000 1861 1078 4 5 0.22221 0.57947 |
396 |
node6 32762 7280 6000 1861 1026 4 4 0.22221 0.55179 |
397 |
node7 32762 7280 6000 1861 1026 4 4 0.22221 0.55179 |
398 |
node8 32762 7280 6000 1861 1026 4 4 0.22221 0.55179 |
399 |
node9 32762 7280 6000 1861 1026 4 4 0.22221 0.55179 |
400 |
node10 32762 7280 6000 1861 1026 4 4 0.22221 0.55179 |
401 |
node11 32762 7280 6000 1861 1022 4 4 0.22221 0.54951 |
402 |
node12 32762 7280 6000 1861 1026 4 4 0.22221 0.55179 |
403 |
node13 32762 7280 6000 1861 1022 4 4 0.22221 0.54951 |
404 |
node14 32762 7280 6000 1861 1022 4 4 0.22221 0.54951 |
405 |
node15 32762 7280 6000 1861 1031 4 4 0.22221 0.55408 |
406 |
node16 32762 7280 6000 1861 1060 4 4 0.22221 0.57007 |
407 |
node17 32762 7280 6000 1861 1006 5 4 0.22221 0.54105 |
408 |
node18 32762 7280 6000 1396 761 4 2 0.22221 0.54570 |
409 |
node19 32762 7280 6000 1861 1026 4 4 0.22221 0.55179 |
410 |
node20 32762 13280 6000 1861 1089 3 5 0.40535 0.58565 |
411 |
|
412 |
.fi |
413 |
.in |
414 |
|
415 |
Here we see, beside the step list, the initial and final cluster |
416 |
status, with the final one showing all nodes being N+1 compliant, and |
417 |
the command list to reach the final solution. In the initial listing, |
418 |
we see which nodes are not N+1 compliant. |
419 |
|
420 |
The algorithm is stable as long as each step above is fully completed, |
421 |
e.g. in step 8, both the migrate and the replace-disks are |
422 |
done. Otherwise, if only the migrate is done, the input data is |
423 |
changed in a way that the program will output a different solution |
424 |
list (but hopefully will end in the same state). |
425 |
|
426 |
.SH SEE ALSO |
427 |
.BR hn1 "(1), " hscan "(1), " ganeti "(7), " gnt-instance "(8), " |
428 |
.BR gnt-node "(8)" |