Statistics
| Branch: | Tag: | Revision:

root / hbal.1 @ 848a9ac5

History | View | Annotate | Download (19.4 kB)

1
.TH HBAL 1 2009-03-23 htools "Ganeti H-tools"
2
.SH NAME
3
hbal \- Cluster balancer for Ganeti
4

    
5
.SH SYNOPSIS
6
.B hbal
7
.B "[backend options...]"
8
.B "[algorithm options...]"
9
.B "[reporting options...]"
10

    
11
.B hbal
12
.B --version
13

    
14
.TP
15
Backend options:
16
.BI "[ -m " cluster " ]"
17
|
18
.BI "[ -L[" path "]]"
19
|
20
.BI "[ -n " nodes-file " ]"
21
.BI "[ -i " instances-file " ]"
22

    
23
.TP
24
Algorithm options:
25
.BI "[ --max-cpu " cpu-ratio " ]"
26
.BI "[ --min-disk " disk-ratio " ]"
27
.BI "[ -l " limit " ]"
28
.BI "[ -e " score " ]"
29
.BI "[ -O " name... " ]"
30

    
31
.TP
32
Reporting options:
33
.BI "[ -C[" file "] ]"
34
.B "[ -p ]"
35
.B "[ -o ]"
36
.B "[ -v... | -q ]"
37

    
38

    
39
.SH DESCRIPTION
40
hbal is a cluster balancer that looks at the current state of the
41
cluster (nodes with their total and free disk, memory, etc.) and
42
instance placement and computes a series of steps designed to bring
43
the cluster into a better state.
44

    
45
The algorithm to do so is designed to be stable (i.e. it will give you
46
the same results when restarting it from the middle of the solution)
47
and reasonably fast. It is not, however, designed to be a perfect
48
algorithm - it is possible to make it go into a corner from which it
49
can find no improvement, because it only look one "step" ahead.
50

    
51
By default, the program will show the solution incrementally as it is
52
computed, in a somewhat cryptic format; for getting the actual Ganeti
53
command list, use the \fB-C\fR option.
54

    
55
.SS ALGORITHM
56

    
57
The program works in independent steps; at each step, we compute the
58
best instance move that lowers the cluster score.
59

    
60
The possible move type for an instance are combinations of
61
failover/migrate and replace-disks such that we change one of the
62
instance nodes, and the other one remains (but possibly with changed
63
role, e.g. from primary it becomes secondary). The list is:
64
.RS 4
65
.TP 3
66
\(em
67
failover (f)
68
.TP
69
\(em
70
replace secondary (r)
71
.TP
72
\(em
73
replace primary, a composite move (f, r, f)
74
.TP
75
\(em
76
failover and replace secondary, also composite (f, r)
77
.TP
78
\(em
79
replace secondary and failover, also composite (r, f)
80
.RE
81

    
82
We don't do the only remaining possibility of replacing both nodes
83
(r,f,r,f or the equivalent f,r,f,r) since these move needs an
84
exhaustive search over both candidate primary and secondary nodes, and
85
is O(n*n) in the number of nodes. Furthermore, it doesn't seems to
86
give better scores but will result in more disk replacements.
87

    
88
.SS CLUSTER SCORING
89

    
90
As said before, the algorithm tries to minimise the cluster score at
91
each step. Currently this score is computed as a sum of the following
92
components:
93
.RS 4
94
.TP 3
95
\(em
96
coefficient of variance of the percent of free memory
97
.TP
98
\(em
99
coefficient of variance of the percent of reserved memory
100
.TP
101
\(em
102
coefficient of variance of the percent of free disk
103
.TP
104
\(em
105
percentage of nodes failing N+1 check
106
.TP
107
\(em
108
percentage of instances living (either as primary or secondary) on
109
offline nodes
110
.TP
111
\(em
112
coefficent of variance of the ratio of virtual-to-physical cpus (for
113
primary instaces of the node)
114
.RE
115

    
116
The free memory and free disk values help ensure that all nodes are
117
somewhat balanced in their resource usage. The reserved memory helps
118
to ensure that nodes are somewhat balanced in holding secondary
119
instances, and that no node keeps too much memory reserved for
120
N+1. And finally, the N+1 percentage helps guide the algorithm towards
121
eliminating N+1 failures, if possible.
122

    
123
Except for the N+1 failures and offline instances percentage, we use
124
the coefficient of variance since this brings the values into the same
125
unit so to speak, and with a restrict domain of values (between zero
126
and one). The percentage of N+1 failures, while also in this numeric
127
range, doesn't actually has the same meaning, but it has shown to work
128
well.
129

    
130
The other alternative, using for N+1 checks the coefficient of
131
variance of (N+1 fail=1, N+1 pass=0) across nodes could hint the
132
algorithm to make more N+1 failures if most nodes are N+1 fail
133
already. Since this (making N+1 failures) is not allowed by other
134
rules of the algorithm, so the N+1 checks would simply not work
135
anymore in this case.
136

    
137
The offline instances percentage (meaning the percentage of instances
138
living on offline nodes) will cause the algorithm to actively move
139
instances away from offline nodes. This, coupled with the restriction
140
on placement given by offline nodes, will cause evacuation of such
141
nodes.
142

    
143
On a perfectly balanced cluster (all nodes the same size, all
144
instances the same size and spread across the nodes equally), all
145
values would be zero. This doesn't happen too often in practice :)
146

    
147
.SS OFFLINE INSTANCES
148

    
149
Since current Ganeti versions do not report the memory used by offline
150
(down) instances, ignoring the run status of instances will cause
151
wrong calculations. For this reason, the algorithm subtracts the
152
memory size of down instances from the free node memory of their
153
primary node, in effect simulating the startup of such instances.
154

    
155
.SS OTHER POSSIBLE METRICS
156

    
157
It would be desirable to add more metrics to the algorithm, especially
158
dynamically-computed metrics, such as:
159
.RS 4
160
.TP 3
161
\(em
162
CPU usage of instances
163
.TP
164
\(em
165
Disk IO usage
166
.TP
167
\(em
168
Network IO
169
.RE
170

    
171
.SH OPTIONS
172
The options that can be passed to the program are as follows:
173
.TP
174
.B -C, --print-commands
175
Print the command list at the end of the run. Without this, the
176
program will only show a shorter, but cryptic output.
177
.TP
178
.B -p, --print-nodes
179
Prints the before and after node status, in a format designed to allow
180
the user to understand the node's most important parameters.
181

    
182
The node list will contain these informations:
183
.RS
184
.TP
185
.B F
186
a character denoting the status of the node, with '-' meaning an
187
offline node, '*' meaning N+1 failure and blank meaning a good node
188
.TP
189
.B Name
190
the node name
191
.TP
192
.B t_mem
193
the total node memory
194
.TP
195
.B n_mem
196
the memory used by the node itself
197
.TP
198
.B i_mem
199
the memory used by instances
200
.TP
201
.B x_mem
202
amount memory which seems to be in use but cannot be determined why or
203
by which instance; usually this means that the hypervisor has some
204
overhead or that there are other reporting errors
205
.TP
206
.B f_mem
207
the free node memory
208
.TP
209
.B r_mem
210
the reserved node memory, which is the amount of free memory needed
211
for N+1 compliance
212
.TP
213
.B t_dsk
214
total disk
215
.TP
216
.B f_dsk
217
free disk
218
.TP
219
.B pcpu
220
the number of physical cpus on the node
221
.TP
222
.B vcpu
223
the number of virtual cpus allocated to primary instances
224
.TP
225
.B pri
226
number of primary instances
227
.TP
228
.B sec
229
number of secondary instances
230
.TP
231
.B p_fmem
232
percent of free memory
233
.TP
234
.B p_fdsk
235
percent of free disk
236
.TP
237
.B r_cpu
238
ratio of virtual to physical cpus
239
.RE
240

    
241
.TP
242
.B -o, --oneline
243
Only shows a one-line output from the program, designed for the case
244
when one wants to look at multiple clusters at once and check their
245
status.
246

    
247
The line will contain four fields:
248
.RS
249
.RS 4
250
.TP 3
251
\(em
252
initial cluster score
253
.TP
254
\(em
255
number of steps in the solution
256
.TP
257
\(em
258
final cluster score
259
.TP
260
\(em
261
improvement in the cluster score
262
.RE
263
.RE
264

    
265
.TP
266
.BI "-O " name
267
This option (which can be given multiple times) will mark nodes as
268
being \fIoffline\fR. This means a couple of things:
269
.RS
270
.RS 4
271
.TP 3
272
\(em
273
instances won't be placed on these nodes, not even temporarily;
274
e.g. the \fIreplace primary\fR move is not available if the secondary
275
node is offline, since this move requires a failover.
276
.TP
277
\(em
278
these nodes will not be included in the score calculation (except for
279
the percentage of instances on offline nodes)
280
.RE
281
Note that hbal will also mark as offline any nodes which are reported
282
by RAPI as such, or that have "?" in file-based input in any numeric
283
fields.
284
.RE
285

    
286
.TP
287
.BI "-e" score ", --min-score=" score
288
This parameter denotes the minimum score we are happy with and alters
289
the computation in two ways:
290
.RS
291
.RS 4
292
.TP 3
293
\(em
294
if the cluster has the initial score lower than this value, then we
295
don't enter the algorithm at all, and exit with success
296
.TP
297
\(em
298
during the iterative process, if we reach a score lower than this
299
value, we exit the algorithm
300
.RE
301
The default value of the parameter is currently \fI1e-9\fR (chosen
302
empirically).
303
.RE
304

    
305
.TP
306
.BI "-n" nodefile ", --nodes=" nodefile
307
The name of the file holding node information (if not collecting via
308
RAPI), instead of the default \fInodes\fR file (but see below how to
309
customize the default value via the environment).
310

    
311
.TP
312
.BI "-i" instancefile ", --instances=" instancefile
313
The name of the file holding instance information (if not collecting
314
via RAPI), instead of the default \fIinstances\fR file (but see below
315
how to customize the default value via the environment).
316

    
317
.TP
318
.BI "-m" cluster
319
Collect data not from files but directly from the
320
.I cluster
321
given as an argument via RAPI. If the argument doesn't contain a colon
322
(:), then it is converted into a fully-built URL via prepending
323
https:// and appending the default RAPI port, otherwise it's
324
considered a fully-specified URL and is used as-is.
325

    
326
.TP
327
.BI "-L[" path "]"
328
Collect data not from files but directly from the master daemon, which
329
is to be contacted via the luxi (an internal Ganeti protocol). An
330
optional \fIpath\fR argument is interpreted as the path to the unix
331
socket on which the master daemon listens; otherwise, the default path
332
used by ganeti when installed with "--localstatedir=/var" is used.
333

    
334
.TP
335
.BI "-l" N ", --max-length=" N
336
Restrict the solution to this length. This can be used for example to
337
automate the execution of the balancing.
338

    
339
.TP
340
.BI "--max-cpu " cpu-ratio
341
The maximum virtual-to-physical cpu ratio, as a floating point number
342
between zero and one. For example, specifying \fIcpu-ratio\fR as
343
\fB2.5\fR means that, for a 4-cpu machine, a maximum of 10 virtual
344
cpus should be allowed to be in use for primary instances. A value of
345
one doesn't make sense though, as that means no disk space can be used
346
on it.
347

    
348
.TP
349
.BI "--min-disk " disk-ratio
350
The minimum amount of free disk space remaining, as a floating point
351
number. For example, specifying \fIdisk-ratio\fR as \fB0.25\fR means
352
that at least one quarter of disk space should be left free on nodes.
353

    
354
.TP
355
.B -v, --verbose
356
Increase the output verbosity. Each usage of this option will increase
357
the verbosity (currently more than 2 doesn't make sense) from the
358
default of one.
359

    
360
.TP
361
.B -q, --quiet
362
Decrease the output verbosity. Each usage of this option will decrease
363
the verbosity (less than zero doesn't make sense) from the default of
364
one.
365

    
366
.TP
367
.B -V, --version
368
Just show the program version and exit.
369

    
370
.SH EXIT STATUS
371

    
372
The exist status of the command will be zero, unless for some reason
373
the algorithm fatally failed (e.g. wrong node or instance data).
374

    
375
.SH ENVIRONMENT
376

    
377
If the variables \fBHTOOLS_NODES\fR and \fBHTOOLS_INSTANCES\fR are
378
present in the environment, they will override the default names for
379
the nodes and instances files. These will have of course no effect
380
when RAPI is used.
381

    
382
.SH BUGS
383

    
384
The program does not check its input data for consistency, and aborts
385
with cryptic errors messages in this case.
386

    
387
The algorithm is not perfect.
388

    
389
The output format is not easily scriptable, and the program should
390
feed moves directly into Ganeti (either via RAPI or via a gnt-debug
391
input file).
392

    
393
.SH EXAMPLE
394

    
395
Note that this example are not for the latest version (they don't have
396
full node data).
397

    
398
.SS Default output
399

    
400
With the default options, the program shows each individual step and
401
the improvements it brings in cluster score:
402

    
403
.in +4n
404
.nf
405
.RB "$" " hbal"
406
Loaded 20 nodes, 80 instances
407
Cluster is not N+1 happy, continuing but no guarantee that the cluster will end N+1 happy.
408
Initial score: 0.52329131
409
Trying to minimize the CV...
410
    1. instance14  node1:node10  => node16:node10 0.42109120 a=f r:node16 f
411
    2. instance54  node4:node15  => node16:node15 0.31904594 a=f r:node16 f
412
    3. instance4   node5:node2   => node2:node16  0.26611015 a=f r:node16
413
    4. instance48  node18:node20 => node2:node18  0.21361717 a=r:node2 f
414
    5. instance93  node19:node18 => node16:node19 0.16166425 a=r:node16 f
415
    6. instance89  node3:node20  => node2:node3   0.11005629 a=r:node2 f
416
    7. instance5   node6:node2   => node16:node6  0.05841589 a=r:node16 f
417
    8. instance94  node7:node20  => node20:node16 0.00658759 a=f r:node16
418
    9. instance44  node20:node2  => node2:node15  0.00438740 a=f r:node15
419
   10. instance62  node14:node18 => node14:node16 0.00390087 a=r:node16
420
   11. instance13  node11:node14 => node11:node16 0.00361787 a=r:node16
421
   12. instance19  node10:node11 => node10:node7  0.00336636 a=r:node7
422
   13. instance43  node12:node13 => node12:node1  0.00305681 a=r:node1
423
   14. instance1   node1:node2   => node1:node4   0.00263124 a=r:node4
424
   15. instance58  node19:node20 => node19:node17 0.00252594 a=r:node17
425
Cluster score improved from 0.52329131 to 0.00252594
426
.fi
427
.in
428

    
429
In the above output, we can see:
430
  - the input data (here from files) shows a cluster with 20 nodes and
431
    80 instances
432
  - the cluster is not initially N+1 compliant
433
  - the initial score is 0.52329131
434

    
435
The step list follows, showing the instance, its initial
436
primary/secondary nodes, the new primary secondary, the cluster list,
437
and the actions taken in this step (with 'f' denoting failover/migrate
438
and 'r' denoting replace secondary).
439

    
440
Finally, the program shows the improvement in cluster score.
441

    
442
A more detailed output is obtained via the \fB-C\fR and \fB-p\fR options:
443

    
444
.in +4n
445
.nf
446
.RB "$" " hbal"
447
Loaded 20 nodes, 80 instances
448
Cluster is not N+1 happy, continuing but no guarantee that the cluster will end N+1 happy.
449
Initial cluster status:
450
N1 Name   t_mem f_mem r_mem t_dsk f_dsk pri sec  p_fmem  p_fdsk
451
 * node1  32762  1280  6000  1861  1026   5   3 0.03907 0.55179
452
   node2  32762 31280 12000  1861  1026   0   8 0.95476 0.55179
453
 * node3  32762  1280  6000  1861  1026   5   3 0.03907 0.55179
454
 * node4  32762  1280  6000  1861  1026   5   3 0.03907 0.55179
455
 * node5  32762  1280  6000  1861   978   5   5 0.03907 0.52573
456
 * node6  32762  1280  6000  1861  1026   5   3 0.03907 0.55179
457
 * node7  32762  1280  6000  1861  1026   5   3 0.03907 0.55179
458
   node8  32762  7280  6000  1861  1026   4   4 0.22221 0.55179
459
   node9  32762  7280  6000  1861  1026   4   4 0.22221 0.55179
460
 * node10 32762  7280 12000  1861  1026   4   4 0.22221 0.55179
461
   node11 32762  7280  6000  1861   922   4   5 0.22221 0.49577
462
   node12 32762  7280  6000  1861  1026   4   4 0.22221 0.55179
463
   node13 32762  7280  6000  1861   922   4   5 0.22221 0.49577
464
   node14 32762  7280  6000  1861   922   4   5 0.22221 0.49577
465
 * node15 32762  7280 12000  1861  1131   4   3 0.22221 0.60782
466
   node16 32762 31280     0  1861  1860   0   0 0.95476 1.00000
467
   node17 32762  7280  6000  1861  1106   5   3 0.22221 0.59479
468
 * node18 32762  1280  6000  1396   561   5   3 0.03907 0.40239
469
 * node19 32762  1280  6000  1861  1026   5   3 0.03907 0.55179
470
   node20 32762 13280 12000  1861   689   3   9 0.40535 0.37068
471

    
472
Initial score: 0.52329131
473
Trying to minimize the CV...
474
    1. instance14  node1:node10  => node16:node10 0.42109120 a=f r:node16 f
475
    2. instance54  node4:node15  => node16:node15 0.31904594 a=f r:node16 f
476
    3. instance4   node5:node2   => node2:node16  0.26611015 a=f r:node16
477
    4. instance48  node18:node20 => node2:node18  0.21361717 a=r:node2 f
478
    5. instance93  node19:node18 => node16:node19 0.16166425 a=r:node16 f
479
    6. instance89  node3:node20  => node2:node3   0.11005629 a=r:node2 f
480
    7. instance5   node6:node2   => node16:node6  0.05841589 a=r:node16 f
481
    8. instance94  node7:node20  => node20:node16 0.00658759 a=f r:node16
482
    9. instance44  node20:node2  => node2:node15  0.00438740 a=f r:node15
483
   10. instance62  node14:node18 => node14:node16 0.00390087 a=r:node16
484
   11. instance13  node11:node14 => node11:node16 0.00361787 a=r:node16
485
   12. instance19  node10:node11 => node10:node7  0.00336636 a=r:node7
486
   13. instance43  node12:node13 => node12:node1  0.00305681 a=r:node1
487
   14. instance1   node1:node2   => node1:node4   0.00263124 a=r:node4
488
   15. instance58  node19:node20 => node19:node17 0.00252594 a=r:node17
489
Cluster score improved from 0.52329131 to 0.00252594
490

    
491
Commands to run to reach the above solution:
492
  echo step 1
493
  echo gnt-instance migrate instance14
494
  echo gnt-instance replace-disks -n node16 instance14
495
  echo gnt-instance migrate instance14
496
  echo step 2
497
  echo gnt-instance migrate instance54
498
  echo gnt-instance replace-disks -n node16 instance54
499
  echo gnt-instance migrate instance54
500
  echo step 3
501
  echo gnt-instance migrate instance4
502
  echo gnt-instance replace-disks -n node16 instance4
503
  echo step 4
504
  echo gnt-instance replace-disks -n node2 instance48
505
  echo gnt-instance migrate instance48
506
  echo step 5
507
  echo gnt-instance replace-disks -n node16 instance93
508
  echo gnt-instance migrate instance93
509
  echo step 6
510
  echo gnt-instance replace-disks -n node2 instance89
511
  echo gnt-instance migrate instance89
512
  echo step 7
513
  echo gnt-instance replace-disks -n node16 instance5
514
  echo gnt-instance migrate instance5
515
  echo step 8
516
  echo gnt-instance migrate instance94
517
  echo gnt-instance replace-disks -n node16 instance94
518
  echo step 9
519
  echo gnt-instance migrate instance44
520
  echo gnt-instance replace-disks -n node15 instance44
521
  echo step 10
522
  echo gnt-instance replace-disks -n node16 instance62
523
  echo step 11
524
  echo gnt-instance replace-disks -n node16 instance13
525
  echo step 12
526
  echo gnt-instance replace-disks -n node7 instance19
527
  echo step 13
528
  echo gnt-instance replace-disks -n node1 instance43
529
  echo step 14
530
  echo gnt-instance replace-disks -n node4 instance1
531
  echo step 15
532
  echo gnt-instance replace-disks -n node17 instance58
533

    
534
Final cluster status:
535
N1 Name   t_mem f_mem r_mem t_dsk f_dsk pri sec  p_fmem  p_fdsk
536
   node1  32762  7280  6000  1861  1026   4   4 0.22221 0.55179
537
   node2  32762  7280  6000  1861  1026   4   4 0.22221 0.55179
538
   node3  32762  7280  6000  1861  1026   4   4 0.22221 0.55179
539
   node4  32762  7280  6000  1861  1026   4   4 0.22221 0.55179
540
   node5  32762  7280  6000  1861  1078   4   5 0.22221 0.57947
541
   node6  32762  7280  6000  1861  1026   4   4 0.22221 0.55179
542
   node7  32762  7280  6000  1861  1026   4   4 0.22221 0.55179
543
   node8  32762  7280  6000  1861  1026   4   4 0.22221 0.55179
544
   node9  32762  7280  6000  1861  1026   4   4 0.22221 0.55179
545
   node10 32762  7280  6000  1861  1026   4   4 0.22221 0.55179
546
   node11 32762  7280  6000  1861  1022   4   4 0.22221 0.54951
547
   node12 32762  7280  6000  1861  1026   4   4 0.22221 0.55179
548
   node13 32762  7280  6000  1861  1022   4   4 0.22221 0.54951
549
   node14 32762  7280  6000  1861  1022   4   4 0.22221 0.54951
550
   node15 32762  7280  6000  1861  1031   4   4 0.22221 0.55408
551
   node16 32762  7280  6000  1861  1060   4   4 0.22221 0.57007
552
   node17 32762  7280  6000  1861  1006   5   4 0.22221 0.54105
553
   node18 32762  7280  6000  1396   761   4   2 0.22221 0.54570
554
   node19 32762  7280  6000  1861  1026   4   4 0.22221 0.55179
555
   node20 32762 13280  6000  1861  1089   3   5 0.40535 0.58565
556

    
557
.fi
558
.in
559

    
560
Here we see, beside the step list, the initial and final cluster
561
status, with the final one showing all nodes being N+1 compliant, and
562
the command list to reach the final solution. In the initial listing,
563
we see which nodes are not N+1 compliant.
564

    
565
The algorithm is stable as long as each step above is fully completed,
566
e.g. in step 8, both the migrate and the replace-disks are
567
done. Otherwise, if only the migrate is done, the input data is
568
changed in a way that the program will output a different solution
569
list (but hopefully will end in the same state).
570

    
571
.SH SEE ALSO
572
.BR hspace "(1), " hscan "(1), " hail "(1), "
573
.BR ganeti "(7), " gnt-instance "(8), " gnt-node "(8)"
574

    
575
.SH "COPYRIGHT"
576
.PP
577
Copyright (C) 2009 Google Inc. Permission is granted to copy,
578
distribute and/or modify under the terms of the GNU General Public
579
License as published by the Free Software Foundation; either version 2
580
of the License, or (at your option) any later version.
581
.PP
582
On Debian systems, the complete text of the GNU General Public License
583
can be found in /usr/share/common-licenses/GPL.