code.grnet.gr Git - ganeti-local/blob - hbal.1

   1 .TH HBAL 1 2009-03-23 htools "Ganeti H-tools"
   2 .SH NAME
   3 hbal \- Cluster balancer for Ganeti
   4
   5 .SH SYNOPSIS
   6 .B hbal
   7 .B "[-C]"
   8 .B "[-p]"
   9 .B "[-o]"
  10 .B "[-v... | -q]"
  11 .BI "[-l" limit "]"
  12 .BI "[-O" name... "]"
  13 .BI "[-e" score "]"
  14 .BI "[-m " cluster "]"
  15 .BI "[-n " nodes-file " ]"
  16 .BI "[-i " instances-file "]"
  17 .BI "[--max-cpu " cpu-ratio "]"
  18 .BI "[--min-disk " disk-ratio "]"
  19
  20 .B hbal
  21 .B --version
  22
  23 .SH DESCRIPTION
  24 hbal is a cluster balancer that looks at the current state of the
  25 cluster (nodes with their total and free disk, memory, etc.) and
  26 instance placement and computes a series of steps designed to bring
  27 the cluster into a better state.
  28
  29 The algorithm to do so is designed to be stable (i.e. it will give you
  30 the same results when restarting it from the middle of the solution)
  31 and reasonably fast. It is not, however, designed to be a perfect
  32 algorithm - it is possible to make it go into a corner from which it
  33 can find no improvement, because it only look one "step" ahead.
  34
  35 By default, the program will show the solution incrementally as it is
  36 computed, in a somewhat cryptic format; for getting the actual Ganeti
  37 command list, use the \fB-C\fR option.
  38
  39 .SS ALGORITHM
  40
  41 The program works in independent steps; at each step, we compute the
  42 best instance move that lowers the cluster score.
  43
  44 The possible move type for an instance are combinations of
  45 failover/migrate and replace-disks such that we change one of the
  46 instance nodes, and the other one remains (but possibly with changed
  47 role, e.g. from primary it becomes secondary). The list is:
  48 .RS 4
  49 .TP 3
  50 \(em
  51 failover (f)
  52 .TP
  53 \(em
  54 replace secondary (r)
  55 .TP
  56 \(em
  57 replace primary, a composite move (f, r, f)
  58 .TP
  59 \(em
  60 failover and replace secondary, also composite (f, r)
  61 .TP
  62 \(em
  63 replace secondary and failover, also composite (r, f)
  64 .RE
  65
  66 We don't do the only remaining possibility of replacing both nodes
  67 (r,f,r,f or the equivalent f,r,f,r) since these move needs an
  68 exhaustive search over both candidate primary and secondary nodes, and
  69 is O(n*n) in the number of nodes. Furthermore, it doesn't seems to
  70 give better scores but will result in more disk replacements.
  71
  72 .SS CLUSTER SCORING
  73
  74 As said before, the algorithm tries to minimise the cluster score at
  75 each step. Currently this score is computed as a sum of the following
  76 components:
  77 .RS 4
  78 .TP 3
  79 \(em
  80 coefficient of variance of the percent of free memory
  81 .TP
  82 \(em
  83 coefficient of variance of the percent of reserved memory
  84 .TP
  85 \(em
  86 coefficient of variance of the percent of free disk
  87 .TP
  88 \(em
  89 percentage of nodes failing N+1 check
  90 .TP
  91 \(em
  92 percentage of instances living (either as primary or secondary) on
  93 offline nodes
  94 .TP
  95 \(em
  96 coefficent of variance of the ratio of virtual-to-physical cpus (for
  97 primary instaces of the node)
  98 .RE
  99
 100 The free memory and free disk values help ensure that all nodes are
 101 somewhat balanced in their resource usage. The reserved memory helps
 102 to ensure that nodes are somewhat balanced in holding secondary
 103 instances, and that no node keeps too much memory reserved for
 104 N+1. And finally, the N+1 percentage helps guide the algorithm towards
 105 eliminating N+1 failures, if possible.
 106
 107 Except for the N+1 failures and offline instances percentage, we use
 108 the coefficient of variance since this brings the values into the same
 109 unit so to speak, and with a restrict domain of values (between zero
 110 and one). The percentage of N+1 failures, while also in this numeric
 111 range, doesn't actually has the same meaning, but it has shown to work
 112 well.
 113
 114 The other alternative, using for N+1 checks the coefficient of
 115 variance of (N+1 fail=1, N+1 pass=0) across nodes could hint the
 116 algorithm to make more N+1 failures if most nodes are N+1 fail
 117 already. Since this (making N+1 failures) is not allowed by other
 118 rules of the algorithm, so the N+1 checks would simply not work
 119 anymore in this case.
 120
 121 The offline instances percentage (meaning the percentage of instances
 122 living on offline nodes) will cause the algorithm to actively move
 123 instances away from offline nodes. This, coupled with the restriction
 124 on placement given by offline nodes, will cause evacuation of such
 125 nodes.
 126
 127 On a perfectly balanced cluster (all nodes the same size, all
 128 instances the same size and spread across the nodes equally), all
 129 values would be zero. This doesn't happen too often in practice :)
 130
 131 .SS OFFLINE INSTANCES
 132
 133 Since current Ganeti versions do not report the memory used by offline
 134 (down) instances, ignoring the run status of instances will cause
 135 wrong calculations. For this reason, the algorithm subtracts the
 136 memory size of down instances from the free node memory of their
 137 primary node, in effect simulating the startup of such instances.
 138
 139 .SS OTHER POSSIBLE METRICS
 140
 141 It would be desirable to add more metrics to the algorithm, especially
 142 dynamically-computed metrics, such as:
 143 .RS 4
 144 .TP 3
 145 \(em
 146 CPU usage of instances
 147 .TP
 148 \(em
 149 Disk IO usage
 150 .TP
 151 \(em
 152 Network IO
 153 .RE
 154
 155 .SH OPTIONS
 156 The options that can be passed to the program are as follows:
 157 .TP
 158 .B -C, --print-commands
 159 Print the command list at the end of the run. Without this, the
 160 program will only show a shorter, but cryptic output.
 161 .TP
 162 .B -p, --print-nodes
 163 Prints the before and after node status, in a format designed to allow
 164 the user to understand the node's most important parameters.
 165
 166 The node list will contain these informations:
 167 .RS
 168 .TP
 169 .B F
 170 a character denoting the status of the node, with '-' meaning an
 171 offline node, '*' meaning N+1 failure and blank meaning a good node
 172 .TP
 173 .B Name
 174 the node name
 175 .TP
 176 .B t_mem
 177 the total node memory
 178 .TP
 179 .B n_mem
 180 the memory used by the node itself
 181 .TP
 182 .B i_mem
 183 the memory used by instances
 184 .TP
 185 .B x_mem
 186 amount memory which seems to be in use but cannot be determined why or
 187 by which instance; usually this means that the hypervisor has some
 188 overhead or that there are other reporting errors
 189 .TP
 190 .B f_mem
 191 the free node memory
 192 .TP
 193 .B r_mem
 194 the reserved node memory, which is the amount of free memory needed
 195 for N+1 compliance
 196 .TP
 197 .B t_dsk
 198 total disk
 199 .TP
 200 .B f_dsk
 201 free disk
 202 .TP
 203 .B pcpu
 204 the number of physical cpus on the node
 205 .TP
 206 .B vcpu
 207 the number of virtual cpus allocated to primary instances
 208 .TP
 209 .B pri
 210 number of primary instances
 211 .TP
 212 .B sec
 213 number of secondary instances
 214 .TP
 215 .B p_fmem
 216 percent of free memory
 217 .TP
 218 .B p_fdsk
 219 percent of free disk
 220 .TP
 221 .B r_cpu
 222 ratio of virtual to physical cpus
 223 .RE
 224
 225 .TP
 226 .B -o, --oneline
 227 Only shows a one-line output from the program, designed for the case
 228 when one wants to look at multiple clusters at once and check their
 229 status.
 230
 231 The line will contain four fields:
 232 .RS
 233 .RS 4
 234 .TP 3
 235 \(em
 236 initial cluster score
 237 .TP
 238 \(em
 239 number of steps in the solution
 240 .TP
 241 \(em
 242 final cluster score
 243 .TP
 244 \(em
 245 improvement in the cluster score
 246 .RE
 247 .RE
 248
 249 .TP
 250 .BI "-O " name
 251 This option (which can be given multiple times) will mark nodes as
 252 being \fIoffline\fR. This means a couple of things:
 253 .RS
 254 .RS 4
 255 .TP 3
 256 \(em
 257 instances won't be placed on these nodes, not even temporarily;
 258 e.g. the \fIreplace primary\fR move is not available if the secondary
 259 node is offline, since this move requires a failover.
 260 .TP
 261 \(em
 262 these nodes will not be included in the score calculation (except for
 263 the percentage of instances on offline nodes)
 264 .RE
 265 Note that hbal will also mark as offline any nodes which are reported
 266 by RAPI as such, or that have "?" in file-based input in any numeric
 267 fields.
 268 .RE
 269
 270 .TP
 271 .BI "-e" score ", --min-score=" score
 272 This parameter denotes the minimum score we are happy with and alters
 273 the computation in two ways:
 274 .RS
 275 .RS 4
 276 .TP 3
 277 \(em
 278 if the cluster has the initial score lower than this value, then we
 279 don't enter the algorithm at all, and exit with success
 280 .TP
 281 \(em
 282 during the iterative process, if we reach a score lower than this
 283 value, we exit the algorithm
 284 .RE
 285 The default value of the parameter is currently \fI1e-9\fR (chosen
 286 empirically).
 287 .RE
 288
 289 .TP
 290 .BI "-n" nodefile ", --nodes=" nodefile
 291 The name of the file holding node information (if not collecting via
 292 RAPI), instead of the default \fInodes\fR file (but see below how to
 293 customize the default value via the environment).
 294
 295 .TP
 296 .BI "-i" instancefile ", --instances=" instancefile
 297 The name of the file holding instance information (if not collecting
 298 via RAPI), instead of the default \fIinstances\fR file (but see below
 299 how to customize the default value via the environment).
 300
 301 .TP
 302 .BI "-m" cluster
 303 Collect data not from files but directly from the
 304 .I cluster
 305 given as an argument via RAPI. If the argument doesn't contain a colon
 306 (:), then it is converted into a fully-built URL via prepending
 307 https:// and appending the default RAPI port, otherwise it's
 308 considered a fully-specified URL and is used as-is.
 309
 310 .TP
 311 .BI "-l" N ", --max-length=" N
 312 Restrict the solution to this length. This can be used for example to
 313 automate the execution of the balancing.
 314
 315 .TP
 316 .BI "--max-cpu " cpu-ratio
 317 The maximum virtual-to-physical cpu ratio, as a floating point number
 318 between zero and one. For example, specifying \fIcpu-ratio\fR as
 319 \fB2.5\fR means that, for a 4-cpu machine, a maximum of 10 virtual
 320 cpus should be allowed to be in use for primary instances. A value of
 321 one doesn't make sense though, as that means no disk space can be used
 322 on it.
 323
 324 .TP
 325 .BI "--min-disk " disk-ratio
 326 The minimum amount of free disk space remaining, as a floating point
 327 number. For example, specifying \fIdisk-ratio\fR as \fB0.25\fR means
 328 that at least one quarter of disk space should be left free on nodes.
 329
 330 .TP
 331 .B -v, --verbose
 332 Increase the output verbosity. Each usage of this option will increase
 333 the verbosity (currently more than 2 doesn't make sense) from the
 334 default of one.
 335
 336 .TP
 337 .B -q, --quiet
 338 Decrease the output verbosity. Each usage of this option will decrease
 339 the verbosity (less than zero doesn't make sense) from the default of
 340 one.
 341
 342 .TP
 343 .B -V, --version
 344 Just show the program version and exit.
 345
 346 .SH EXIT STATUS
 347
 348 The exist status of the command will be zero, unless for some reason
 349 the algorithm fatally failed (e.g. wrong node or instance data).
 350
 351 .SH ENVIRONMENT
 352
 353 If the variables \fBHTOOLS_NODES\fR and \fBHTOOLS_INSTANCES\fR are
 354 present in the environment, they will override the default names for
 355 the nodes and instances files. These will have of course no effect
 356 when RAPI is used.
 357
 358 .SH BUGS
 359
 360 The program does not check its input data for consistency, and aborts
 361 with cryptic errors messages in this case.
 362
 363 The algorithm is not perfect.
 364
 365 The output format is not easily scriptable, and the program should
 366 feed moves directly into Ganeti (either via RAPI or via a gnt-debug
 367 input file).
 368
 369 .SH EXAMPLE
 370
 371 Note that this example are not for the latest version (they don't have
 372 full node data).
 373
 374 .SS Default output
 375
 376 With the default options, the program shows each individual step and
 377 the improvements it brings in cluster score:
 378
 379 .in +4n
 380 .nf
 381 .RB "$" " hbal"
 382 Loaded 20 nodes, 80 instances
 383 Cluster is not N+1 happy, continuing but no guarantee that the cluster will end N+1 happy.
 384 Initial score: 0.52329131
 385 Trying to minimize the CV...
 386     1. instance14  node1:node10  => node16:node10 0.42109120 a=f r:node16 f
 387     2. instance54  node4:node15  => node16:node15 0.31904594 a=f r:node16 f
 388     3. instance4   node5:node2   => node2:node16  0.26611015 a=f r:node16
 389     4. instance48  node18:node20 => node2:node18  0.21361717 a=r:node2 f
 390     5. instance93  node19:node18 => node16:node19 0.16166425 a=r:node16 f
 391     6. instance89  node3:node20  => node2:node3   0.11005629 a=r:node2 f
 392     7. instance5   node6:node2   => node16:node6  0.05841589 a=r:node16 f
 393     8. instance94  node7:node20  => node20:node16 0.00658759 a=f r:node16
 394     9. instance44  node20:node2  => node2:node15  0.00438740 a=f r:node15
 395    10. instance62  node14:node18 => node14:node16 0.00390087 a=r:node16
 396    11. instance13  node11:node14 => node11:node16 0.00361787 a=r:node16
 397    12. instance19  node10:node11 => node10:node7  0.00336636 a=r:node7
 398    13. instance43  node12:node13 => node12:node1  0.00305681 a=r:node1
 399    14. instance1   node1:node2   => node1:node4   0.00263124 a=r:node4
 400    15. instance58  node19:node20 => node19:node17 0.00252594 a=r:node17
 401 Cluster score improved from 0.52329131 to 0.00252594
 402 .fi
 403 .in
 404
 405 In the above output, we can see:
 406   - the input data (here from files) shows a cluster with 20 nodes and
 407     80 instances
 408   - the cluster is not initially N+1 compliant
 409   - the initial score is 0.52329131
 410
 411 The step list follows, showing the instance, its initial
 412 primary/secondary nodes, the new primary secondary, the cluster list,
 413 and the actions taken in this step (with 'f' denoting failover/migrate
 414 and 'r' denoting replace secondary).
 415
 416 Finally, the program shows the improvement in cluster score.
 417
 418 A more detailed output is obtained via the \fB-C\fR and \fB-p\fR options:
 419
 420 .in +4n
 421 .nf
 422 .RB "$" " hbal"
 423 Loaded 20 nodes, 80 instances
 424 Cluster is not N+1 happy, continuing but no guarantee that the cluster will end N+1 happy.
 425 Initial cluster status:
 426 N1 Name   t_mem f_mem r_mem t_dsk f_dsk pri sec  p_fmem  p_fdsk
 427  * node1  32762  1280  6000  1861  1026   5   3 0.03907 0.55179
 428    node2  32762 31280 12000  1861  1026   0   8 0.95476 0.55179
 429  * node3  32762  1280  6000  1861  1026   5   3 0.03907 0.55179
 430  * node4  32762  1280  6000  1861  1026   5   3 0.03907 0.55179
 431  * node5  32762  1280  6000  1861   978   5   5 0.03907 0.52573
 432  * node6  32762  1280  6000  1861  1026   5   3 0.03907 0.55179
 433  * node7  32762  1280  6000  1861  1026   5   3 0.03907 0.55179
 434    node8  32762  7280  6000  1861  1026   4   4 0.22221 0.55179
 435    node9  32762  7280  6000  1861  1026   4   4 0.22221 0.55179
 436  * node10 32762  7280 12000  1861  1026   4   4 0.22221 0.55179
 437    node11 32762  7280  6000  1861   922   4   5 0.22221 0.49577
 438    node12 32762  7280  6000  1861  1026   4   4 0.22221 0.55179
 439    node13 32762  7280  6000  1861   922   4   5 0.22221 0.49577
 440    node14 32762  7280  6000  1861   922   4   5 0.22221 0.49577
 441  * node15 32762  7280 12000  1861  1131   4   3 0.22221 0.60782
 442    node16 32762 31280     0  1861  1860   0   0 0.95476 1.00000
 443    node17 32762  7280  6000  1861  1106   5   3 0.22221 0.59479
 444  * node18 32762  1280  6000  1396   561   5   3 0.03907 0.40239
 445  * node19 32762  1280  6000  1861  1026   5   3 0.03907 0.55179
 446    node20 32762 13280 12000  1861   689   3   9 0.40535 0.37068
 447
 448 Initial score: 0.52329131
 449 Trying to minimize the CV...
 450     1. instance14  node1:node10  => node16:node10 0.42109120 a=f r:node16 f
 451     2. instance54  node4:node15  => node16:node15 0.31904594 a=f r:node16 f
 452     3. instance4   node5:node2   => node2:node16  0.26611015 a=f r:node16
 453     4. instance48  node18:node20 => node2:node18  0.21361717 a=r:node2 f
 454     5. instance93  node19:node18 => node16:node19 0.16166425 a=r:node16 f
 455     6. instance89  node3:node20  => node2:node3   0.11005629 a=r:node2 f
 456     7. instance5   node6:node2   => node16:node6  0.05841589 a=r:node16 f
 457     8. instance94  node7:node20  => node20:node16 0.00658759 a=f r:node16
 458     9. instance44  node20:node2  => node2:node15  0.00438740 a=f r:node15
 459    10. instance62  node14:node18 => node14:node16 0.00390087 a=r:node16
 460    11. instance13  node11:node14 => node11:node16 0.00361787 a=r:node16
 461    12. instance19  node10:node11 => node10:node7  0.00336636 a=r:node7
 462    13. instance43  node12:node13 => node12:node1  0.00305681 a=r:node1
 463    14. instance1   node1:node2   => node1:node4   0.00263124 a=r:node4
 464    15. instance58  node19:node20 => node19:node17 0.00252594 a=r:node17
 465 Cluster score improved from 0.52329131 to 0.00252594
 466
 467 Commands to run to reach the above solution:
 468   echo step 1
 469   echo gnt-instance migrate instance14
 470   echo gnt-instance replace-disks -n node16 instance14
 471   echo gnt-instance migrate instance14
 472   echo step 2
 473   echo gnt-instance migrate instance54
 474   echo gnt-instance replace-disks -n node16 instance54
 475   echo gnt-instance migrate instance54
 476   echo step 3
 477   echo gnt-instance migrate instance4
 478   echo gnt-instance replace-disks -n node16 instance4
 479   echo step 4
 480   echo gnt-instance replace-disks -n node2 instance48
 481   echo gnt-instance migrate instance48
 482   echo step 5
 483   echo gnt-instance replace-disks -n node16 instance93
 484   echo gnt-instance migrate instance93
 485   echo step 6
 486   echo gnt-instance replace-disks -n node2 instance89
 487   echo gnt-instance migrate instance89
 488   echo step 7
 489   echo gnt-instance replace-disks -n node16 instance5
 490   echo gnt-instance migrate instance5
 491   echo step 8
 492   echo gnt-instance migrate instance94
 493   echo gnt-instance replace-disks -n node16 instance94
 494   echo step 9
 495   echo gnt-instance migrate instance44
 496   echo gnt-instance replace-disks -n node15 instance44
 497   echo step 10
 498   echo gnt-instance replace-disks -n node16 instance62
 499   echo step 11
 500   echo gnt-instance replace-disks -n node16 instance13
 501   echo step 12
 502   echo gnt-instance replace-disks -n node7 instance19
 503   echo step 13
 504   echo gnt-instance replace-disks -n node1 instance43
 505   echo step 14
 506   echo gnt-instance replace-disks -n node4 instance1
 507   echo step 15
 508   echo gnt-instance replace-disks -n node17 instance58
 509
 510 Final cluster status:
 511 N1 Name   t_mem f_mem r_mem t_dsk f_dsk pri sec  p_fmem  p_fdsk
 512    node1  32762  7280  6000  1861  1026   4   4 0.22221 0.55179
 513    node2  32762  7280  6000  1861  1026   4   4 0.22221 0.55179
 514    node3  32762  7280  6000  1861  1026   4   4 0.22221 0.55179
 515    node4  32762  7280  6000  1861  1026   4   4 0.22221 0.55179
 516    node5  32762  7280  6000  1861  1078   4   5 0.22221 0.57947
 517    node6  32762  7280  6000  1861  1026   4   4 0.22221 0.55179
 518    node7  32762  7280  6000  1861  1026   4   4 0.22221 0.55179
 519    node8  32762  7280  6000  1861  1026   4   4 0.22221 0.55179
 520    node9  32762  7280  6000  1861  1026   4   4 0.22221 0.55179
 521    node10 32762  7280  6000  1861  1026   4   4 0.22221 0.55179
 522    node11 32762  7280  6000  1861  1022   4   4 0.22221 0.54951
 523    node12 32762  7280  6000  1861  1026   4   4 0.22221 0.55179
 524    node13 32762  7280  6000  1861  1022   4   4 0.22221 0.54951
 525    node14 32762  7280  6000  1861  1022   4   4 0.22221 0.54951
 526    node15 32762  7280  6000  1861  1031   4   4 0.22221 0.55408
 527    node16 32762  7280  6000  1861  1060   4   4 0.22221 0.57007
 528    node17 32762  7280  6000  1861  1006   5   4 0.22221 0.54105
 529    node18 32762  7280  6000  1396   761   4   2 0.22221 0.54570
 530    node19 32762  7280  6000  1861  1026   4   4 0.22221 0.55179
 531    node20 32762 13280  6000  1861  1089   3   5 0.40535 0.58565
 532
 533 .fi
 534 .in
 535
 536 Here we see, beside the step list, the initial and final cluster
 537 status, with the final one showing all nodes being N+1 compliant, and
 538 the command list to reach the final solution. In the initial listing,
 539 we see which nodes are not N+1 compliant.
 540
 541 The algorithm is stable as long as each step above is fully completed,
 542 e.g. in step 8, both the migrate and the replace-disks are
 543 done. Otherwise, if only the migrate is done, the input data is
 544 changed in a way that the program will output a different solution
 545 list (but hopefully will end in the same state).
 546
 547 .SH SEE ALSO
 548 .BR hn1 "(1), " hscan "(1), " ganeti "(7), " gnt-instance "(8), "
 549 .BR gnt-node "(8)"
 550
 551 .SH "COPYRIGHT"
 552 .PP
 553 Copyright (C) 2009 Google Inc. Permission is granted to copy,
 554 distribute and/or modify under the terms of the GNU General Public
 555 License as published by the Free Software Foundation; either version 2
 556 of the License, or (at your option) any later version.
 557 .PP
 558 On Debian systems, the complete text of the GNU General Public License
 559 can be found in /usr/share/common-licenses/GPL.