X-Git-Url: https://code.grnet.gr/git/ganeti-local/blobdiff_plain/04be800ae437719a9f0067829ab30c09bdbe99cd..2c9b2122732565c8972e0bdc560378371fdc9bec:/hbal.1 diff --git a/hbal.1 b/hbal.1 index 9955073..c7a8e79 100644 --- a/hbal.1 +++ b/hbal.1 @@ -1,4 +1,4 @@ -.TH HBAL 1 2009-03-14 htools "Ganeti H-tools" +.TH HBAL 1 2009-03-23 htools "Ganeti H-tools" .SH NAME hbal \- Cluster balancer for Ganeti @@ -7,10 +7,13 @@ hbal \- Cluster balancer for Ganeti .B "[-C]" .B "[-p]" .B "[-o]" -.B "-l" -.BI "[ -m " cluster "]" +.B "[-v... | -q]" +.BI "[-l" limit "]" +.BI "[-O" name... "]" +.BI "[-e" score "]" +.BI "[-m " cluster "]" .BI "[-n " nodes-file " ]" -.BI "[ -i " instances-file "]" +.BI "[-i " instances-file "]" .B hbal .B --version @@ -40,11 +43,23 @@ The possible move type for an instance are combinations of failover/migrate and replace-disks such that we change one of the instance nodes, and the other one remains (but possibly with changed role, e.g. from primary it becomes secondary). The list is: - - failover (f) - - replace secondary (r) - - replace primary, a composite move (f, r, f) - - failover and replace secondary, also composite (f, r) - - replace secondary and failover, also composite (r, f) +.RS 4 +.TP 3 +\(em +failover (f) +.TP +\(em +replace secondary (r) +.TP +\(em +replace primary, a composite move (f, r, f) +.TP +\(em +failover and replace secondary, also composite (f, r) +.TP +\(em +replace secondary and failover, also composite (r, f) +.RE We don't do the only remaining possibility of replacing both nodes (r,f,r,f or the equivalent f,r,f,r) since these move needs an @@ -57,10 +72,24 @@ give better scores but will result in more disk replacements. As said before, the algorithm tries to minimise the cluster score at each step. Currently this score is computed as a sum of the following components: - - coefficient of variance of the percent of free memory - - coefficient of variance of the percent of reserved memory - - coefficient of variance of the percent of free disk - - percentage of nodes failing N+1 check +.RS 4 +.TP 3 +\(em +coefficient of variance of the percent of free memory +.TP +\(em +coefficient of variance of the percent of reserved memory +.TP +\(em +coefficient of variance of the percent of free disk +.TP +\(em +percentage of nodes failing N+1 check +.TP +\(em +percentage of instances living (either as primary or secondary) on +offline nodes +.RE The free memory and free disk values help ensure that all nodes are somewhat balanced in their resource usage. The reserved memory helps @@ -69,11 +98,12 @@ instances, and that no node keeps too much memory reserved for N+1. And finally, the N+1 percentage helps guide the algorithm towards eliminating N+1 failures, if possible. -Except for the N+1 failures, we use the coefficient of variance since -this brings the values into the same unit so to speak, and with a -restrict domain of values (between zero and one). The percentage of -N+1 failures, while also in this numeric range, doesn't actually has -the same meaning, but it has shown to work well. +Except for the N+1 failures and offline instances percentage, we use +the coefficient of variance since this brings the values into the same +unit so to speak, and with a restrict domain of values (between zero +and one). The percentage of N+1 failures, while also in this numeric +range, doesn't actually has the same meaning, but it has shown to work +well. The other alternative, using for N+1 checks the coefficient of variance of (N+1 fail=1, N+1 pass=0) across nodes could hint the @@ -82,17 +112,39 @@ already. Since this (making N+1 failures) is not allowed by other rules of the algorithm, so the N+1 checks would simply not work anymore in this case. +The offline instances percentage (meaning the percentage of instances +living on offline nodes) will cause the algorithm to actively move +instances away from offline nodes. This, coupled with the restriction +on placement given by offline nodes, will cause evacuation of such +nodes. + On a perfectly balanced cluster (all nodes the same size, all instances the same size and spread across the nodes equally), all values would be zero. This doesn't happen too often in practice :) +.SS OFFLINE INSTANCES + +Since current Ganeti versions do not report the memory used by offline +(down) instances, ignoring the run status of instances will cause +wrong calculations. For this reason, the algorithm subtracts the +memory size of down instances from the free node memory of their +primary node, in effect simulating the startup of such instances. + .SS OTHER POSSIBLE METRICS It would be desirable to add more metrics to the algorithm, especially dynamically-computed metrics, such as: - - CPU usage of instances, combined with VCPU versus PCPU count - - Disk IO usage - - Network IO +.RS 4 +.TP 3 +\(em +CPU usage of instances, combined with VCPU versus PCPU count +.TP +\(em +Disk IO usage +.TP +\(em +Network IO +.RE .SH OPTIONS The options that can be passed to the program are as follows: @@ -106,21 +158,54 @@ Prints the before and after node status, in a format designed to allow the user to understand the node's most important parameters. The node list will contain these informations: - - a character denoting the status of the node, with '-' meaning an - offline node, '*' meaning N+1 failure and blank meaning a good - node - - the node name - - the total node memory - - the memory used by the node itself - - the free node memory - - the reserved node memory, which is the amount of free memory - needed for N+1 compliance - - total disk - - free disk - - number of primary instances - - number of secondary instances - - percent of free memory - - percent of free disk +.RS +.TP +.B F +a character denoting the status of the node, with '-' meaning an +offline node, '*' meaning N+1 failure and blank meaning a good node +.TP +.B Name +the node name +.TP +.B t_mem +the total node memory +.TP +.B n_mem +the memory used by the node itself +.TP +.B i_mem +the memory used by instances +.TP +.B x_mem +amount memory which seems to be in use but cannot be determined why or +by which instance; usually this means that the hypervisor has some +overhead or that there are other reporting errors +.TP +.B f_mem +the free node memory +.TP +.B r_mem +the reserved node memory, which is the amount of free memory needed +for N+1 compliance +.TP +.B t_dsk +total disk +.TP +.B f_dsk +free disk +.TP +.B pri +number of primary instances +.TP +.B sec +number of secondary instances +.TP +.B p_fmem +percent of free memory +.TP +.B p_fdsk +percent of free disk +.RE .TP .B -o, --oneline @@ -129,31 +214,83 @@ when one wants to look at multiple clusters at once and check their status. The line will contain four fields: - - initial cluster score - - number of steps in the solution - - final cluster score - - improvement in the cluster score +.RS +.RS 4 +.TP 3 +\(em +initial cluster score +.TP +\(em +number of steps in the solution +.TP +\(em +final cluster score +.TP +\(em +improvement in the cluster score +.RE +.RE + +.TP +.BI "-O " name +This option (which can be given multiple times) will mark nodes as +being \fIoffline\fR. This means a couple of things: +.RS +.RS 4 +.TP 3 +\(em +instances won't be placed on these nodes, not even temporarily; +e.g. the \fIreplace primary\fR move is not available if the secondary +node is offline, since this move requires a failover. +.TP +\(em +these nodes will not be included in the score calculation (except for +the percentage of instances on offline nodes) +.RE +Note that hbal will also mark as offline any nodes which are reported +by RAPI as such, or that have "?" in file-based input in any numeric +fields. +.RE + +.TP +.BI "-e" score ", --min-score=" score +This parameter denotes the minimum score we are happy with and alters +the computation in two ways: +.RS +.RS 4 +.TP 3 +\(em +if the cluster has the initial score lower than this value, then we +don't enter the algorithm at all, and exit with success +.TP +\(em +during the iterative process, if we reach a score lower than this +value, we exit the algorithm +.RE +The default value of the parameter is currently \fI1e-9\fR (chosen +empirically). +.RE .TP .BI "-n" nodefile ", --nodes=" nodefile The name of the file holding node information (if not collecting via -RAPI), instead of the default -.I nodes -file. +RAPI), instead of the default \fInodes\fR file (but see below how to +customize the default value via the environment). .TP .BI "-i" instancefile ", --instances=" instancefile The name of the file holding instance information (if not collecting -via RAPI), instead of the default -.I instances -file. +via RAPI), instead of the default \fIinstances\fR file (but see below +how to customize the default value via the environment). .TP .BI "-m" cluster Collect data not from files but directly from the .I cluster -given as an argument via RAPI. This work for both Ganeti 1.2 and -Ganeti 2.0. +given as an argument via RAPI. If the argument doesn't contain a colon +(:), then it is converted into a fully-built URL via prepending +https:// and appending the default RAPI port, otherwise it's +considered a fully-specified URL and is used unchanged. .TP .BI "-l" N ", --max-length=" N @@ -164,7 +301,13 @@ automate the execution of the balancing. .B -v, --verbose Increase the output verbosity. Each usage of this option will increase the verbosity (currently more than 2 doesn't make sense) from the -default of zero. +default of one. + +.TP +.B -q, --quiet +Decrease the output verbosity. Each usage of this option will decrease +the verbosity (less than zero doesn't make sense) from the default of +one. .TP .B -V, --version @@ -175,6 +318,13 @@ Just show the program version and exit. The exist status of the command will be zero, unless for some reason the algorithm fatally failed (e.g. wrong node or instance data). +.SH ENVIRONMENT + +If the variables \fBHTOOLS_NODES\fR and \fBHTOOLS_INSTANCES\fR are +present in the environment, they will override the default names for +the nodes and instances files. These will have of course no effect +when RAPI is used. + .SH BUGS The program does not check its input data for consistency, and aborts @@ -182,12 +332,18 @@ with cryptic errors messages in this case. The algorithm is not perfect. +The algorithm doesn't deal with non-\fBdrbd\fR instances, and chokes +on input data which has such instances. + The output format is not easily scriptable, and the program should feed moves directly into Ganeti (either via RAPI or via a gnt-debug input file). .SH EXAMPLE +Note that this example are not for the latest version (they don't have +full node data). + .SS Default output With the default options, the program shows each individual step and @@ -362,4 +518,5 @@ changed in a way that the program will output a different solution list (but hopefully will end in the same state). .SH SEE ALSO -hn1(1), ganeti(7), gnt-instance(8), gnt-node(8) +.BR hn1 "(1), " hscan "(1), " ganeti "(7), " gnt-instance "(8), " +.BR gnt-node "(8)"