Update NEWS file for the 0.2.4 release

[ganeti-local] / hbal.1
diff --git a/hbal.1 b/hbal.1

index e3e4162..10e4da1 100644 (file)
--- a/hbal.1
+++ b/hbal.1
@@ -1,16 +1,43 @@
-.TH HBAL 2 2009-03-13 htools "Ganeti H-tools"
+.TH HBAL 1 2009-03-23 htools "Ganeti H-tools"
  .SH NAME
  hbal \- Cluster balancer for Ganeti
  
  .SH SYNOPSIS
  .B hbal
-.B "[-C]"
-.B "[-p]"
-.B "[-o]"
-.B "-l"
-.BI "[ -m " cluster "]"
-.BI "[-n " nodes-file " ]"
-.BI "[ -i " instances-file "]"
+.B "[backend options...]"
+.B "[algorithm options...]"
+.B "[reporting options...]"
+
+.B hbal
+.B --version
+
+.TP
+Backend options:
+.BI "[ -m " cluster " ]"
+|
+.BI "[ -L[" path "] [-X]]"
+|
+.BI "[ -t " data-file " ]"
+
+.TP
+Algorithm options:
+.BI "[ --max-cpu " cpu-ratio " ]"
+.BI "[ --min-disk " disk-ratio " ]"
+.BI "[ -l " limit " ]"
+.BI "[ -e " score " ]"
+.BI "[ -O " name... " ]"
+.B "[ --no-disk-moves ]"
+.BI "[ -U " util-file " ]"
+.B "[ --evac-mode ]"
+
+.TP
+Reporting options:
+.BI "[ -C[" file "] ]"
+.BI "[ -p[" fields "] ]"
+.B "[ --print-instances ]"
+.B "[ -o ]"
+.B "[ -v... | -q ]"
+
  
  .SH DESCRIPTION
  hbal is a cluster balancer that looks at the current state of the
@@ -18,11 +45,11 @@ cluster (nodes with their total and free disk, memory, etc.) and
  instance placement and computes a series of steps designed to bring
  the cluster into a better state.
  
-The algorithm to do so is designed to be stable (i.e. it will give you
-the same results when restarting it from the middle of the solution)
-and reasonably fast. It is not, however, designed to be a perfect
-algorithm - it is possible to make it go into a corner from which it
-can find no improvement, because it only look one "step" ahead.
+The algorithm used is designed to be stable (i.e. it will give you the
+same results when restarting it from the middle of the solution) and
+reasonably fast. It is not, however, designed to be a perfect
+algorithm \(em it is possible to make it go into a corner from which
+it can find no improvement, because it looks only one "step" ahead.
  
  By default, the program will show the solution incrementally as it is
  computed, in a somewhat cryptic format; for getting the actual Ganeti
@@ -30,18 +57,30 @@ command list, use the \fB-C\fR option.
  
  .SS ALGORITHM
  
-The program works in indepentent steps; at each step, we compute the
+The program works in independent steps; at each step, we compute the
  best instance move that lowers the cluster score.
  
  The possible move type for an instance are combinations of
  failover/migrate and replace-disks such that we change one of the
  instance nodes, and the other one remains (but possibly with changed
  role, e.g. from primary it becomes secondary). The list is:
-  - failover (f)
-  - replace secondary (r)
-  - replace primary, a composite move (f, r, f)
-  - failover and replace secondary, also composite (f, r)
-  - replace secondary and failover, also composite (r, f)
+.RS 4
+.TP 3
+\(em
+failover (f)
+.TP
+\(em
+replace secondary (r)
+.TP
+\(em
+replace primary, a composite move (f, r, f)
+.TP
+\(em
+failover and replace secondary, also composite (f, r)
+.TP
+\(em
+replace secondary and failover, also composite (r, f)
+.RE
  
  We don't do the only remaining possibility of replacing both nodes
  (r,f,r,f or the equivalent f,r,f,r) since these move needs an
@@ -49,15 +88,66 @@ exhaustive search over both candidate primary and secondary nodes, and
  is O(n*n) in the number of nodes. Furthermore, it doesn't seems to
  give better scores but will result in more disk replacements.
  
+.SS PLACEMENT RESTRICTIONS
+
+At each step, we prevent an instance move if it would cause:
+
+.RS 4
+.TP 3
+\(em
+a node to go into N+1 failure state
+.TP
+\(em
+an instance to move onto an offline node (offline nodes are either
+read from the cluster or declared with \fI-O\fR)
+.TP
+\(em
+an exclusion-tag based conflict (exclusion tags are read from the
+cluster and/or defined via the \fI--exclusion-tags\fR option)
+.TP
+\(em
+a max vcpu/pcpu ratio to be exceeded (configured via \fI--max-cpu\fR)
+.TP
+\(em
+min disk free percentage to go below the configured limit (configured
+via \fI--min-disk\fR)
+
  .SS CLUSTER SCORING
  
-As said before, the algorithm tries to minimize the cluster score at
+As said before, the algorithm tries to minimise the cluster score at
  each step. Currently this score is computed as a sum of the following
  components:
-  - coefficient of variance of the percent of free memory
-  - coefficient of variance of the percent of reserved memory
-  - coefficient of variance of the percent of free disk
-  - percentage of nodes failing N+1 check
+.RS 4
+.TP 3
+\(em
+standard deviation of the percent of free memory
+.TP
+\(em
+standard deviation of the percent of reserved memory
+.TP
+\(em
+standard deviation of the percent of free disk
+.TP
+\(em
+count of nodes failing N+1 check
+.TP
+\(em
+count of instances living (either as primary or secondary) on
+offline nodes
+.TP
+\(em
+count of instances living (as primary) on offline nodes; this differs
+from the above metric by helping failover of such instances in 2-node
+clusters
+.TP
+\(em
+standard deviation of the ratio of virtual-to-physical cpus (for
+primary instances of the node)
+.TP
+\(em
+standard deviation of the dynamic load on the nodes, for cpus,
+memory, disk and network
+.RE
  
  The free memory and free disk values help ensure that all nodes are
  somewhat balanced in their resource usage. The reserved memory helps
@@ -66,30 +156,75 @@ instances, and that no node keeps too much memory reserved for
  N+1. And finally, the N+1 percentage helps guide the algorithm towards
  eliminating N+1 failures, if possible.
  
-Except for the N+1 failures, we use the coefficient of variance since
-this brings the values into the same unit so to speak, and with a
-restrict domain of values (between zero and one). The percentange of
-N+1 failures, while also in this numeric range, doesn't actually has
-the same meaning, but it has shown to work well.
-
-The other alternative, using for N+1 checks the coefficient of
-variance of (N+1 fail=1, N+1 pass=0) across nodes could hint the
-algorithm to make more N+1 failures if most nodes are N+1 fail
-already. Since this (making N+1 failures) is not allowed by other
-rules of the algorithm, so the N+1 checks would simply not work
-anymore in this case.
+Except for the N+1 failures and offline instances counts, we use the
+standard deviation since when used with values within a fixed range
+(we use percents expressed as values between zero and one) it gives
+consistent results across all metrics (there are some small issues
+related to different means, but it works generally well). The 'count'
+type values will have higher score and thus will matter more for
+balancing; thus these are better for hard constraints (like evacuating
+nodes and fixing N+1 failures). For example, the offline instances
+count (i.e. the number of instances living on offline nodes) will
+cause the algorithm to actively move instances away from offline
+nodes. This, coupled with the restriction on placement given by
+offline nodes, will cause evacuation of such nodes.
+
+The dynamic load values need to be read from an external file (Ganeti
+doesn't supply them), and are computed for each node as: sum of
+primary instance cpu load, sum of primary instance memory load, sum of
+primary and secondary instance disk load (as DRBD generates write load
+on secondary nodes too in normal case and in degraded scenarios also
+read load), and sum of primary instance network load. An example of
+how to generate these values for input to hbal would be to track "xm
+list" for instance over a day and by computing the delta of the cpu
+values, and feed that via the \fI-U\fR option for all instances (and
+keep the other metrics as one). For the algorithm to work, all that is
+needed is that the values are consistent for a metric across all
+instances (e.g. all instances use cpu% to report cpu usage, and not
+something related to number of CPU seconds used if the CPUs are
+different), and that they are normalised to between zero and one. Note
+that it's recommended to not have zero as the load value for any
+instance metric since then secondary instances are not well balanced.
  
  On a perfectly balanced cluster (all nodes the same size, all
-instances the same size and spread across the nodes equally), all
-values would be zero. This doesn't happen too often in practice :)
+instances the same size and spread across the nodes equally), the
+values for all metrics would be zero. This doesn't happen too often in
+practice :)
+
+.SS OFFLINE INSTANCES
+
+Since current Ganeti versions do not report the memory used by offline
+(down) instances, ignoring the run status of instances will cause
+wrong calculations. For this reason, the algorithm subtracts the
+memory size of down instances from the free node memory of their
+primary node, in effect simulating the startup of such instances.
+
+.SS EXCLUSION TAGS
+
+The exclusion tags mechanism is designed to prevent instances which
+run the same workload (e.g. two DNS servers) to land on the same node,
+which would make the respective node a SPOF for the given service.
+
+It works by tagging instances with certain tags and then building
+exclusion maps based on these. Which tags are actually used is
+configured either via the command line (option \fI--exclusion-tags\fR)
+or via adding them to the cluster tags:
  
-.SS OTHER POSSIBLE METRICS
+.TP
+.B --exclusion-tags=a,b
+This will make all instance tags of the form \fIa:*\fR, \fIb:*\fR be
+considered for the exclusion map
+
+.TP
+cluster tags \fBhtools:iextags:a\fR, \fBhtools:iextags:b\fR
+This will make instance tags \fIa:*\fR, \fIb:*\fR be considered for
+the exclusion map. More precisely, the suffix of cluster tags starting
+with \fBhtools:iextags:\fR will become the prefix of the exclusion
+tags.
  
-It would be desirable to add more metrics to the algorithm, especially
-dynamically-computed metrics, such as:
-  - CPU usage of instances, combined with VCPU versus PCPU count
-  - Disk IO usage
-  - Network IO
+.P
+Both the above forms mean that two instances both having (e.g.) the
+tag \fIa:foo\fR or \fIb:bar\fR won't end on the same node.
  
  .SH OPTIONS
  The options that can be passed to the program are as follows:
@@ -97,58 +232,235 @@ The options that can be passed to the program are as follows:
  .B -C, --print-commands
  Print the command list at the end of the run. Without this, the
  program will only show a shorter, but cryptic output.
+
+Note that the moves list will be split into independent steps, called
+"jobsets", but only for visual inspection, not for actually
+parallelisation. It is not possible to parallelise these directly when
+executed via "gnt-instance" commands, since a compound command
+(e.g. failover and replace\-disks) must be executed serially. Parallel
+execution is only possible when using the Luxi backend and the
+\fI-L\fR option.
+
+The algorithm for splitting the moves into jobsets is by accumulating
+moves until the next move is touching nodes already touched by the
+current moves; this means we can't execute in parallel (due to
+resource allocation in Ganeti) and thus we start a new jobset.
+
  .TP
  .B -p, --print-nodes
  Prints the before and after node status, in a format designed to allow
  the user to understand the node's most important parameters.
  
-The node list will contain these informations:
-  - a character denoting the N+1 status of the node, with blank
-    meaning pass and an asterisk ('*') meaning fail
-  - the node name
-  - the total node memory
-  - the free node memory
-  - the reserved node memory, which is the amount of free memory
-    needed for N+1 compliancy
-  - total disk
-  - free disk
-  - number of primary instances
-  - number of secondary instances
-  - percent of free memory
-  - percent of free disk
+It is possible to customise the listed information by passing a
+comma\(hyseparated list of field names to this option (the field list is
+currently undocumented). By default, the node list will contain these
+informations:
+.RS
+.TP
+.B F
+a character denoting the status of the node, with '\-' meaning an
+offline node, '*' meaning N+1 failure and blank meaning a good node
+.TP
+.B Name
+the node name
+.TP
+.B t_mem
+the total node memory
+.TP
+.B n_mem
+the memory used by the node itself
+.TP
+.B i_mem
+the memory used by instances
+.TP
+.B x_mem
+amount memory which seems to be in use but cannot be determined why or
+by which instance; usually this means that the hypervisor has some
+overhead or that there are other reporting errors
+.TP
+.B f_mem
+the free node memory
+.TP
+.B r_mem
+the reserved node memory, which is the amount of free memory needed
+for N+1 compliance
+.TP
+.B t_dsk
+total disk
+.TP
+.B f_dsk
+free disk
+.TP
+.B pcpu
+the number of physical cpus on the node
+.TP
+.B vcpu
+the number of virtual cpus allocated to primary instances
+.TP
+.B pri
+number of primary instances
+.TP
+.B sec
+number of secondary instances
+.TP
+.B p_fmem
+percent of free memory
+.TP
+.B p_fdsk
+percent of free disk
+.TP
+.B r_cpu
+ratio of virtual to physical cpus
+.TP
+.B lCpu
+the dynamic CPU load (if the information is available)
+.TP
+.B lMem
+the dynamic memory load (if the information is available)
+.TP
+.B lDsk
+the dynamic disk load (if the information is available)
+.TP
+.B lNet
+the dynamic net load (if the information is available)
+.RE
+
+.TP
+.B --print-instances
+Prints the before and after instance map. This is less useful as the
+node status, but it can help in understanding instance moves.
  
  .TP
  .B -o, --oneline
-Only shows a one-line output from the program, designed for the case
+Only shows a one\(hyline output from the program, designed for the case
  when one wants to look at multiple clusters at once and check their
  status.
  
  The line will contain four fields:
-  - initial cluster score
-  - number of steps in the solution
-  - final cluster score
-  - improvement in the cluster score
+.RS
+.RS 4
+.TP 3
+\(em
+initial cluster score
+.TP
+\(em
+number of steps in the solution
+.TP
+\(em
+final cluster score
+.TP
+\(em
+improvement in the cluster score
+.RE
+.RE
+
+.TP
+.BI "-O " name
+This option (which can be given multiple times) will mark nodes as
+being \fIoffline\fR. This means a couple of things:
+.RS
+.RS 4
+.TP 3
+\(em
+instances won't be placed on these nodes, not even temporarily;
+e.g. the \fIreplace primary\fR move is not available if the secondary
+node is offline, since this move requires a failover.
+.TP
+\(em
+these nodes will not be included in the score calculation (except for
+the percentage of instances on offline nodes)
+.RE
+Note that hbal will also mark as offline any nodes which are reported
+by RAPI as such, or that have "?" in file\(hybased input in any numeric
+fields.
+.RE
+
+.TP
+.BI "-e" score ", --min-score=" score
+This parameter denotes the minimum score we are happy with and alters
+the computation in two ways:
+.RS
+.RS 4
+.TP 3
+\(em
+if the cluster has the initial score lower than this value, then we
+don't enter the algorithm at all, and exit with success
+.TP
+\(em
+during the iterative process, if we reach a score lower than this
+value, we exit the algorithm
+.RE
+The default value of the parameter is currently \fI1e-9\fR (chosen
+empirically).
+.RE
  
  .TP
-.BI "-n" nodefile ", --nodes=" nodefile
-The name of the file holding node information (if not collecting via
-RAPI), instead of the default
-.I nodes
-file.
+.BI "--no-disk-moves"
+This parameter prevents hbal from using disk move (i.e. "gnt\-instance
+replace\-disks") operations. This will result in a much quicker
+balancing, but of course the improvements are limited. It is up to the
+user to decide when to use one or another.
  
  .TP
-.BI "-i" instancefile ", --instances=" instancefile
-The name of the file holding instance information (if not collecting
-via RAPI), instead of the default
-.I instances
-file.
+.B "--evac-mode"
+This parameter restricts the list of instances considered for moving
+to the ones living on offline/drained nodes. It can be used as a
+(bulk) replacement for Ganeti's own \fIgnt-node evacuate\fR, with the
+note that it doesn't guarantee full evacuation.
+
+.TP
+.BI "-U" util-file
+This parameter specifies a file holding instance dynamic utilisation
+information that will be used to tweak the balancing algorithm to
+equalise load on the nodes (as opposed to static resource usage). The
+file is in the format "instance_name cpu_util mem_util disk_util
+net_util" where the "_util" parameters are interpreted as numbers and
+the instance name must match exactly the instance as read from
+Ganeti. In case of unknown instance names, the program will abort.
+
+If not given, the default values are one for all metrics and thus
+dynamic utilisation has only one effect on the algorithm: the
+equalisation of the secondary instances across nodes (this is the only
+metric that is not tracked by another, dedicated value, and thus the
+disk load of instances will cause secondary instance
+equalisation). Note that value of one will also influence slightly the
+primary instance count, but that is already tracked via other metrics
+and thus the influence of the dynamic utilisation will be practically
+insignificant.
+
+.TP
+.BI "-t" datafile ", --text-data=" datafile
+The name of the file holding node and instance information (if not
+collecting via RAPI or LUXI). This or one of the other backends must
+be selected.
  
  .TP
  .BI "-m" cluster
-Collect data not from files but directly from the
+Collect data directly from the
  .I cluster
-given as an argument via RAPI. This work for both Ganeti 1.2 and
-Ganeti 2.0.
+given as an argument via RAPI. If the argument doesn't contain a colon
+(:), then it is converted into a fully\(hybuilt URL via prepending
+https:// and appending the default RAPI port, otherwise it's
+considered a fully\(hyspecified URL and is used as\(hyis.
+
+.TP
+.BI "-L[" path "]"
+Collect data directly from the master daemon, which is to be contacted
+via the luxi (an internal Ganeti protocol). An optional \fIpath\fR
+argument is interpreted as the path to the unix socket on which the
+master daemon listens; otherwise, the default path used by ganeti when
+installed with \fI--localstatedir=/var\fR is used.
+
+.TP
+.B "-X"
+When using the Luxi backend, hbal can also execute the given
+commands. The execution method is to execute the individual jobsets
+(see the \fI-C\fR option for details) in separate stages, aborting if
+at any time a jobset doesn't have all jobs successful. Each step in
+the balancing solution will be translated into exactly one Ganeti job
+(having between one and three OpCodes), and all the steps in a jobset
+will be executed in parallel. The jobsets themselves are executed
+serially.
  
  .TP
  .BI "-l" N ", --max-length=" N
@@ -156,10 +468,31 @@ Restrict the solution to this length. This can be used for example to
  automate the execution of the balancing.
  
  .TP
+.BI "--max-cpu " cpu-ratio
+The maximum virtual\(hyto\(hyphysical cpu ratio, as a floating point
+number between zero and one. For example, specifying \fIcpu-ratio\fR
+as \fB2.5\fR means that, for a 4\(hycpu machine, a maximum of 10
+virtual cpus should be allowed to be in use for primary instances. A
+value of one doesn't make sense though, as that means no disk space
+can be used on it.
+
+.TP
+.BI "--min-disk " disk-ratio
+The minimum amount of free disk space remaining, as a floating point
+number. For example, specifying \fIdisk-ratio\fR as \fB0.25\fR means
+that at least one quarter of disk space should be left free on nodes.
+
+.TP
  .B -v, --verbose
  Increase the output verbosity. Each usage of this option will increase
  the verbosity (currently more than 2 doesn't make sense) from the
-default of zero.
+default of one.
+
+.TP
+.B -q, --quiet
+Decrease the output verbosity. Each usage of this option will decrease
+the verbosity (less than zero doesn't make sense) from the default of
+one.
  
  .TP
  .B -V, --version
@@ -170,6 +503,13 @@ Just show the program version and exit.
  The exist status of the command will be zero, unless for some reason
  the algorithm fatally failed (e.g. wrong node or instance data).
  
+.SH ENVIRONMENT
+
+If the variables \fBHTOOLS_NODES\fR and \fBHTOOLS_INSTANCES\fR are
+present in the environment, they will override the default names for
+the nodes and instances files. These will have of course no effect
+when the RAPI or Luxi backends are used.
+
  .SH BUGS
  
  The program does not check its input data for consistency, and aborts
@@ -178,11 +518,14 @@ with cryptic errors messages in this case.
  The algorithm is not perfect.
  
  The output format is not easily scriptable, and the program should
-feed moves directly into Ganeti (either via RAPI or via a gnt-debug
+feed moves directly into Ganeti (either via RAPI or via a gnt\-debug
  input file).
  
  .SH EXAMPLE
  
+Note that this example are not for the latest version (they don't have
+full node data).
+
  .SS Default output
  
  With the default options, the program shows each individual step and
@@ -278,46 +621,46 @@ Cluster score improved from 0.52329131 to 0.00252594
  
  Commands to run to reach the above solution:
    echo step 1
-  echo gnt-instance migrate instance14
-  echo gnt-instance replace-disks -n node16 instance14
-  echo gnt-instance migrate instance14
+  echo gnt\-instance migrate instance14
+  echo gnt\-instance replace\-disks \-n node16 instance14
+  echo gnt\-instance migrate instance14
    echo step 2
-  echo gnt-instance migrate instance54
-  echo gnt-instance replace-disks -n node16 instance54
-  echo gnt-instance migrate instance54
+  echo gnt\-instance migrate instance54
+  echo gnt\-instance replace\-disks \-n node16 instance54
+  echo gnt\-instance migrate instance54
    echo step 3
-  echo gnt-instance migrate instance4
-  echo gnt-instance replace-disks -n node16 instance4
+  echo gnt\-instance migrate instance4
+  echo gnt\-instance replace\-disks \-n node16 instance4
    echo step 4
-  echo gnt-instance replace-disks -n node2 instance48
-  echo gnt-instance migrate instance48
+  echo gnt\-instance replace\-disks \-n node2 instance48
+  echo gnt\-instance migrate instance48
    echo step 5
-  echo gnt-instance replace-disks -n node16 instance93
-  echo gnt-instance migrate instance93
+  echo gnt\-instance replace\-disks \-n node16 instance93
+  echo gnt\-instance migrate instance93
    echo step 6
-  echo gnt-instance replace-disks -n node2 instance89
-  echo gnt-instance migrate instance89
+  echo gnt\-instance replace\-disks \-n node2 instance89
+  echo gnt\-instance migrate instance89
    echo step 7
-  echo gnt-instance replace-disks -n node16 instance5
-  echo gnt-instance migrate instance5
+  echo gnt\-instance replace\-disks \-n node16 instance5
+  echo gnt\-instance migrate instance5
    echo step 8
-  echo gnt-instance migrate instance94
-  echo gnt-instance replace-disks -n node16 instance94
+  echo gnt\-instance migrate instance94
+  echo gnt\-instance replace\-disks \-n node16 instance94
    echo step 9
-  echo gnt-instance migrate instance44
-  echo gnt-instance replace-disks -n node15 instance44
+  echo gnt\-instance migrate instance44
+  echo gnt\-instance replace\-disks \-n node15 instance44
    echo step 10
-  echo gnt-instance replace-disks -n node16 instance62
+  echo gnt\-instance replace\-disks \-n node16 instance62
    echo step 11
-  echo gnt-instance replace-disks -n node16 instance13
+  echo gnt\-instance replace\-disks \-n node16 instance13
    echo step 12
-  echo gnt-instance replace-disks -n node7 instance19
+  echo gnt\-instance replace\-disks \-n node7 instance19
    echo step 13
-  echo gnt-instance replace-disks -n node1 instance43
+  echo gnt\-instance replace\-disks \-n node1 instance43
    echo step 14
-  echo gnt-instance replace-disks -n node4 instance1
+  echo gnt\-instance replace\-disks \-n node4 instance1
    echo step 15
-  echo gnt-instance replace-disks -n node17 instance58
+  echo gnt\-instance replace\-disks \-n node17 instance58
  
  Final cluster status:
  N1 Name   t_mem f_mem r_mem t_dsk f_dsk pri sec  p_fmem  p_fdsk
@@ -351,10 +694,21 @@ the command list to reach the final solution. In the initial listing,
  we see which nodes are not N+1 compliant.
  
  The algorithm is stable as long as each step above is fully completed,
-e.g. in step 8, both the migrate and the replace-disks are
+e.g. in step 8, both the migrate and the replace\-disks are
  done. Otherwise, if only the migrate is done, the input data is
  changed in a way that the program will output a different solution
  list (but hopefully will end in the same state).
  
  .SH SEE ALSO
-ganeti(7), gnt-instance(8), gnt-node(8)
+.BR hspace "(1), " hscan "(1), " hail "(1), "
+.BR ganeti "(7), " gnt-instance "(8), " gnt-node "(8)"
+
+.SH "COPYRIGHT"
+.PP
+Copyright (C) 2009 Google Inc. Permission is granted to copy,
+distribute and/or modify under the terms of the GNU General Public
+License as published by the Free Software Foundation; either version 2
+of the License, or (at your option) any later version.
+.PP
+On Debian systems, the complete text of the GNU General Public License
+can be found in /usr/share/common-licenses/GPL.