Merge branch 'stable-2.8' into stable-2.9

[ganeti-local] / man / hbal.rst
diff --git a/man/hbal.rst b/man/hbal.rst

index 2bf018f..4b1e5ef 100644 (file)
--- a/man/hbal.rst
+++ b/man/hbal.rst
@@ -16,7 +16,8 @@ SYNOPSIS
  
  Backend options:
  
-{ **-m** *cluster* | **-L[** *path* **] [-X]** | **-t** *data-file* }
+{ **-m** *cluster* | **-L[** *path* **] [-X]** | **-t** *data-file* |
+**-I** *path* }
  
  Algorithm options:
  
@@ -38,7 +39,7 @@ Reporting options:
  **[ -C[ *file* ] ]**
  **[ -p[ *fields* ] ]**
  **[ \--print-instances ]**
-**[ -o ]**
+**[ -S *file* ]**
  **[ -v... | -q ]**
  
  
@@ -90,7 +91,8 @@ At each step, we prevent an instance move if it would cause:
  
  - a node to go into N+1 failure state
  - an instance to move onto an offline node (offline nodes are either
-  read from the cluster or declared with *-O*)
+  read from the cluster or declared with *-O*; drained nodes are
+  considered offline)
  - an exclusion-tag based conflict (exclusion tags are read from the
    cluster and/or defined via the *\--exclusion-tags* option)
  - a max vcpu/pcpu ratio to be exceeded (configured via *\--max-cpu*)
@@ -101,15 +103,16 @@ CLUSTER SCORING
  ~~~~~~~~~~~~~~~
  
  As said before, the algorithm tries to minimise the cluster score at
-each step. Currently this score is computed as a sum of the following
-components:
+each step. Currently this score is computed as a weighted sum of the
+following components:
  
  - standard deviation of the percent of free memory
  - standard deviation of the percent of reserved memory
  - standard deviation of the percent of free disk
  - count of nodes failing N+1 check
  - count of instances living (either as primary or secondary) on
-  offline nodes
+  offline nodes; in the sense of hbal (and the other htools) drained
+  nodes are considered offline
  - count of instances living (as primary) on offline nodes; this
    differs from the above metric by helping failover of such instances
    in 2-node clusters
@@ -219,7 +222,7 @@ The options that can be passed to the program are as follows:
  -p, \--print-nodes
    Prints the before and after node status, in a format designed to allow
    the user to understand the node's most important parameters. See the
-  man page **htools**(1) for more details about this option.
+  man page **htools**\(1) for more details about this option.
  
  \--print-instances
    Prints the before and after instance map. This is less useful as the
@@ -325,17 +328,17 @@ The options that can be passed to the program are as follows:
    Backend specification: the name of the file holding node and instance
    information (if not collecting via RAPI or LUXI). This or one of the
    other backends must be selected. The option is described in the man
-  page **htools**(1).
+  page **htools**\(1).
  
  -m *cluster*
    Backend specification: collect data directly from the *cluster* given
    as an argument via RAPI. The option is described in the man page
-  **htools**(1).
+  **htools**\(1).
  
  -L [*path*]
    Backend specification: collect data directly from the master daemon,
    which is to be contacted via LUXI (an internal Ganeti protocol). The
-  option is described in the man page **htools**(1).
+  option is described in the man page **htools**\(1).
  
  -X
    When using the Luxi backend, hbal can also execute the given
@@ -399,24 +402,32 @@ in two ways:
  
  - by sending a ``SIGINT`` (``^C``), hbal will register the termination
    request, and will wait until the currently submitted jobs finish, at
-  which point it will exit (with exit code 1)
+  which point it will exit (with exit code 0 if all jobs finished
+  correctly, otherwise with exit code 1 as usual)
+
  - by sending a ``SIGTERM``, hbal will immediately exit (with exit code
-  2); it is the responsibility of the user to follow up with Ganeti the
-  result of the currently-executing jobs
+  2\); it is the responsibility of the user to follow up with Ganeti
+  and check the result of the currently-executing jobs
  
  Note that in any situation, it's perfectly safe to kill hbal, either via
  the above signals or via any other signal (e.g. ``SIGQUIT``,
  ``SIGKILL``), since the jobs themselves are processed by Ganeti whereas
  hbal (after submission) only watches their progression. In this case,
-the use will again have to query Ganeti for job results.
+the user will have to query Ganeti for job results.
  
  EXIT STATUS
  -----------
  
  The exit status of the command will be zero, unless for some reason the
-algorithm fatally failed (e.g. wrong node or instance data), or (in case
-of job execution) either one of the jobs has failed or the balancing was
-interrupted early.
+algorithm failed (e.g. wrong node or instance data), invalid command
+line options, or (in case of job execution) one of the jobs has failed.
+
+Once job execution via Luxi has started (``-X``), if the balancing was
+interrupted early (via *SIGINT*, or via ``--max-length``) but all jobs
+executed successfully, then the exit status is zero; a non-zero exit
+code means that the cluster state should be investigated, since a job
+failed or we couldn't compute its status and this can also point to a
+problem on the Ganeti side.
  
  BUGS
  ----