.TH HN1 1 2009-03-23 htools "Ganeti H-tools"
.SH NAME
hn1 \- N+1 fixer for Ganeti

.SH SYNOPSIS
.B hn1
.B "[-C]"
.B "[-p]"
.B "[-o]"
.BI "[ -m " cluster "]"
.BI "[-n " nodes-file " ]"
.BI "[ -i " instances-file "]"
.BI "[-d " depth "]"
.BI "[-r " max-removals "]"
.BI "[-L " max-delta "]"
.BI "[-l " min-delta "]"

.B hn1
.B --version

.SH DESCRIPTION
hn1 is a cluster N+1 fixer that tries to compute the minimum number of
moves needed for getting all nodes to be N+1 compliant.

The algorithm is designed to be a 'perfect' algorithm, so that we
always examine the entire solution space until we find the minimum
solution. The algorithm can be tweaked via the \fB-d\fR, \fB-r\fR,
\fB-L\fR and \fB-l\fR options.

By default, the program will show the solution in a somewhat cryptic
format; for getting the actual Ganeti command list, use the \fB-C\fR
option.

\fBNote:\fR this program is somewhat deprecated; \fBhbal(1)\fR gives
usually much faster results, and a better cluster. It is recommended
to use this program only when \fBhbal\fR doesn't give a N+1 compliant
cluster.

.SS ALGORITHM

The algorithm works in multiple rounds, of increasing \fIdepth\fR,
until we have a solution.

First, before starting the solution computation, we compute all the
N+1-fail nodes and the instances they hold. These instances are
candidate for replacement (and only these!).

The program start then with \fIdepth\fR one (unless overridden via the
\fB-d\fR option), and at each round:
.RS 4
.TP 3
\(em
it tries to remove from the cluster as many instances as the current
depth in order to make the cluster N+1 compliant

.TP
\(em
then, for each of the possible instance combinations that allow this
(unless the total size is reduced via the \fB-r\fR option), it tries
to put them back on the cluster while maintaining N+1 compliance
.RE

It might be that at a given round, the results are:
.RS 4
.TP 3
\(em
no instance combination that can be put back; this means it is not
possible to make the cluster N+1 compliant with this number of
instances being moved, so we increase the depth and go on to the next
round
.TP
\(em
one or more successful result, in which case we take the one that has
as few changes as possible (by change meaning a replace-disks needed)
.RE

The main problem with the algorithm is that, being an exhaustive
search, the CPU time required grows very very quickly based on
depth. On a 20-node, 80-instances cluster, depths up to 5-6 are
quickly computed, and depth 10 could already take days.

The main factors that influence the run time are:
.RS 4
.TP 3
\(em
the removal depth; for each increase with one of the depth, we grow
the solution space by the number of nodes squared (since a new
instance can live any two nodes as primary/secondary, therefore
(almost) N times N); i.e., depth=1 will create a N^2 solution space,
depth two will make this N^4, depth three will be N^6, etc.

.TP
\(em
the removal depth again; for each increase in the depth, there will be
more valid removal sets, and the space of solutions increases linearly
with the number of removal sets
.RE

Therefore, the smaller the depth the faster the algorithm will be; it doesn't
seem like this algorithm will work for clusters of 100 nodes and many many
small instances (e.g. 256MB instances on 16GB nodes).

As an optimisation, since the algorithm is designed to prune the
search space as quickly as possible, is by luck we find a good
solution early at a given depth, then the other solutions which would
result in a bigger delta (the number of changes) will not be
investigated, and the program will finish fast. Since this is random
and depends on where in the full solution space the good solution will
be, there are two options for cutting down the time needed:
.RS 4
.TP 3
\(em
\fB-l\fR makes any solution that has delta lower than its parameter
succeed instantly; the default value for this parameter is zero, so
once we find a "perfect" solution we finish early

.TP
\(em
\fB-L\fR makes any solution with delta higher than its parameter being
rejected instantly (and not descend on the search tree); this can
reduce the depth of the search tree, with sometimes significant
speedups; by default, this optimization is not used
.RE

The algorithm also has some other internal optimisations:
.RS 4
.TP 3
\(em
when choosing where to place an instance in phase two, there are
N*(N-1) possible primary/secondary options; however, if instead of
iterating over all p * s pairs, we first determine the set of primary
nodes that can hold this instance (without failing N+1), we can cut
(N-1) secondary placements for each primary node removed; and since
this applies at every iteration of phase 2 it linearly decreases the
solution space, and on full clusters, this can mean a four-five times
reductions of solution space

.TP
\(em
since the number of solutions is very high even for smaller depths (on
the test data, depth=4 results in 1.8M solutions) we can't compare
them at the end, so at each iteration in phase 2 we only promote the
best solution out of our own set of solutions

.TP
\(em
since the placement of instances can only increase the delta of the
solution (placing a new instance will add zero or more replace-disks
steps), it means the delta will only increase while recursing during
phase 2; therefore, if we know at one point that we have a current
delta that is equal or higher to the delta of the best solution so
far, we can abort the recursion; this cuts a tremendous number of
branches; further promotion of the best solution from one removal set
to another can cut entire removal sets after a few recursions

.RE

.SH OPTIONS
The options that can be passed to the program are as follows:
.TP
.B -C, --print-commands
Print the command list at the end of the run. Without this, the
program will only show a shorter, but cryptic output.
.TP
.B -p, --print-nodes
Prints the before and after node status, in a format designed to allow
the user to understand the node's most important parameters.

The node list will contain these informations:
.RS
.TP
.B F
a character denoting the status of the node, with '-' meaning an
offline node, '*' meaning N+1 failure and blank meaning a good node
.TP
.B Name
the node name
.TP
.B t_mem
the total node memory
.TP
.B n_mem
the memory used by the node itself
.TP
.B i_mem
the memory used by instances
.TP
.B x_mem
amount memory which seems to be in use but cannot be determined why or
by which instance; usually this means that the hypervisor has some
overhead or that there are other reporting errors
.TP
.B f_mem
the free node memory
.TP
.B r_mem
the reserved node memory, which is the amount of free memory needed
for N+1 compliance
.TP
.B t_dsk
total disk
.TP
.B f_dsk
free disk
.TP
.B pri
number of primary instances
.TP
.B sec
number of secondary instances
.TP
.B p_fmem
percent of free memory
.TP
.B p_fdsk
percent of free disk
.RE

.TP
.BI "-n" nodefile ", --nodes=" nodefile
The name of the file holding node information (if not collecting via
RAPI), instead of the default \fInodes\fR file (but see below how to
customize the default value via the environment).

.TP
.BI "-i" instancefile ", --instances=" instancefile
The name of the file holding instance information (if not collecting
via RAPI), instead of the default \fIinstances\fR file (but see below
how to customize the default value via the environment).

.TP
.BI "-m" cluster
Collect data not from files but directly from the
.I cluster
given as an argument via RAPI. If the argument doesn't contain a colon
(:), then it is converted into a fully-built URL via prepending
https:// and appending the default RAPI port, otherwise it's
considered a fully-specified URL and is used unchanged.

.TP
.BI "-d" DEPTH ", --depth=" DEPTH
Start the algorithm directly at depth \fID\fR, so that we don't
examine lower depth. This will be faster if we know a solution is not
found a lower depths, and thus it's unneeded to search them.

.TP
.BI "-l" MIN-DELTA ", --min-delta=" MIN-DELTA
If we find a solution with delta lower or equal to \fIMIN-DELTA\fR,
consider this a success and don't examine further.

.TP
.BI "-L" MAX-DELTA ", --max-delta=" MAX-DELTA
If while computing a solution, it's intermediate delta is already
higher or equal to \fIMAX-DELTA\fR, consider this a failure and abort
(as if N+1 checks have failed).

.TP
.B -V, --version
Just show the program version and exit.

.SH EXIT STATUS

The exist status of the command will be zero, unless for some reason
the algorithm fatally failed (e.g. wrong node or instance data).

.SH ENVIRONMENT

If the variables \fBHTOOLS_NODES\fR and \fBHTOOLS_INSTANCES\fR are
present in the environment, they will override the default names for
the nodes and instances files. These will have of course no effect
when RAPI is used.

.SH BUGS

The program does not check its input data for consistency, and aborts
with cryptic errors messages in this case.

The algorithm doesn't deal with non-\fBdrbd\fR instances, and chokes
on input data which has such instances.

The algorithm doesn't know when it won't be possible to reach N+1
compliance at all, and will happily churn CPU for ages without
realising it won't reach a solution.

The algorithm is too slow.

The output format is not easily scriptable, and the program should
feed moves directly into Ganeti (either via RAPI or via a gnt-debug
input file).

.SH SEE ALSO
.BR hbal "(1), " hscan "(1), " ganeti "(7), " gnt-instance "(8), "
.BR gnt-node "(8)"