Statistics
| Branch: | Tag: | Revision:

## root / hn1.1 @ e015b554

 1 ```.TH HN1 1 2009-03-23 htools "Ganeti H-tools" ``` ```.SH NAME ``` ```hn1 \- N+1 fixer for Ganeti ``` ```.SH SYNOPSIS ``` ```.B hn1 ``` ```.B "[-C]" ``` ```.B "[-p]" ``` ```.B "[-o]" ``` ```.BI "[ -m " cluster "]" ``` ```.BI "[-n " nodes-file " ]" ``` ```.BI "[ -i " instances-file "]" ``` ```.BI "[-d " depth "]" ``` ```.BI "[-r " max-removals "]" ``` ```.BI "[-L " max-delta "]" ``` ```.BI "[-l " min-delta "]" ``` ```.B hn1 ``` ```.B --version ``` ```.SH DESCRIPTION ``` ```hn1 is a cluster N+1 fixer that tries to compute the minimum number of ``` ```moves needed for getting all nodes to be N+1 compliant. ``` ```The algorithm is designed to be a 'perfect' algorithm, so that we ``` ```always examine the entire solution space until we find the minimum ``` ```solution. The algorithm can be tweaked via the \fB-d\fR, \fB-r\fR, ``` ```\fB-L\fR and \fB-l\fR options. ``` ```By default, the program will show the solution in a somewhat cryptic ``` ```format; for getting the actual Ganeti command list, use the \fB-C\fR ``` ```option. ``` ```\fBNote:\fR this program is somewhat deprecated; \fBhbal(1)\fR gives ``` ```usually much faster results, and a better cluster. It is recommended ``` ```to use this program only when \fBhbal\fR doesn't give a N+1 compliant ``` ```cluster. ``` ```.SS ALGORITHM ``` ```The algorithm works in multiple rounds, of increasing \fIdepth\fR, ``` ```until we have a solution. ``` ```First, before starting the solution computation, we compute all the ``` ```N+1-fail nodes and the instances they hold. These instances are ``` ```candidate for replacement (and only these!). ``` ```The program start then with \fIdepth\fR one (unless overridden via the ``` ```\fB-d\fR option), and at each round: ``` ```.RS 4 ``` ```.TP 3 ``` ```\(em ``` ```it tries to remove from the cluster as many instances as the current ``` ```depth in order to make the cluster N+1 compliant ``` ```.TP ``` ```\(em ``` ```then, for each of the possible instance combinations that allow this ``` ```(unless the total size is reduced via the \fB-r\fR option), it tries ``` ```to put them back on the cluster while maintaining N+1 compliance ``` ```.RE ``` ```It might be that at a given round, the results are: ``` ```.RS 4 ``` ```.TP 3 ``` ```\(em ``` ```no instance combination that can be put back; this means it is not ``` ```possible to make the cluster N+1 compliant with this number of ``` ```instances being moved, so we increase the depth and go on to the next ``` ```round ``` ```.TP ``` ```\(em ``` ```one or more successful result, in which case we take the one that has ``` ```as few changes as possible (by change meaning a replace-disks needed) ``` ```.RE ``` ```The main problem with the algorithm is that, being an exhaustive ``` ```search, the CPU time required grows very very quickly based on ``` ```depth. On a 20-node, 80-instances cluster, depths up to 5-6 are ``` ```quickly computed, and depth 10 could already take days. ``` ```The main factors that influence the run time are: ``` ```.RS 4 ``` ```.TP 3 ``` ```\(em ``` ```the removal depth; for each increase with one of the depth, we grow ``` ```the solution space by the number of nodes squared (since a new ``` ```instance can live any two nodes as primary/secondary, therefore ``` ```(almost) N times N); i.e., depth=1 will create a N^2 solution space, ``` ```depth two will make this N^4, depth three will be N^6, etc. ``` ```.TP ``` ```\(em ``` ```the removal depth again; for each increase in the depth, there will be ``` ```more valid removal sets, and the space of solutions increases linearly ``` ```with the number of removal sets ``` ```.RE ``` ```Therefore, the smaller the depth the faster the algorithm will be; it doesn't ``` ```seem like this algorithm will work for clusters of 100 nodes and many many ``` ```small instances (e.g. 256MB instances on 16GB nodes). ``` ```As an optimisation, since the algorithm is designed to prune the ``` ```search space as quickly as possible, is by luck we find a good ``` ```solution early at a given depth, then the other solutions which would ``` ```result in a bigger delta (the number of changes) will not be ``` ```investigated, and the program will finish fast. Since this is random ``` ```and depends on where in the full solution space the good solution will ``` ```be, there are two options for cutting down the time needed: ``` ```.RS 4 ``` ```.TP 3 ``` ```\(em ``` ```\fB-l\fR makes any solution that has delta lower than its parameter ``` ```succeed instantly; the default value for this parameter is zero, so ``` ```once we find a "perfect" solution we finish early ``` ```.TP ``` ```\(em ``` ```\fB-L\fR makes any solution with delta higher than its parameter being ``` ```rejected instantly (and not descend on the search tree); this can ``` ```reduce the depth of the search tree, with sometimes significant ``` ```speedups; by default, this optimization is not used ``` ```.RE ``` ```The algorithm also has some other internal optimisations: ``` ```.RS 4 ``` ```.TP 3 ``` ```\(em ``` ```when choosing where to place an instance in phase two, there are ``` ```N*(N-1) possible primary/secondary options; however, if instead of ``` ```iterating over all p * s pairs, we first determine the set of primary ``` ```nodes that can hold this instance (without failing N+1), we can cut ``` ```(N-1) secondary placements for each primary node removed; and since ``` ```this applies at every iteration of phase 2 it linearly decreases the ``` ```solution space, and on full clusters, this can mean a four-five times ``` ```reductions of solution space ``` ```.TP ``` ```\(em ``` ```since the number of solutions is very high even for smaller depths (on ``` ```the test data, depth=4 results in 1.8M solutions) we can't compare ``` ```them at the end, so at each iteration in phase 2 we only promote the ``` ```best solution out of our own set of solutions ``` ```.TP ``` ```\(em ``` ```since the placement of instances can only increase the delta of the ``` ```solution (placing a new instance will add zero or more replace-disks ``` ```steps), it means the delta will only increase while recursing during ``` ```phase 2; therefore, if we know at one point that we have a current ``` ```delta that is equal or higher to the delta of the best solution so ``` ```far, we can abort the recursion; this cuts a tremendous number of ``` ```branches; further promotion of the best solution from one removal set ``` ```to another can cut entire removal sets after a few recursions ``` ```.RE ``` ```.SH OPTIONS ``` ```The options that can be passed to the program are as follows: ``` ```.TP ``` ```.B -C, --print-commands ``` ```Print the command list at the end of the run. Without this, the ``` ```program will only show a shorter, but cryptic output. ``` ```.TP ``` ```.B -p, --print-nodes ``` ```Prints the before and after node status, in a format designed to allow ``` ```the user to understand the node's most important parameters. ``` ```The node list will contain these informations: ``` ```.RS ``` ```.TP ``` ```.B F ``` ```a character denoting the status of the node, with '-' meaning an ``` ```offline node, '*' meaning N+1 failure and blank meaning a good node ``` ```.TP ``` ```.B Name ``` ```the node name ``` ```.TP ``` ```.B t_mem ``` ```the total node memory ``` ```.TP ``` ```.B n_mem ``` ```the memory used by the node itself ``` ```.TP ``` ```.B i_mem ``` ```the memory used by instances ``` ```.TP ``` ```.B x_mem ``` ```amount memory which seems to be in use but cannot be determined why or ``` ```by which instance; usually this means that the hypervisor has some ``` ```overhead or that there are other reporting errors ``` ```.TP ``` ```.B f_mem ``` ```the free node memory ``` ```.TP ``` ```.B r_mem ``` ```the reserved node memory, which is the amount of free memory needed ``` ```for N+1 compliance ``` ```.TP ``` ```.B t_dsk ``` ```total disk ``` ```.TP ``` ```.B f_dsk ``` ```free disk ``` ```.TP ``` ```.B pri ``` ```number of primary instances ``` ```.TP ``` ```.B sec ``` ```number of secondary instances ``` ```.TP ``` ```.B p_fmem ``` ```percent of free memory ``` ```.TP ``` ```.B p_fdsk ``` ```percent of free disk ``` ```.RE ``` ```.TP ``` ```.BI "-n" nodefile ", --nodes=" nodefile ``` ```The name of the file holding node information (if not collecting via ``` ```RAPI), instead of the default \fInodes\fR file (but see below how to ``` ```customize the default value via the environment). ``` ```.TP ``` ```.BI "-i" instancefile ", --instances=" instancefile ``` ```The name of the file holding instance information (if not collecting ``` ```via RAPI), instead of the default \fIinstances\fR file (but see below ``` ```how to customize the default value via the environment). ``` ```.TP ``` ```.BI "-m" cluster ``` ```Collect data not from files but directly from the ``` ```.I cluster ``` ```given as an argument via RAPI. If the argument doesn't contain a colon ``` ```(:), then it is converted into a fully-built URL via prepending ``` ```https:// and appending the default RAPI port, otherwise it's ``` ```considered a fully-specified URL and is used unchanged. ``` ```.TP ``` ```.BI "-d" DEPTH ", --depth=" DEPTH ``` ```Start the algorithm directly at depth \fID\fR, so that we don't ``` ```examine lower depth. This will be faster if we know a solution is not ``` ```found a lower depths, and thus it's unneeded to search them. ``` ```.TP ``` ```.BI "-l" MIN-DELTA ", --min-delta=" MIN-DELTA ``` ```If we find a solution with delta lower or equal to \fIMIN-DELTA\fR, ``` ```consider this a success and don't examine further. ``` ```.TP ``` ```.BI "-L" MAX-DELTA ", --max-delta=" MAX-DELTA ``` ```If while computing a solution, it's intermediate delta is already ``` ```higher or equal to \fIMAX-DELTA\fR, consider this a failure and abort ``` ```(as if N+1 checks have failed). ``` ```.TP ``` ```.B -V, --version ``` ```Just show the program version and exit. ``` ```.SH EXIT STATUS ``` ```The exist status of the command will be zero, unless for some reason ``` ```the algorithm fatally failed (e.g. wrong node or instance data). ``` ```.SH ENVIRONMENT ``` ```If the variables \fBHTOOLS_NODES\fR and \fBHTOOLS_INSTANCES\fR are ``` ```present in the environment, they will override the default names for ``` ```the nodes and instances files. These will have of course no effect ``` ```when RAPI is used. ``` ```.SH BUGS ``` ```The program does not check its input data for consistency, and aborts ``` ```with cryptic errors messages in this case. ``` ```The algorithm doesn't deal with non-\fBdrbd\fR instances, and chokes ``` ```on input data which has such instances. ``` ```The algorithm doesn't know when it won't be possible to reach N+1 ``` ```compliance at all, and will happily churn CPU for ages without ``` ```realising it won't reach a solution. ``` ```The algorithm is too slow. ``` ```The output format is not easily scriptable, and the program should ``` ```feed moves directly into Ganeti (either via RAPI or via a gnt-debug ``` ```input file). ``` ```.SH SEE ALSO ``` ```.BR hbal "(1), " hscan "(1), " ganeti "(7), " gnt-instance "(8), " ``` ```.BR gnt-node "(8)" ```