X-Git-Url: https://code.grnet.gr/git/ganeti-local/blobdiff_plain/352806f7dc06a05ca51430926c1b3fa21d2daa0d..0a8dd21dd36b6083b539773fea19f88e360211ec:/hn1.1 diff --git a/hn1.1 b/hn1.1 index f3920ee..b35117e 100644 --- a/hn1.1 +++ b/hn1.1 @@ -1,4 +1,4 @@ -.TH HN1 1 2009-03-14 htools "Ganeti H-tools" +.TH HN1 1 2009-03-23 htools "Ganeti H-tools" .SH NAME hn1 \- N+1 fixer for Ganeti @@ -47,38 +47,113 @@ candidate for replacement (and only these!). The program start then with \fIdepth\fR one (unless overridden via the \fB-d\fR option), and at each round: - - it tries to remove from the cluster as many instances as the - current depth in order to make the cluster N+1 compliant - - then, for each of the possible instance combinations that allow - this (unless the total size is reduced via the \fB-r\fR option), - it tries to put them back on the cluster while maintaining N+1 - compliance +.RS 4 +.TP 3 +\(em +it tries to remove from the cluster as many instances as the current +depth in order to make the cluster N+1 compliant + +.TP +\(em +then, for each of the possible instance combinations that allow this +(unless the total size is reduced via the \fB-r\fR option), it tries +to put them back on the cluster while maintaining N+1 compliance +.RE It might be that at a given round, the results are: - - no instance combination that can be put back; this means it is not - possible to make the cluster N+1 compliant with this number of - instances being moved, so we increase the depth and go on to the - next round - - one or more successful result, in which case we take the one that - has as few changes as possible (by change meaning a replace-disks - needed) +.RS 4 +.TP 3 +\(em +no instance combination that can be put back; this means it is not +possible to make the cluster N+1 compliant with this number of +instances being moved, so we increase the depth and go on to the next +round +.TP +\(em +one or more successful result, in which case we take the one that has +as few changes as possible (by change meaning a replace-disks needed) +.RE The main problem with the algorithm is that, being an exhaustive search, the CPU time required grows very very quickly based on depth. On a 20-node, 80-instances cluster, depths up to 5-6 are quickly computed, and depth 10 could already take days. -Since the algorithm is designed to prune the search space as quickly -as possible, is by luck we find a good solution early at a given -depth, then the other solutions which would result in a bigger delta -(the number of changes) will not be investigated, and the program will -finish fast. Since this is random and depends on where in the full -solution space the good solution will be, there are two options for -cutting down the time needed: - - \fB-l\fR makes any solution that has delta lower than its - parameter succeed instantly - - \fB-L\fR makes any solution with delta higher than its parameter - being rejected instantly (and not descend on the search tree) +The main factors that influence the run time are: +.RS 4 +.TP 3 +\(em +the removal depth; for each increase with one of the depth, we grow +the solution space by the number of nodes squared (since a new +instance can live any two nodes as primary/secondary, therefore +(almost) N times N); i.e., depth=1 will create a N^2 solution space, +depth two will make this N^4, depth three will be N^6, etc. + +.TP +\(em +the removal depth again; for each increase in the depth, there will be +more valid removal sets, and the space of solutions increases linearly +with the number of removal sets +.RE + +Therefore, the smaller the depth the faster the algorithm will be; it doesn't +seem like this algorithm will work for clusters of 100 nodes and many many +small instances (e.g. 256MB instances on 16GB nodes). + +As an optimisation, since the algorithm is designed to prune the +search space as quickly as possible, is by luck we find a good +solution early at a given depth, then the other solutions which would +result in a bigger delta (the number of changes) will not be +investigated, and the program will finish fast. Since this is random +and depends on where in the full solution space the good solution will +be, there are two options for cutting down the time needed: +.RS 4 +.TP 3 +\(em +\fB-l\fR makes any solution that has delta lower than its parameter +succeed instantly; the default value for this parameter is zero, so +once we find a "perfect" solution we finish early + +.TP +\(em +\fB-L\fR makes any solution with delta higher than its parameter being +rejected instantly (and not descend on the search tree); this can +reduce the depth of the search tree, with sometimes significant +speedups; by default, this optimization is not used +.RE + +The algorithm also has some other internal optimisations: +.RS 4 +.TP 3 +\(em +when choosing where to place an instance in phase two, there are +N*(N-1) possible primary/secondary options; however, if instead of +iterating over all p * s pairs, we first determine the set of primary +nodes that can hold this instance (without failing N+1), we can cut +(N-1) secondary placements for each primary node removed; and since +this applies at every iteration of phase 2 it linearly decreases the +solution space, and on full clusters, this can mean a four-five times +reductions of solution space + +.TP +\(em +since the number of solutions is very high even for smaller depths (on +the test data, depth=4 results in 1.8M solutions) we can't compare +them at the end, so at each iteration in phase 2 we only promote the +best solution out of our own set of solutions + +.TP +\(em +since the placement of instances can only increase the delta of the +solution (placing a new instance will add zero or more replace-disks +steps), it means the delta will only increase while recursing during +phase 2; therefore, if we know at one point that we have a current +delta that is equal or higher to the delta of the best solution so +far, we can abort the recursion; this cuts a tremendous number of +branches; further promotion of the best solution from one removal set +to another can cut entire removal sets after a few recursions + +.RE .SH OPTIONS The options that can be passed to the program are as follows: @@ -92,41 +167,75 @@ Prints the before and after node status, in a format designed to allow the user to understand the node's most important parameters. The node list will contain these informations: - - a character denoting the status of the node, with '-' meaning an - offline node, '*' meaning N+1 failure and blank meaning a good - node - - the node name - - the total node memory - - the free node memory - - the reserved node memory, which is the amount of free memory - needed for N+1 compliance - - total disk - - free disk - - number of primary instances - - number of secondary instances - - percent of free memory - - percent of free disk +.RS +.TP +.B F +a character denoting the status of the node, with '-' meaning an +offline node, '*' meaning N+1 failure and blank meaning a good node +.TP +.B Name +the node name +.TP +.B t_mem +the total node memory +.TP +.B n_mem +the memory used by the node itself +.TP +.B i_mem +the memory used by instances +.TP +.B x_mem +amount memory which seems to be in use but cannot be determined why or +by which instance; usually this means that the hypervisor has some +overhead or that there are other reporting errors +.TP +.B f_mem +the free node memory +.TP +.B r_mem +the reserved node memory, which is the amount of free memory needed +for N+1 compliance +.TP +.B t_dsk +total disk +.TP +.B f_dsk +free disk +.TP +.B pri +number of primary instances +.TP +.B sec +number of secondary instances +.TP +.B p_fmem +percent of free memory +.TP +.B p_fdsk +percent of free disk +.RE .TP .BI "-n" nodefile ", --nodes=" nodefile The name of the file holding node information (if not collecting via -RAPI), instead of the default -.I nodes -file. +RAPI), instead of the default \fInodes\fR file (but see below how to +customize the default value via the environment). .TP .BI "-i" instancefile ", --instances=" instancefile The name of the file holding instance information (if not collecting -via RAPI), instead of the default -.I instances -file. +via RAPI), instead of the default \fIinstances\fR file (but see below +how to customize the default value via the environment). .TP .BI "-m" cluster Collect data not from files but directly from the .I cluster -given as an argument via RAPI. This work for both Ganeti 1.2 and -Ganeti 2.0. +given as an argument via RAPI. If the argument doesn't contain a colon +(:), then it is converted into a fully-built URL via prepending +https:// and appending the default RAPI port, otherwise it's +considered a fully-specified URL and is used unchanged. .TP .BI "-d" DEPTH ", --depth=" DEPTH @@ -154,11 +263,21 @@ Just show the program version and exit. The exist status of the command will be zero, unless for some reason the algorithm fatally failed (e.g. wrong node or instance data). +.SH ENVIRONMENT + +If the variables \fBHTOOLS_NODES\fR and \fBHTOOLS_INSTANCES\fR are +present in the environment, they will override the default names for +the nodes and instances files. These will have of course no effect +when RAPI is used. + .SH BUGS The program does not check its input data for consistency, and aborts with cryptic errors messages in this case. +The algorithm doesn't deal with non-\fBdrbd\fR instances, and chokes +on input data which has such instances. + The algorithm doesn't know when it won't be possible to reach N+1 compliance at all, and will happily churn CPU for ages without realising it won't reach a solution. @@ -170,4 +289,15 @@ feed moves directly into Ganeti (either via RAPI or via a gnt-debug input file). .SH SEE ALSO -hbal(1), ganeti(7), gnt-instance(8), gnt-node(8) +.BR hbal "(1), " hscan "(1), " ganeti "(7), " gnt-instance "(8), " +.BR gnt-node "(8)" + +.SH "COPYRIGHT" +.PP +Copyright (C) 2009 Google Inc. Permission is granted to copy, +distribute and/or modify under the terms of the GNU General Public +License as published by the Free Software Foundation; either version 2 +of the License, or (at your option) any later version. +.PP +On Debian systems, the complete text of the GNU General Public License +can be found in /usr/share/common-licenses/GPL.