code.grnet.gr Git - ganeti-local/blob - hn1.1

   1 .TH HN1 1 2009-03-23 htools "Ganeti H-tools"
   2 .SH NAME
   3 hn1 \- N+1 fixer for Ganeti
   4
   5 .SH SYNOPSIS
   6 .B hn1
   7 .B "[-C]"
   8 .B "[-p]"
   9 .B "[-o]"
  10 .BI "[ -m " cluster "]"
  11 .BI "[-n " nodes-file " ]"
  12 .BI "[ -i " instances-file "]"
  13 .BI "[-d " depth "]"
  14 .BI "[-r " max-removals "]"
  15 .BI "[-L " max-delta "]"
  16 .BI "[-l " min-delta "]"
  17
  18 .B hn1
  19 .B --version
  20
  21 .SH DESCRIPTION
  22 hn1 is a cluster N+1 fixer that tries to compute the minimum number of
  23 moves needed for getting all nodes to be N+1 compliant.
  24
  25 The algorithm is designed to be a 'perfect' algorithm, so that we
  26 always examine the entire solution space until we find the minimum
  27 solution. The algorithm can be tweaked via the \fB-d\fR, \fB-r\fR,
  28 \fB-L\fR and \fB-l\fR options.
  29
  30 By default, the program will show the solution in a somewhat cryptic
  31 format; for getting the actual Ganeti command list, use the \fB-C\fR
  32 option.
  33
  34 \fBNote:\fR this program is somewhat deprecated; \fBhbal(1)\fR gives
  35 usually much faster results, and a better cluster. It is recommended
  36 to use this program only when \fBhbal\fR doesn't give a N+1 compliant
  37 cluster.
  38
  39 .SS ALGORITHM
  40
  41 The algorithm works in multiple rounds, of increasing \fIdepth\fR,
  42 until we have a solution.
  43
  44 First, before starting the solution computation, we compute all the
  45 N+1-fail nodes and the instances they hold. These instances are
  46 candidate for replacement (and only these!).
  47
  48 The program start then with \fIdepth\fR one (unless overridden via the
  49 \fB-d\fR option), and at each round:
  50 .RS 4
  51 .TP 3
  52 \(em
  53 it tries to remove from the cluster as many instances as the current
  54 depth in order to make the cluster N+1 compliant
  55
  56 .TP
  57 \(em
  58 then, for each of the possible instance combinations that allow this
  59 (unless the total size is reduced via the \fB-r\fR option), it tries
  60 to put them back on the cluster while maintaining N+1 compliance
  61 .RE
  62
  63 It might be that at a given round, the results are:
  64 .RS 4
  65 .TP 3
  66 \(em
  67 no instance combination that can be put back; this means it is not
  68 possible to make the cluster N+1 compliant with this number of
  69 instances being moved, so we increase the depth and go on to the next
  70 round
  71 .TP
  72 \(em
  73 one or more successful result, in which case we take the one that has
  74 as few changes as possible (by change meaning a replace-disks needed)
  75 .RE
  76
  77 The main problem with the algorithm is that, being an exhaustive
  78 search, the CPU time required grows very very quickly based on
  79 depth. On a 20-node, 80-instances cluster, depths up to 5-6 are
  80 quickly computed, and depth 10 could already take days.
  81
  82 The main factors that influence the run time are:
  83 .RS 4
  84 .TP 3
  85 \(em
  86 the removal depth; for each increase with one of the depth, we grow
  87 the solution space by the number of nodes squared (since a new
  88 instance can live any two nodes as primary/secondary, therefore
  89 (almost) N times N); i.e., depth=1 will create a N^2 solution space,
  90 depth two will make this N^4, depth three will be N^6, etc.
  91
  92 .TP
  93 \(em
  94 the removal depth again; for each increase in the depth, there will be
  95 more valid removal sets, and the space of solutions increases linearly
  96 with the number of removal sets
  97 .RE
  98
  99 Therefore, the smaller the depth the faster the algorithm will be; it doesn't
 100 seem like this algorithm will work for clusters of 100 nodes and many many
 101 small instances (e.g. 256MB instances on 16GB nodes).
 102
 103 As an optimisation, since the algorithm is designed to prune the
 104 search space as quickly as possible, is by luck we find a good
 105 solution early at a given depth, then the other solutions which would
 106 result in a bigger delta (the number of changes) will not be
 107 investigated, and the program will finish fast. Since this is random
 108 and depends on where in the full solution space the good solution will
 109 be, there are two options for cutting down the time needed:
 110 .RS 4
 111 .TP 3
 112 \(em
 113 \fB-l\fR makes any solution that has delta lower than its parameter
 114 succeed instantly; the default value for this parameter is zero, so
 115 once we find a "perfect" solution we finish early
 116
 117 .TP
 118 \(em
 119 \fB-L\fR makes any solution with delta higher than its parameter being
 120 rejected instantly (and not descend on the search tree); this can
 121 reduce the depth of the search tree, with sometimes significant
 122 speedups; by default, this optimization is not used
 123 .RE
 124
 125 The algorithm also has some other internal optimisations:
 126 .RS 4
 127 .TP 3
 128 \(em
 129 when choosing where to place an instance in phase two, there are
 130 N*(N-1) possible primary/secondary options; however, if instead of
 131 iterating over all p * s pairs, we first determine the set of primary
 132 nodes that can hold this instance (without failing N+1), we can cut
 133 (N-1) secondary placements for each primary node removed; and since
 134 this applies at every iteration of phase 2 it linearly decreases the
 135 solution space, and on full clusters, this can mean a four-five times
 136 reductions of solution space
 137
 138 .TP
 139 \(em
 140 since the number of solutions is very high even for smaller depths (on
 141 the test data, depth=4 results in 1.8M solutions) we can't compare
 142 them at the end, so at each iteration in phase 2 we only promote the
 143 best solution out of our own set of solutions
 144
 145 .TP
 146 \(em
 147 since the placement of instances can only increase the delta of the
 148 solution (placing a new instance will add zero or more replace-disks
 149 steps), it means the delta will only increase while recursing during
 150 phase 2; therefore, if we know at one point that we have a current
 151 delta that is equal or higher to the delta of the best solution so
 152 far, we can abort the recursion; this cuts a tremendous number of
 153 branches; further promotion of the best solution from one removal set
 154 to another can cut entire removal sets after a few recursions
 155
 156 .RE
 157
 158 .SH OPTIONS
 159 The options that can be passed to the program are as follows:
 160 .TP
 161 .B -C, --print-commands
 162 Print the command list at the end of the run. Without this, the
 163 program will only show a shorter, but cryptic output.
 164 .TP
 165 .B -p, --print-nodes
 166 Prints the before and after node status, in a format designed to allow
 167 the user to understand the node's most important parameters.
 168
 169 The node list will contain these informations:
 170 .RS
 171 .TP
 172 .B F
 173 a character denoting the status of the node, with '-' meaning an
 174 offline node, '*' meaning N+1 failure and blank meaning a good node
 175 .TP
 176 .B Name
 177 the node name
 178 .TP
 179 .B t_mem
 180 the total node memory
 181 .TP
 182 .B n_mem
 183 the memory used by the node itself
 184 .TP
 185 .B i_mem
 186 the memory used by instances
 187 .TP
 188 .B x_mem
 189 amount memory which seems to be in use but cannot be determined why or
 190 by which instance; usually this means that the hypervisor has some
 191 overhead or that there are other reporting errors
 192 .TP
 193 .B f_mem
 194 the free node memory
 195 .TP
 196 .B r_mem
 197 the reserved node memory, which is the amount of free memory needed
 198 for N+1 compliance
 199 .TP
 200 .B t_dsk
 201 total disk
 202 .TP
 203 .B f_dsk
 204 free disk
 205 .TP
 206 .B pri
 207 number of primary instances
 208 .TP
 209 .B sec
 210 number of secondary instances
 211 .TP
 212 .B p_fmem
 213 percent of free memory
 214 .TP
 215 .B p_fdsk
 216 percent of free disk
 217 .RE
 218
 219 .TP
 220 .BI "-n" nodefile ", --nodes=" nodefile
 221 The name of the file holding node information (if not collecting via
 222 RAPI), instead of the default \fInodes\fR file (but see below how to
 223 customize the default value via the environment).
 224
 225 .TP
 226 .BI "-i" instancefile ", --instances=" instancefile
 227 The name of the file holding instance information (if not collecting
 228 via RAPI), instead of the default \fIinstances\fR file (but see below
 229 how to customize the default value via the environment).
 230
 231 .TP
 232 .BI "-m" cluster
 233 Collect data not from files but directly from the
 234 .I cluster
 235 given as an argument via RAPI. If the argument doesn't contain a colon
 236 (:), then it is converted into a fully-built URL via prepending
 237 https:// and appending the default RAPI port, otherwise it's
 238 considered a fully-specified URL and is used unchanged.
 239
 240 .TP
 241 .BI "-d" DEPTH ", --depth=" DEPTH
 242 Start the algorithm directly at depth \fID\fR, so that we don't
 243 examine lower depth. This will be faster if we know a solution is not
 244 found a lower depths, and thus it's unneeded to search them.
 245
 246 .TP
 247 .BI "-l" MIN-DELTA ", --min-delta=" MIN-DELTA
 248 If we find a solution with delta lower or equal to \fIMIN-DELTA\fR,
 249 consider this a success and don't examine further.
 250
 251 .TP
 252 .BI "-L" MAX-DELTA ", --max-delta=" MAX-DELTA
 253 If while computing a solution, it's intermediate delta is already
 254 higher or equal to \fIMAX-DELTA\fR, consider this a failure and abort
 255 (as if N+1 checks have failed).
 256
 257 .TP
 258 .B -V, --version
 259 Just show the program version and exit.
 260
 261 .SH EXIT STATUS
 262
 263 The exist status of the command will be zero, unless for some reason
 264 the algorithm fatally failed (e.g. wrong node or instance data).
 265
 266 .SH ENVIRONMENT
 267
 268 If the variables \fBHTOOLS_NODES\fR and \fBHTOOLS_INSTANCES\fR are
 269 present in the environment, they will override the default names for
 270 the nodes and instances files. These will have of course no effect
 271 when RAPI is used.
 272
 273 .SH BUGS
 274
 275 The program does not check its input data for consistency, and aborts
 276 with cryptic errors messages in this case.
 277
 278 The algorithm doesn't deal with non-\fBdrbd\fR instances, and chokes
 279 on input data which has such instances.
 280
 281 The algorithm doesn't know when it won't be possible to reach N+1
 282 compliance at all, and will happily churn CPU for ages without
 283 realising it won't reach a solution.
 284
 285 The algorithm is too slow.
 286
 287 The output format is not easily scriptable, and the program should
 288 feed moves directly into Ganeti (either via RAPI or via a gnt-debug
 289 input file).
 290
 291 .SH SEE ALSO
 292 .BR hbal "(1), " hscan "(1), " ganeti "(7), " gnt-instance "(8), "
 293 .BR gnt-node "(8)"
 294
 295 .SH "COPYRIGHT"
 296 .PP
 297 Copyright (C) 2009 Google Inc. Permission is granted to copy,
 298 distribute and/or modify under the terms of the GNU General Public
 299 License as published by the Free Software Foundation; either version 2
 300 of the License, or (at your option) any later version.
 301 .PP
 302 On Debian systems, the complete text of the GNU General Public License
 303 can be found in /usr/share/common-licenses/GPL.