root / hbal.1 @ 2b7a98ae
History | View | Annotate | Download (24.8 kB)
1 |
.TH HBAL 1 2009-03-23 htools "Ganeti H-tools" |
---|---|
2 |
.SH NAME |
3 |
hbal \- Cluster balancer for Ganeti |
4 |
|
5 |
.SH SYNOPSIS |
6 |
.B hbal |
7 |
.B "[backend options...]" |
8 |
.B "[algorithm options...]" |
9 |
.B "[reporting options...]" |
10 |
|
11 |
.B hbal |
12 |
.B --version |
13 |
|
14 |
.TP |
15 |
Backend options: |
16 |
.BI "[ -m " cluster " ]" |
17 |
| |
18 |
.BI "[ -L[" path "] [-X]]" |
19 |
| |
20 |
.BI "[ -t " data-file " ]" |
21 |
|
22 |
.TP |
23 |
Algorithm options: |
24 |
.BI "[ --max-cpu " cpu-ratio " ]" |
25 |
.BI "[ --min-disk " disk-ratio " ]" |
26 |
.BI "[ -l " limit " ]" |
27 |
.BI "[ -e " score " ]" |
28 |
.BI "[ -O " name... " ]" |
29 |
.B "[ --no-disk-moves ]" |
30 |
.BI "[ -U " util-file " ]" |
31 |
|
32 |
.TP |
33 |
Reporting options: |
34 |
.BI "[ -C[" file "] ]" |
35 |
.BI "[ -p[" fields "] ]" |
36 |
.B "[ --print-instances ]" |
37 |
.B "[ -o ]" |
38 |
.B "[ -v... | -q ]" |
39 |
|
40 |
|
41 |
.SH DESCRIPTION |
42 |
hbal is a cluster balancer that looks at the current state of the |
43 |
cluster (nodes with their total and free disk, memory, etc.) and |
44 |
instance placement and computes a series of steps designed to bring |
45 |
the cluster into a better state. |
46 |
|
47 |
The algorithm used is designed to be stable (i.e. it will give you the |
48 |
same results when restarting it from the middle of the solution) and |
49 |
reasonably fast. It is not, however, designed to be a perfect |
50 |
algorithm \(em it is possible to make it go into a corner from which |
51 |
it can find no improvement, because it looks only one "step" ahead. |
52 |
|
53 |
By default, the program will show the solution incrementally as it is |
54 |
computed, in a somewhat cryptic format; for getting the actual Ganeti |
55 |
command list, use the \fB-C\fR option. |
56 |
|
57 |
.SS ALGORITHM |
58 |
|
59 |
The program works in independent steps; at each step, we compute the |
60 |
best instance move that lowers the cluster score. |
61 |
|
62 |
The possible move type for an instance are combinations of |
63 |
failover/migrate and replace-disks such that we change one of the |
64 |
instance nodes, and the other one remains (but possibly with changed |
65 |
role, e.g. from primary it becomes secondary). The list is: |
66 |
.RS 4 |
67 |
.TP 3 |
68 |
\(em |
69 |
failover (f) |
70 |
.TP |
71 |
\(em |
72 |
replace secondary (r) |
73 |
.TP |
74 |
\(em |
75 |
replace primary, a composite move (f, r, f) |
76 |
.TP |
77 |
\(em |
78 |
failover and replace secondary, also composite (f, r) |
79 |
.TP |
80 |
\(em |
81 |
replace secondary and failover, also composite (r, f) |
82 |
.RE |
83 |
|
84 |
We don't do the only remaining possibility of replacing both nodes |
85 |
(r,f,r,f or the equivalent f,r,f,r) since these move needs an |
86 |
exhaustive search over both candidate primary and secondary nodes, and |
87 |
is O(n*n) in the number of nodes. Furthermore, it doesn't seems to |
88 |
give better scores but will result in more disk replacements. |
89 |
|
90 |
.SS PLACEMENT RESTRICTIONS |
91 |
|
92 |
At each step, we prevent an instance move if it would cause: |
93 |
|
94 |
.RS 4 |
95 |
.TP 3 |
96 |
\(em |
97 |
a node to go into N+1 failure state |
98 |
.TP |
99 |
\(em |
100 |
an instance to move onto an offline node (offline nodes are either |
101 |
read from the cluster or declared with \fI-O\fR) |
102 |
.TP |
103 |
\(em |
104 |
an exclusion-tag based conflict (exclusion tags are read from the |
105 |
cluster and/or defined via the \fI--exclusion-tags\fR option) |
106 |
.TP |
107 |
\(em |
108 |
a max vcpu/pcpu ratio to be exceeded (configured via \fI--max-cpu\fR) |
109 |
.TP |
110 |
\(em |
111 |
min disk free percentage to go below the configured limit (configured |
112 |
via \fI--min-disk\fR) |
113 |
|
114 |
.SS CLUSTER SCORING |
115 |
|
116 |
As said before, the algorithm tries to minimise the cluster score at |
117 |
each step. Currently this score is computed as a sum of the following |
118 |
components: |
119 |
.RS 4 |
120 |
.TP 3 |
121 |
\(em |
122 |
standard deviation of the percent of free memory |
123 |
.TP |
124 |
\(em |
125 |
standard deviation of the percent of reserved memory |
126 |
.TP |
127 |
\(em |
128 |
standard deviation of the percent of free disk |
129 |
.TP |
130 |
\(em |
131 |
count of nodes failing N+1 check |
132 |
.TP |
133 |
\(em |
134 |
count of instances living (either as primary or secondary) on |
135 |
offline nodes |
136 |
.TP |
137 |
\(em |
138 |
count of instances living (as primary) on offline nodes; this differs |
139 |
from the above metric by helping failover of such instances in 2-node |
140 |
clusters |
141 |
.TP |
142 |
\(em |
143 |
standard deviation of the ratio of virtual-to-physical cpus (for |
144 |
primary instances of the node) |
145 |
.TP |
146 |
\(em |
147 |
standard deviation of the dynamic load on the nodes, for cpus, |
148 |
memory, disk and network |
149 |
.RE |
150 |
|
151 |
The free memory and free disk values help ensure that all nodes are |
152 |
somewhat balanced in their resource usage. The reserved memory helps |
153 |
to ensure that nodes are somewhat balanced in holding secondary |
154 |
instances, and that no node keeps too much memory reserved for |
155 |
N+1. And finally, the N+1 percentage helps guide the algorithm towards |
156 |
eliminating N+1 failures, if possible. |
157 |
|
158 |
Except for the N+1 failures and offline instances counts, we use the |
159 |
standard deviation since when used with values within a fixed range |
160 |
(we use percents expressed as values between zero and one) it gives |
161 |
consistent results across all metrics (there are some small issues |
162 |
related to different means, but it works generally well). The 'count' |
163 |
type values will have higher score and thus will matter more for |
164 |
balancing; thus these are better for hard constraints (like evacuating |
165 |
nodes and fixing N+1 failures). For example, the offline instances |
166 |
count (i.e. the number of instances living on offline nodes) will |
167 |
cause the algorithm to actively move instances away from offline |
168 |
nodes. This, coupled with the restriction on placement given by |
169 |
offline nodes, will cause evacuation of such nodes. |
170 |
|
171 |
The dynamic load values need to be read from an external file (Ganeti |
172 |
doesn't supply them), and are computed for each node as: sum of |
173 |
primary instance cpu load, sum of primary instance memory load, sum of |
174 |
primary and secondary instance disk load (as DRBD generates write load |
175 |
on secondary nodes too in normal case and in degraded scenarios also |
176 |
read load), and sum of primary instance network load. An example of |
177 |
how to generate these values for input to hbal would be to track "xm |
178 |
list" for instance over a day and by computing the delta of the cpu |
179 |
values, and feed that via the \fI-U\fR option for all instances (and |
180 |
keep the other metrics as one). For the algorithm to work, all that is |
181 |
needed is that the values are consistent for a metric across all |
182 |
instances (e.g. all instances use cpu% to report cpu usage, and not |
183 |
something related to number of CPU seconds used if the CPUs are |
184 |
different), and that they are normalised to between zero and one. Note |
185 |
that it's recommended to not have zero as the load value for any |
186 |
instance metric since then secondary instances are not well balanced. |
187 |
|
188 |
On a perfectly balanced cluster (all nodes the same size, all |
189 |
instances the same size and spread across the nodes equally), the |
190 |
values for all metrics would be zero. This doesn't happen too often in |
191 |
practice :) |
192 |
|
193 |
.SS OFFLINE INSTANCES |
194 |
|
195 |
Since current Ganeti versions do not report the memory used by offline |
196 |
(down) instances, ignoring the run status of instances will cause |
197 |
wrong calculations. For this reason, the algorithm subtracts the |
198 |
memory size of down instances from the free node memory of their |
199 |
primary node, in effect simulating the startup of such instances. |
200 |
|
201 |
.SS EXCLUSION TAGS |
202 |
|
203 |
The exclusion tags mechanism is designed to prevent instances which |
204 |
run the same workload (e.g. two DNS servers) to land on the same node, |
205 |
which would make the respective node a SPOF for the given service. |
206 |
|
207 |
It works by tagging instances with certain tags and then building |
208 |
exclusion maps based on these. Which tags are actually used is |
209 |
configured either via the command line (option \fI--exclusion-tags\fR) |
210 |
or via adding them to the cluster tags: |
211 |
|
212 |
.TP |
213 |
.B --exclusion-tags=a,b |
214 |
This will make all instance tags of the form \fIa:*\fR, \fIb:*\fR be |
215 |
considered for the exclusion map |
216 |
|
217 |
.TP |
218 |
cluster tags \fBhtools:iextags:a\fR, \fBhtools:iextags:b\fR |
219 |
This will make instance tags \fIa:*\fR, \fIb:*\fR be considered for |
220 |
the exclusion map. More precisely, the suffix of cluster tags starting |
221 |
with \fBhtools:iextags:\fR will become the prefix of the exclusion |
222 |
tags. |
223 |
|
224 |
.P |
225 |
Both the above forms mean that two instances both having (e.g.) the |
226 |
tag \fIa:foo\fR or \fIb:bar\fR won't end on the same node. |
227 |
|
228 |
.SH OPTIONS |
229 |
The options that can be passed to the program are as follows: |
230 |
.TP |
231 |
.B -C, --print-commands |
232 |
Print the command list at the end of the run. Without this, the |
233 |
program will only show a shorter, but cryptic output. |
234 |
|
235 |
Note that the moves list will be split into independent steps, called |
236 |
"jobsets", but only for visual inspection, not for actually |
237 |
parallelisation. It is not possible to parallelise these directly when |
238 |
executed via "gnt-instance" commands, since a compound command |
239 |
(e.g. failover and replace\-disks) must be executed serially. Parallel |
240 |
execution is only possible when using the Luxi backend and the |
241 |
\fI-L\fR option. |
242 |
|
243 |
The algorithm for splitting the moves into jobsets is by accumulating |
244 |
moves until the next move is touching nodes already touched by the |
245 |
current moves; this means we can't execute in parallel (due to |
246 |
resource allocation in Ganeti) and thus we start a new jobset. |
247 |
|
248 |
.TP |
249 |
.B -p, --print-nodes |
250 |
Prints the before and after node status, in a format designed to allow |
251 |
the user to understand the node's most important parameters. |
252 |
|
253 |
It is possible to customise the listed information by passing a |
254 |
comma\(hyseparated list of field names to this option (the field list is |
255 |
currently undocumented). By default, the node list will contain these |
256 |
informations: |
257 |
.RS |
258 |
.TP |
259 |
.B F |
260 |
a character denoting the status of the node, with '\-' meaning an |
261 |
offline node, '*' meaning N+1 failure and blank meaning a good node |
262 |
.TP |
263 |
.B Name |
264 |
the node name |
265 |
.TP |
266 |
.B t_mem |
267 |
the total node memory |
268 |
.TP |
269 |
.B n_mem |
270 |
the memory used by the node itself |
271 |
.TP |
272 |
.B i_mem |
273 |
the memory used by instances |
274 |
.TP |
275 |
.B x_mem |
276 |
amount memory which seems to be in use but cannot be determined why or |
277 |
by which instance; usually this means that the hypervisor has some |
278 |
overhead or that there are other reporting errors |
279 |
.TP |
280 |
.B f_mem |
281 |
the free node memory |
282 |
.TP |
283 |
.B r_mem |
284 |
the reserved node memory, which is the amount of free memory needed |
285 |
for N+1 compliance |
286 |
.TP |
287 |
.B t_dsk |
288 |
total disk |
289 |
.TP |
290 |
.B f_dsk |
291 |
free disk |
292 |
.TP |
293 |
.B pcpu |
294 |
the number of physical cpus on the node |
295 |
.TP |
296 |
.B vcpu |
297 |
the number of virtual cpus allocated to primary instances |
298 |
.TP |
299 |
.B pri |
300 |
number of primary instances |
301 |
.TP |
302 |
.B sec |
303 |
number of secondary instances |
304 |
.TP |
305 |
.B p_fmem |
306 |
percent of free memory |
307 |
.TP |
308 |
.B p_fdsk |
309 |
percent of free disk |
310 |
.TP |
311 |
.B r_cpu |
312 |
ratio of virtual to physical cpus |
313 |
.TP |
314 |
.B lCpu |
315 |
the dynamic CPU load (if the information is available) |
316 |
.TP |
317 |
.B lMem |
318 |
the dynamic memory load (if the information is available) |
319 |
.TP |
320 |
.B lDsk |
321 |
the dynamic disk load (if the information is available) |
322 |
.TP |
323 |
.B lNet |
324 |
the dynamic net load (if the information is available) |
325 |
.RE |
326 |
|
327 |
.TP |
328 |
.B --print-instances |
329 |
Prints the before and after instance map. This is less useful as the |
330 |
node status, but it can help in understanding instance moves. |
331 |
|
332 |
.TP |
333 |
.B -o, --oneline |
334 |
Only shows a one\(hyline output from the program, designed for the case |
335 |
when one wants to look at multiple clusters at once and check their |
336 |
status. |
337 |
|
338 |
The line will contain four fields: |
339 |
.RS |
340 |
.RS 4 |
341 |
.TP 3 |
342 |
\(em |
343 |
initial cluster score |
344 |
.TP |
345 |
\(em |
346 |
number of steps in the solution |
347 |
.TP |
348 |
\(em |
349 |
final cluster score |
350 |
.TP |
351 |
\(em |
352 |
improvement in the cluster score |
353 |
.RE |
354 |
.RE |
355 |
|
356 |
.TP |
357 |
.BI "-O " name |
358 |
This option (which can be given multiple times) will mark nodes as |
359 |
being \fIoffline\fR. This means a couple of things: |
360 |
.RS |
361 |
.RS 4 |
362 |
.TP 3 |
363 |
\(em |
364 |
instances won't be placed on these nodes, not even temporarily; |
365 |
e.g. the \fIreplace primary\fR move is not available if the secondary |
366 |
node is offline, since this move requires a failover. |
367 |
.TP |
368 |
\(em |
369 |
these nodes will not be included in the score calculation (except for |
370 |
the percentage of instances on offline nodes) |
371 |
.RE |
372 |
Note that hbal will also mark as offline any nodes which are reported |
373 |
by RAPI as such, or that have "?" in file\(hybased input in any numeric |
374 |
fields. |
375 |
.RE |
376 |
|
377 |
.TP |
378 |
.BI "-e" score ", --min-score=" score |
379 |
This parameter denotes the minimum score we are happy with and alters |
380 |
the computation in two ways: |
381 |
.RS |
382 |
.RS 4 |
383 |
.TP 3 |
384 |
\(em |
385 |
if the cluster has the initial score lower than this value, then we |
386 |
don't enter the algorithm at all, and exit with success |
387 |
.TP |
388 |
\(em |
389 |
during the iterative process, if we reach a score lower than this |
390 |
value, we exit the algorithm |
391 |
.RE |
392 |
The default value of the parameter is currently \fI1e-9\fR (chosen |
393 |
empirically). |
394 |
.RE |
395 |
|
396 |
.TP |
397 |
.BI "--no-disk-moves" |
398 |
This parameter prevents hbal from using disk move (i.e. "gnt\-instance |
399 |
replace\-disks") operations. This will result in a much quicker |
400 |
balancing, but of course the improvements are limited. It is up to the |
401 |
user to decide when to use one or another. |
402 |
|
403 |
.TP |
404 |
.BI "-U" util-file |
405 |
This parameter specifies a file holding instance dynamic utilisation |
406 |
information that will be used to tweak the balancing algorithm to |
407 |
equalise load on the nodes (as opposed to static resource usage). The |
408 |
file is in the format "instance_name cpu_util mem_util disk_util |
409 |
net_util" where the "_util" parameters are interpreted as numbers and |
410 |
the instance name must match exactly the instance as read from |
411 |
Ganeti. In case of unknown instance names, the program will abort. |
412 |
|
413 |
If not given, the default values are one for all metrics and thus |
414 |
dynamic utilisation has only one effect on the algorithm: the |
415 |
equalisation of the secondary instances across nodes (this is the only |
416 |
metric that is not tracked by another, dedicated value, and thus the |
417 |
disk load of instances will cause secondary instance |
418 |
equalisation). Note that value of one will also influence slightly the |
419 |
primary instance count, but that is already tracked via other metrics |
420 |
and thus the influence of the dynamic utilisation will be practically |
421 |
insignificant. |
422 |
|
423 |
.TP |
424 |
.BI "-t" datafile ", --text-data=" datafile |
425 |
The name of the file holding node and instance information (if not |
426 |
collecting via RAPI or LUXI). This or one of the other backends must |
427 |
be selected. |
428 |
|
429 |
.TP |
430 |
.BI "-m" cluster |
431 |
Collect data directly from the |
432 |
.I cluster |
433 |
given as an argument via RAPI. If the argument doesn't contain a colon |
434 |
(:), then it is converted into a fully\(hybuilt URL via prepending |
435 |
https:// and appending the default RAPI port, otherwise it's |
436 |
considered a fully\(hyspecified URL and is used as\(hyis. |
437 |
|
438 |
.TP |
439 |
.BI "-L[" path "]" |
440 |
Collect data directly from the master daemon, which is to be contacted |
441 |
via the luxi (an internal Ganeti protocol). An optional \fIpath\fR |
442 |
argument is interpreted as the path to the unix socket on which the |
443 |
master daemon listens; otherwise, the default path used by ganeti when |
444 |
installed with \fI--localstatedir=/var\fR is used. |
445 |
|
446 |
.TP |
447 |
.B "-X" |
448 |
When using the Luxi backend, hbal can also execute the given |
449 |
commands. The execution method is to execute the individual jobsets |
450 |
(see the \fI-C\fR option for details) in separate stages, aborting if |
451 |
at any time a jobset doesn't have all jobs successful. Each step in |
452 |
the balancing solution will be translated into exactly one Ganeti job |
453 |
(having between one and three OpCodes), and all the steps in a jobset |
454 |
will be executed in parallel. The jobsets themselves are executed |
455 |
serially. |
456 |
|
457 |
.TP |
458 |
.BI "-l" N ", --max-length=" N |
459 |
Restrict the solution to this length. This can be used for example to |
460 |
automate the execution of the balancing. |
461 |
|
462 |
.TP |
463 |
.BI "--max-cpu " cpu-ratio |
464 |
The maximum virtual\(hyto\(hyphysical cpu ratio, as a floating point |
465 |
number between zero and one. For example, specifying \fIcpu-ratio\fR |
466 |
as \fB2.5\fR means that, for a 4\(hycpu machine, a maximum of 10 |
467 |
virtual cpus should be allowed to be in use for primary instances. A |
468 |
value of one doesn't make sense though, as that means no disk space |
469 |
can be used on it. |
470 |
|
471 |
.TP |
472 |
.BI "--min-disk " disk-ratio |
473 |
The minimum amount of free disk space remaining, as a floating point |
474 |
number. For example, specifying \fIdisk-ratio\fR as \fB0.25\fR means |
475 |
that at least one quarter of disk space should be left free on nodes. |
476 |
|
477 |
.TP |
478 |
.B -v, --verbose |
479 |
Increase the output verbosity. Each usage of this option will increase |
480 |
the verbosity (currently more than 2 doesn't make sense) from the |
481 |
default of one. |
482 |
|
483 |
.TP |
484 |
.B -q, --quiet |
485 |
Decrease the output verbosity. Each usage of this option will decrease |
486 |
the verbosity (less than zero doesn't make sense) from the default of |
487 |
one. |
488 |
|
489 |
.TP |
490 |
.B -V, --version |
491 |
Just show the program version and exit. |
492 |
|
493 |
.SH EXIT STATUS |
494 |
|
495 |
The exist status of the command will be zero, unless for some reason |
496 |
the algorithm fatally failed (e.g. wrong node or instance data). |
497 |
|
498 |
.SH ENVIRONMENT |
499 |
|
500 |
If the variables \fBHTOOLS_NODES\fR and \fBHTOOLS_INSTANCES\fR are |
501 |
present in the environment, they will override the default names for |
502 |
the nodes and instances files. These will have of course no effect |
503 |
when the RAPI or Luxi backends are used. |
504 |
|
505 |
.SH BUGS |
506 |
|
507 |
The program does not check its input data for consistency, and aborts |
508 |
with cryptic errors messages in this case. |
509 |
|
510 |
The algorithm is not perfect. |
511 |
|
512 |
The output format is not easily scriptable, and the program should |
513 |
feed moves directly into Ganeti (either via RAPI or via a gnt\-debug |
514 |
input file). |
515 |
|
516 |
.SH EXAMPLE |
517 |
|
518 |
Note that this example are not for the latest version (they don't have |
519 |
full node data). |
520 |
|
521 |
.SS Default output |
522 |
|
523 |
With the default options, the program shows each individual step and |
524 |
the improvements it brings in cluster score: |
525 |
|
526 |
.in +4n |
527 |
.nf |
528 |
.RB "$" " hbal" |
529 |
Loaded 20 nodes, 80 instances |
530 |
Cluster is not N+1 happy, continuing but no guarantee that the cluster will end N+1 happy. |
531 |
Initial score: 0.52329131 |
532 |
Trying to minimize the CV... |
533 |
1. instance14 node1:node10 => node16:node10 0.42109120 a=f r:node16 f |
534 |
2. instance54 node4:node15 => node16:node15 0.31904594 a=f r:node16 f |
535 |
3. instance4 node5:node2 => node2:node16 0.26611015 a=f r:node16 |
536 |
4. instance48 node18:node20 => node2:node18 0.21361717 a=r:node2 f |
537 |
5. instance93 node19:node18 => node16:node19 0.16166425 a=r:node16 f |
538 |
6. instance89 node3:node20 => node2:node3 0.11005629 a=r:node2 f |
539 |
7. instance5 node6:node2 => node16:node6 0.05841589 a=r:node16 f |
540 |
8. instance94 node7:node20 => node20:node16 0.00658759 a=f r:node16 |
541 |
9. instance44 node20:node2 => node2:node15 0.00438740 a=f r:node15 |
542 |
10. instance62 node14:node18 => node14:node16 0.00390087 a=r:node16 |
543 |
11. instance13 node11:node14 => node11:node16 0.00361787 a=r:node16 |
544 |
12. instance19 node10:node11 => node10:node7 0.00336636 a=r:node7 |
545 |
13. instance43 node12:node13 => node12:node1 0.00305681 a=r:node1 |
546 |
14. instance1 node1:node2 => node1:node4 0.00263124 a=r:node4 |
547 |
15. instance58 node19:node20 => node19:node17 0.00252594 a=r:node17 |
548 |
Cluster score improved from 0.52329131 to 0.00252594 |
549 |
.fi |
550 |
.in |
551 |
|
552 |
In the above output, we can see: |
553 |
- the input data (here from files) shows a cluster with 20 nodes and |
554 |
80 instances |
555 |
- the cluster is not initially N+1 compliant |
556 |
- the initial score is 0.52329131 |
557 |
|
558 |
The step list follows, showing the instance, its initial |
559 |
primary/secondary nodes, the new primary secondary, the cluster list, |
560 |
and the actions taken in this step (with 'f' denoting failover/migrate |
561 |
and 'r' denoting replace secondary). |
562 |
|
563 |
Finally, the program shows the improvement in cluster score. |
564 |
|
565 |
A more detailed output is obtained via the \fB-C\fR and \fB-p\fR options: |
566 |
|
567 |
.in +4n |
568 |
.nf |
569 |
.RB "$" " hbal" |
570 |
Loaded 20 nodes, 80 instances |
571 |
Cluster is not N+1 happy, continuing but no guarantee that the cluster will end N+1 happy. |
572 |
Initial cluster status: |
573 |
N1 Name t_mem f_mem r_mem t_dsk f_dsk pri sec p_fmem p_fdsk |
574 |
* node1 32762 1280 6000 1861 1026 5 3 0.03907 0.55179 |
575 |
node2 32762 31280 12000 1861 1026 0 8 0.95476 0.55179 |
576 |
* node3 32762 1280 6000 1861 1026 5 3 0.03907 0.55179 |
577 |
* node4 32762 1280 6000 1861 1026 5 3 0.03907 0.55179 |
578 |
* node5 32762 1280 6000 1861 978 5 5 0.03907 0.52573 |
579 |
* node6 32762 1280 6000 1861 1026 5 3 0.03907 0.55179 |
580 |
* node7 32762 1280 6000 1861 1026 5 3 0.03907 0.55179 |
581 |
node8 32762 7280 6000 1861 1026 4 4 0.22221 0.55179 |
582 |
node9 32762 7280 6000 1861 1026 4 4 0.22221 0.55179 |
583 |
* node10 32762 7280 12000 1861 1026 4 4 0.22221 0.55179 |
584 |
node11 32762 7280 6000 1861 922 4 5 0.22221 0.49577 |
585 |
node12 32762 7280 6000 1861 1026 4 4 0.22221 0.55179 |
586 |
node13 32762 7280 6000 1861 922 4 5 0.22221 0.49577 |
587 |
node14 32762 7280 6000 1861 922 4 5 0.22221 0.49577 |
588 |
* node15 32762 7280 12000 1861 1131 4 3 0.22221 0.60782 |
589 |
node16 32762 31280 0 1861 1860 0 0 0.95476 1.00000 |
590 |
node17 32762 7280 6000 1861 1106 5 3 0.22221 0.59479 |
591 |
* node18 32762 1280 6000 1396 561 5 3 0.03907 0.40239 |
592 |
* node19 32762 1280 6000 1861 1026 5 3 0.03907 0.55179 |
593 |
node20 32762 13280 12000 1861 689 3 9 0.40535 0.37068 |
594 |
|
595 |
Initial score: 0.52329131 |
596 |
Trying to minimize the CV... |
597 |
1. instance14 node1:node10 => node16:node10 0.42109120 a=f r:node16 f |
598 |
2. instance54 node4:node15 => node16:node15 0.31904594 a=f r:node16 f |
599 |
3. instance4 node5:node2 => node2:node16 0.26611015 a=f r:node16 |
600 |
4. instance48 node18:node20 => node2:node18 0.21361717 a=r:node2 f |
601 |
5. instance93 node19:node18 => node16:node19 0.16166425 a=r:node16 f |
602 |
6. instance89 node3:node20 => node2:node3 0.11005629 a=r:node2 f |
603 |
7. instance5 node6:node2 => node16:node6 0.05841589 a=r:node16 f |
604 |
8. instance94 node7:node20 => node20:node16 0.00658759 a=f r:node16 |
605 |
9. instance44 node20:node2 => node2:node15 0.00438740 a=f r:node15 |
606 |
10. instance62 node14:node18 => node14:node16 0.00390087 a=r:node16 |
607 |
11. instance13 node11:node14 => node11:node16 0.00361787 a=r:node16 |
608 |
12. instance19 node10:node11 => node10:node7 0.00336636 a=r:node7 |
609 |
13. instance43 node12:node13 => node12:node1 0.00305681 a=r:node1 |
610 |
14. instance1 node1:node2 => node1:node4 0.00263124 a=r:node4 |
611 |
15. instance58 node19:node20 => node19:node17 0.00252594 a=r:node17 |
612 |
Cluster score improved from 0.52329131 to 0.00252594 |
613 |
|
614 |
Commands to run to reach the above solution: |
615 |
echo step 1 |
616 |
echo gnt\-instance migrate instance14 |
617 |
echo gnt\-instance replace\-disks \-n node16 instance14 |
618 |
echo gnt\-instance migrate instance14 |
619 |
echo step 2 |
620 |
echo gnt\-instance migrate instance54 |
621 |
echo gnt\-instance replace\-disks \-n node16 instance54 |
622 |
echo gnt\-instance migrate instance54 |
623 |
echo step 3 |
624 |
echo gnt\-instance migrate instance4 |
625 |
echo gnt\-instance replace\-disks \-n node16 instance4 |
626 |
echo step 4 |
627 |
echo gnt\-instance replace\-disks \-n node2 instance48 |
628 |
echo gnt\-instance migrate instance48 |
629 |
echo step 5 |
630 |
echo gnt\-instance replace\-disks \-n node16 instance93 |
631 |
echo gnt\-instance migrate instance93 |
632 |
echo step 6 |
633 |
echo gnt\-instance replace\-disks \-n node2 instance89 |
634 |
echo gnt\-instance migrate instance89 |
635 |
echo step 7 |
636 |
echo gnt\-instance replace\-disks \-n node16 instance5 |
637 |
echo gnt\-instance migrate instance5 |
638 |
echo step 8 |
639 |
echo gnt\-instance migrate instance94 |
640 |
echo gnt\-instance replace\-disks \-n node16 instance94 |
641 |
echo step 9 |
642 |
echo gnt\-instance migrate instance44 |
643 |
echo gnt\-instance replace\-disks \-n node15 instance44 |
644 |
echo step 10 |
645 |
echo gnt\-instance replace\-disks \-n node16 instance62 |
646 |
echo step 11 |
647 |
echo gnt\-instance replace\-disks \-n node16 instance13 |
648 |
echo step 12 |
649 |
echo gnt\-instance replace\-disks \-n node7 instance19 |
650 |
echo step 13 |
651 |
echo gnt\-instance replace\-disks \-n node1 instance43 |
652 |
echo step 14 |
653 |
echo gnt\-instance replace\-disks \-n node4 instance1 |
654 |
echo step 15 |
655 |
echo gnt\-instance replace\-disks \-n node17 instance58 |
656 |
|
657 |
Final cluster status: |
658 |
N1 Name t_mem f_mem r_mem t_dsk f_dsk pri sec p_fmem p_fdsk |
659 |
node1 32762 7280 6000 1861 1026 4 4 0.22221 0.55179 |
660 |
node2 32762 7280 6000 1861 1026 4 4 0.22221 0.55179 |
661 |
node3 32762 7280 6000 1861 1026 4 4 0.22221 0.55179 |
662 |
node4 32762 7280 6000 1861 1026 4 4 0.22221 0.55179 |
663 |
node5 32762 7280 6000 1861 1078 4 5 0.22221 0.57947 |
664 |
node6 32762 7280 6000 1861 1026 4 4 0.22221 0.55179 |
665 |
node7 32762 7280 6000 1861 1026 4 4 0.22221 0.55179 |
666 |
node8 32762 7280 6000 1861 1026 4 4 0.22221 0.55179 |
667 |
node9 32762 7280 6000 1861 1026 4 4 0.22221 0.55179 |
668 |
node10 32762 7280 6000 1861 1026 4 4 0.22221 0.55179 |
669 |
node11 32762 7280 6000 1861 1022 4 4 0.22221 0.54951 |
670 |
node12 32762 7280 6000 1861 1026 4 4 0.22221 0.55179 |
671 |
node13 32762 7280 6000 1861 1022 4 4 0.22221 0.54951 |
672 |
node14 32762 7280 6000 1861 1022 4 4 0.22221 0.54951 |
673 |
node15 32762 7280 6000 1861 1031 4 4 0.22221 0.55408 |
674 |
node16 32762 7280 6000 1861 1060 4 4 0.22221 0.57007 |
675 |
node17 32762 7280 6000 1861 1006 5 4 0.22221 0.54105 |
676 |
node18 32762 7280 6000 1396 761 4 2 0.22221 0.54570 |
677 |
node19 32762 7280 6000 1861 1026 4 4 0.22221 0.55179 |
678 |
node20 32762 13280 6000 1861 1089 3 5 0.40535 0.58565 |
679 |
|
680 |
.fi |
681 |
.in |
682 |
|
683 |
Here we see, beside the step list, the initial and final cluster |
684 |
status, with the final one showing all nodes being N+1 compliant, and |
685 |
the command list to reach the final solution. In the initial listing, |
686 |
we see which nodes are not N+1 compliant. |
687 |
|
688 |
The algorithm is stable as long as each step above is fully completed, |
689 |
e.g. in step 8, both the migrate and the replace\-disks are |
690 |
done. Otherwise, if only the migrate is done, the input data is |
691 |
changed in a way that the program will output a different solution |
692 |
list (but hopefully will end in the same state). |
693 |
|
694 |
.SH SEE ALSO |
695 |
.BR hspace "(1), " hscan "(1), " hail "(1), " |
696 |
.BR ganeti "(7), " gnt-instance "(8), " gnt-node "(8)" |
697 |
|
698 |
.SH "COPYRIGHT" |
699 |
.PP |
700 |
Copyright (C) 2009 Google Inc. Permission is granted to copy, |
701 |
distribute and/or modify under the terms of the GNU General Public |
702 |
License as published by the Free Software Foundation; either version 2 |
703 |
of the License, or (at your option) any later version. |
704 |
.PP |
705 |
On Debian systems, the complete text of the GNU General Public License |
706 |
can be found in /usr/share/common-licenses/GPL. |