root / man / hbal.rst @ f815aa89
History | View | Annotate | Download (26.9 kB)
1 |
HBAL(1) Ganeti | Version @GANETI_VERSION@ |
---|---|
2 |
========================================= |
3 |
|
4 |
NAME |
5 |
---- |
6 |
|
7 |
hbal \- Cluster balancer for Ganeti |
8 |
|
9 |
SYNOPSIS |
10 |
-------- |
11 |
|
12 |
**hbal** {backend options...} [algorithm options...] [reporting options...] |
13 |
|
14 |
**hbal** --version |
15 |
|
16 |
|
17 |
Backend options: |
18 |
|
19 |
{ **-m** *cluster* | **-L[** *path* **] [-X]** | **-t** *data-file* } |
20 |
|
21 |
Algorithm options: |
22 |
|
23 |
**[ --max-cpu *cpu-ratio* ]** |
24 |
**[ --min-disk *disk-ratio* ]** |
25 |
**[ -l *limit* ]** |
26 |
**[ -e *score* ]** |
27 |
**[ -g *delta* ]** **[ --min-gain-limit *threshold* ]** |
28 |
**[ -O *name...* ]** |
29 |
**[ --no-disk-moves ]** |
30 |
**[ --no-instance-moves ]** |
31 |
**[ -U *util-file* ]** |
32 |
**[ --evac-mode ]** |
33 |
**[ --select-instances *inst...* ]** |
34 |
**[ --exclude-instances *inst...* ]** |
35 |
|
36 |
Reporting options: |
37 |
|
38 |
**[ -C[ *file* ] ]** |
39 |
**[ -p[ *fields* ] ]** |
40 |
**[ --print-instances ]** |
41 |
**[ -o ]** |
42 |
**[ -v... | -q ]** |
43 |
|
44 |
|
45 |
DESCRIPTION |
46 |
----------- |
47 |
|
48 |
hbal is a cluster balancer that looks at the current state of the |
49 |
cluster (nodes with their total and free disk, memory, etc.) and |
50 |
instance placement and computes a series of steps designed to bring |
51 |
the cluster into a better state. |
52 |
|
53 |
The algorithm used is designed to be stable (i.e. it will give you the |
54 |
same results when restarting it from the middle of the solution) and |
55 |
reasonably fast. It is not, however, designed to be a perfect |
56 |
algorithm--it is possible to make it go into a corner from which |
57 |
it can find no improvement, because it looks only one "step" ahead. |
58 |
|
59 |
By default, the program will show the solution incrementally as it is |
60 |
computed, in a somewhat cryptic format; for getting the actual Ganeti |
61 |
command list, use the **-C** option. |
62 |
|
63 |
ALGORITHM |
64 |
~~~~~~~~~ |
65 |
|
66 |
The program works in independent steps; at each step, we compute the |
67 |
best instance move that lowers the cluster score. |
68 |
|
69 |
The possible move type for an instance are combinations of |
70 |
failover/migrate and replace-disks such that we change one of the |
71 |
instance nodes, and the other one remains (but possibly with changed |
72 |
role, e.g. from primary it becomes secondary). The list is: |
73 |
|
74 |
- failover (f) |
75 |
- replace secondary (r) |
76 |
- replace primary, a composite move (f, r, f) |
77 |
- failover and replace secondary, also composite (f, r) |
78 |
- replace secondary and failover, also composite (r, f) |
79 |
|
80 |
We don't do the only remaining possibility of replacing both nodes |
81 |
(r,f,r,f or the equivalent f,r,f,r) since these move needs an |
82 |
exhaustive search over both candidate primary and secondary nodes, and |
83 |
is O(n*n) in the number of nodes. Furthermore, it doesn't seems to |
84 |
give better scores but will result in more disk replacements. |
85 |
|
86 |
PLACEMENT RESTRICTIONS |
87 |
~~~~~~~~~~~~~~~~~~~~~~ |
88 |
|
89 |
At each step, we prevent an instance move if it would cause: |
90 |
|
91 |
- a node to go into N+1 failure state |
92 |
- an instance to move onto an offline node (offline nodes are either |
93 |
read from the cluster or declared with *-O*) |
94 |
- an exclusion-tag based conflict (exclusion tags are read from the |
95 |
cluster and/or defined via the *--exclusion-tags* option) |
96 |
- a max vcpu/pcpu ratio to be exceeded (configured via *--max-cpu*) |
97 |
- min disk free percentage to go below the configured limit |
98 |
(configured via *--min-disk*) |
99 |
|
100 |
CLUSTER SCORING |
101 |
~~~~~~~~~~~~~~~ |
102 |
|
103 |
As said before, the algorithm tries to minimise the cluster score at |
104 |
each step. Currently this score is computed as a sum of the following |
105 |
components: |
106 |
|
107 |
- standard deviation of the percent of free memory |
108 |
- standard deviation of the percent of reserved memory |
109 |
- standard deviation of the percent of free disk |
110 |
- count of nodes failing N+1 check |
111 |
- count of instances living (either as primary or secondary) on |
112 |
offline nodes |
113 |
- count of instances living (as primary) on offline nodes; this |
114 |
differs from the above metric by helping failover of such instances |
115 |
in 2-node clusters |
116 |
- standard deviation of the ratio of virtual-to-physical cpus (for |
117 |
primary instances of the node) |
118 |
- standard deviation of the dynamic load on the nodes, for cpus, |
119 |
memory, disk and network |
120 |
|
121 |
The free memory and free disk values help ensure that all nodes are |
122 |
somewhat balanced in their resource usage. The reserved memory helps |
123 |
to ensure that nodes are somewhat balanced in holding secondary |
124 |
instances, and that no node keeps too much memory reserved for |
125 |
N+1. And finally, the N+1 percentage helps guide the algorithm towards |
126 |
eliminating N+1 failures, if possible. |
127 |
|
128 |
Except for the N+1 failures and offline instances counts, we use the |
129 |
standard deviation since when used with values within a fixed range |
130 |
(we use percents expressed as values between zero and one) it gives |
131 |
consistent results across all metrics (there are some small issues |
132 |
related to different means, but it works generally well). The 'count' |
133 |
type values will have higher score and thus will matter more for |
134 |
balancing; thus these are better for hard constraints (like evacuating |
135 |
nodes and fixing N+1 failures). For example, the offline instances |
136 |
count (i.e. the number of instances living on offline nodes) will |
137 |
cause the algorithm to actively move instances away from offline |
138 |
nodes. This, coupled with the restriction on placement given by |
139 |
offline nodes, will cause evacuation of such nodes. |
140 |
|
141 |
The dynamic load values need to be read from an external file (Ganeti |
142 |
doesn't supply them), and are computed for each node as: sum of |
143 |
primary instance cpu load, sum of primary instance memory load, sum of |
144 |
primary and secondary instance disk load (as DRBD generates write load |
145 |
on secondary nodes too in normal case and in degraded scenarios also |
146 |
read load), and sum of primary instance network load. An example of |
147 |
how to generate these values for input to hbal would be to track ``xm |
148 |
list`` for instances over a day and by computing the delta of the cpu |
149 |
values, and feed that via the *-U* option for all instances (and keep |
150 |
the other metrics as one). For the algorithm to work, all that is |
151 |
needed is that the values are consistent for a metric across all |
152 |
instances (e.g. all instances use cpu% to report cpu usage, and not |
153 |
something related to number of CPU seconds used if the CPUs are |
154 |
different), and that they are normalised to between zero and one. Note |
155 |
that it's recommended to not have zero as the load value for any |
156 |
instance metric since then secondary instances are not well balanced. |
157 |
|
158 |
On a perfectly balanced cluster (all nodes the same size, all |
159 |
instances the same size and spread across the nodes equally), the |
160 |
values for all metrics would be zero. This doesn't happen too often in |
161 |
practice :) |
162 |
|
163 |
OFFLINE INSTANCES |
164 |
~~~~~~~~~~~~~~~~~ |
165 |
|
166 |
Since current Ganeti versions do not report the memory used by offline |
167 |
(down) instances, ignoring the run status of instances will cause |
168 |
wrong calculations. For this reason, the algorithm subtracts the |
169 |
memory size of down instances from the free node memory of their |
170 |
primary node, in effect simulating the startup of such instances. |
171 |
|
172 |
EXCLUSION TAGS |
173 |
~~~~~~~~~~~~~~ |
174 |
|
175 |
The exclusion tags mechanism is designed to prevent instances which |
176 |
run the same workload (e.g. two DNS servers) to land on the same node, |
177 |
which would make the respective node a SPOF for the given service. |
178 |
|
179 |
It works by tagging instances with certain tags and then building |
180 |
exclusion maps based on these. Which tags are actually used is |
181 |
configured either via the command line (option *--exclusion-tags*) |
182 |
or via adding them to the cluster tags: |
183 |
|
184 |
--exclusion-tags=a,b |
185 |
This will make all instance tags of the form *a:\**, *b:\** be |
186 |
considered for the exclusion map |
187 |
|
188 |
cluster tags *htools:iextags:a*, *htools:iextags:b* |
189 |
This will make instance tags *a:\**, *b:\** be considered for the |
190 |
exclusion map. More precisely, the suffix of cluster tags starting |
191 |
with *htools:iextags:* will become the prefix of the exclusion tags. |
192 |
|
193 |
Both the above forms mean that two instances both having (e.g.) the |
194 |
tag *a:foo* or *b:bar* won't end on the same node. |
195 |
|
196 |
OPTIONS |
197 |
------- |
198 |
|
199 |
The options that can be passed to the program are as follows: |
200 |
|
201 |
-C, --print-commands |
202 |
Print the command list at the end of the run. Without this, the |
203 |
program will only show a shorter, but cryptic output. |
204 |
|
205 |
Note that the moves list will be split into independent steps, |
206 |
called "jobsets", but only for visual inspection, not for actually |
207 |
parallelisation. It is not possible to parallelise these directly |
208 |
when executed via "gnt-instance" commands, since a compound command |
209 |
(e.g. failover and replace-disks) must be executed |
210 |
serially. Parallel execution is only possible when using the Luxi |
211 |
backend and the *-L* option. |
212 |
|
213 |
The algorithm for splitting the moves into jobsets is by |
214 |
accumulating moves until the next move is touching nodes already |
215 |
touched by the current moves; this means we can't execute in |
216 |
parallel (due to resource allocation in Ganeti) and thus we start a |
217 |
new jobset. |
218 |
|
219 |
-p, --print-nodes |
220 |
Prints the before and after node status, in a format designed to |
221 |
allow the user to understand the node's most important parameters. |
222 |
|
223 |
It is possible to customise the listed information by passing a |
224 |
comma-separated list of field names to this option (the field list |
225 |
is currently undocumented), or to extend the default field list by |
226 |
prefixing the additional field list with a plus sign. By default, |
227 |
the node list will contain the following information: |
228 |
|
229 |
F |
230 |
a character denoting the status of the node, with '-' meaning an |
231 |
offline node, '*' meaning N+1 failure and blank meaning a good |
232 |
node |
233 |
|
234 |
Name |
235 |
the node name |
236 |
|
237 |
t_mem |
238 |
the total node memory |
239 |
|
240 |
n_mem |
241 |
the memory used by the node itself |
242 |
|
243 |
i_mem |
244 |
the memory used by instances |
245 |
|
246 |
x_mem |
247 |
amount memory which seems to be in use but cannot be determined |
248 |
why or by which instance; usually this means that the hypervisor |
249 |
has some overhead or that there are other reporting errors |
250 |
|
251 |
f_mem |
252 |
the free node memory |
253 |
|
254 |
r_mem |
255 |
the reserved node memory, which is the amount of free memory |
256 |
needed for N+1 compliance |
257 |
|
258 |
t_dsk |
259 |
total disk |
260 |
|
261 |
f_dsk |
262 |
free disk |
263 |
|
264 |
pcpu |
265 |
the number of physical cpus on the node |
266 |
|
267 |
vcpu |
268 |
the number of virtual cpus allocated to primary instances |
269 |
|
270 |
pcnt |
271 |
number of primary instances |
272 |
|
273 |
scnt |
274 |
number of secondary instances |
275 |
|
276 |
p_fmem |
277 |
percent of free memory |
278 |
|
279 |
p_fdsk |
280 |
percent of free disk |
281 |
|
282 |
r_cpu |
283 |
ratio of virtual to physical cpus |
284 |
|
285 |
lCpu |
286 |
the dynamic CPU load (if the information is available) |
287 |
|
288 |
lMem |
289 |
the dynamic memory load (if the information is available) |
290 |
|
291 |
lDsk |
292 |
the dynamic disk load (if the information is available) |
293 |
|
294 |
lNet |
295 |
the dynamic net load (if the information is available) |
296 |
|
297 |
--print-instances |
298 |
Prints the before and after instance map. This is less useful as the |
299 |
node status, but it can help in understanding instance moves. |
300 |
|
301 |
-o, --oneline |
302 |
Only shows a one-line output from the program, designed for the case |
303 |
when one wants to look at multiple clusters at once and check their |
304 |
status. |
305 |
|
306 |
The line will contain four fields: |
307 |
|
308 |
- initial cluster score |
309 |
- number of steps in the solution |
310 |
- final cluster score |
311 |
- improvement in the cluster score |
312 |
|
313 |
-O *name* |
314 |
This option (which can be given multiple times) will mark nodes as |
315 |
being *offline*. This means a couple of things: |
316 |
|
317 |
- instances won't be placed on these nodes, not even temporarily; |
318 |
e.g. the *replace primary* move is not available if the secondary |
319 |
node is offline, since this move requires a failover. |
320 |
- these nodes will not be included in the score calculation (except |
321 |
for the percentage of instances on offline nodes) |
322 |
|
323 |
Note that algorithm will also mark as offline any nodes which are |
324 |
reported by RAPI as such, or that have "?" in file-based input in |
325 |
any numeric fields. |
326 |
|
327 |
-e *score*, --min-score=*score* |
328 |
This parameter denotes the minimum score we are happy with and alters |
329 |
the computation in two ways: |
330 |
|
331 |
- if the cluster has the initial score lower than this value, then we |
332 |
don't enter the algorithm at all, and exit with success |
333 |
- during the iterative process, if we reach a score lower than this |
334 |
value, we exit the algorithm |
335 |
|
336 |
The default value of the parameter is currently ``1e-9`` (chosen |
337 |
empirically). |
338 |
|
339 |
-g *delta*, --min-gain=*delta* |
340 |
Since the balancing algorithm can sometimes result in just very tiny |
341 |
improvements, that bring less gain that they cost in relocation |
342 |
time, this parameter (defaulting to 0.01) represents the minimum |
343 |
gain we require during a step, to continue balancing. |
344 |
|
345 |
--min-gain-limit=*threshold* |
346 |
The above min-gain option will only take effect if the cluster score |
347 |
is already below *threshold* (defaults to 0.1). The rationale behind |
348 |
this setting is that at high cluster scores (badly balanced |
349 |
clusters), we don't want to abort the rebalance too quickly, as |
350 |
later gains might still be significant. However, under the |
351 |
threshold, the total gain is only the threshold value, so we can |
352 |
exit early. |
353 |
|
354 |
--no-disk-moves |
355 |
This parameter prevents hbal from using disk move |
356 |
(i.e. "gnt-instance replace-disks") operations. This will result in |
357 |
a much quicker balancing, but of course the improvements are |
358 |
limited. It is up to the user to decide when to use one or another. |
359 |
|
360 |
--no-instance-moves |
361 |
This parameter prevents hbal from using instance moves |
362 |
(i.e. "gnt-instance migrate/failover") operations. This will only use |
363 |
the slow disk-replacement operations, and will also provide a worse |
364 |
balance, but can be useful if moving instances around is deemed unsafe |
365 |
or not preferred. |
366 |
|
367 |
--evac-mode |
368 |
This parameter restricts the list of instances considered for moving |
369 |
to the ones living on offline/drained nodes. It can be used as a |
370 |
(bulk) replacement for Ganeti's own *gnt-node evacuate*, with the |
371 |
note that it doesn't guarantee full evacuation. |
372 |
|
373 |
--select-instances=*instances* |
374 |
This parameter marks the given instances (as a comma-separated list) |
375 |
as the only ones being moved during the rebalance. |
376 |
|
377 |
--exclude-instances=*instances* |
378 |
This parameter marks the given instances (as a comma-separated list) |
379 |
from being moved during the rebalance. |
380 |
|
381 |
-U *util-file* |
382 |
This parameter specifies a file holding instance dynamic utilisation |
383 |
information that will be used to tweak the balancing algorithm to |
384 |
equalise load on the nodes (as opposed to static resource |
385 |
usage). The file is in the format "instance_name cpu_util mem_util |
386 |
disk_util net_util" where the "_util" parameters are interpreted as |
387 |
numbers and the instance name must match exactly the instance as |
388 |
read from Ganeti. In case of unknown instance names, the program |
389 |
will abort. |
390 |
|
391 |
If not given, the default values are one for all metrics and thus |
392 |
dynamic utilisation has only one effect on the algorithm: the |
393 |
equalisation of the secondary instances across nodes (this is the |
394 |
only metric that is not tracked by another, dedicated value, and |
395 |
thus the disk load of instances will cause secondary instance |
396 |
equalisation). Note that value of one will also influence slightly |
397 |
the primary instance count, but that is already tracked via other |
398 |
metrics and thus the influence of the dynamic utilisation will be |
399 |
practically insignificant. |
400 |
|
401 |
-t *datafile*, --text-data=*datafile* |
402 |
The name of the file holding node and instance information (if not |
403 |
collecting via RAPI or LUXI). This or one of the other backends must |
404 |
be selected. |
405 |
|
406 |
-S *filename*, --save-cluster=*filename* |
407 |
If given, the state of the cluster before the balancing is saved to |
408 |
the given file plus the extension "original" |
409 |
(i.e. *filename*.original), and the state at the end of the |
410 |
balancing is saved to the given file plus the extension "balanced" |
411 |
(i.e. *filename*.balanced). This allows re-feeding the cluster state |
412 |
to either hbal itself or for example hspace. |
413 |
|
414 |
-m *cluster* |
415 |
Collect data directly from the *cluster* given as an argument via |
416 |
RAPI. If the argument doesn't contain a colon (:), then it is |
417 |
converted into a fully-built URL via prepending ``https://`` and |
418 |
appending the default RAPI port, otherwise it's considered a |
419 |
fully-specified URL and is used as-is. |
420 |
|
421 |
-L [*path*] |
422 |
Collect data directly from the master daemon, which is to be |
423 |
contacted via the luxi (an internal Ganeti protocol). An optional |
424 |
*path* argument is interpreted as the path to the unix socket on |
425 |
which the master daemon listens; otherwise, the default path used by |
426 |
ganeti when installed with *--localstatedir=/var* is used. |
427 |
|
428 |
-X |
429 |
When using the Luxi backend, hbal can also execute the given |
430 |
commands. The execution method is to execute the individual jobsets |
431 |
(see the *-C* option for details) in separate stages, aborting if at |
432 |
any time a jobset doesn't have all jobs successful. Each step in the |
433 |
balancing solution will be translated into exactly one Ganeti job |
434 |
(having between one and three OpCodes), and all the steps in a |
435 |
jobset will be executed in parallel. The jobsets themselves are |
436 |
executed serially. |
437 |
|
438 |
-l *N*, --max-length=*N* |
439 |
Restrict the solution to this length. This can be used for example |
440 |
to automate the execution of the balancing. |
441 |
|
442 |
--max-cpu=*cpu-ratio* |
443 |
The maximum virtual to physical cpu ratio, as a floating point number |
444 |
greater than or equal to one. For example, specifying *cpu-ratio* as |
445 |
**2.5** means that, for a 4-cpu machine, a maximum of 10 virtual cpus |
446 |
should be allowed to be in use for primary instances. A value of |
447 |
exactly one means there will be no over-subscription of CPU (except |
448 |
for the CPU time used by the node itself), and values below one do not |
449 |
make sense, as that means other resources (e.g. disk) won't be fully |
450 |
utilised due to CPU restrictions. |
451 |
|
452 |
--min-disk=*disk-ratio* |
453 |
The minimum amount of free disk space remaining, as a floating point |
454 |
number. For example, specifying *disk-ratio* as **0.25** means that |
455 |
at least one quarter of disk space should be left free on nodes. |
456 |
|
457 |
-G *uuid*, --group=*uuid* |
458 |
On an multi-group cluster, select this group for |
459 |
processing. Otherwise hbal will abort, since it cannot balance |
460 |
multiple groups at the same time. |
461 |
|
462 |
-v, --verbose |
463 |
Increase the output verbosity. Each usage of this option will |
464 |
increase the verbosity (currently more than 2 doesn't make sense) |
465 |
from the default of one. |
466 |
|
467 |
-q, --quiet |
468 |
Decrease the output verbosity. Each usage of this option will |
469 |
decrease the verbosity (less than zero doesn't make sense) from the |
470 |
default of one. |
471 |
|
472 |
-V, --version |
473 |
Just show the program version and exit. |
474 |
|
475 |
EXIT STATUS |
476 |
----------- |
477 |
|
478 |
The exit status of the command will be zero, unless for some reason |
479 |
the algorithm fatally failed (e.g. wrong node or instance data), or |
480 |
(in case of job execution) any job has failed. |
481 |
|
482 |
BUGS |
483 |
---- |
484 |
|
485 |
The program does not check its input data for consistency, and aborts |
486 |
with cryptic errors messages in this case. |
487 |
|
488 |
The algorithm is not perfect. |
489 |
|
490 |
The output format is not easily scriptable, and the program should |
491 |
feed moves directly into Ganeti (either via RAPI or via a gnt-debug |
492 |
input file). |
493 |
|
494 |
EXAMPLE |
495 |
------- |
496 |
|
497 |
Note that these examples are not for the latest version (they don't |
498 |
have full node data). |
499 |
|
500 |
Default output |
501 |
~~~~~~~~~~~~~~ |
502 |
|
503 |
With the default options, the program shows each individual step and |
504 |
the improvements it brings in cluster score:: |
505 |
|
506 |
$ hbal |
507 |
Loaded 20 nodes, 80 instances |
508 |
Cluster is not N+1 happy, continuing but no guarantee that the cluster will end N+1 happy. |
509 |
Initial score: 0.52329131 |
510 |
Trying to minimize the CV... |
511 |
1. instance14 node1:node10 => node16:node10 0.42109120 a=f r:node16 f |
512 |
2. instance54 node4:node15 => node16:node15 0.31904594 a=f r:node16 f |
513 |
3. instance4 node5:node2 => node2:node16 0.26611015 a=f r:node16 |
514 |
4. instance48 node18:node20 => node2:node18 0.21361717 a=r:node2 f |
515 |
5. instance93 node19:node18 => node16:node19 0.16166425 a=r:node16 f |
516 |
6. instance89 node3:node20 => node2:node3 0.11005629 a=r:node2 f |
517 |
7. instance5 node6:node2 => node16:node6 0.05841589 a=r:node16 f |
518 |
8. instance94 node7:node20 => node20:node16 0.00658759 a=f r:node16 |
519 |
9. instance44 node20:node2 => node2:node15 0.00438740 a=f r:node15 |
520 |
10. instance62 node14:node18 => node14:node16 0.00390087 a=r:node16 |
521 |
11. instance13 node11:node14 => node11:node16 0.00361787 a=r:node16 |
522 |
12. instance19 node10:node11 => node10:node7 0.00336636 a=r:node7 |
523 |
13. instance43 node12:node13 => node12:node1 0.00305681 a=r:node1 |
524 |
14. instance1 node1:node2 => node1:node4 0.00263124 a=r:node4 |
525 |
15. instance58 node19:node20 => node19:node17 0.00252594 a=r:node17 |
526 |
Cluster score improved from 0.52329131 to 0.00252594 |
527 |
|
528 |
In the above output, we can see: |
529 |
|
530 |
- the input data (here from files) shows a cluster with 20 nodes and |
531 |
80 instances |
532 |
- the cluster is not initially N+1 compliant |
533 |
- the initial score is 0.52329131 |
534 |
|
535 |
The step list follows, showing the instance, its initial |
536 |
primary/secondary nodes, the new primary secondary, the cluster list, |
537 |
and the actions taken in this step (with 'f' denoting failover/migrate |
538 |
and 'r' denoting replace secondary). |
539 |
|
540 |
Finally, the program shows the improvement in cluster score. |
541 |
|
542 |
A more detailed output is obtained via the *-C* and *-p* options:: |
543 |
|
544 |
$ hbal |
545 |
Loaded 20 nodes, 80 instances |
546 |
Cluster is not N+1 happy, continuing but no guarantee that the cluster will end N+1 happy. |
547 |
Initial cluster status: |
548 |
N1 Name t_mem f_mem r_mem t_dsk f_dsk pri sec p_fmem p_fdsk |
549 |
* node1 32762 1280 6000 1861 1026 5 3 0.03907 0.55179 |
550 |
node2 32762 31280 12000 1861 1026 0 8 0.95476 0.55179 |
551 |
* node3 32762 1280 6000 1861 1026 5 3 0.03907 0.55179 |
552 |
* node4 32762 1280 6000 1861 1026 5 3 0.03907 0.55179 |
553 |
* node5 32762 1280 6000 1861 978 5 5 0.03907 0.52573 |
554 |
* node6 32762 1280 6000 1861 1026 5 3 0.03907 0.55179 |
555 |
* node7 32762 1280 6000 1861 1026 5 3 0.03907 0.55179 |
556 |
node8 32762 7280 6000 1861 1026 4 4 0.22221 0.55179 |
557 |
node9 32762 7280 6000 1861 1026 4 4 0.22221 0.55179 |
558 |
* node10 32762 7280 12000 1861 1026 4 4 0.22221 0.55179 |
559 |
node11 32762 7280 6000 1861 922 4 5 0.22221 0.49577 |
560 |
node12 32762 7280 6000 1861 1026 4 4 0.22221 0.55179 |
561 |
node13 32762 7280 6000 1861 922 4 5 0.22221 0.49577 |
562 |
node14 32762 7280 6000 1861 922 4 5 0.22221 0.49577 |
563 |
* node15 32762 7280 12000 1861 1131 4 3 0.22221 0.60782 |
564 |
node16 32762 31280 0 1861 1860 0 0 0.95476 1.00000 |
565 |
node17 32762 7280 6000 1861 1106 5 3 0.22221 0.59479 |
566 |
* node18 32762 1280 6000 1396 561 5 3 0.03907 0.40239 |
567 |
* node19 32762 1280 6000 1861 1026 5 3 0.03907 0.55179 |
568 |
node20 32762 13280 12000 1861 689 3 9 0.40535 0.37068 |
569 |
|
570 |
Initial score: 0.52329131 |
571 |
Trying to minimize the CV... |
572 |
1. instance14 node1:node10 => node16:node10 0.42109120 a=f r:node16 f |
573 |
2. instance54 node4:node15 => node16:node15 0.31904594 a=f r:node16 f |
574 |
3. instance4 node5:node2 => node2:node16 0.26611015 a=f r:node16 |
575 |
4. instance48 node18:node20 => node2:node18 0.21361717 a=r:node2 f |
576 |
5. instance93 node19:node18 => node16:node19 0.16166425 a=r:node16 f |
577 |
6. instance89 node3:node20 => node2:node3 0.11005629 a=r:node2 f |
578 |
7. instance5 node6:node2 => node16:node6 0.05841589 a=r:node16 f |
579 |
8. instance94 node7:node20 => node20:node16 0.00658759 a=f r:node16 |
580 |
9. instance44 node20:node2 => node2:node15 0.00438740 a=f r:node15 |
581 |
10. instance62 node14:node18 => node14:node16 0.00390087 a=r:node16 |
582 |
11. instance13 node11:node14 => node11:node16 0.00361787 a=r:node16 |
583 |
12. instance19 node10:node11 => node10:node7 0.00336636 a=r:node7 |
584 |
13. instance43 node12:node13 => node12:node1 0.00305681 a=r:node1 |
585 |
14. instance1 node1:node2 => node1:node4 0.00263124 a=r:node4 |
586 |
15. instance58 node19:node20 => node19:node17 0.00252594 a=r:node17 |
587 |
Cluster score improved from 0.52329131 to 0.00252594 |
588 |
|
589 |
Commands to run to reach the above solution: |
590 |
echo step 1 |
591 |
echo gnt-instance migrate instance14 |
592 |
echo gnt-instance replace-disks -n node16 instance14 |
593 |
echo gnt-instance migrate instance14 |
594 |
echo step 2 |
595 |
echo gnt-instance migrate instance54 |
596 |
echo gnt-instance replace-disks -n node16 instance54 |
597 |
echo gnt-instance migrate instance54 |
598 |
echo step 3 |
599 |
echo gnt-instance migrate instance4 |
600 |
echo gnt-instance replace-disks -n node16 instance4 |
601 |
echo step 4 |
602 |
echo gnt-instance replace-disks -n node2 instance48 |
603 |
echo gnt-instance migrate instance48 |
604 |
echo step 5 |
605 |
echo gnt-instance replace-disks -n node16 instance93 |
606 |
echo gnt-instance migrate instance93 |
607 |
echo step 6 |
608 |
echo gnt-instance replace-disks -n node2 instance89 |
609 |
echo gnt-instance migrate instance89 |
610 |
echo step 7 |
611 |
echo gnt-instance replace-disks -n node16 instance5 |
612 |
echo gnt-instance migrate instance5 |
613 |
echo step 8 |
614 |
echo gnt-instance migrate instance94 |
615 |
echo gnt-instance replace-disks -n node16 instance94 |
616 |
echo step 9 |
617 |
echo gnt-instance migrate instance44 |
618 |
echo gnt-instance replace-disks -n node15 instance44 |
619 |
echo step 10 |
620 |
echo gnt-instance replace-disks -n node16 instance62 |
621 |
echo step 11 |
622 |
echo gnt-instance replace-disks -n node16 instance13 |
623 |
echo step 12 |
624 |
echo gnt-instance replace-disks -n node7 instance19 |
625 |
echo step 13 |
626 |
echo gnt-instance replace-disks -n node1 instance43 |
627 |
echo step 14 |
628 |
echo gnt-instance replace-disks -n node4 instance1 |
629 |
echo step 15 |
630 |
echo gnt-instance replace-disks -n node17 instance58 |
631 |
|
632 |
Final cluster status: |
633 |
N1 Name t_mem f_mem r_mem t_dsk f_dsk pri sec p_fmem p_fdsk |
634 |
node1 32762 7280 6000 1861 1026 4 4 0.22221 0.55179 |
635 |
node2 32762 7280 6000 1861 1026 4 4 0.22221 0.55179 |
636 |
node3 32762 7280 6000 1861 1026 4 4 0.22221 0.55179 |
637 |
node4 32762 7280 6000 1861 1026 4 4 0.22221 0.55179 |
638 |
node5 32762 7280 6000 1861 1078 4 5 0.22221 0.57947 |
639 |
node6 32762 7280 6000 1861 1026 4 4 0.22221 0.55179 |
640 |
node7 32762 7280 6000 1861 1026 4 4 0.22221 0.55179 |
641 |
node8 32762 7280 6000 1861 1026 4 4 0.22221 0.55179 |
642 |
node9 32762 7280 6000 1861 1026 4 4 0.22221 0.55179 |
643 |
node10 32762 7280 6000 1861 1026 4 4 0.22221 0.55179 |
644 |
node11 32762 7280 6000 1861 1022 4 4 0.22221 0.54951 |
645 |
node12 32762 7280 6000 1861 1026 4 4 0.22221 0.55179 |
646 |
node13 32762 7280 6000 1861 1022 4 4 0.22221 0.54951 |
647 |
node14 32762 7280 6000 1861 1022 4 4 0.22221 0.54951 |
648 |
node15 32762 7280 6000 1861 1031 4 4 0.22221 0.55408 |
649 |
node16 32762 7280 6000 1861 1060 4 4 0.22221 0.57007 |
650 |
node17 32762 7280 6000 1861 1006 5 4 0.22221 0.54105 |
651 |
node18 32762 7280 6000 1396 761 4 2 0.22221 0.54570 |
652 |
node19 32762 7280 6000 1861 1026 4 4 0.22221 0.55179 |
653 |
node20 32762 13280 6000 1861 1089 3 5 0.40535 0.58565 |
654 |
|
655 |
Here we see, beside the step list, the initial and final cluster |
656 |
status, with the final one showing all nodes being N+1 compliant, and |
657 |
the command list to reach the final solution. In the initial listing, |
658 |
we see which nodes are not N+1 compliant. |
659 |
|
660 |
The algorithm is stable as long as each step above is fully completed, |
661 |
e.g. in step 8, both the migrate and the replace-disks are |
662 |
done. Otherwise, if only the migrate is done, the input data is |
663 |
changed in a way that the program will output a different solution |
664 |
list (but hopefully will end in the same state). |
665 |
|
666 |
.. vim: set textwidth=72 : |
667 |
.. Local Variables: |
668 |
.. mode: rst |
669 |
.. fill-column: 72 |
670 |
.. End: |