Revision ab0521f9
/dev/null | ||
---|---|---|
1 |
.TH HAIL 1 2009-03-23 htools "Ganeti H-tools" |
|
2 |
.SH NAME |
|
3 |
hail \- Ganeti IAllocator plugin |
|
4 |
|
|
5 |
.SH SYNOPSIS |
|
6 |
.B hail |
|
7 |
.I "input-file" |
|
8 |
|
|
9 |
.B hail |
|
10 |
.B --version |
|
11 |
|
|
12 |
.SH DESCRIPTION |
|
13 |
hail is a Ganeti IAllocator plugin that allows automatic instance |
|
14 |
placement and automatic instance secondary node replacement using the |
|
15 |
same algorithm as \fBhbal\fR(1). |
|
16 |
|
|
17 |
The program takes input via a JSON\(hyfile containing current cluster |
|
18 |
state and the request details, and output (on stdout) a JSON\(hyformatted |
|
19 |
response. In case of critical failures, the error message is printed |
|
20 |
on stderr and the exit code is changed to show failure. |
|
21 |
|
|
22 |
.SS ALGORITHM |
|
23 |
|
|
24 |
The program uses a simplified version of the hbal algorithm. |
|
25 |
|
|
26 |
For relocations, we try to change the secondary node of the instance |
|
27 |
to all the valid other nodes; the node which results in the best |
|
28 |
cluster score is chosen. |
|
29 |
|
|
30 |
For single\(hynode allocations (non\(hymirrored instances), again we |
|
31 |
select the node which, when chosen as the primary node, gives the best |
|
32 |
score. |
|
33 |
|
|
34 |
For dual\(hynode allocations (mirrored instances), we chose the best |
|
35 |
pair; this is the only choice where the algorithm is non\(hytrivial |
|
36 |
with regard to cluster size. |
|
37 |
|
|
38 |
For node evacuations (\fImulti-evacuate\fR mode), we iterate over all |
|
39 |
instances which live as secondaries on those nodes and try to relocate |
|
40 |
them using the single-instance relocation algorithm. |
|
41 |
|
|
42 |
In all cases, the cluster scoring is identical to the hbal algorithm. |
|
43 |
|
|
44 |
.SH CONFIGURATION |
|
45 |
|
|
46 |
For the tag-exclusion configuration (see the manpage of hbal for more |
|
47 |
details), the list of which instance tags to consider as exclusion |
|
48 |
tags will be read from the cluster tags, configured as follows: |
|
49 |
|
|
50 |
- get all cluster tags starting with \fBhtools:iextags:\fR |
|
51 |
|
|
52 |
- use their suffix as the prefix for exclusion tags |
|
53 |
|
|
54 |
For example, given a cluster tag like \fBhtools:iextags:service\fR, |
|
55 |
all instance tags of the form \fBservice:X\fR will be considered as |
|
56 |
exclusion tags, meaning that (e.g.) two instances which both have a |
|
57 |
tag \fBservice:foo\fR will not be placed on the same primary node. |
|
58 |
|
|
59 |
.SH EXIT STATUS |
|
60 |
|
|
61 |
The exist status of the command will be zero, unless for some reason |
|
62 |
the algorithm fatally failed (e.g. wrong node or instance data). |
|
63 |
|
|
64 |
.SH SEE ALSO |
|
65 |
.BR hbal "(1), " hspace "(1), " hscan "(1), " ganeti "(7), " |
|
66 |
.BR gnt-instance "(8), " gnt-node "(8)" |
|
67 |
|
|
68 |
.SH "COPYRIGHT" |
|
69 |
.PP |
|
70 |
Copyright (C) 2009 Google Inc. Permission is granted to copy, |
|
71 |
distribute and/or modify under the terms of the GNU General Public |
|
72 |
License as published by the Free Software Foundation; either version 2 |
|
73 |
of the License, or (at your option) any later version. |
|
74 |
.PP |
|
75 |
On Debian systems, the complete text of the GNU General Public License |
|
76 |
can be found in /usr/share/common-licenses/GPL. |
/dev/null | ||
---|---|---|
1 |
.TH HBAL 1 2009-03-23 htools "Ganeti H-tools" |
|
2 |
.SH NAME |
|
3 |
hbal \- Cluster balancer for Ganeti |
|
4 |
|
|
5 |
.SH SYNOPSIS |
|
6 |
.B hbal |
|
7 |
.B "[backend options...]" |
|
8 |
.B "[algorithm options...]" |
|
9 |
.B "[reporting options...]" |
|
10 |
|
|
11 |
.B hbal |
|
12 |
.B --version |
|
13 |
|
|
14 |
.TP |
|
15 |
Backend options: |
|
16 |
.BI "[ -m " cluster " ]" |
|
17 |
| |
|
18 |
.BI "[ -L[" path "] [-X]]" |
|
19 |
| |
|
20 |
.BI "[ -t " data-file " ]" |
|
21 |
|
|
22 |
.TP |
|
23 |
Algorithm options: |
|
24 |
.BI "[ --max-cpu " cpu-ratio " ]" |
|
25 |
.BI "[ --min-disk " disk-ratio " ]" |
|
26 |
.BI "[ -l " limit " ]" |
|
27 |
.BI "[ -e " score " ]" |
|
28 |
.BI "[ -g " delta " ] [ --min-gain-limit " threshold " ]" |
|
29 |
.BI "[ -O " name... " ]" |
|
30 |
.B "[ --no-disk-moves ]" |
|
31 |
.BI "[ -U " util-file " ]" |
|
32 |
.B "[ --evac-mode ]" |
|
33 |
.BI "[ --exclude-instances " inst... " ]" |
|
34 |
|
|
35 |
.TP |
|
36 |
Reporting options: |
|
37 |
.BI "[ -C[" file "] ]" |
|
38 |
.BI "[ -p[" fields "] ]" |
|
39 |
.B "[ --print-instances ]" |
|
40 |
.B "[ -o ]" |
|
41 |
.B "[ -v... | -q ]" |
|
42 |
|
|
43 |
|
|
44 |
.SH DESCRIPTION |
|
45 |
hbal is a cluster balancer that looks at the current state of the |
|
46 |
cluster (nodes with their total and free disk, memory, etc.) and |
|
47 |
instance placement and computes a series of steps designed to bring |
|
48 |
the cluster into a better state. |
|
49 |
|
|
50 |
The algorithm used is designed to be stable (i.e. it will give you the |
|
51 |
same results when restarting it from the middle of the solution) and |
|
52 |
reasonably fast. It is not, however, designed to be a perfect |
|
53 |
algorithm \(em it is possible to make it go into a corner from which |
|
54 |
it can find no improvement, because it looks only one "step" ahead. |
|
55 |
|
|
56 |
By default, the program will show the solution incrementally as it is |
|
57 |
computed, in a somewhat cryptic format; for getting the actual Ganeti |
|
58 |
command list, use the \fB-C\fR option. |
|
59 |
|
|
60 |
.SS ALGORITHM |
|
61 |
|
|
62 |
The program works in independent steps; at each step, we compute the |
|
63 |
best instance move that lowers the cluster score. |
|
64 |
|
|
65 |
The possible move type for an instance are combinations of |
|
66 |
failover/migrate and replace-disks such that we change one of the |
|
67 |
instance nodes, and the other one remains (but possibly with changed |
|
68 |
role, e.g. from primary it becomes secondary). The list is: |
|
69 |
.RS 4 |
|
70 |
.TP 3 |
|
71 |
\(em |
|
72 |
failover (f) |
|
73 |
.TP |
|
74 |
\(em |
|
75 |
replace secondary (r) |
|
76 |
.TP |
|
77 |
\(em |
|
78 |
replace primary, a composite move (f, r, f) |
|
79 |
.TP |
|
80 |
\(em |
|
81 |
failover and replace secondary, also composite (f, r) |
|
82 |
.TP |
|
83 |
\(em |
|
84 |
replace secondary and failover, also composite (r, f) |
|
85 |
.RE |
|
86 |
|
|
87 |
We don't do the only remaining possibility of replacing both nodes |
|
88 |
(r,f,r,f or the equivalent f,r,f,r) since these move needs an |
|
89 |
exhaustive search over both candidate primary and secondary nodes, and |
|
90 |
is O(n*n) in the number of nodes. Furthermore, it doesn't seems to |
|
91 |
give better scores but will result in more disk replacements. |
|
92 |
|
|
93 |
.SS PLACEMENT RESTRICTIONS |
|
94 |
|
|
95 |
At each step, we prevent an instance move if it would cause: |
|
96 |
|
|
97 |
.RS 4 |
|
98 |
.TP 3 |
|
99 |
\(em |
|
100 |
a node to go into N+1 failure state |
|
101 |
.TP |
|
102 |
\(em |
|
103 |
an instance to move onto an offline node (offline nodes are either |
|
104 |
read from the cluster or declared with \fI-O\fR) |
|
105 |
.TP |
|
106 |
\(em |
|
107 |
an exclusion-tag based conflict (exclusion tags are read from the |
|
108 |
cluster and/or defined via the \fI--exclusion-tags\fR option) |
|
109 |
.TP |
|
110 |
\(em |
|
111 |
a max vcpu/pcpu ratio to be exceeded (configured via \fI--max-cpu\fR) |
|
112 |
.TP |
|
113 |
\(em |
|
114 |
min disk free percentage to go below the configured limit (configured |
|
115 |
via \fI--min-disk\fR) |
|
116 |
|
|
117 |
.SS CLUSTER SCORING |
|
118 |
|
|
119 |
As said before, the algorithm tries to minimise the cluster score at |
|
120 |
each step. Currently this score is computed as a sum of the following |
|
121 |
components: |
|
122 |
.RS 4 |
|
123 |
.TP 3 |
|
124 |
\(em |
|
125 |
standard deviation of the percent of free memory |
|
126 |
.TP |
|
127 |
\(em |
|
128 |
standard deviation of the percent of reserved memory |
|
129 |
.TP |
|
130 |
\(em |
|
131 |
standard deviation of the percent of free disk |
|
132 |
.TP |
|
133 |
\(em |
|
134 |
count of nodes failing N+1 check |
|
135 |
.TP |
|
136 |
\(em |
|
137 |
count of instances living (either as primary or secondary) on |
|
138 |
offline nodes |
|
139 |
.TP |
|
140 |
\(em |
|
141 |
count of instances living (as primary) on offline nodes; this differs |
|
142 |
from the above metric by helping failover of such instances in 2-node |
|
143 |
clusters |
|
144 |
.TP |
|
145 |
\(em |
|
146 |
standard deviation of the ratio of virtual-to-physical cpus (for |
|
147 |
primary instances of the node) |
|
148 |
.TP |
|
149 |
\(em |
|
150 |
standard deviation of the dynamic load on the nodes, for cpus, |
|
151 |
memory, disk and network |
|
152 |
.RE |
|
153 |
|
|
154 |
The free memory and free disk values help ensure that all nodes are |
|
155 |
somewhat balanced in their resource usage. The reserved memory helps |
|
156 |
to ensure that nodes are somewhat balanced in holding secondary |
|
157 |
instances, and that no node keeps too much memory reserved for |
|
158 |
N+1. And finally, the N+1 percentage helps guide the algorithm towards |
|
159 |
eliminating N+1 failures, if possible. |
|
160 |
|
|
161 |
Except for the N+1 failures and offline instances counts, we use the |
|
162 |
standard deviation since when used with values within a fixed range |
|
163 |
(we use percents expressed as values between zero and one) it gives |
|
164 |
consistent results across all metrics (there are some small issues |
|
165 |
related to different means, but it works generally well). The 'count' |
|
166 |
type values will have higher score and thus will matter more for |
|
167 |
balancing; thus these are better for hard constraints (like evacuating |
|
168 |
nodes and fixing N+1 failures). For example, the offline instances |
|
169 |
count (i.e. the number of instances living on offline nodes) will |
|
170 |
cause the algorithm to actively move instances away from offline |
|
171 |
nodes. This, coupled with the restriction on placement given by |
|
172 |
offline nodes, will cause evacuation of such nodes. |
|
173 |
|
|
174 |
The dynamic load values need to be read from an external file (Ganeti |
|
175 |
doesn't supply them), and are computed for each node as: sum of |
|
176 |
primary instance cpu load, sum of primary instance memory load, sum of |
|
177 |
primary and secondary instance disk load (as DRBD generates write load |
|
178 |
on secondary nodes too in normal case and in degraded scenarios also |
|
179 |
read load), and sum of primary instance network load. An example of |
|
180 |
how to generate these values for input to hbal would be to track "xm |
|
181 |
list" for instance over a day and by computing the delta of the cpu |
|
182 |
values, and feed that via the \fI-U\fR option for all instances (and |
|
183 |
keep the other metrics as one). For the algorithm to work, all that is |
|
184 |
needed is that the values are consistent for a metric across all |
|
185 |
instances (e.g. all instances use cpu% to report cpu usage, and not |
|
186 |
something related to number of CPU seconds used if the CPUs are |
|
187 |
different), and that they are normalised to between zero and one. Note |
|
188 |
that it's recommended to not have zero as the load value for any |
|
189 |
instance metric since then secondary instances are not well balanced. |
|
190 |
|
|
191 |
On a perfectly balanced cluster (all nodes the same size, all |
|
192 |
instances the same size and spread across the nodes equally), the |
|
193 |
values for all metrics would be zero. This doesn't happen too often in |
|
194 |
practice :) |
|
195 |
|
|
196 |
.SS OFFLINE INSTANCES |
|
197 |
|
|
198 |
Since current Ganeti versions do not report the memory used by offline |
|
199 |
(down) instances, ignoring the run status of instances will cause |
|
200 |
wrong calculations. For this reason, the algorithm subtracts the |
|
201 |
memory size of down instances from the free node memory of their |
|
202 |
primary node, in effect simulating the startup of such instances. |
|
203 |
|
|
204 |
.SS EXCLUSION TAGS |
|
205 |
|
|
206 |
The exclusion tags mechanism is designed to prevent instances which |
|
207 |
run the same workload (e.g. two DNS servers) to land on the same node, |
|
208 |
which would make the respective node a SPOF for the given service. |
|
209 |
|
|
210 |
It works by tagging instances with certain tags and then building |
|
211 |
exclusion maps based on these. Which tags are actually used is |
|
212 |
configured either via the command line (option \fI--exclusion-tags\fR) |
|
213 |
or via adding them to the cluster tags: |
|
214 |
|
|
215 |
.TP |
|
216 |
.B --exclusion-tags=a,b |
|
217 |
This will make all instance tags of the form \fIa:*\fR, \fIb:*\fR be |
|
218 |
considered for the exclusion map |
|
219 |
|
|
220 |
.TP |
|
221 |
cluster tags \fBhtools:iextags:a\fR, \fBhtools:iextags:b\fR |
|
222 |
This will make instance tags \fIa:*\fR, \fIb:*\fR be considered for |
|
223 |
the exclusion map. More precisely, the suffix of cluster tags starting |
|
224 |
with \fBhtools:iextags:\fR will become the prefix of the exclusion |
|
225 |
tags. |
|
226 |
|
|
227 |
.P |
|
228 |
Both the above forms mean that two instances both having (e.g.) the |
|
229 |
tag \fIa:foo\fR or \fIb:bar\fR won't end on the same node. |
|
230 |
|
|
231 |
.SH OPTIONS |
|
232 |
The options that can be passed to the program are as follows: |
|
233 |
.TP |
|
234 |
.B -C, --print-commands |
|
235 |
Print the command list at the end of the run. Without this, the |
|
236 |
program will only show a shorter, but cryptic output. |
|
237 |
|
|
238 |
Note that the moves list will be split into independent steps, called |
|
239 |
"jobsets", but only for visual inspection, not for actually |
|
240 |
parallelisation. It is not possible to parallelise these directly when |
|
241 |
executed via "gnt-instance" commands, since a compound command |
|
242 |
(e.g. failover and replace\-disks) must be executed serially. Parallel |
|
243 |
execution is only possible when using the Luxi backend and the |
|
244 |
\fI-L\fR option. |
|
245 |
|
|
246 |
The algorithm for splitting the moves into jobsets is by accumulating |
|
247 |
moves until the next move is touching nodes already touched by the |
|
248 |
current moves; this means we can't execute in parallel (due to |
|
249 |
resource allocation in Ganeti) and thus we start a new jobset. |
|
250 |
|
|
251 |
.TP |
|
252 |
.B -p, --print-nodes |
|
253 |
Prints the before and after node status, in a format designed to allow |
|
254 |
the user to understand the node's most important parameters. |
|
255 |
|
|
256 |
It is possible to customise the listed information by passing a |
|
257 |
comma\(hyseparated list of field names to this option (the field list |
|
258 |
is currently undocumented), or to extend the default field list by |
|
259 |
prefixing the additional field list with a plus sign. By default, the |
|
260 |
node list will contain the following information: |
|
261 |
.RS |
|
262 |
.TP |
|
263 |
.B F |
|
264 |
a character denoting the status of the node, with '\-' meaning an |
|
265 |
offline node, '*' meaning N+1 failure and blank meaning a good node |
|
266 |
.TP |
|
267 |
.B Name |
|
268 |
the node name |
|
269 |
.TP |
|
270 |
.B t_mem |
|
271 |
the total node memory |
|
272 |
.TP |
|
273 |
.B n_mem |
|
274 |
the memory used by the node itself |
|
275 |
.TP |
|
276 |
.B i_mem |
|
277 |
the memory used by instances |
|
278 |
.TP |
|
279 |
.B x_mem |
|
280 |
amount memory which seems to be in use but cannot be determined why or |
|
281 |
by which instance; usually this means that the hypervisor has some |
|
282 |
overhead or that there are other reporting errors |
|
283 |
.TP |
|
284 |
.B f_mem |
|
285 |
the free node memory |
|
286 |
.TP |
|
287 |
.B r_mem |
|
288 |
the reserved node memory, which is the amount of free memory needed |
|
289 |
for N+1 compliance |
|
290 |
.TP |
|
291 |
.B t_dsk |
|
292 |
total disk |
|
293 |
.TP |
|
294 |
.B f_dsk |
|
295 |
free disk |
|
296 |
.TP |
|
297 |
.B pcpu |
|
298 |
the number of physical cpus on the node |
|
299 |
.TP |
|
300 |
.B vcpu |
|
301 |
the number of virtual cpus allocated to primary instances |
|
302 |
.TP |
|
303 |
.B pcnt |
|
304 |
number of primary instances |
|
305 |
.TP |
|
306 |
.B scnt |
|
307 |
number of secondary instances |
|
308 |
.TP |
|
309 |
.B p_fmem |
|
310 |
percent of free memory |
|
311 |
.TP |
|
312 |
.B p_fdsk |
|
313 |
percent of free disk |
|
314 |
.TP |
|
315 |
.B r_cpu |
|
316 |
ratio of virtual to physical cpus |
|
317 |
.TP |
|
318 |
.B lCpu |
|
319 |
the dynamic CPU load (if the information is available) |
|
320 |
.TP |
|
321 |
.B lMem |
|
322 |
the dynamic memory load (if the information is available) |
|
323 |
.TP |
|
324 |
.B lDsk |
|
325 |
the dynamic disk load (if the information is available) |
|
326 |
.TP |
|
327 |
.B lNet |
|
328 |
the dynamic net load (if the information is available) |
|
329 |
.RE |
|
330 |
|
|
331 |
.TP |
|
332 |
.B --print-instances |
|
333 |
Prints the before and after instance map. This is less useful as the |
|
334 |
node status, but it can help in understanding instance moves. |
|
335 |
|
|
336 |
.TP |
|
337 |
.B -o, --oneline |
|
338 |
Only shows a one\(hyline output from the program, designed for the case |
|
339 |
when one wants to look at multiple clusters at once and check their |
|
340 |
status. |
|
341 |
|
|
342 |
The line will contain four fields: |
|
343 |
.RS |
|
344 |
.RS 4 |
|
345 |
.TP 3 |
|
346 |
\(em |
|
347 |
initial cluster score |
|
348 |
.TP |
|
349 |
\(em |
|
350 |
number of steps in the solution |
|
351 |
.TP |
|
352 |
\(em |
|
353 |
final cluster score |
|
354 |
.TP |
|
355 |
\(em |
|
356 |
improvement in the cluster score |
|
357 |
.RE |
|
358 |
.RE |
|
359 |
|
|
360 |
.TP |
|
361 |
.BI "-O " name |
|
362 |
This option (which can be given multiple times) will mark nodes as |
|
363 |
being \fIoffline\fR. This means a couple of things: |
|
364 |
.RS |
|
365 |
.RS 4 |
|
366 |
.TP 3 |
|
367 |
\(em |
|
368 |
instances won't be placed on these nodes, not even temporarily; |
|
369 |
e.g. the \fIreplace primary\fR move is not available if the secondary |
|
370 |
node is offline, since this move requires a failover. |
|
371 |
.TP |
|
372 |
\(em |
|
373 |
these nodes will not be included in the score calculation (except for |
|
374 |
the percentage of instances on offline nodes) |
|
375 |
.RE |
|
376 |
Note that hbal will also mark as offline any nodes which are reported |
|
377 |
by RAPI as such, or that have "?" in file\(hybased input in any numeric |
|
378 |
fields. |
|
379 |
.RE |
|
380 |
|
|
381 |
.TP |
|
382 |
.BI "-e" score ", --min-score=" score |
|
383 |
This parameter denotes the minimum score we are happy with and alters |
|
384 |
the computation in two ways: |
|
385 |
.RS |
|
386 |
.RS 4 |
|
387 |
.TP 3 |
|
388 |
\(em |
|
389 |
if the cluster has the initial score lower than this value, then we |
|
390 |
don't enter the algorithm at all, and exit with success |
|
391 |
.TP |
|
392 |
\(em |
|
393 |
during the iterative process, if we reach a score lower than this |
|
394 |
value, we exit the algorithm |
|
395 |
.RE |
|
396 |
The default value of the parameter is currently \fI1e-9\fR (chosen |
|
397 |
empirically). |
|
398 |
.RE |
|
399 |
|
|
400 |
.TP |
|
401 |
.BI "-g" delta ", --min-gain=" delta |
|
402 |
Since the balancing algorithm can sometimes result in just very tiny |
|
403 |
improvements, that bring less gain that they cost in relocation time, |
|
404 |
this parameter (defaulting to 0.01) represents the minimum gain we |
|
405 |
require during a step, to continue balancing. |
|
406 |
|
|
407 |
.TP |
|
408 |
.BI "--min-gain-limit=" threshold |
|
409 |
The above min-gain option will only take effect if the cluster score |
|
410 |
is already below \fIthreshold\fR (defaults to 0.1). The rationale |
|
411 |
behind this setting is that at high cluster scores (badly balanced |
|
412 |
clusters), we don't want to abort the rebalance too quickly, as later |
|
413 |
gains might still be significant. However, under the threshold, the |
|
414 |
total gain is only the threshold value, so we can exit early. |
|
415 |
|
|
416 |
.TP |
|
417 |
.BI "--no-disk-moves" |
|
418 |
This parameter prevents hbal from using disk move (i.e. "gnt\-instance |
|
419 |
replace\-disks") operations. This will result in a much quicker |
|
420 |
balancing, but of course the improvements are limited. It is up to the |
|
421 |
user to decide when to use one or another. |
|
422 |
|
|
423 |
.TP |
|
424 |
.B "--evac-mode" |
|
425 |
This parameter restricts the list of instances considered for moving |
|
426 |
to the ones living on offline/drained nodes. It can be used as a |
|
427 |
(bulk) replacement for Ganeti's own \fIgnt-node evacuate\fR, with the |
|
428 |
note that it doesn't guarantee full evacuation. |
|
429 |
|
|
430 |
.TP |
|
431 |
.BI "--exclude-instances " instances |
|
432 |
This parameter marks the given instances (as a comma-separated list) |
|
433 |
from being moved during the rebalance. |
|
434 |
|
|
435 |
.TP |
|
436 |
.BI "-U" util-file |
|
437 |
This parameter specifies a file holding instance dynamic utilisation |
|
438 |
information that will be used to tweak the balancing algorithm to |
|
439 |
equalise load on the nodes (as opposed to static resource usage). The |
|
440 |
file is in the format "instance_name cpu_util mem_util disk_util |
|
441 |
net_util" where the "_util" parameters are interpreted as numbers and |
|
442 |
the instance name must match exactly the instance as read from |
|
443 |
Ganeti. In case of unknown instance names, the program will abort. |
|
444 |
|
|
445 |
If not given, the default values are one for all metrics and thus |
|
446 |
dynamic utilisation has only one effect on the algorithm: the |
|
447 |
equalisation of the secondary instances across nodes (this is the only |
|
448 |
metric that is not tracked by another, dedicated value, and thus the |
|
449 |
disk load of instances will cause secondary instance |
|
450 |
equalisation). Note that value of one will also influence slightly the |
|
451 |
primary instance count, but that is already tracked via other metrics |
|
452 |
and thus the influence of the dynamic utilisation will be practically |
|
453 |
insignificant. |
|
454 |
|
|
455 |
.TP |
|
456 |
.BI "-t" datafile ", --text-data=" datafile |
|
457 |
The name of the file holding node and instance information (if not |
|
458 |
collecting via RAPI or LUXI). This or one of the other backends must |
|
459 |
be selected. |
|
460 |
|
|
461 |
.TP |
|
462 |
.BI "-S" datafile ", --save-cluster=" datafile |
|
463 |
If given, the state of the cluster at the end of the balancing is |
|
464 |
saved to the given file. This allows re-feeding the cluster state to |
|
465 |
either hbal itself or for example hspace. |
|
466 |
|
|
467 |
.TP |
|
468 |
.BI "-m" cluster |
|
469 |
Collect data directly from the |
|
470 |
.I cluster |
|
471 |
given as an argument via RAPI. If the argument doesn't contain a colon |
|
472 |
(:), then it is converted into a fully\(hybuilt URL via prepending |
|
473 |
https:// and appending the default RAPI port, otherwise it's |
|
474 |
considered a fully\(hyspecified URL and is used as\(hyis. |
|
475 |
|
|
476 |
.TP |
|
477 |
.BI "-L[" path "]" |
|
478 |
Collect data directly from the master daemon, which is to be contacted |
|
479 |
via the luxi (an internal Ganeti protocol). An optional \fIpath\fR |
|
480 |
argument is interpreted as the path to the unix socket on which the |
|
481 |
master daemon listens; otherwise, the default path used by ganeti when |
|
482 |
installed with \fI--localstatedir=/var\fR is used. |
|
483 |
|
|
484 |
.TP |
|
485 |
.B "-X" |
|
486 |
When using the Luxi backend, hbal can also execute the given |
|
487 |
commands. The execution method is to execute the individual jobsets |
|
488 |
(see the \fI-C\fR option for details) in separate stages, aborting if |
|
489 |
at any time a jobset doesn't have all jobs successful. Each step in |
|
490 |
the balancing solution will be translated into exactly one Ganeti job |
|
491 |
(having between one and three OpCodes), and all the steps in a jobset |
|
492 |
will be executed in parallel. The jobsets themselves are executed |
|
493 |
serially. |
|
494 |
|
|
495 |
.TP |
|
496 |
.BI "-l" N ", --max-length=" N |
|
497 |
Restrict the solution to this length. This can be used for example to |
|
498 |
automate the execution of the balancing. |
|
499 |
|
|
500 |
.TP |
|
501 |
.BI "--max-cpu " cpu-ratio |
|
502 |
The maximum virtual\(hyto\(hyphysical cpu ratio, as a floating point |
|
503 |
number between zero and one. For example, specifying \fIcpu-ratio\fR |
|
504 |
as \fB2.5\fR means that, for a 4\(hycpu machine, a maximum of 10 |
|
505 |
virtual cpus should be allowed to be in use for primary instances. A |
|
506 |
value of one doesn't make sense though, as that means no disk space |
|
507 |
can be used on it. |
|
508 |
|
|
509 |
.TP |
|
510 |
.BI "--min-disk " disk-ratio |
|
511 |
The minimum amount of free disk space remaining, as a floating point |
|
512 |
number. For example, specifying \fIdisk-ratio\fR as \fB0.25\fR means |
|
513 |
that at least one quarter of disk space should be left free on nodes. |
|
514 |
|
|
515 |
.TP |
|
516 |
.B -v, --verbose |
|
517 |
Increase the output verbosity. Each usage of this option will increase |
|
518 |
the verbosity (currently more than 2 doesn't make sense) from the |
|
519 |
default of one. |
|
520 |
|
|
521 |
.TP |
|
522 |
.B -q, --quiet |
|
523 |
Decrease the output verbosity. Each usage of this option will decrease |
|
524 |
the verbosity (less than zero doesn't make sense) from the default of |
|
525 |
one. |
|
526 |
|
|
527 |
.TP |
|
528 |
.B -V, --version |
|
529 |
Just show the program version and exit. |
|
530 |
|
|
531 |
.SH EXIT STATUS |
|
532 |
|
|
533 |
The exist status of the command will be zero, unless for some reason |
|
534 |
the algorithm fatally failed (e.g. wrong node or instance data). |
|
535 |
|
|
536 |
.SH ENVIRONMENT |
|
537 |
|
|
538 |
If the variables \fBHTOOLS_NODES\fR and \fBHTOOLS_INSTANCES\fR are |
|
539 |
present in the environment, they will override the default names for |
|
540 |
the nodes and instances files. These will have of course no effect |
|
541 |
when the RAPI or Luxi backends are used. |
|
542 |
|
|
543 |
.SH BUGS |
|
544 |
|
|
545 |
The program does not check its input data for consistency, and aborts |
|
546 |
with cryptic errors messages in this case. |
|
547 |
|
|
548 |
The algorithm is not perfect. |
|
549 |
|
|
550 |
The output format is not easily scriptable, and the program should |
|
551 |
feed moves directly into Ganeti (either via RAPI or via a gnt\-debug |
|
552 |
input file). |
|
553 |
|
|
554 |
.SH EXAMPLE |
|
555 |
|
|
556 |
Note that this example are not for the latest version (they don't have |
|
557 |
full node data). |
|
558 |
|
|
559 |
.SS Default output |
|
560 |
|
|
561 |
With the default options, the program shows each individual step and |
|
562 |
the improvements it brings in cluster score: |
|
563 |
|
|
564 |
.in +4n |
|
565 |
.nf |
|
566 |
.RB "$" " hbal" |
|
567 |
Loaded 20 nodes, 80 instances |
|
568 |
Cluster is not N+1 happy, continuing but no guarantee that the cluster will end N+1 happy. |
|
569 |
Initial score: 0.52329131 |
|
570 |
Trying to minimize the CV... |
|
571 |
1. instance14 node1:node10 => node16:node10 0.42109120 a=f r:node16 f |
|
572 |
2. instance54 node4:node15 => node16:node15 0.31904594 a=f r:node16 f |
|
573 |
3. instance4 node5:node2 => node2:node16 0.26611015 a=f r:node16 |
|
574 |
4. instance48 node18:node20 => node2:node18 0.21361717 a=r:node2 f |
|
575 |
5. instance93 node19:node18 => node16:node19 0.16166425 a=r:node16 f |
|
576 |
6. instance89 node3:node20 => node2:node3 0.11005629 a=r:node2 f |
|
577 |
7. instance5 node6:node2 => node16:node6 0.05841589 a=r:node16 f |
|
578 |
8. instance94 node7:node20 => node20:node16 0.00658759 a=f r:node16 |
|
579 |
9. instance44 node20:node2 => node2:node15 0.00438740 a=f r:node15 |
|
580 |
10. instance62 node14:node18 => node14:node16 0.00390087 a=r:node16 |
|
581 |
11. instance13 node11:node14 => node11:node16 0.00361787 a=r:node16 |
|
582 |
12. instance19 node10:node11 => node10:node7 0.00336636 a=r:node7 |
|
583 |
13. instance43 node12:node13 => node12:node1 0.00305681 a=r:node1 |
|
584 |
14. instance1 node1:node2 => node1:node4 0.00263124 a=r:node4 |
|
585 |
15. instance58 node19:node20 => node19:node17 0.00252594 a=r:node17 |
|
586 |
Cluster score improved from 0.52329131 to 0.00252594 |
|
587 |
.fi |
|
588 |
.in |
|
589 |
|
|
590 |
In the above output, we can see: |
|
591 |
- the input data (here from files) shows a cluster with 20 nodes and |
|
592 |
80 instances |
|
593 |
- the cluster is not initially N+1 compliant |
|
594 |
- the initial score is 0.52329131 |
|
595 |
|
|
596 |
The step list follows, showing the instance, its initial |
|
597 |
primary/secondary nodes, the new primary secondary, the cluster list, |
|
598 |
and the actions taken in this step (with 'f' denoting failover/migrate |
|
599 |
and 'r' denoting replace secondary). |
|
600 |
|
|
601 |
Finally, the program shows the improvement in cluster score. |
|
602 |
|
|
603 |
A more detailed output is obtained via the \fB-C\fR and \fB-p\fR options: |
|
604 |
|
|
605 |
.in +4n |
|
606 |
.nf |
|
607 |
.RB "$" " hbal" |
|
608 |
Loaded 20 nodes, 80 instances |
|
609 |
Cluster is not N+1 happy, continuing but no guarantee that the cluster will end N+1 happy. |
|
610 |
Initial cluster status: |
|
611 |
N1 Name t_mem f_mem r_mem t_dsk f_dsk pri sec p_fmem p_fdsk |
|
612 |
* node1 32762 1280 6000 1861 1026 5 3 0.03907 0.55179 |
|
613 |
node2 32762 31280 12000 1861 1026 0 8 0.95476 0.55179 |
|
614 |
* node3 32762 1280 6000 1861 1026 5 3 0.03907 0.55179 |
|
615 |
* node4 32762 1280 6000 1861 1026 5 3 0.03907 0.55179 |
|
616 |
* node5 32762 1280 6000 1861 978 5 5 0.03907 0.52573 |
|
617 |
* node6 32762 1280 6000 1861 1026 5 3 0.03907 0.55179 |
|
618 |
* node7 32762 1280 6000 1861 1026 5 3 0.03907 0.55179 |
|
619 |
node8 32762 7280 6000 1861 1026 4 4 0.22221 0.55179 |
|
620 |
node9 32762 7280 6000 1861 1026 4 4 0.22221 0.55179 |
|
621 |
* node10 32762 7280 12000 1861 1026 4 4 0.22221 0.55179 |
|
622 |
node11 32762 7280 6000 1861 922 4 5 0.22221 0.49577 |
|
623 |
node12 32762 7280 6000 1861 1026 4 4 0.22221 0.55179 |
|
624 |
node13 32762 7280 6000 1861 922 4 5 0.22221 0.49577 |
|
625 |
node14 32762 7280 6000 1861 922 4 5 0.22221 0.49577 |
|
626 |
* node15 32762 7280 12000 1861 1131 4 3 0.22221 0.60782 |
|
627 |
node16 32762 31280 0 1861 1860 0 0 0.95476 1.00000 |
|
628 |
node17 32762 7280 6000 1861 1106 5 3 0.22221 0.59479 |
|
629 |
* node18 32762 1280 6000 1396 561 5 3 0.03907 0.40239 |
|
630 |
* node19 32762 1280 6000 1861 1026 5 3 0.03907 0.55179 |
|
631 |
node20 32762 13280 12000 1861 689 3 9 0.40535 0.37068 |
|
632 |
|
|
633 |
Initial score: 0.52329131 |
|
634 |
Trying to minimize the CV... |
|
635 |
1. instance14 node1:node10 => node16:node10 0.42109120 a=f r:node16 f |
|
636 |
2. instance54 node4:node15 => node16:node15 0.31904594 a=f r:node16 f |
|
637 |
3. instance4 node5:node2 => node2:node16 0.26611015 a=f r:node16 |
|
638 |
4. instance48 node18:node20 => node2:node18 0.21361717 a=r:node2 f |
|
639 |
5. instance93 node19:node18 => node16:node19 0.16166425 a=r:node16 f |
|
640 |
6. instance89 node3:node20 => node2:node3 0.11005629 a=r:node2 f |
|
641 |
7. instance5 node6:node2 => node16:node6 0.05841589 a=r:node16 f |
|
642 |
8. instance94 node7:node20 => node20:node16 0.00658759 a=f r:node16 |
|
643 |
9. instance44 node20:node2 => node2:node15 0.00438740 a=f r:node15 |
|
644 |
10. instance62 node14:node18 => node14:node16 0.00390087 a=r:node16 |
|
645 |
11. instance13 node11:node14 => node11:node16 0.00361787 a=r:node16 |
|
646 |
12. instance19 node10:node11 => node10:node7 0.00336636 a=r:node7 |
|
647 |
13. instance43 node12:node13 => node12:node1 0.00305681 a=r:node1 |
|
648 |
14. instance1 node1:node2 => node1:node4 0.00263124 a=r:node4 |
|
649 |
15. instance58 node19:node20 => node19:node17 0.00252594 a=r:node17 |
|
650 |
Cluster score improved from 0.52329131 to 0.00252594 |
|
651 |
|
|
652 |
Commands to run to reach the above solution: |
|
653 |
echo step 1 |
|
654 |
echo gnt\-instance migrate instance14 |
|
655 |
echo gnt\-instance replace\-disks \-n node16 instance14 |
|
656 |
echo gnt\-instance migrate instance14 |
|
657 |
echo step 2 |
|
658 |
echo gnt\-instance migrate instance54 |
|
659 |
echo gnt\-instance replace\-disks \-n node16 instance54 |
|
660 |
echo gnt\-instance migrate instance54 |
|
661 |
echo step 3 |
|
662 |
echo gnt\-instance migrate instance4 |
|
663 |
echo gnt\-instance replace\-disks \-n node16 instance4 |
|
664 |
echo step 4 |
|
665 |
echo gnt\-instance replace\-disks \-n node2 instance48 |
|
666 |
echo gnt\-instance migrate instance48 |
|
667 |
echo step 5 |
|
668 |
echo gnt\-instance replace\-disks \-n node16 instance93 |
|
669 |
echo gnt\-instance migrate instance93 |
|
670 |
echo step 6 |
|
671 |
echo gnt\-instance replace\-disks \-n node2 instance89 |
|
672 |
echo gnt\-instance migrate instance89 |
|
673 |
echo step 7 |
|
674 |
echo gnt\-instance replace\-disks \-n node16 instance5 |
|
675 |
echo gnt\-instance migrate instance5 |
|
676 |
echo step 8 |
|
677 |
echo gnt\-instance migrate instance94 |
|
678 |
echo gnt\-instance replace\-disks \-n node16 instance94 |
|
679 |
echo step 9 |
|
680 |
echo gnt\-instance migrate instance44 |
|
681 |
echo gnt\-instance replace\-disks \-n node15 instance44 |
|
682 |
echo step 10 |
|
683 |
echo gnt\-instance replace\-disks \-n node16 instance62 |
|
684 |
echo step 11 |
|
685 |
echo gnt\-instance replace\-disks \-n node16 instance13 |
|
686 |
echo step 12 |
|
687 |
echo gnt\-instance replace\-disks \-n node7 instance19 |
|
688 |
echo step 13 |
|
689 |
echo gnt\-instance replace\-disks \-n node1 instance43 |
|
690 |
echo step 14 |
|
691 |
echo gnt\-instance replace\-disks \-n node4 instance1 |
|
692 |
echo step 15 |
|
693 |
echo gnt\-instance replace\-disks \-n node17 instance58 |
|
694 |
|
|
695 |
Final cluster status: |
|
696 |
N1 Name t_mem f_mem r_mem t_dsk f_dsk pri sec p_fmem p_fdsk |
|
697 |
node1 32762 7280 6000 1861 1026 4 4 0.22221 0.55179 |
|
698 |
node2 32762 7280 6000 1861 1026 4 4 0.22221 0.55179 |
|
699 |
node3 32762 7280 6000 1861 1026 4 4 0.22221 0.55179 |
|
700 |
node4 32762 7280 6000 1861 1026 4 4 0.22221 0.55179 |
|
701 |
node5 32762 7280 6000 1861 1078 4 5 0.22221 0.57947 |
|
702 |
node6 32762 7280 6000 1861 1026 4 4 0.22221 0.55179 |
|
703 |
node7 32762 7280 6000 1861 1026 4 4 0.22221 0.55179 |
|
704 |
node8 32762 7280 6000 1861 1026 4 4 0.22221 0.55179 |
|
705 |
node9 32762 7280 6000 1861 1026 4 4 0.22221 0.55179 |
|
706 |
node10 32762 7280 6000 1861 1026 4 4 0.22221 0.55179 |
|
707 |
node11 32762 7280 6000 1861 1022 4 4 0.22221 0.54951 |
|
708 |
node12 32762 7280 6000 1861 1026 4 4 0.22221 0.55179 |
|
709 |
node13 32762 7280 6000 1861 1022 4 4 0.22221 0.54951 |
|
710 |
node14 32762 7280 6000 1861 1022 4 4 0.22221 0.54951 |
|
711 |
node15 32762 7280 6000 1861 1031 4 4 0.22221 0.55408 |
|
712 |
node16 32762 7280 6000 1861 1060 4 4 0.22221 0.57007 |
|
713 |
node17 32762 7280 6000 1861 1006 5 4 0.22221 0.54105 |
|
714 |
node18 32762 7280 6000 1396 761 4 2 0.22221 0.54570 |
|
715 |
node19 32762 7280 6000 1861 1026 4 4 0.22221 0.55179 |
|
716 |
node20 32762 13280 6000 1861 1089 3 5 0.40535 0.58565 |
|
717 |
|
|
718 |
.fi |
|
719 |
.in |
|
720 |
|
|
721 |
Here we see, beside the step list, the initial and final cluster |
|
722 |
status, with the final one showing all nodes being N+1 compliant, and |
|
723 |
the command list to reach the final solution. In the initial listing, |
|
724 |
we see which nodes are not N+1 compliant. |
|
725 |
|
|
726 |
The algorithm is stable as long as each step above is fully completed, |
|
727 |
e.g. in step 8, both the migrate and the replace\-disks are |
|
728 |
done. Otherwise, if only the migrate is done, the input data is |
|
729 |
changed in a way that the program will output a different solution |
|
730 |
list (but hopefully will end in the same state). |
|
731 |
|
|
732 |
.SH SEE ALSO |
|
733 |
.BR hspace "(1), " hscan "(1), " hail "(1), " |
|
734 |
.BR ganeti "(7), " gnt-instance "(8), " gnt-node "(8)" |
|
735 |
|
|
736 |
.SH "COPYRIGHT" |
|
737 |
.PP |
|
738 |
Copyright (C) 2009 Google Inc. Permission is granted to copy, |
|
739 |
distribute and/or modify under the terms of the GNU General Public |
|
740 |
License as published by the Free Software Foundation; either version 2 |
|
741 |
of the License, or (at your option) any later version. |
|
742 |
.PP |
|
743 |
On Debian systems, the complete text of the GNU General Public License |
|
744 |
can be found in /usr/share/common-licenses/GPL. |
/dev/null | ||
---|---|---|
1 |
.TH HSCAN 1 2009-03-23 htools "Ganeti H-tools" |
|
2 |
.SH NAME |
|
3 |
hscan \- Scan clusters via RAPI and save node/instance data |
|
4 |
|
|
5 |
.SH SYNOPSIS |
|
6 |
.B hscan |
|
7 |
.B "[-p]" |
|
8 |
.B "[--no-headers]" |
|
9 |
.BI "[-d " path "]" |
|
10 |
.I cluster... |
|
11 |
|
|
12 |
.B hscan |
|
13 |
.B --version |
|
14 |
|
|
15 |
.SH DESCRIPTION |
|
16 |
hscan is a tool for scanning clusters via RAPI and saving their data |
|
17 |
in the input format used by |
|
18 |
.BR hbal "(1) and " hspace "(1)." |
|
19 |
It will also show a one\(hyline score for each cluster scanned or, if |
|
20 |
desired, the cluster state as show by the \fB-p\fR option to the other |
|
21 |
tools. |
|
22 |
|
|
23 |
For each cluster, one file named \fIcluster\fB.data\ will be generated |
|
24 |
holding the node and instance data. This file can then be used in |
|
25 |
\fBhbal\fR(1) or \fBhspace\fR(1) via the \fB-t\fR option. In case the |
|
26 |
cluster name contains slashes (as it can happen when the cluster is a |
|
27 |
fully-specified URL), these will be replaced with underscores. |
|
28 |
|
|
29 |
The one\(hyline output for each cluster will show the following: |
|
30 |
.RS |
|
31 |
.TP |
|
32 |
.B Name |
|
33 |
The name of the cluster (or the IP address that was given, etc.) |
|
34 |
.TP |
|
35 |
.B Nodes |
|
36 |
The number of nodes in the cluster |
|
37 |
.TP |
|
38 |
.B Inst |
|
39 |
The number of instances in the cluster |
|
40 |
.TP |
|
41 |
.B BNode |
|
42 |
The number of nodes failing N+1 |
|
43 |
.TP |
|
44 |
.B BInst |
|
45 |
The number of instances living on N+1\(hyfailed nodes |
|
46 |
.TP |
|
47 |
.B t_mem |
|
48 |
Total memory in the cluster |
|
49 |
.TP |
|
50 |
.B f_mem |
|
51 |
Free memory in the cluster |
|
52 |
.TP |
|
53 |
.B t_disk |
|
54 |
Total disk in the cluster |
|
55 |
.TP |
|
56 |
.B f_disk |
|
57 |
Free disk space in the cluster |
|
58 |
.TP |
|
59 |
.B Score |
|
60 |
The score of the cluster, as would be reported by \fBhbal\fR(1) if |
|
61 |
run on the generated data files. |
|
62 |
|
|
63 |
.RE |
|
64 |
|
|
65 |
In case of errors while collecting data, all fields after the name of |
|
66 |
the cluster are replaced with the error display. |
|
67 |
|
|
68 |
.B Note: |
|
69 |
this output format is not yet final so it should not be used for |
|
70 |
scripting yet. |
|
71 |
|
|
72 |
.SH OPTIONS |
|
73 |
The options that can be passed to the program are as follows: |
|
74 |
|
|
75 |
.TP |
|
76 |
.B -p, --print-nodes |
|
77 |
Prints the node status for each cluster after the cluster's one\(hyline |
|
78 |
status display, in a format designed to allow the user to understand |
|
79 |
the node's most important parameters. For details, see the man page |
|
80 |
for \fBhbal\fR(1). |
|
81 |
|
|
82 |
.TP |
|
83 |
.BI "-d " path |
|
84 |
Save the node and instance data for each cluster under \fIpath\fR, |
|
85 |
instead of the current directory. |
|
86 |
|
|
87 |
.TP |
|
88 |
.B -V, --version |
|
89 |
Just show the program version and exit. |
|
90 |
|
|
91 |
.SH EXIT STATUS |
|
92 |
|
|
93 |
The exist status of the command will be zero, unless for some reason |
|
94 |
loading the input data failed fatally (e.g. wrong node or instance |
|
95 |
data). |
|
96 |
|
|
97 |
.SH BUGS |
|
98 |
|
|
99 |
The program does not check its input data for consistency, and aborts |
|
100 |
with cryptic errors messages in this case. |
|
101 |
|
|
102 |
.SH EXAMPLE |
|
103 |
|
|
104 |
.in +4n |
|
105 |
.nf |
|
106 |
.RB "$ " "hscan cluster1" |
|
107 |
Name Nodes Inst BNode BInst t_mem f_mem t_disk f_disk Score |
|
108 |
cluster1 2 2 0 0 1008 652 255 253 0.24404762 |
|
109 |
.RB "$ " "ls -l cluster1.data" |
|
110 |
\-rw\-r\-\-r\-\- 1 root root 364 2009\-03\-23 07:26 cluster1.data |
|
111 |
.fi |
|
112 |
.in |
|
113 |
|
|
114 |
.SH SEE ALSO |
|
115 |
.BR hbal "(1), " hspace "(1), " hail "(1), " |
|
116 |
.BR ganeti "(7), " gnt-instance "(8), " gnt-node "(8)" |
|
117 |
|
|
118 |
.SH "COPYRIGHT" |
|
119 |
.PP |
|
120 |
Copyright (C) 2009 Google Inc. Permission is granted to copy, |
|
121 |
distribute and/or modify under the terms of the GNU General Public |
|
122 |
License as published by the Free Software Foundation; either version 2 |
|
123 |
of the License, or (at your option) any later version. |
|
124 |
.PP |
|
125 |
On Debian systems, the complete text of the GNU General Public License |
|
126 |
can be found in /usr/share/common-licenses/GPL. |
/dev/null | ||
---|---|---|
1 |
.TH HSPACE 1 2009-06-01 htools "Ganeti H-tools" |
|
2 |
.SH NAME |
|
3 |
hspace \- Cluster space analyzer for Ganeti |
|
4 |
|
|
5 |
.SH SYNOPSIS |
|
6 |
.B hspace |
|
7 |
.B "[backend options...]" |
|
8 |
.B "[algorithm options...]" |
|
9 |
.B "[request options..."] |
|
10 |
.BI "[ -p[" fields "] ]" |
|
11 |
.B "[-v... | -q]" |
|
12 |
|
|
13 |
.B hspace |
|
14 |
.B --version |
|
15 |
|
|
16 |
.TP |
|
17 |
Backend options: |
|
18 |
.BI " -m " cluster |
|
19 |
| |
|
20 |
.BI " -L[" path "]" |
|
21 |
| |
|
22 |
.BI " -t " data-file |
|
23 |
| |
|
24 |
.BI " --simulate " spec |
|
25 |
|
|
26 |
.TP |
|
27 |
Algorithm options: |
|
28 |
.BI "[ --max-cpu " cpu-ratio " ]" |
|
29 |
.BI "[ --min-disk " disk-ratio " ]" |
|
30 |
.BI "[ -O " name... " ]" |
|
31 |
|
|
32 |
.TP |
|
33 |
Request options: |
|
34 |
.BI "[--memory " mem "]" |
|
35 |
.BI "[--disk " disk "]" |
|
36 |
.BI "[--req-nodes " req-nodes "]" |
|
37 |
.BI "[--vcpus " vcpus "]" |
|
38 |
.BI "[--tiered-alloc " spec "]" |
|
39 |
|
|
40 |
|
|
41 |
.SH DESCRIPTION |
|
42 |
hspace computes how many additional instances can be fit on a cluster, |
|
43 |
while maintaining N+1 status. |
|
44 |
|
|
45 |
The program will try to place instances, all of the same size, on the |
|
46 |
cluster, until the point where we don't have any N+1 possible |
|
47 |
allocation. It uses the exact same allocation algorithm as the hail |
|
48 |
iallocator plugin. |
|
49 |
|
|
50 |
The output of the program is designed to interpreted as a shell |
|
51 |
fragment (or parsed as a \fIkey=value\fR file). Options which extend |
|
52 |
the output (e.g. \-p, \-v) will output the additional information on |
|
53 |
stderr (such that the stdout is still parseable). |
|
54 |
|
|
55 |
The following keys are available in the output of the script (all |
|
56 |
prefixed with \fIHTS_\fR): |
|
57 |
.TP |
|
58 |
.I SPEC_MEM, SPEC_DSK, SPEC_CPU, SPEC_RQN |
|
59 |
These represent the specifications of the instance model used for |
|
60 |
allocation (the memory, disk, cpu, requested nodes). |
|
61 |
|
|
62 |
.TP |
|
63 |
.I CLUSTER_MEM, CLUSTER_DSK, CLUSTER_CPU, CLUSTER_NODES |
|
64 |
These represent the total memory, disk, CPU count and total nodes in |
|
65 |
the cluster. |
|
66 |
|
|
67 |
.TP |
|
68 |
.I INI_SCORE, FIN_SCORE |
|
69 |
These are the initial (current) and final cluster score (see the hbal |
|
70 |
man page for details about the scoring algorithm). |
|
71 |
|
|
72 |
.TP |
|
73 |
.I INI_INST_CNT, FIN_INST_CNT |
|
74 |
The initial and final instance count. |
|
75 |
|
|
76 |
.TP |
|
77 |
.I INI_MEM_FREE, FIN_MEM_FREE |
|
78 |
The initial and final total free memory in the cluster (but this |
|
79 |
doesn't necessarily mean available for use). |
|
80 |
|
|
81 |
.TP |
|
82 |
.I INI_MEM_AVAIL, FIN_MEM_AVAIL |
|
83 |
The initial and final total available memory for allocation in the |
|
84 |
cluster. If allocating redundant instances, new instances could |
|
85 |
increase the reserved memory so it doesn't necessarily mean the |
|
86 |
entirety of this memory can be used for new instance allocations. |
|
87 |
|
|
88 |
.TP |
|
89 |
.I INI_MEM_RESVD, FIN_MEM_RESVD |
|
90 |
The initial and final reserved memory (for redundancy/N+1 purposes). |
|
91 |
|
|
92 |
.TP |
|
93 |
.I INI_MEM_INST, FIN_MEM_INST |
|
94 |
The initial and final memory used for instances (actual runtime used |
|
95 |
RAM). |
|
96 |
|
|
97 |
.TP |
|
98 |
.I INI_MEM_OVERHEAD, FIN_MEM_OVERHEAD |
|
99 |
The initial and final memory overhead \(em memory used for the node |
|
100 |
itself and unacounted memory (e.g. due to hypervisor overhead). |
|
101 |
|
|
102 |
.TP |
|
103 |
.I INI_MEM_EFF, HTS_INI_MEM_EFF |
|
104 |
The initial and final memory efficiency, represented as instance |
|
105 |
memory divided by total memory. |
|
106 |
|
|
107 |
.TP |
|
108 |
.I INI_DSK_FREE, INI_DSK_AVAIL, INI_DSK_RESVD, INI_DSK_INST, INI_DSK_EFF |
|
109 |
Initial disk stats, similar to the memory ones. |
|
110 |
|
|
111 |
.TP |
|
112 |
.I FIN_DSK_FREE, FIN_DSK_AVAIL, FIN_DSK_RESVD, FIN_DSK_INST, FIN_DSK_EFF |
|
113 |
Final disk stats, similar to the memory ones. |
|
114 |
|
|
115 |
.TP |
|
116 |
.I INI_CPU_INST, FIN_CPU_INST |
|
117 |
Initial and final number of virtual CPUs used by instances. |
|
118 |
|
|
119 |
.TP |
|
120 |
.I INI_CPU_EFF, FIN_CPU_EFF |
|
121 |
The initial and final CPU efficiency, represented as the count of |
|
122 |
virtual instance CPUs divided by the total physical CPU count. |
|
123 |
|
|
124 |
.TP |
|
125 |
.I INI_MNODE_MEM_AVAIL, FIN_MNODE_MEM_AVAIL |
|
126 |
The initial and final maximum per\(hynode available memory. This is not |
|
127 |
very useful as a metric but can give an impression of the status of |
|
128 |
the nodes; as an example, this value restricts the maximum instance |
|
129 |
size that can be still created on the cluster. |
|
130 |
|
|
131 |
.TP |
|
132 |
.I INI_MNODE_DSK_AVAIL, FIN_MNODE_DSK_AVAIL |
|
133 |
Like the above but for disk. |
|
134 |
|
|
135 |
.TP |
|
136 |
.I TSPEC |
|
137 |
If the tiered allocation mode has been enabled, this parameter holds |
|
138 |
the pairs of specifications and counts of instances that can be |
|
139 |
created in this mode. The value of the key is a space\(hyseparated list |
|
140 |
of values; each value is of the form \fImemory,disk,vcpu=count\fR |
|
141 |
where the memory, disk and vcpu are the values for the current spec, |
|
142 |
and count is how many instances of this spec can be created. A |
|
143 |
complete value for this variable could be: \fB4096,102400,2=225 |
|
144 |
2560,102400,2=20 512,102400,2=21\fR. |
|
145 |
|
|
146 |
.TP |
|
147 |
.I KM_USED_CPU, KM_USED_NPU, KM_USED_MEM, KM_USED_DSK |
|
148 |
These represents the metrics of used resources at the start of the |
|
149 |
computation (only for tiered allocation mode). The NPU value is |
|
150 |
"normalized" CPU count, i.e. the number of virtual CPUs divided by the |
|
151 |
maximum ratio of the virtual to physical CPUs. |
|
152 |
|
|
153 |
.TP |
|
154 |
.I KM_POOL_CPU, KM_POOL_NPU, KM_POOL_MEM, KM_POOL_DSK |
|
155 |
These represents the total resources allocated during the tiered |
|
156 |
allocation process. In effect, they represent how much is readily |
|
157 |
available for allocation. |
|
158 |
|
|
159 |
.TP |
|
160 |
.I KM_UNAV_CPU, KM_POOL_NPU, KM_UNAV_MEM, KM_UNAV_DSK |
|
161 |
These represents the resources left over (either free as in |
|
162 |
unallocable or allocable on their own) after the tiered allocation has |
|
163 |
been completed. They represent better the actual unallocable |
|
164 |
resources, because some other resource has been exhausted. For |
|
165 |
example, the cluster might still have 100GiB disk free, but with no |
|
166 |
memory left for instances, we cannot allocate another instance, so in |
|
167 |
effect the disk space is unallocable. Note that the CPUs here |
|
168 |
represent instance virtual CPUs, and in case the \fI--max-cpu\fR |
|
169 |
option hasn't been specified this will be \-1. |
|
170 |
|
|
171 |
.TP |
|
172 |
.I ALLOC_USAGE |
|
173 |
The current usage represented as initial number of instances divided |
|
174 |
per final number of instances. |
|
175 |
|
|
176 |
.TP |
|
177 |
.I ALLOC_COUNT |
|
178 |
The number of instances allocated (delta between FIN_INST_CNT and |
|
179 |
INI_INST_CNT). |
|
180 |
|
|
181 |
.TP |
|
182 |
.I ALLOC_FAIL*_CNT |
|
183 |
For the last attemp at allocations (which would have increased |
|
184 |
FIN_INST_CNT with one, if it had succeeded), this is the count of the |
|
185 |
failure reasons per failure type; currently defined are FAILMEM, |
|
186 |
FAILDISK and FAILCPU which represent errors due to not enough memory, |
|
187 |
disk and CPUs, and FAILN1 which represents a non N+1 compliant cluster |
|
188 |
on which we can't allocate instances at all. |
|
189 |
|
|
190 |
.TP |
|
191 |
.I ALLOC_FAIL_REASON |
|
192 |
The reason for most of the failures, being one of the above FAIL* |
|
193 |
strings. |
|
194 |
|
|
195 |
.TP |
|
196 |
.I OK |
|
197 |
A marker representing the successful end of the computation, and |
|
198 |
having value "1". If this key is not present in the output it means |
|
199 |
that the computation failed and any values present should not be |
|
200 |
relied upon. |
|
201 |
|
|
202 |
.PP |
|
203 |
|
|
204 |
If the tiered allocation mode is enabled, then many of the INI_/FIN_ |
|
205 |
metrics will be also displayed with a TRL_ prefix, and denote the |
|
206 |
cluster status at the end of the tiered allocation run. |
|
207 |
|
|
208 |
.SH OPTIONS |
|
209 |
The options that can be passed to the program are as follows: |
|
210 |
|
|
211 |
.TP |
|
212 |
.BI "--memory " mem |
|
213 |
The memory size of the instances to be placed (defaults to 4GiB). |
|
214 |
|
|
215 |
.TP |
|
216 |
.BI "--disk " disk |
|
217 |
The disk size of the instances to be placed (defaults to 100GiB). |
|
218 |
|
|
219 |
.TP |
|
220 |
.BI "--req-nodes " num-nodes |
|
221 |
The number of nodes for the instances; the default of two means |
|
222 |
mirrored instances, while passing one means plain type instances. |
|
223 |
|
|
224 |
.TP |
|
225 |
.BI "--vcpus " vcpus |
|
226 |
The number of VCPUs of the instances to be placed (defaults to 1). |
|
227 |
|
|
228 |
.TP |
|
229 |
.BI "--max-cpu " cpu-ratio |
|
230 |
The maximum virtual\(hyto\(hyphysical cpu ratio, as a floating point |
|
231 |
number between zero and one. For example, specifying \fIcpu-ratio\fR |
|
232 |
as \fB2.5\fR means that, for a 4\(hycpu machine, a maximum of 10 |
|
233 |
virtual cpus should be allowed to be in use for primary instances. A |
|
234 |
value of one doesn't make sense though, as that means no disk space |
|
235 |
can be used on it. |
|
236 |
|
|
237 |
.TP |
|
238 |
.BI "--min-disk " disk-ratio |
|
239 |
The minimum amount of free disk space remaining, as a floating point |
|
240 |
number. For example, specifying \fIdisk-ratio\fR as \fB0.25\fR means |
|
241 |
that at least one quarter of disk space should be left free on nodes. |
|
242 |
|
|
243 |
.TP |
|
244 |
.B -p, --print-nodes |
|
245 |
Prints the before and after node status, in a format designed to allow |
|
246 |
the user to understand the node's most important parameters. |
|
247 |
|
|
248 |
It is possible to customise the listed information by passing a |
|
249 |
comma\(hyseparated list of field names to this option (the field list |
|
250 |
is currently undocumented), or to extend the default field list by |
|
251 |
prefixing the additional field list with a plus sign. By default, the |
|
252 |
node list will contain the following information: |
|
253 |
.RS |
|
254 |
.TP |
|
255 |
.B F |
|
256 |
a character denoting the status of the node, with '\-' meaning an |
|
257 |
offline node, '*' meaning N+1 failure and blank meaning a good node |
|
258 |
.TP |
|
259 |
.B Name |
|
260 |
the node name |
|
261 |
.TP |
|
262 |
.B t_mem |
|
263 |
the total node memory |
|
264 |
.TP |
|
265 |
.B n_mem |
|
266 |
the memory used by the node itself |
|
267 |
.TP |
|
268 |
.B i_mem |
|
269 |
the memory used by instances |
|
270 |
.TP |
|
271 |
.B x_mem |
|
272 |
amount memory which seems to be in use but cannot be determined why or |
|
273 |
by which instance; usually this means that the hypervisor has some |
|
274 |
overhead or that there are other reporting errors |
|
275 |
.TP |
|
276 |
.B f_mem |
|
277 |
the free node memory |
|
278 |
.TP |
|
279 |
.B r_mem |
|
280 |
the reserved node memory, which is the amount of free memory needed |
|
281 |
for N+1 compliance |
|
282 |
.TP |
|
283 |
.B t_dsk |
|
284 |
total disk |
|
285 |
.TP |
|
286 |
.B f_dsk |
|
287 |
free disk |
|
288 |
.TP |
|
289 |
.B pcpu |
|
290 |
the number of physical cpus on the node |
|
291 |
.TP |
|
292 |
.B vcpu |
|
293 |
the number of virtual cpus allocated to primary instances |
|
294 |
.TP |
|
295 |
.B pcnt |
|
296 |
number of primary instances |
|
297 |
.TP |
|
298 |
.B pcnt |
|
299 |
number of secondary instances |
|
300 |
.TP |
|
301 |
.B p_fmem |
|
302 |
percent of free memory |
|
303 |
.TP |
|
304 |
.B p_fdsk |
|
305 |
percent of free disk |
|
306 |
.TP |
|
307 |
.B r_cpu |
|
308 |
ratio of virtual to physical cpus |
|
309 |
.TP |
|
310 |
.B lCpu |
|
311 |
the dynamic CPU load (if the information is available) |
|
312 |
.TP |
|
313 |
.B lMem |
|
314 |
the dynamic memory load (if the information is available) |
|
315 |
.TP |
|
316 |
.B lDsk |
|
317 |
the dynamic disk load (if the information is available) |
|
318 |
.TP |
|
319 |
.B lNet |
|
320 |
the dynamic net load (if the information is available) |
|
321 |
.RE |
|
322 |
|
|
323 |
.TP |
|
324 |
.BI "-O " name |
|
325 |
This option (which can be given multiple times) will mark nodes as |
|
326 |
being \fIoffline\fR, and instances won't be placed on these nodes. |
|
327 |
|
|
328 |
Note that hspace will also mark as offline any nodes which are |
|
329 |
reported by RAPI as such, or that have "?" in file\(hybased input in any |
|
330 |
numeric fields. |
|
331 |
.RE |
|
332 |
|
|
333 |
.TP |
|
334 |
.BI "-t" datafile ", --text-data=" datafile |
|
335 |
The name of the file holding node and instance information (if not |
|
336 |
collecting via RAPI or LUXI). This or one of the other backends must |
|
337 |
be selected. |
|
338 |
|
|
339 |
.TP |
|
340 |
.BI "-S" filename ", --save-cluster=" filename |
|
341 |
If given, the state of the cluster at the end of the allocation is |
|
342 |
saved to a file named \fIfilename.alloc\fR, and if tiered allocation |
|
343 |
is enabled, the state after tiered allocation will be saved to |
|
344 |
\fIfilename.tiered\fR. This allows re-feeding the cluster state to |
|
345 |
either hspace itself (with different parameters) or for example hbal. |
|
346 |
|
|
347 |
.TP |
|
348 |
.BI "-m" cluster |
|
349 |
Collect data directly from the |
|
350 |
.I cluster |
|
351 |
given as an argument via RAPI. If the argument doesn't contain a colon |
|
352 |
(:), then it is converted into a fully\(hybuilt URL via prepending |
|
353 |
https:// and appending the default RAPI port, otherwise it's |
|
354 |
considered a fully\(hyspecified URL and is used as\(hyis. |
|
355 |
|
|
356 |
.TP |
|
357 |
.BI "-L[" path "]" |
|
358 |
Collect data directly from the master daemon, which is to be contacted |
|
359 |
via the luxi (an internal Ganeti protocol). An optional \fIpath\fR |
|
360 |
argument is interpreted as the path to the unix socket on which the |
|
361 |
master daemon listens; otherwise, the default path used by ganeti when |
|
362 |
installed with \fI--localstatedir=/var\fR is used. |
|
363 |
|
|
364 |
.TP |
|
365 |
.BI "--simulate " description |
|
366 |
Instead of using actual data, build an empty cluster given a node |
|
367 |
description. The \fIdescription\fR parameter must be a |
|
368 |
comma\(hyseparated list of four elements, describing in order: |
|
369 |
|
|
370 |
.RS |
|
371 |
|
|
372 |
.RS |
|
373 |
.TP |
|
374 |
the number of nodes in the cluster |
|
375 |
|
|
376 |
.TP |
|
377 |
the disk size of the nodes, in mebibytes |
|
378 |
|
|
379 |
.TP |
|
380 |
the memory size of the nodes, in mebibytes |
|
381 |
|
|
382 |
.TP |
|
383 |
the cpu core count for the nodes |
|
384 |
|
|
385 |
.RE |
|
386 |
|
|
387 |
An example description would be \fB20,102400,16384,4\fR describing a |
|
388 |
20\(hynode cluster where each node has 100GiB of disk space, 16GiB of |
|
389 |
memory and 4 CPU cores. Note that all nodes must have the same specs |
|
390 |
currently. |
|
391 |
|
|
392 |
.RE |
|
393 |
|
|
394 |
.TP |
|
395 |
.BI "--tiered-alloc " spec |
|
396 |
Beside the standard, fixed\(hysize allocation, also do a tiered |
|
397 |
allocation scheme where the algorithm starts from the given |
|
398 |
specification and allocates until there is no more space; then it |
|
399 |
decreases the specification and tries the allocation again. The |
|
400 |
decrease is done on the matric that last failed during allocation. The |
|
401 |
specification given is similar to the \fI--simulate\fR option and it |
|
402 |
holds: |
|
403 |
|
|
404 |
.RS |
|
405 |
|
|
406 |
.RS |
|
407 |
|
|
408 |
.TP |
|
409 |
the disk size of the instance |
|
410 |
|
|
411 |
.TP |
|
412 |
the memory size of the instance |
|
413 |
|
|
414 |
.TP |
|
415 |
the vcpu count for the insance |
|
416 |
|
|
417 |
.RE |
|
418 |
|
|
419 |
An example description would be \fB10240,8192,2\fR describing an |
|
420 |
initial starting specification of 10GiB of disk space, 4GiB of memory |
|
421 |
and 2 VCPUs. |
|
422 |
|
|
423 |
Also note that the normal allocation and the tiered allocation are |
|
424 |
independent, and both start from the initial cluster state; as such, |
|
425 |
the instance count for these two modes are not related one to another. |
|
426 |
|
|
427 |
.RE |
|
428 |
|
|
429 |
.TP |
|
430 |
.B -v, --verbose |
|
431 |
Increase the output verbosity. Each usage of this option will increase |
|
432 |
the verbosity (currently more than 2 doesn't make sense) from the |
|
433 |
default of one. At verbosity 2 the location of the new instances is |
|
434 |
shown in the standard error. |
|
435 |
|
|
436 |
.TP |
|
437 |
.B -q, --quiet |
|
438 |
Decrease the output verbosity. Each usage of this option will decrease |
|
439 |
the verbosity (less than zero doesn't make sense) from the default of |
|
440 |
one. |
|
441 |
|
|
442 |
.TP |
|
443 |
.B -V, --version |
|
444 |
Just show the program version and exit. |
|
445 |
|
|
446 |
.SH EXIT STATUS |
|
447 |
|
|
448 |
The exist status of the command will be zero, unless for some reason |
|
449 |
the algorithm fatally failed (e.g. wrong node or instance data). |
|
450 |
|
|
451 |
.SH BUGS |
|
452 |
|
|
453 |
The algorithm is highly dependent on the number of nodes; its runtime |
|
454 |
grows exponentially with this number, and as such is impractical for |
|
455 |
really big clusters. |
|
456 |
|
|
457 |
The algorithm doesn't rebalance the cluster or try to get the optimal |
|
458 |
fit; it just allocates in the best place for the current step, without |
|
459 |
taking into consideration the impact on future placements. |
|
460 |
|
|
461 |
.SH ENVIRONMENT |
|
462 |
|
|
463 |
If the variables \fBHTOOLS_NODES\fR and \fBHTOOLS_INSTANCES\fR are |
|
464 |
present in the environment, they will override the default names for |
|
465 |
the nodes and instances files. These will have of course no effect |
|
466 |
when the RAPI or Luxi backends are used. |
|
467 |
|
|
468 |
.SH SEE ALSO |
|
469 |
.BR hbal "(1), " hscan "(1), " ganeti "(7), " gnt-instance "(8), " |
|
470 |
.BR gnt-node "(8)" |
|
471 |
|
|
472 |
.SH "COPYRIGHT" |
|
473 |
.PP |
|
474 |
Copyright (C) 2009 Google Inc. Permission is granted to copy, |
|
475 |
distribute and/or modify under the terms of the GNU General Public |
|
476 |
License as published by the Free Software Foundation; either version 2 |
|
477 |
of the License, or (at your option) any later version. |
|
478 |
.PP |
|
479 |
On Debian systems, the complete text of the GNU General Public License |
|
480 |
can be found in /usr/share/common-licenses/GPL. |
b/man/hail.1 | ||
---|---|---|
1 |
.TH HAIL 1 2009-03-23 htools "Ganeti H-tools" |
|
2 |
.SH NAME |
|
3 |
hail \- Ganeti IAllocator plugin |
|
4 |
|
|
5 |
.SH SYNOPSIS |
|
6 |
.B hail |
|
7 |
.I "input-file" |
|
8 |
|
|
9 |
.B hail |
|
10 |
.B --version |
|
11 |
|
|
12 |
.SH DESCRIPTION |
|
13 |
hail is a Ganeti IAllocator plugin that allows automatic instance |
|
14 |
placement and automatic instance secondary node replacement using the |
|
15 |
same algorithm as \fBhbal\fR(1). |
|
16 |
|
|
17 |
The program takes input via a JSON\(hyfile containing current cluster |
|
18 |
state and the request details, and output (on stdout) a JSON\(hyformatted |
|
19 |
response. In case of critical failures, the error message is printed |
|
20 |
on stderr and the exit code is changed to show failure. |
|
21 |
|
|
22 |
.SS ALGORITHM |
|
23 |
|
|
24 |
The program uses a simplified version of the hbal algorithm. |
|
25 |
|
|
26 |
For relocations, we try to change the secondary node of the instance |
|
27 |
to all the valid other nodes; the node which results in the best |
|
28 |
cluster score is chosen. |
|
29 |
|
|
30 |
For single\(hynode allocations (non\(hymirrored instances), again we |
|
31 |
select the node which, when chosen as the primary node, gives the best |
|
32 |
score. |
Also available in: Unified diff