root / hn1.1 @ 7b255913
History | View | Annotate | Download (9.1 kB)
1 |
.TH HN1 1 2009-03-23 htools "Ganeti H-tools" |
---|---|
2 |
.SH NAME |
3 |
hn1 \- N+1 fixer for Ganeti |
4 |
|
5 |
.SH SYNOPSIS |
6 |
.B hn1 |
7 |
.B "[-C]" |
8 |
.B "[-p]" |
9 |
.B "[-o]" |
10 |
.BI "[ -m " cluster "]" |
11 |
.BI "[-n " nodes-file " ]" |
12 |
.BI "[ -i " instances-file "]" |
13 |
.BI "[-d " depth "]" |
14 |
.BI "[-r " max-removals "]" |
15 |
.BI "[-L " max-delta "]" |
16 |
.BI "[-l " min-delta "]" |
17 |
|
18 |
.B hn1 |
19 |
.B --version |
20 |
|
21 |
.SH DESCRIPTION |
22 |
hn1 is a cluster N+1 fixer that tries to compute the minimum number of |
23 |
moves needed for getting all nodes to be N+1 compliant. |
24 |
|
25 |
The algorithm is designed to be a 'perfect' algorithm, so that we |
26 |
always examine the entire solution space until we find the minimum |
27 |
solution. The algorithm can be tweaked via the \fB-d\fR, \fB-r\fR, |
28 |
\fB-L\fR and \fB-l\fR options. |
29 |
|
30 |
By default, the program will show the solution in a somewhat cryptic |
31 |
format; for getting the actual Ganeti command list, use the \fB-C\fR |
32 |
option. |
33 |
|
34 |
\fBNote:\fR this program is somewhat deprecated; \fBhbal(1)\fR gives |
35 |
usually much faster results, and a better cluster. It is recommended |
36 |
to use this program only when \fBhbal\fR doesn't give a N+1 compliant |
37 |
cluster. |
38 |
|
39 |
.SS ALGORITHM |
40 |
|
41 |
The algorithm works in multiple rounds, of increasing \fIdepth\fR, |
42 |
until we have a solution. |
43 |
|
44 |
First, before starting the solution computation, we compute all the |
45 |
N+1-fail nodes and the instances they hold. These instances are |
46 |
candidate for replacement (and only these!). |
47 |
|
48 |
The program start then with \fIdepth\fR one (unless overridden via the |
49 |
\fB-d\fR option), and at each round: |
50 |
.RS 4 |
51 |
.TP 3 |
52 |
\(em |
53 |
it tries to remove from the cluster as many instances as the current |
54 |
depth in order to make the cluster N+1 compliant |
55 |
|
56 |
.TP |
57 |
\(em |
58 |
then, for each of the possible instance combinations that allow this |
59 |
(unless the total size is reduced via the \fB-r\fR option), it tries |
60 |
to put them back on the cluster while maintaining N+1 compliance |
61 |
.RE |
62 |
|
63 |
It might be that at a given round, the results are: |
64 |
.RS 4 |
65 |
.TP 3 |
66 |
\(em |
67 |
no instance combination that can be put back; this means it is not |
68 |
possible to make the cluster N+1 compliant with this number of |
69 |
instances being moved, so we increase the depth and go on to the next |
70 |
round |
71 |
.TP |
72 |
\(em |
73 |
one or more successful result, in which case we take the one that has |
74 |
as few changes as possible (by change meaning a replace-disks needed) |
75 |
.RE |
76 |
|
77 |
The main problem with the algorithm is that, being an exhaustive |
78 |
search, the CPU time required grows very very quickly based on |
79 |
depth. On a 20-node, 80-instances cluster, depths up to 5-6 are |
80 |
quickly computed, and depth 10 could already take days. |
81 |
|
82 |
The main factors that influence the run time are: |
83 |
.RS 4 |
84 |
.TP 3 |
85 |
\(em |
86 |
the removal depth; for each increase with one of the depth, we grow |
87 |
the solution space by the number of nodes squared (since a new |
88 |
instance can live any two nodes as primary/secondary, therefore |
89 |
(almost) N times N); i.e., depth=1 will create a N^2 solution space, |
90 |
depth two will make this N^4, depth three will be N^6, etc. |
91 |
|
92 |
.TP |
93 |
\(em |
94 |
the removal depth again; for each increase in the depth, there will be |
95 |
more valid removal sets, and the space of solutions increases linearly |
96 |
with the number of removal sets |
97 |
.RE |
98 |
|
99 |
Therefore, the smaller the depth the faster the algorithm will be; it doesn't |
100 |
seem like this algorithm will work for clusters of 100 nodes and many many |
101 |
small instances (e.g. 256MB instances on 16GB nodes). |
102 |
|
103 |
As an optimisation, since the algorithm is designed to prune the |
104 |
search space as quickly as possible, is by luck we find a good |
105 |
solution early at a given depth, then the other solutions which would |
106 |
result in a bigger delta (the number of changes) will not be |
107 |
investigated, and the program will finish fast. Since this is random |
108 |
and depends on where in the full solution space the good solution will |
109 |
be, there are two options for cutting down the time needed: |
110 |
.RS 4 |
111 |
.TP 3 |
112 |
\(em |
113 |
\fB-l\fR makes any solution that has delta lower than its parameter |
114 |
succeed instantly; the default value for this parameter is zero, so |
115 |
once we find a "perfect" solution we finish early |
116 |
|
117 |
.TP |
118 |
\(em |
119 |
\fB-L\fR makes any solution with delta higher than its parameter being |
120 |
rejected instantly (and not descend on the search tree); this can |
121 |
reduce the depth of the search tree, with sometimes significant |
122 |
speedups; by default, this optimization is not used |
123 |
.RE |
124 |
|
125 |
The algorithm also has some other internal optimisations: |
126 |
.RS 4 |
127 |
.TP 3 |
128 |
\(em |
129 |
when choosing where to place an instance in phase two, there are |
130 |
N*(N-1) possible primary/secondary options; however, if instead of |
131 |
iterating over all p * s pairs, we first determine the set of primary |
132 |
nodes that can hold this instance (without failing N+1), we can cut |
133 |
(N-1) secondary placements for each primary node removed; and since |
134 |
this applies at every iteration of phase 2 it linearly decreases the |
135 |
solution space, and on full clusters, this can mean a four-five times |
136 |
reductions of solution space |
137 |
|
138 |
.TP |
139 |
\(em |
140 |
since the number of solutions is very high even for smaller depths (on |
141 |
the test data, depth=4 results in 1.8M solutions) we can't compare |
142 |
them at the end, so at each iteration in phase 2 we only promote the |
143 |
best solution out of our own set of solutions |
144 |
|
145 |
.TP |
146 |
\(em |
147 |
since the placement of instances can only increase the delta of the |
148 |
solution (placing a new instance will add zero or more replace-disks |
149 |
steps), it means the delta will only increase while recursing during |
150 |
phase 2; therefore, if we know at one point that we have a current |
151 |
delta that is equal or higher to the delta of the best solution so |
152 |
far, we can abort the recursion; this cuts a tremendous number of |
153 |
branches; further promotion of the best solution from one removal set |
154 |
to another can cut entire removal sets after a few recursions |
155 |
|
156 |
.RE |
157 |
|
158 |
.SH OPTIONS |
159 |
The options that can be passed to the program are as follows: |
160 |
.TP |
161 |
.B -C, --print-commands |
162 |
Print the command list at the end of the run. Without this, the |
163 |
program will only show a shorter, but cryptic output. |
164 |
.TP |
165 |
.B -p, --print-nodes |
166 |
Prints the before and after node status, in a format designed to allow |
167 |
the user to understand the node's most important parameters. |
168 |
|
169 |
The node list will contain these informations: |
170 |
.RS |
171 |
.TP |
172 |
.B F |
173 |
a character denoting the status of the node, with '-' meaning an |
174 |
offline node, '*' meaning N+1 failure and blank meaning a good node |
175 |
.TP |
176 |
.B Name |
177 |
the node name |
178 |
.TP |
179 |
.B t_mem |
180 |
the total node memory |
181 |
.TP |
182 |
.B n_mem |
183 |
the memory used by the node itself |
184 |
.TP |
185 |
.B i_mem |
186 |
the memory used by instances |
187 |
.TP |
188 |
.B x_mem |
189 |
amount memory which seems to be in use but cannot be determined why or |
190 |
by which instance; usually this means that the hypervisor has some |
191 |
overhead or that there are other reporting errors |
192 |
.TP |
193 |
.B f_mem |
194 |
the free node memory |
195 |
.TP |
196 |
.B r_mem |
197 |
the reserved node memory, which is the amount of free memory needed |
198 |
for N+1 compliance |
199 |
.TP |
200 |
.B t_dsk |
201 |
total disk |
202 |
.TP |
203 |
.B f_dsk |
204 |
free disk |
205 |
.TP |
206 |
.B pri |
207 |
number of primary instances |
208 |
.TP |
209 |
.B sec |
210 |
number of secondary instances |
211 |
.TP |
212 |
.B p_fmem |
213 |
percent of free memory |
214 |
.TP |
215 |
.B p_fdsk |
216 |
percent of free disk |
217 |
.RE |
218 |
|
219 |
.TP |
220 |
.BI "-n" nodefile ", --nodes=" nodefile |
221 |
The name of the file holding node information (if not collecting via |
222 |
RAPI), instead of the default \fInodes\fR file (but see below how to |
223 |
customize the default value via the environment). |
224 |
|
225 |
.TP |
226 |
.BI "-i" instancefile ", --instances=" instancefile |
227 |
The name of the file holding instance information (if not collecting |
228 |
via RAPI), instead of the default \fIinstances\fR file (but see below |
229 |
how to customize the default value via the environment). |
230 |
|
231 |
.TP |
232 |
.BI "-m" cluster |
233 |
Collect data not from files but directly from the |
234 |
.I cluster |
235 |
given as an argument via RAPI. This work for both Ganeti 1.2 and |
236 |
Ganeti 2.0. |
237 |
|
238 |
.TP |
239 |
.BI "-d" DEPTH ", --depth=" DEPTH |
240 |
Start the algorithm directly at depth \fID\fR, so that we don't |
241 |
examine lower depth. This will be faster if we know a solution is not |
242 |
found a lower depths, and thus it's unneeded to search them. |
243 |
|
244 |
.TP |
245 |
.BI "-l" MIN-DELTA ", --min-delta=" MIN-DELTA |
246 |
If we find a solution with delta lower or equal to \fIMIN-DELTA\fR, |
247 |
consider this a success and don't examine further. |
248 |
|
249 |
.TP |
250 |
.BI "-L" MAX-DELTA ", --max-delta=" MAX-DELTA |
251 |
If while computing a solution, it's intermediate delta is already |
252 |
higher or equal to \fIMAX-DELTA\fR, consider this a failure and abort |
253 |
(as if N+1 checks have failed). |
254 |
|
255 |
.TP |
256 |
.B -V, --version |
257 |
Just show the program version and exit. |
258 |
|
259 |
.SH EXIT STATUS |
260 |
|
261 |
The exist status of the command will be zero, unless for some reason |
262 |
the algorithm fatally failed (e.g. wrong node or instance data). |
263 |
|
264 |
.SH ENVIRONMENT |
265 |
|
266 |
If the variables \fBHTOOLS_NODES\fR and \fBHTOOLS_INSTANCES\fR are |
267 |
present in the environment, they will override the default names for |
268 |
the nodes and instances files. These will have of course no effect |
269 |
when RAPI is used. |
270 |
|
271 |
.SH BUGS |
272 |
|
273 |
The program does not check its input data for consistency, and aborts |
274 |
with cryptic errors messages in this case. |
275 |
|
276 |
The algorithm doesn't deal with non-\fBdrbd\fR instances, and chokes |
277 |
on input data which has such instances. |
278 |
|
279 |
The algorithm doesn't know when it won't be possible to reach N+1 |
280 |
compliance at all, and will happily churn CPU for ages without |
281 |
realising it won't reach a solution. |
282 |
|
283 |
The algorithm is too slow. |
284 |
|
285 |
The output format is not easily scriptable, and the program should |
286 |
feed moves directly into Ganeti (either via RAPI or via a gnt-debug |
287 |
input file). |
288 |
|
289 |
.SH SEE ALSO |
290 |
.BR hbal "(1), " hscan "(1), " ganeti "(7), " gnt-instance "(8), " |
291 |
.BR gnt-node "(8)" |