Statistics
| Branch: | Tag: | Revision:

root / hn1.1 @ e015b554

History | View | Annotate | Download (9.2 kB)

1 d0003b35 Iustin Pop
.TH HN1 1 2009-03-23 htools "Ganeti H-tools"
2 b0045e4d Iustin Pop
.SH NAME
3 b0045e4d Iustin Pop
hn1 \- N+1 fixer for Ganeti
4 b0045e4d Iustin Pop
5 b0045e4d Iustin Pop
.SH SYNOPSIS
6 b0045e4d Iustin Pop
.B hn1
7 b0045e4d Iustin Pop
.B "[-C]"
8 b0045e4d Iustin Pop
.B "[-p]"
9 b0045e4d Iustin Pop
.B "[-o]"
10 b0045e4d Iustin Pop
.BI "[ -m " cluster "]"
11 b0045e4d Iustin Pop
.BI "[-n " nodes-file " ]"
12 b0045e4d Iustin Pop
.BI "[ -i " instances-file "]"
13 b0045e4d Iustin Pop
.BI "[-d " depth "]"
14 b0045e4d Iustin Pop
.BI "[-r " max-removals "]"
15 b0045e4d Iustin Pop
.BI "[-L " max-delta "]"
16 b0045e4d Iustin Pop
.BI "[-l " min-delta "]"
17 b0045e4d Iustin Pop
18 b0045e4d Iustin Pop
.B hn1
19 b0045e4d Iustin Pop
.B --version
20 b0045e4d Iustin Pop
21 b0045e4d Iustin Pop
.SH DESCRIPTION
22 b0045e4d Iustin Pop
hn1 is a cluster N+1 fixer that tries to compute the minimum number of
23 b0045e4d Iustin Pop
moves needed for getting all nodes to be N+1 compliant.
24 b0045e4d Iustin Pop
25 b0045e4d Iustin Pop
The algorithm is designed to be a 'perfect' algorithm, so that we
26 b0045e4d Iustin Pop
always examine the entire solution space until we find the minimum
27 b0045e4d Iustin Pop
solution. The algorithm can be tweaked via the \fB-d\fR, \fB-r\fR,
28 b0045e4d Iustin Pop
\fB-L\fR and \fB-l\fR options.
29 b0045e4d Iustin Pop
30 b0045e4d Iustin Pop
By default, the program will show the solution in a somewhat cryptic
31 b0045e4d Iustin Pop
format; for getting the actual Ganeti command list, use the \fB-C\fR
32 b0045e4d Iustin Pop
option.
33 b0045e4d Iustin Pop
34 b0045e4d Iustin Pop
\fBNote:\fR this program is somewhat deprecated; \fBhbal(1)\fR gives
35 b0045e4d Iustin Pop
usually much faster results, and a better cluster. It is recommended
36 b0045e4d Iustin Pop
to use this program only when \fBhbal\fR doesn't give a N+1 compliant
37 b0045e4d Iustin Pop
cluster.
38 b0045e4d Iustin Pop
39 b0045e4d Iustin Pop
.SS ALGORITHM
40 b0045e4d Iustin Pop
41 b0045e4d Iustin Pop
The algorithm works in multiple rounds, of increasing \fIdepth\fR,
42 b0045e4d Iustin Pop
until we have a solution.
43 b0045e4d Iustin Pop
44 b0045e4d Iustin Pop
First, before starting the solution computation, we compute all the
45 b0045e4d Iustin Pop
N+1-fail nodes and the instances they hold. These instances are
46 b0045e4d Iustin Pop
candidate for replacement (and only these!).
47 b0045e4d Iustin Pop
48 b0045e4d Iustin Pop
The program start then with \fIdepth\fR one (unless overridden via the
49 b0045e4d Iustin Pop
\fB-d\fR option), and at each round:
50 d0003b35 Iustin Pop
.RS 4
51 d0003b35 Iustin Pop
.TP 3
52 d0003b35 Iustin Pop
\(em
53 d0003b35 Iustin Pop
it tries to remove from the cluster as many instances as the current
54 d0003b35 Iustin Pop
depth in order to make the cluster N+1 compliant
55 d0003b35 Iustin Pop
56 d0003b35 Iustin Pop
.TP
57 d0003b35 Iustin Pop
\(em
58 d0003b35 Iustin Pop
then, for each of the possible instance combinations that allow this
59 d0003b35 Iustin Pop
(unless the total size is reduced via the \fB-r\fR option), it tries
60 d0003b35 Iustin Pop
to put them back on the cluster while maintaining N+1 compliance
61 d0003b35 Iustin Pop
.RE
62 b0045e4d Iustin Pop
63 b0045e4d Iustin Pop
It might be that at a given round, the results are:
64 d0003b35 Iustin Pop
.RS 4
65 d0003b35 Iustin Pop
.TP 3
66 d0003b35 Iustin Pop
\(em
67 d0003b35 Iustin Pop
no instance combination that can be put back; this means it is not
68 d0003b35 Iustin Pop
possible to make the cluster N+1 compliant with this number of
69 d0003b35 Iustin Pop
instances being moved, so we increase the depth and go on to the next
70 d0003b35 Iustin Pop
round
71 d0003b35 Iustin Pop
.TP
72 d0003b35 Iustin Pop
\(em
73 d0003b35 Iustin Pop
one or more successful result, in which case we take the one that has
74 d0003b35 Iustin Pop
as few changes as possible (by change meaning a replace-disks needed)
75 d0003b35 Iustin Pop
.RE
76 b0045e4d Iustin Pop
77 b0045e4d Iustin Pop
The main problem with the algorithm is that, being an exhaustive
78 b0045e4d Iustin Pop
search, the CPU time required grows very very quickly based on
79 b0045e4d Iustin Pop
depth. On a 20-node, 80-instances cluster, depths up to 5-6 are
80 b0045e4d Iustin Pop
quickly computed, and depth 10 could already take days.
81 b0045e4d Iustin Pop
82 d0003b35 Iustin Pop
The main factors that influence the run time are:
83 d0003b35 Iustin Pop
.RS 4
84 d0003b35 Iustin Pop
.TP 3
85 d0003b35 Iustin Pop
\(em
86 d0003b35 Iustin Pop
the removal depth; for each increase with one of the depth, we grow
87 d0003b35 Iustin Pop
the solution space by the number of nodes squared (since a new
88 d0003b35 Iustin Pop
instance can live any two nodes as primary/secondary, therefore
89 d0003b35 Iustin Pop
(almost) N times N); i.e., depth=1 will create a N^2 solution space,
90 d0003b35 Iustin Pop
depth two will make this N^4, depth three will be N^6, etc.
91 d0003b35 Iustin Pop
92 d0003b35 Iustin Pop
.TP
93 d0003b35 Iustin Pop
\(em
94 d0003b35 Iustin Pop
the removal depth again; for each increase in the depth, there will be
95 d0003b35 Iustin Pop
more valid removal sets, and the space of solutions increases linearly
96 d0003b35 Iustin Pop
with the number of removal sets
97 d0003b35 Iustin Pop
.RE
98 d0003b35 Iustin Pop
99 d0003b35 Iustin Pop
Therefore, the smaller the depth the faster the algorithm will be; it doesn't
100 d0003b35 Iustin Pop
seem like this algorithm will work for clusters of 100 nodes and many many
101 d0003b35 Iustin Pop
small instances (e.g. 256MB instances on 16GB nodes).
102 d0003b35 Iustin Pop
103 d0003b35 Iustin Pop
As an optimisation, since the algorithm is designed to prune the
104 d0003b35 Iustin Pop
search space as quickly as possible, is by luck we find a good
105 d0003b35 Iustin Pop
solution early at a given depth, then the other solutions which would
106 d0003b35 Iustin Pop
result in a bigger delta (the number of changes) will not be
107 d0003b35 Iustin Pop
investigated, and the program will finish fast. Since this is random
108 d0003b35 Iustin Pop
and depends on where in the full solution space the good solution will
109 d0003b35 Iustin Pop
be, there are two options for cutting down the time needed:
110 d0003b35 Iustin Pop
.RS 4
111 d0003b35 Iustin Pop
.TP 3
112 d0003b35 Iustin Pop
\(em
113 d0003b35 Iustin Pop
\fB-l\fR makes any solution that has delta lower than its parameter
114 d0003b35 Iustin Pop
succeed instantly; the default value for this parameter is zero, so
115 d0003b35 Iustin Pop
once we find a "perfect" solution we finish early
116 d0003b35 Iustin Pop
117 d0003b35 Iustin Pop
.TP
118 d0003b35 Iustin Pop
\(em
119 d0003b35 Iustin Pop
\fB-L\fR makes any solution with delta higher than its parameter being
120 d0003b35 Iustin Pop
rejected instantly (and not descend on the search tree); this can
121 d0003b35 Iustin Pop
reduce the depth of the search tree, with sometimes significant
122 d0003b35 Iustin Pop
speedups; by default, this optimization is not used
123 d0003b35 Iustin Pop
.RE
124 d0003b35 Iustin Pop
125 d0003b35 Iustin Pop
The algorithm also has some other internal optimisations:
126 d0003b35 Iustin Pop
.RS 4
127 d0003b35 Iustin Pop
.TP 3
128 d0003b35 Iustin Pop
\(em
129 d0003b35 Iustin Pop
when choosing where to place an instance in phase two, there are
130 d0003b35 Iustin Pop
N*(N-1) possible primary/secondary options; however, if instead of
131 d0003b35 Iustin Pop
iterating over all p * s pairs, we first determine the set of primary
132 d0003b35 Iustin Pop
nodes that can hold this instance (without failing N+1), we can cut
133 d0003b35 Iustin Pop
(N-1) secondary placements for each primary node removed; and since
134 d0003b35 Iustin Pop
this applies at every iteration of phase 2 it linearly decreases the
135 d0003b35 Iustin Pop
solution space, and on full clusters, this can mean a four-five times
136 d0003b35 Iustin Pop
reductions of solution space
137 d0003b35 Iustin Pop
138 d0003b35 Iustin Pop
.TP
139 d0003b35 Iustin Pop
\(em
140 d0003b35 Iustin Pop
since the number of solutions is very high even for smaller depths (on
141 d0003b35 Iustin Pop
the test data, depth=4 results in 1.8M solutions) we can't compare
142 d0003b35 Iustin Pop
them at the end, so at each iteration in phase 2 we only promote the
143 d0003b35 Iustin Pop
best solution out of our own set of solutions
144 d0003b35 Iustin Pop
145 d0003b35 Iustin Pop
.TP
146 d0003b35 Iustin Pop
\(em
147 d0003b35 Iustin Pop
since the placement of instances can only increase the delta of the
148 d0003b35 Iustin Pop
solution (placing a new instance will add zero or more replace-disks
149 d0003b35 Iustin Pop
steps), it means the delta will only increase while recursing during
150 d0003b35 Iustin Pop
phase 2; therefore, if we know at one point that we have a current
151 d0003b35 Iustin Pop
delta that is equal or higher to the delta of the best solution so
152 d0003b35 Iustin Pop
far, we can abort the recursion; this cuts a tremendous number of
153 d0003b35 Iustin Pop
branches; further promotion of the best solution from one removal set
154 d0003b35 Iustin Pop
to another can cut entire removal sets after a few recursions
155 d0003b35 Iustin Pop
156 d0003b35 Iustin Pop
.RE
157 b0045e4d Iustin Pop
158 b0045e4d Iustin Pop
.SH OPTIONS
159 b0045e4d Iustin Pop
The options that can be passed to the program are as follows:
160 b0045e4d Iustin Pop
.TP
161 b0045e4d Iustin Pop
.B -C, --print-commands
162 b0045e4d Iustin Pop
Print the command list at the end of the run. Without this, the
163 b0045e4d Iustin Pop
program will only show a shorter, but cryptic output.
164 b0045e4d Iustin Pop
.TP
165 b0045e4d Iustin Pop
.B -p, --print-nodes
166 b0045e4d Iustin Pop
Prints the before and after node status, in a format designed to allow
167 b0045e4d Iustin Pop
the user to understand the node's most important parameters.
168 b0045e4d Iustin Pop
169 b0045e4d Iustin Pop
The node list will contain these informations:
170 d2ac5526 Iustin Pop
.RS
171 d2ac5526 Iustin Pop
.TP
172 d2ac5526 Iustin Pop
.B F
173 d2ac5526 Iustin Pop
a character denoting the status of the node, with '-' meaning an
174 d2ac5526 Iustin Pop
offline node, '*' meaning N+1 failure and blank meaning a good node
175 d2ac5526 Iustin Pop
.TP
176 d2ac5526 Iustin Pop
.B Name
177 d2ac5526 Iustin Pop
the node name
178 d2ac5526 Iustin Pop
.TP
179 d2ac5526 Iustin Pop
.B t_mem
180 d2ac5526 Iustin Pop
the total node memory
181 d2ac5526 Iustin Pop
.TP
182 d2ac5526 Iustin Pop
.B n_mem
183 d2ac5526 Iustin Pop
the memory used by the node itself
184 d2ac5526 Iustin Pop
.TP
185 d2ac5526 Iustin Pop
.B i_mem
186 d2ac5526 Iustin Pop
the memory used by instances
187 d2ac5526 Iustin Pop
.TP
188 d2ac5526 Iustin Pop
.B x_mem
189 d2ac5526 Iustin Pop
amount memory which seems to be in use but cannot be determined why or
190 d2ac5526 Iustin Pop
by which instance; usually this means that the hypervisor has some
191 d2ac5526 Iustin Pop
overhead or that there are other reporting errors
192 d2ac5526 Iustin Pop
.TP
193 d2ac5526 Iustin Pop
.B f_mem
194 d2ac5526 Iustin Pop
the free node memory
195 d2ac5526 Iustin Pop
.TP
196 d2ac5526 Iustin Pop
.B r_mem
197 d2ac5526 Iustin Pop
the reserved node memory, which is the amount of free memory needed
198 d2ac5526 Iustin Pop
for N+1 compliance
199 d2ac5526 Iustin Pop
.TP
200 d2ac5526 Iustin Pop
.B t_dsk
201 d2ac5526 Iustin Pop
total disk
202 d2ac5526 Iustin Pop
.TP
203 d2ac5526 Iustin Pop
.B f_dsk
204 d2ac5526 Iustin Pop
free disk
205 d2ac5526 Iustin Pop
.TP
206 d2ac5526 Iustin Pop
.B pri
207 d2ac5526 Iustin Pop
number of primary instances
208 d2ac5526 Iustin Pop
.TP
209 d2ac5526 Iustin Pop
.B sec
210 d2ac5526 Iustin Pop
number of secondary instances
211 d2ac5526 Iustin Pop
.TP
212 d2ac5526 Iustin Pop
.B p_fmem
213 d2ac5526 Iustin Pop
percent of free memory
214 d2ac5526 Iustin Pop
.TP
215 d2ac5526 Iustin Pop
.B p_fdsk
216 d2ac5526 Iustin Pop
percent of free disk
217 d2ac5526 Iustin Pop
.RE
218 b0045e4d Iustin Pop
219 b0045e4d Iustin Pop
.TP
220 b0045e4d Iustin Pop
.BI "-n" nodefile ", --nodes=" nodefile
221 b0045e4d Iustin Pop
The name of the file holding node information (if not collecting via
222 7b255913 Iustin Pop
RAPI), instead of the default \fInodes\fR file (but see below how to
223 7b255913 Iustin Pop
customize the default value via the environment).
224 b0045e4d Iustin Pop
225 b0045e4d Iustin Pop
.TP
226 b0045e4d Iustin Pop
.BI "-i" instancefile ", --instances=" instancefile
227 b0045e4d Iustin Pop
The name of the file holding instance information (if not collecting
228 7b255913 Iustin Pop
via RAPI), instead of the default \fIinstances\fR file (but see below
229 7b255913 Iustin Pop
how to customize the default value via the environment).
230 b0045e4d Iustin Pop
231 b0045e4d Iustin Pop
.TP
232 b0045e4d Iustin Pop
.BI "-m" cluster
233 b0045e4d Iustin Pop
Collect data not from files but directly from the
234 b0045e4d Iustin Pop
.I cluster
235 e015b554 Iustin Pop
given as an argument via RAPI. If the argument doesn't contain a colon
236 e015b554 Iustin Pop
(:), then it is converted into a fully-built URL via prepending
237 e015b554 Iustin Pop
https:// and appending the default RAPI port, otherwise it's
238 e015b554 Iustin Pop
considered a fully-specified URL and is used unchanged.
239 b0045e4d Iustin Pop
240 b0045e4d Iustin Pop
.TP
241 b0045e4d Iustin Pop
.BI "-d" DEPTH ", --depth=" DEPTH
242 b0045e4d Iustin Pop
Start the algorithm directly at depth \fID\fR, so that we don't
243 b0045e4d Iustin Pop
examine lower depth. This will be faster if we know a solution is not
244 b0045e4d Iustin Pop
found a lower depths, and thus it's unneeded to search them.
245 b0045e4d Iustin Pop
246 b0045e4d Iustin Pop
.TP
247 b0045e4d Iustin Pop
.BI "-l" MIN-DELTA ", --min-delta=" MIN-DELTA
248 b0045e4d Iustin Pop
If we find a solution with delta lower or equal to \fIMIN-DELTA\fR,
249 b0045e4d Iustin Pop
consider this a success and don't examine further.
250 b0045e4d Iustin Pop
251 b0045e4d Iustin Pop
.TP
252 b0045e4d Iustin Pop
.BI "-L" MAX-DELTA ", --max-delta=" MAX-DELTA
253 b0045e4d Iustin Pop
If while computing a solution, it's intermediate delta is already
254 b0045e4d Iustin Pop
higher or equal to \fIMAX-DELTA\fR, consider this a failure and abort
255 b0045e4d Iustin Pop
(as if N+1 checks have failed).
256 b0045e4d Iustin Pop
257 b0045e4d Iustin Pop
.TP
258 b0045e4d Iustin Pop
.B -V, --version
259 b0045e4d Iustin Pop
Just show the program version and exit.
260 b0045e4d Iustin Pop
261 b0045e4d Iustin Pop
.SH EXIT STATUS
262 b0045e4d Iustin Pop
263 b0045e4d Iustin Pop
The exist status of the command will be zero, unless for some reason
264 b0045e4d Iustin Pop
the algorithm fatally failed (e.g. wrong node or instance data).
265 b0045e4d Iustin Pop
266 7b255913 Iustin Pop
.SH ENVIRONMENT
267 7b255913 Iustin Pop
268 7b255913 Iustin Pop
If the variables \fBHTOOLS_NODES\fR and \fBHTOOLS_INSTANCES\fR are
269 7b255913 Iustin Pop
present in the environment, they will override the default names for
270 7b255913 Iustin Pop
the nodes and instances files. These will have of course no effect
271 7b255913 Iustin Pop
when RAPI is used.
272 7b255913 Iustin Pop
273 b0045e4d Iustin Pop
.SH BUGS
274 b0045e4d Iustin Pop
275 b0045e4d Iustin Pop
The program does not check its input data for consistency, and aborts
276 b0045e4d Iustin Pop
with cryptic errors messages in this case.
277 b0045e4d Iustin Pop
278 d0003b35 Iustin Pop
The algorithm doesn't deal with non-\fBdrbd\fR instances, and chokes
279 d0003b35 Iustin Pop
on input data which has such instances.
280 d0003b35 Iustin Pop
281 b0045e4d Iustin Pop
The algorithm doesn't know when it won't be possible to reach N+1
282 b0045e4d Iustin Pop
compliance at all, and will happily churn CPU for ages without
283 b0045e4d Iustin Pop
realising it won't reach a solution.
284 b0045e4d Iustin Pop
285 b0045e4d Iustin Pop
The algorithm is too slow.
286 b0045e4d Iustin Pop
287 b0045e4d Iustin Pop
The output format is not easily scriptable, and the program should
288 b0045e4d Iustin Pop
feed moves directly into Ganeti (either via RAPI or via a gnt-debug
289 b0045e4d Iustin Pop
input file).
290 b0045e4d Iustin Pop
291 b0045e4d Iustin Pop
.SH SEE ALSO
292 d2ac5526 Iustin Pop
.BR hbal "(1), " hscan "(1), " ganeti "(7), " gnt-instance "(8), "
293 d2ac5526 Iustin Pop
.BR gnt-node "(8)"