Statistics
| Branch: | Tag: | Revision:

root / hn1.1 @ e2fa2baf

History | View | Annotate | Download (9.6 kB)

1
.TH HN1 1 2009-03-23 htools "Ganeti H-tools"
2
.SH NAME
3
hn1 \- N+1 fixer for Ganeti
4

    
5
.SH SYNOPSIS
6
.B hn1
7
.B "[-C]"
8
.B "[-p]"
9
.B "[-o]"
10
.BI "[ -m " cluster "]"
11
.BI "[-n " nodes-file " ]"
12
.BI "[ -i " instances-file "]"
13
.BI "[-d " depth "]"
14
.BI "[-r " max-removals "]"
15
.BI "[-L " max-delta "]"
16
.BI "[-l " min-delta "]"
17

    
18
.B hn1
19
.B --version
20

    
21
.SH DESCRIPTION
22
hn1 is a cluster N+1 fixer that tries to compute the minimum number of
23
moves needed for getting all nodes to be N+1 compliant.
24

    
25
The algorithm is designed to be a 'perfect' algorithm, so that we
26
always examine the entire solution space until we find the minimum
27
solution. The algorithm can be tweaked via the \fB-d\fR, \fB-r\fR,
28
\fB-L\fR and \fB-l\fR options.
29

    
30
By default, the program will show the solution in a somewhat cryptic
31
format; for getting the actual Ganeti command list, use the \fB-C\fR
32
option.
33

    
34
\fBNote:\fR this program is somewhat deprecated; \fBhbal(1)\fR gives
35
usually much faster results, and a better cluster. It is recommended
36
to use this program only when \fBhbal\fR doesn't give a N+1 compliant
37
cluster.
38

    
39
.SS ALGORITHM
40

    
41
The algorithm works in multiple rounds, of increasing \fIdepth\fR,
42
until we have a solution.
43

    
44
First, before starting the solution computation, we compute all the
45
N+1-fail nodes and the instances they hold. These instances are
46
candidate for replacement (and only these!).
47

    
48
The program start then with \fIdepth\fR one (unless overridden via the
49
\fB-d\fR option), and at each round:
50
.RS 4
51
.TP 3
52
\(em
53
it tries to remove from the cluster as many instances as the current
54
depth in order to make the cluster N+1 compliant
55

    
56
.TP
57
\(em
58
then, for each of the possible instance combinations that allow this
59
(unless the total size is reduced via the \fB-r\fR option), it tries
60
to put them back on the cluster while maintaining N+1 compliance
61
.RE
62

    
63
It might be that at a given round, the results are:
64
.RS 4
65
.TP 3
66
\(em
67
no instance combination that can be put back; this means it is not
68
possible to make the cluster N+1 compliant with this number of
69
instances being moved, so we increase the depth and go on to the next
70
round
71
.TP
72
\(em
73
one or more successful result, in which case we take the one that has
74
as few changes as possible (by change meaning a replace-disks needed)
75
.RE
76

    
77
The main problem with the algorithm is that, being an exhaustive
78
search, the CPU time required grows very very quickly based on
79
depth. On a 20-node, 80-instances cluster, depths up to 5-6 are
80
quickly computed, and depth 10 could already take days.
81

    
82
The main factors that influence the run time are:
83
.RS 4
84
.TP 3
85
\(em
86
the removal depth; for each increase with one of the depth, we grow
87
the solution space by the number of nodes squared (since a new
88
instance can live any two nodes as primary/secondary, therefore
89
(almost) N times N); i.e., depth=1 will create a N^2 solution space,
90
depth two will make this N^4, depth three will be N^6, etc.
91

    
92
.TP
93
\(em
94
the removal depth again; for each increase in the depth, there will be
95
more valid removal sets, and the space of solutions increases linearly
96
with the number of removal sets
97
.RE
98

    
99
Therefore, the smaller the depth the faster the algorithm will be; it doesn't
100
seem like this algorithm will work for clusters of 100 nodes and many many
101
small instances (e.g. 256MB instances on 16GB nodes).
102

    
103
As an optimisation, since the algorithm is designed to prune the
104
search space as quickly as possible, is by luck we find a good
105
solution early at a given depth, then the other solutions which would
106
result in a bigger delta (the number of changes) will not be
107
investigated, and the program will finish fast. Since this is random
108
and depends on where in the full solution space the good solution will
109
be, there are two options for cutting down the time needed:
110
.RS 4
111
.TP 3
112
\(em
113
\fB-l\fR makes any solution that has delta lower than its parameter
114
succeed instantly; the default value for this parameter is zero, so
115
once we find a "perfect" solution we finish early
116

    
117
.TP
118
\(em
119
\fB-L\fR makes any solution with delta higher than its parameter being
120
rejected instantly (and not descend on the search tree); this can
121
reduce the depth of the search tree, with sometimes significant
122
speedups; by default, this optimization is not used
123
.RE
124

    
125
The algorithm also has some other internal optimisations:
126
.RS 4
127
.TP 3
128
\(em
129
when choosing where to place an instance in phase two, there are
130
N*(N-1) possible primary/secondary options; however, if instead of
131
iterating over all p * s pairs, we first determine the set of primary
132
nodes that can hold this instance (without failing N+1), we can cut
133
(N-1) secondary placements for each primary node removed; and since
134
this applies at every iteration of phase 2 it linearly decreases the
135
solution space, and on full clusters, this can mean a four-five times
136
reductions of solution space
137

    
138
.TP
139
\(em
140
since the number of solutions is very high even for smaller depths (on
141
the test data, depth=4 results in 1.8M solutions) we can't compare
142
them at the end, so at each iteration in phase 2 we only promote the
143
best solution out of our own set of solutions
144

    
145
.TP
146
\(em
147
since the placement of instances can only increase the delta of the
148
solution (placing a new instance will add zero or more replace-disks
149
steps), it means the delta will only increase while recursing during
150
phase 2; therefore, if we know at one point that we have a current
151
delta that is equal or higher to the delta of the best solution so
152
far, we can abort the recursion; this cuts a tremendous number of
153
branches; further promotion of the best solution from one removal set
154
to another can cut entire removal sets after a few recursions
155

    
156
.RE
157

    
158
.SH OPTIONS
159
The options that can be passed to the program are as follows:
160
.TP
161
.B -C, --print-commands
162
Print the command list at the end of the run. Without this, the
163
program will only show a shorter, but cryptic output.
164
.TP
165
.B -p, --print-nodes
166
Prints the before and after node status, in a format designed to allow
167
the user to understand the node's most important parameters.
168

    
169
The node list will contain these informations:
170
.RS
171
.TP
172
.B F
173
a character denoting the status of the node, with '-' meaning an
174
offline node, '*' meaning N+1 failure and blank meaning a good node
175
.TP
176
.B Name
177
the node name
178
.TP
179
.B t_mem
180
the total node memory
181
.TP
182
.B n_mem
183
the memory used by the node itself
184
.TP
185
.B i_mem
186
the memory used by instances
187
.TP
188
.B x_mem
189
amount memory which seems to be in use but cannot be determined why or
190
by which instance; usually this means that the hypervisor has some
191
overhead or that there are other reporting errors
192
.TP
193
.B f_mem
194
the free node memory
195
.TP
196
.B r_mem
197
the reserved node memory, which is the amount of free memory needed
198
for N+1 compliance
199
.TP
200
.B t_dsk
201
total disk
202
.TP
203
.B f_dsk
204
free disk
205
.TP
206
.B pri
207
number of primary instances
208
.TP
209
.B sec
210
number of secondary instances
211
.TP
212
.B p_fmem
213
percent of free memory
214
.TP
215
.B p_fdsk
216
percent of free disk
217
.RE
218

    
219
.TP
220
.BI "-n" nodefile ", --nodes=" nodefile
221
The name of the file holding node information (if not collecting via
222
RAPI), instead of the default \fInodes\fR file (but see below how to
223
customize the default value via the environment).
224

    
225
.TP
226
.BI "-i" instancefile ", --instances=" instancefile
227
The name of the file holding instance information (if not collecting
228
via RAPI), instead of the default \fIinstances\fR file (but see below
229
how to customize the default value via the environment).
230

    
231
.TP
232
.BI "-m" cluster
233
Collect data not from files but directly from the
234
.I cluster
235
given as an argument via RAPI. If the argument doesn't contain a colon
236
(:), then it is converted into a fully-built URL via prepending
237
https:// and appending the default RAPI port, otherwise it's
238
considered a fully-specified URL and is used unchanged.
239

    
240
.TP
241
.BI "-d" DEPTH ", --depth=" DEPTH
242
Start the algorithm directly at depth \fID\fR, so that we don't
243
examine lower depth. This will be faster if we know a solution is not
244
found a lower depths, and thus it's unneeded to search them.
245

    
246
.TP
247
.BI "-l" MIN-DELTA ", --min-delta=" MIN-DELTA
248
If we find a solution with delta lower or equal to \fIMIN-DELTA\fR,
249
consider this a success and don't examine further.
250

    
251
.TP
252
.BI "-L" MAX-DELTA ", --max-delta=" MAX-DELTA
253
If while computing a solution, it's intermediate delta is already
254
higher or equal to \fIMAX-DELTA\fR, consider this a failure and abort
255
(as if N+1 checks have failed).
256

    
257
.TP
258
.B -V, --version
259
Just show the program version and exit.
260

    
261
.SH EXIT STATUS
262

    
263
The exist status of the command will be zero, unless for some reason
264
the algorithm fatally failed (e.g. wrong node or instance data).
265

    
266
.SH ENVIRONMENT
267

    
268
If the variables \fBHTOOLS_NODES\fR and \fBHTOOLS_INSTANCES\fR are
269
present in the environment, they will override the default names for
270
the nodes and instances files. These will have of course no effect
271
when RAPI is used.
272

    
273
.SH BUGS
274

    
275
The program does not check its input data for consistency, and aborts
276
with cryptic errors messages in this case.
277

    
278
The algorithm doesn't deal with non-\fBdrbd\fR instances, and chokes
279
on input data which has such instances.
280

    
281
The algorithm doesn't know when it won't be possible to reach N+1
282
compliance at all, and will happily churn CPU for ages without
283
realising it won't reach a solution.
284

    
285
The algorithm is too slow.
286

    
287
The output format is not easily scriptable, and the program should
288
feed moves directly into Ganeti (either via RAPI or via a gnt-debug
289
input file).
290

    
291
.SH SEE ALSO
292
.BR hbal "(1), " hscan "(1), " ganeti "(7), " gnt-instance "(8), "
293
.BR gnt-node "(8)"
294

    
295
.SH "COPYRIGHT"
296
.PP
297
Copyright (C) 2009 Google Inc. Permission is granted to copy,
298
distribute and/or modify under the terms of the GNU General Public
299
License as published by the Free Software Foundation; either version 2
300
of the License, or (at your option) any later version.
301
.PP
302
On Debian systems, the complete text of the GNU General Public License
303
can be found in /usr/share/common-licenses/GPL.