## root / hn1.1 @ e015b554

History | View | Annotate | Download (9.2 kB)

1 |
.TH HN1 1 2009-03-23 htools "Ganeti H-tools" |
---|---|

2 |
.SH NAME |

3 |
hn1 \- N+1 fixer for Ganeti |

4 | |

5 |
.SH SYNOPSIS |

6 |
.B hn1 |

7 |
.B "[-C]" |

8 |
.B "[-p]" |

9 |
.B "[-o]" |

10 |
.BI "[ -m " cluster "]" |

11 |
.BI "[-n " nodes-file " ]" |

12 |
.BI "[ -i " instances-file "]" |

13 |
.BI "[-d " depth "]" |

14 |
.BI "[-r " max-removals "]" |

15 |
.BI "[-L " max-delta "]" |

16 |
.BI "[-l " min-delta "]" |

17 | |

18 |
.B hn1 |

19 |
.B --version |

20 | |

21 |
.SH DESCRIPTION |

22 |
hn1 is a cluster N+1 fixer that tries to compute the minimum number of |

23 |
moves needed for getting all nodes to be N+1 compliant. |

24 | |

25 |
The algorithm is designed to be a 'perfect' algorithm, so that we |

26 |
always examine the entire solution space until we find the minimum |

27 |
solution. The algorithm can be tweaked via the \fB-d\fR, \fB-r\fR, |

28 |
\fB-L\fR and \fB-l\fR options. |

29 | |

30 |
By default, the program will show the solution in a somewhat cryptic |

31 |
format; for getting the actual Ganeti command list, use the \fB-C\fR |

32 |
option. |

33 | |

34 |
\fBNote:\fR this program is somewhat deprecated; \fBhbal(1)\fR gives |

35 |
usually much faster results, and a better cluster. It is recommended |

36 |
to use this program only when \fBhbal\fR doesn't give a N+1 compliant |

37 |
cluster. |

38 | |

39 |
.SS ALGORITHM |

40 | |

41 |
The algorithm works in multiple rounds, of increasing \fIdepth\fR, |

42 |
until we have a solution. |

43 | |

44 |
First, before starting the solution computation, we compute all the |

45 |
N+1-fail nodes and the instances they hold. These instances are |

46 |
candidate for replacement (and only these!). |

47 | |

48 |
The program start then with \fIdepth\fR one (unless overridden via the |

49 |
\fB-d\fR option), and at each round: |

50 |
.RS 4 |

51 |
.TP 3 |

52 |
\(em |

53 |
it tries to remove from the cluster as many instances as the current |

54 |
depth in order to make the cluster N+1 compliant |

55 | |

56 |
.TP |

57 |
\(em |

58 |
then, for each of the possible instance combinations that allow this |

59 |
(unless the total size is reduced via the \fB-r\fR option), it tries |

60 |
to put them back on the cluster while maintaining N+1 compliance |

61 |
.RE |

62 | |

63 |
It might be that at a given round, the results are: |

64 |
.RS 4 |

65 |
.TP 3 |

66 |
\(em |

67 |
no instance combination that can be put back; this means it is not |

68 |
possible to make the cluster N+1 compliant with this number of |

69 |
instances being moved, so we increase the depth and go on to the next |

70 |
round |

71 |
.TP |

72 |
\(em |

73 |
one or more successful result, in which case we take the one that has |

74 |
as few changes as possible (by change meaning a replace-disks needed) |

75 |
.RE |

76 | |

77 |
The main problem with the algorithm is that, being an exhaustive |

78 |
search, the CPU time required grows very very quickly based on |

79 |
depth. On a 20-node, 80-instances cluster, depths up to 5-6 are |

80 |
quickly computed, and depth 10 could already take days. |

81 | |

82 |
The main factors that influence the run time are: |

83 |
.RS 4 |

84 |
.TP 3 |

85 |
\(em |

86 |
the removal depth; for each increase with one of the depth, we grow |

87 |
the solution space by the number of nodes squared (since a new |

88 |
instance can live any two nodes as primary/secondary, therefore |

89 |
(almost) N times N); i.e., depth=1 will create a N^2 solution space, |

90 |
depth two will make this N^4, depth three will be N^6, etc. |

91 | |

92 |
.TP |

93 |
\(em |

94 |
the removal depth again; for each increase in the depth, there will be |

95 |
more valid removal sets, and the space of solutions increases linearly |

96 |
with the number of removal sets |

97 |
.RE |

98 | |

99 |
Therefore, the smaller the depth the faster the algorithm will be; it doesn't |

100 |
seem like this algorithm will work for clusters of 100 nodes and many many |

101 |
small instances (e.g. 256MB instances on 16GB nodes). |

102 | |

103 |
As an optimisation, since the algorithm is designed to prune the |

104 |
search space as quickly as possible, is by luck we find a good |

105 |
solution early at a given depth, then the other solutions which would |

106 |
result in a bigger delta (the number of changes) will not be |

107 |
investigated, and the program will finish fast. Since this is random |

108 |
and depends on where in the full solution space the good solution will |

109 |
be, there are two options for cutting down the time needed: |

110 |
.RS 4 |

111 |
.TP 3 |

112 |
\(em |

113 |
\fB-l\fR makes any solution that has delta lower than its parameter |

114 |
succeed instantly; the default value for this parameter is zero, so |

115 |
once we find a "perfect" solution we finish early |

116 | |

117 |
.TP |

118 |
\(em |

119 |
\fB-L\fR makes any solution with delta higher than its parameter being |

120 |
rejected instantly (and not descend on the search tree); this can |

121 |
reduce the depth of the search tree, with sometimes significant |

122 |
speedups; by default, this optimization is not used |

123 |
.RE |

124 | |

125 |
The algorithm also has some other internal optimisations: |

126 |
.RS 4 |

127 |
.TP 3 |

128 |
\(em |

129 |
when choosing where to place an instance in phase two, there are |

130 |
N*(N-1) possible primary/secondary options; however, if instead of |

131 |
iterating over all p * s pairs, we first determine the set of primary |

132 |
nodes that can hold this instance (without failing N+1), we can cut |

133 |
(N-1) secondary placements for each primary node removed; and since |

134 |
this applies at every iteration of phase 2 it linearly decreases the |

135 |
solution space, and on full clusters, this can mean a four-five times |

136 |
reductions of solution space |

137 | |

138 |
.TP |

139 |
\(em |

140 |
since the number of solutions is very high even for smaller depths (on |

141 |
the test data, depth=4 results in 1.8M solutions) we can't compare |

142 |
them at the end, so at each iteration in phase 2 we only promote the |

143 |
best solution out of our own set of solutions |

144 | |

145 |
.TP |

146 |
\(em |

147 |
since the placement of instances can only increase the delta of the |

148 |
solution (placing a new instance will add zero or more replace-disks |

149 |
steps), it means the delta will only increase while recursing during |

150 |
phase 2; therefore, if we know at one point that we have a current |

151 |
delta that is equal or higher to the delta of the best solution so |

152 |
far, we can abort the recursion; this cuts a tremendous number of |

153 |
branches; further promotion of the best solution from one removal set |

154 |
to another can cut entire removal sets after a few recursions |

155 | |

156 |
.RE |

157 | |

158 |
.SH OPTIONS |

159 |
The options that can be passed to the program are as follows: |

160 |
.TP |

161 |
.B -C, --print-commands |

162 |
Print the command list at the end of the run. Without this, the |

163 |
program will only show a shorter, but cryptic output. |

164 |
.TP |

165 |
.B -p, --print-nodes |

166 |
Prints the before and after node status, in a format designed to allow |

167 |
the user to understand the node's most important parameters. |

168 | |

169 |
The node list will contain these informations: |

170 |
.RS |

171 |
.TP |

172 |
.B F |

173 |
a character denoting the status of the node, with '-' meaning an |

174 |
offline node, '*' meaning N+1 failure and blank meaning a good node |

175 |
.TP |

176 |
.B Name |

177 |
the node name |

178 |
.TP |

179 |
.B t_mem |

180 |
the total node memory |

181 |
.TP |

182 |
.B n_mem |

183 |
the memory used by the node itself |

184 |
.TP |

185 |
.B i_mem |

186 |
the memory used by instances |

187 |
.TP |

188 |
.B x_mem |

189 |
amount memory which seems to be in use but cannot be determined why or |

190 |
by which instance; usually this means that the hypervisor has some |

191 |
overhead or that there are other reporting errors |

192 |
.TP |

193 |
.B f_mem |

194 |
the free node memory |

195 |
.TP |

196 |
.B r_mem |

197 |
the reserved node memory, which is the amount of free memory needed |

198 |
for N+1 compliance |

199 |
.TP |

200 |
.B t_dsk |

201 |
total disk |

202 |
.TP |

203 |
.B f_dsk |

204 |
free disk |

205 |
.TP |

206 |
.B pri |

207 |
number of primary instances |

208 |
.TP |

209 |
.B sec |

210 |
number of secondary instances |

211 |
.TP |

212 |
.B p_fmem |

213 |
percent of free memory |

214 |
.TP |

215 |
.B p_fdsk |

216 |
percent of free disk |

217 |
.RE |

218 | |

219 |
.TP |

220 |
.BI "-n" nodefile ", --nodes=" nodefile |

221 |
The name of the file holding node information (if not collecting via |

222 |
RAPI), instead of the default \fInodes\fR file (but see below how to |

223 |
customize the default value via the environment). |

224 | |

225 |
.TP |

226 |
.BI "-i" instancefile ", --instances=" instancefile |

227 |
The name of the file holding instance information (if not collecting |

228 |
via RAPI), instead of the default \fIinstances\fR file (but see below |

229 |
how to customize the default value via the environment). |

230 | |

231 |
.TP |

232 |
.BI "-m" cluster |

233 |
Collect data not from files but directly from the |

234 |
.I cluster |

235 |
given as an argument via RAPI. If the argument doesn't contain a colon |

236 |
(:), then it is converted into a fully-built URL via prepending |

237 |
https:// and appending the default RAPI port, otherwise it's |

238 |
considered a fully-specified URL and is used unchanged. |

239 | |

240 |
.TP |

241 |
.BI "-d" DEPTH ", --depth=" DEPTH |

242 |
Start the algorithm directly at depth \fID\fR, so that we don't |

243 |
examine lower depth. This will be faster if we know a solution is not |

244 |
found a lower depths, and thus it's unneeded to search them. |

245 | |

246 |
.TP |

247 |
.BI "-l" MIN-DELTA ", --min-delta=" MIN-DELTA |

248 |
If we find a solution with delta lower or equal to \fIMIN-DELTA\fR, |

249 |
consider this a success and don't examine further. |

250 | |

251 |
.TP |

252 |
.BI "-L" MAX-DELTA ", --max-delta=" MAX-DELTA |

253 |
If while computing a solution, it's intermediate delta is already |

254 |
higher or equal to \fIMAX-DELTA\fR, consider this a failure and abort |

255 |
(as if N+1 checks have failed). |

256 | |

257 |
.TP |

258 |
.B -V, --version |

259 |
Just show the program version and exit. |

260 | |

261 |
.SH EXIT STATUS |

262 | |

263 |
The exist status of the command will be zero, unless for some reason |

264 |
the algorithm fatally failed (e.g. wrong node or instance data). |

265 | |

266 |
.SH ENVIRONMENT |

267 | |

268 |
If the variables \fBHTOOLS_NODES\fR and \fBHTOOLS_INSTANCES\fR are |

269 |
present in the environment, they will override the default names for |

270 |
the nodes and instances files. These will have of course no effect |

271 |
when RAPI is used. |

272 | |

273 |
.SH BUGS |

274 | |

275 |
The program does not check its input data for consistency, and aborts |

276 |
with cryptic errors messages in this case. |

277 | |

278 |
The algorithm doesn't deal with non-\fBdrbd\fR instances, and chokes |

279 |
on input data which has such instances. |

280 | |

281 |
The algorithm doesn't know when it won't be possible to reach N+1 |

282 |
compliance at all, and will happily churn CPU for ages without |

283 |
realising it won't reach a solution. |

284 | |

285 |
The algorithm is too slow. |

286 | |

287 |
The output format is not easily scriptable, and the program should |

288 |
feed moves directly into Ganeti (either via RAPI or via a gnt-debug |

289 |
input file). |

290 | |

291 |
.SH SEE ALSO |

292 |
.BR hbal "(1), " hscan "(1), " ganeti "(7), " gnt-instance "(8), " |

293 |
.BR gnt-node "(8)" |