## root / README @ e4f08c46

History | View | Annotate | Download (10.8 kB)

1 |
Cluster tools (h-aneti?) |
---|---|

2 |
======================== |

3 | |

4 |
These are some simple cluster tools for fixing common problems. Right now N+1 |

5 |
and rebalancing are included. |

6 | |

7 |
.. contents:: |

8 | |

9 |
Cluster N+1 solver |

10 |
------------------ |

11 | |

12 |
This program runs a very simple brute force algorithm over the instance |

13 |
placement space in order to determine the shortest number of replace-disks |

14 |
needed to fix the cluster. Note this means we won't get a balanced cluster, |

15 |
just one that passes N+1 checks. |

16 | |

17 |
Also note that the set of all instance placements on a 20/80 cluster is |

18 |
(20*19)^80, that is ~10^200, so... |

19 | |

20 |
Algorithm |

21 |
+++++++++ |

22 | |

23 |
The algorithm is a simple two-phase process. |

24 | |

25 |
In phase 1 we determine the removal set, that is the set of instances that when |

26 |
removed completely from the cluster, make it healthy again. The instances that |

27 |
can go into the set are all the primary and secondary instances of the failing |

28 |
nodes. The result from this phase is actually a list - we compute all sets of |

29 |
the same minimum length. |

30 | |

31 |
So basically we aim to determine here: what is the minimum number of instances |

32 |
that need to be removed (this is called the removal depth) and which are the |

33 |
actual combinations that fit (called the list of removal sets). |

34 | |

35 |
In phase 2, for each removal set computed in the previous phase, we take the |

36 |
removed instances and try to determine where we can put them so that the |

37 |
cluster is still passing N+1 checks. From this list of possible solutions |

38 |
(called the list of solutions), we compute the one that has the smallest delta |

39 |
from the original state (the delta is the number of replace disks that needs to |

40 |
be run) and chose this as the final solution. |

41 | |

42 |
Implementation |

43 |
++++++++++++++ |

44 | |

45 |
Of course, a naive implementation based on the above description will run for |

46 |
long periods of time, so the implementation has to be smart in order to prune |

47 |
the solution space as eagerly as possible. |

48 | |

49 |
In the following, we use as example a set of test data (a cluster with 20 |

50 |
nodes, 80 instances that has 5 nodes failing N+1 checks for a total of 12 |

51 |
warnings). |

52 | |

53 |
On this set, the minimum depth is 4 (anything below fails), and for this depth |

54 |
the current version of the algorithm generates 5 removal sets; a previous |

55 |
version of the first phase generated a slightly different set of instances, with |

56 |
two removal sets. For the original version of the algorithm: |

57 | |

58 |
- the first, non-optimized implementation computed a solution of delta=4 in 30 |

59 |
minutes on server-class CPUs and was still running when aborted 10 minutes |

60 |
later |

61 |
- the intermediate optimized version computed the whole solution space and |

62 |
found a delta=3 solution in around 10 seconds on a laptop-class CPU (total |

63 |
number of solutions ~600k) |

64 |
- latest version on server CPUs (which actually computes more removal sets) |

65 |
computes depth=4 in less than a second and depth=5 in around 2 seconds, and |

66 |
depth=6 in less than 20 seconds; depth=8 takes under five minutes (this is |

67 |
10^10 bigger solution space) |

68 | |

69 |
Note that when (artificially) increasing the depth to 5 the number of removal |

70 |
sets grows fast (~3000) and a (again artificial) depth 6 generates 61k removal |

71 |
sets. Therefore, it is possible to restrict the number of solution sets |

72 |
examined via a command-line option. |

73 | |

74 |
The factors that influence the run time are: |

75 | |

76 |
- the removal depth; for each increase with one of the depth, we grow the |

77 |
solution space by the number of nodes squared (since a new instance can live |

78 |
any two nodes as primary/secondary, therefore (almost) N times N); i.e., |

79 |
depth=1 will create a N^2 solution space, depth two will make this N^4, |

80 |
depth three will be N^6, etc. |

81 |
- the removal depth again; for each increase in the depth, there will be more |

82 |
valid removal sets, and the space of solutions increases linearly with the |

83 |
number of removal sets |

84 | |

85 |
Therefore, the smaller the depth the faster the algorithm will be; it doesn't |

86 |
seem like this algorithm will work for clusters of 100 nodes and many many |

87 |
small instances (e.g. 256MB instances on 16GB nodes). |

88 | |

89 |
Currently applied optimizations: |

90 | |

91 |
- when choosing where to place an instance in phase two, there are N*(N-1) |

92 |
possible primary/secondary options; however, if instead of iterating over all |

93 |
p * s pairs, we first determine the set of primary nodes that can hold this |

94 |
instance (without failing N+1), we can cut (N-1) secondary placements for |

95 |
each primary node removed; and since this applies at every iteration of phase |

96 |
2 it linearly decreases the solution space, and on full clusters, this can |

97 |
mean a four-five times reductions of solution space |

98 |
- since the number of solutions is very high even for smaller depths (on the |

99 |
test data, depth=4 results in 1.8M solutions) we can't compare them at the |

100 |
end, so at each iteration in phase 2 we only promote the best solution out of |

101 |
our own set of solutions |

102 |
- since the placement of instances can only increase the delta of the solution |

103 |
(placing a new instance will add zero or more replace-disks steps), it means |

104 |
the delta will only increase while recursing during phase 2; therefore, if we |

105 |
know at one point that we have a current delta that is equal or higher to the |

106 |
delta of the best solution so far, we can abort the recursion; this cuts a |

107 |
tremendous number of branches; further promotion of the best solution from |

108 |
one removal set to another can cut entire removal sets after a few recursions |

109 | |

110 |
Command line usage |

111 |
++++++++++++++++++ |

112 | |

113 |
Synopsis:: |

114 | |

115 |
hn1 [-n NODES_FILE] [-i INSTANCES_FILE] [-d START_DEPTH] \ |

116 |
[-r MAX_REMOVALS] [-l MIN_DELTA] [-L MAX_DELTA] \ |

117 |
[-p] [-C] |

118 | |

119 |
The -n and -i options change the names of the input files. The -d option |

120 |
changes the start depth, as a higher depth can give (with a longer computation |

121 |
time) a solution with better delta. The -r option restricts at each depth the |

122 |
number of solutions considered - with r=1000 for example even depth=10 finishes |

123 |
in less than a second. |

124 | |

125 |
The -p option will show the cluster state after the solution is implemented, |

126 |
while the -C option will show the needed gnt-instance commands to implement |

127 |
it. |

128 | |

129 |
The -l (--min-delta) and -L (--max-delta) options restrict the solution in the |

130 |
following ways: |

131 | |

132 |
- min-delta will cause the search to abort early once we find a solution with |

133 |
delta less than or equal to this parameter; this can cause extremely fast |

134 |
results in case a desired solution is found quickly; the default value for |

135 |
this parameter is zero, so once we find a "perfect" solution we finish early |

136 |
- max-delta causes rejection of valid solution but which have delta higher |

137 |
than the value of this parameter; this can reduce the depth of the search |

138 |
tree, with sometimes significant speedups; by default, this optimization is |

139 |
not used |

140 | |

141 |
Individually or combined, these two parameters can (if there are any) very |

142 |
fast result; on our test data, depth=34 (max depth!) is solved in 2 seconds |

143 |
with min-delta=0/max-delta=1 (since there is such a solution), and the |

144 |
extremely low max-delta causes extreme pruning. |

145 | |

146 |
Cluster rebalancer |

147 |
------------------ |

148 | |

149 |
Compared to the N+1 solver, the rebalancer uses a very simple algorithm: |

150 |
repeatedly try to move each instance one step, so that the cluster score |

151 |
becomes better. We stop when no further move can improve the score. |

152 | |

153 |
The algorithm is divided into rounds (all identical): |

154 | |

155 |
#. Repeat for each instance: |

156 | |

157 |
#. Compute score after the potential failover of the instance |

158 | |

159 |
#. For each node that is different from the current primary/secondary |

160 | |

161 |
#. Compute score after replacing the primary with this new node |

162 | |

163 |
#. Compute score after replacing the secondary with this new node |

164 | |

165 | |

166 |
#. Out of this N*2+1 possible new scores (and their associated move) for |

167 |
this instance, we choose the one that is the best in terms of cluster |

168 |
score, and then proceed to the next instance |

169 | |

170 |
Since we don't compute all combinations of moves for instances (e.g. the first |

171 |
instance's all moves Cartesian product with second instance's all moves, etc.) |

172 |
but we proceed serially instance A, then B, then C, the total computations we |

173 |
make in one steps is simply N(number of nodes)*2+1 times I(number of instances), |

174 |
instead of (N*2+1)^I. So therefore the runtime for a round is trivial. |

175 | |

176 |
Further rounds are done, since the relocation of instances might offer better |

177 |
places for instances which we didn't move, or simply didn't move to the best |

178 |
place. It is possible to limit the rounds, but usually the algorithm finishes |

179 |
after a few rounds by itself. |

180 | |

181 |
Note that the cluster *must* be N+1 compliant before this algorithm is run, and |

182 |
will stay at each move N+1 compliant. Therefore, the final cluster will be N+1 |

183 |
compliant. |

184 | |

185 |
Single-round solutions |

186 |
++++++++++++++++++++++ |

187 | |

188 |
Single-round solutions have the very nice property that they are |

189 |
incrementally-valid. In other words, if you have a 10-step solution, at each |

190 |
step the cluster is both N+1 compliant and better than the previous step. |

191 | |

192 |
This means that you can stop at any point and you will have a better cluster. |

193 |
For this reason, single-round solutions are recommended in the common case of |

194 |
let's make this better. Multi-round solutions will be better though when adding |

195 |
a couple of new, empty nodes to the cluster due to the many relocations needed. |

196 | |

197 | |

198 |
Multi-round solutions |

199 |
+++++++++++++++++++++ |

200 | |

201 |
A multi-round solution (not for a single round), due to de-duplication of moves |

202 |
(i.e. just put the instance directly in its final place, and not move it five |

203 |
times around) loses both these properties. It might be that it's not possible to |

204 |
directly put the instance on the final nodes. So it can be possible that yes, |

205 |
the cluster is happy in the final solution and nice, but you cannot do the steps |

206 |
in the shown order. Solving this (via additional instance move(s)) is left to |

207 |
the user. |

208 | |

209 |
Command line usage |

210 |
++++++++++++++++++ |

211 | |

212 |
Synopsis:: |

213 | |

214 |
hbal [-n NODES_FILE] [-i INSTANCES_FILE] \ |

215 |
[-r MAX_ROUNDS] \ |

216 |
[-p] [-C] |

217 | |

218 |
The -n and -i options change the names of the input files. The -r option |

219 |
restricts the maximum number of rounds (and is more of safety measure). |

220 | |

221 |
The -p option will show the cluster state after the solution is implemented, |

222 |
while the -C option will show the needed gnt-instance commands to implement |

223 |
it. |

224 | |

225 |
Integration with Ganeti |

226 |
----------------------- |

227 | |

228 |
The programs needs only the output of the node list and instance list. That is, |

229 |
they need the following two commands to be run:: |

230 | |

231 |
gnt-node list -oname,mtotal,mfree,dtotal,dfree,pinst_list,sinst_list \ |

232 |
--separator '|' --no-headers > nodes |

233 |
gnt-instance list -oname,admin_ram,sda_size \ |

234 |
--separator '|' --no-head > instances |

235 | |

236 |
These two files should be saved under the names of 'nodes' and 'instances'. |

237 | |

238 |
When run, the programs will show some informational messages and output the |

239 |
chosen solution, in the form of a list of instance name and chosen |

240 |
primary/secondary nodes. The user then needs to run the necessary commands to |

241 |
get the instances to live on those nodes. |

242 | |

243 |
Note that sda_size is less than the total disk size of an instance by 4352 |

244 |
MiB, so if disk space is at a premium the calculation could be wrong; in this |

245 |
case, please adjust the values manually. |