Statistics
| Branch: | Tag: | Revision:

root / README @ 01f6a5d2

History | View | Annotate | Download (10.8 kB)

1 e4f08c46 Iustin Pop
Cluster tools (h-aneti?)
2 e4f08c46 Iustin Pop
========================
3 e4f08c46 Iustin Pop
4 e4f08c46 Iustin Pop
These are some simple cluster tools for fixing common problems. Right now N+1
5 e4f08c46 Iustin Pop
and rebalancing are included.
6 e4f08c46 Iustin Pop
7 e4f08c46 Iustin Pop
.. contents::
8 e4f08c46 Iustin Pop
9 e4f08c46 Iustin Pop
Cluster N+1 solver
10 e4f08c46 Iustin Pop
------------------
11 e4f08c46 Iustin Pop
12 e4f08c46 Iustin Pop
This program runs a very simple brute force algorithm over the instance
13 e4f08c46 Iustin Pop
placement space in order to determine the shortest number of replace-disks
14 e4f08c46 Iustin Pop
needed to fix the cluster. Note this means we won't get a balanced cluster,
15 e4f08c46 Iustin Pop
just one that passes N+1 checks.
16 e4f08c46 Iustin Pop
17 e4f08c46 Iustin Pop
Also note that the set of all instance placements on a 20/80 cluster is
18 e4f08c46 Iustin Pop
(20*19)^80, that is ~10^200, so...
19 e4f08c46 Iustin Pop
20 e4f08c46 Iustin Pop
Algorithm
21 e4f08c46 Iustin Pop
+++++++++
22 e4f08c46 Iustin Pop
23 e4f08c46 Iustin Pop
The algorithm is a simple two-phase process.
24 e4f08c46 Iustin Pop
25 e4f08c46 Iustin Pop
In phase 1 we determine the removal set, that is the set of instances that when
26 e4f08c46 Iustin Pop
removed completely from the cluster, make it healthy again. The instances that
27 e4f08c46 Iustin Pop
can go into the set are all the primary and secondary instances of the failing
28 e4f08c46 Iustin Pop
nodes. The result from this phase is actually a list - we compute all sets of
29 e4f08c46 Iustin Pop
the same minimum length.
30 e4f08c46 Iustin Pop
31 e4f08c46 Iustin Pop
So basically we aim to determine here: what is the minimum number of instances
32 e4f08c46 Iustin Pop
that need to be removed (this is called the removal depth) and which are the
33 01f6a5d2 Iustin Pop
actual combinations that fit (called the list of removal sets).
34 e4f08c46 Iustin Pop
35 e4f08c46 Iustin Pop
In phase 2, for each removal set computed in the previous phase, we take the
36 e4f08c46 Iustin Pop
removed instances and try to determine where we can put them so that the
37 e4f08c46 Iustin Pop
cluster is still passing N+1 checks. From this list of possible solutions
38 e4f08c46 Iustin Pop
(called the list of solutions), we compute the one that has the smallest delta
39 e4f08c46 Iustin Pop
from the original state (the delta is the number of replace disks that needs to
40 e4f08c46 Iustin Pop
be run) and chose this as the final solution.
41 e4f08c46 Iustin Pop
42 e4f08c46 Iustin Pop
Implementation
43 e4f08c46 Iustin Pop
++++++++++++++
44 e4f08c46 Iustin Pop
45 e4f08c46 Iustin Pop
Of course, a naive implementation based on the above description will run for
46 e4f08c46 Iustin Pop
long periods of time, so the implementation has to be smart in order to prune
47 e4f08c46 Iustin Pop
the solution space as eagerly as possible.
48 e4f08c46 Iustin Pop
49 e4f08c46 Iustin Pop
In the following, we use as example a set of test data (a cluster with 20
50 e4f08c46 Iustin Pop
nodes, 80 instances that has 5 nodes failing N+1 checks for a total of 12
51 e4f08c46 Iustin Pop
warnings).
52 e4f08c46 Iustin Pop
53 e4f08c46 Iustin Pop
On this set, the minimum depth is 4 (anything below fails), and for this depth
54 e4f08c46 Iustin Pop
the current version of the algorithm generates 5 removal sets; a previous
55 e4f08c46 Iustin Pop
version of the first phase generated a slightly different set of instances, with
56 e4f08c46 Iustin Pop
two removal sets. For the original version of the algorithm:
57 e4f08c46 Iustin Pop
58 e4f08c46 Iustin Pop
- the first, non-optimized implementation computed a solution of delta=4 in 30
59 e4f08c46 Iustin Pop
  minutes on server-class CPUs and was still running when aborted 10 minutes
60 e4f08c46 Iustin Pop
  later
61 e4f08c46 Iustin Pop
- the intermediate optimized version computed the whole solution space and
62 e4f08c46 Iustin Pop
  found a delta=3 solution in around 10 seconds on a laptop-class CPU (total
63 e4f08c46 Iustin Pop
  number of solutions ~600k)
64 e4f08c46 Iustin Pop
- latest version on server CPUs (which actually computes more removal sets)
65 e4f08c46 Iustin Pop
  computes depth=4 in less than a second and depth=5 in around 2 seconds, and
66 e4f08c46 Iustin Pop
  depth=6 in less than 20 seconds; depth=8 takes under five minutes (this is
67 e4f08c46 Iustin Pop
  10^10 bigger solution space)
68 e4f08c46 Iustin Pop
69 e4f08c46 Iustin Pop
Note that when (artificially) increasing the depth to 5 the number of removal
70 e4f08c46 Iustin Pop
sets grows fast (~3000) and a (again artificial) depth 6 generates 61k removal
71 e4f08c46 Iustin Pop
sets. Therefore, it is possible to restrict the number of solution sets
72 e4f08c46 Iustin Pop
examined via a command-line option.
73 e4f08c46 Iustin Pop
74 e4f08c46 Iustin Pop
The factors that influence the run time are:
75 e4f08c46 Iustin Pop
76 e4f08c46 Iustin Pop
- the removal depth; for each increase with one of the depth, we grow the
77 e4f08c46 Iustin Pop
  solution space by the number of nodes squared (since a new instance can live
78 e4f08c46 Iustin Pop
  any two nodes as primary/secondary, therefore (almost) N times N); i.e.,
79 e4f08c46 Iustin Pop
  depth=1 will create a N^2 solution space, depth two will make this N^4,
80 e4f08c46 Iustin Pop
  depth three will be N^6, etc.
81 e4f08c46 Iustin Pop
- the removal depth again; for each increase in the depth, there will be more
82 e4f08c46 Iustin Pop
  valid removal sets, and the space of solutions increases linearly with the
83 e4f08c46 Iustin Pop
  number of removal sets
84 e4f08c46 Iustin Pop
85 e4f08c46 Iustin Pop
Therefore, the smaller the depth the faster the algorithm will be; it doesn't
86 e4f08c46 Iustin Pop
seem like this algorithm will work for clusters of 100 nodes and many many
87 e4f08c46 Iustin Pop
small instances (e.g. 256MB instances on 16GB nodes).
88 e4f08c46 Iustin Pop
89 e4f08c46 Iustin Pop
Currently applied optimizations:
90 e4f08c46 Iustin Pop
91 e4f08c46 Iustin Pop
- when choosing where to place an instance in phase two, there are N*(N-1)
92 e4f08c46 Iustin Pop
  possible primary/secondary options; however, if instead of iterating over all
93 e4f08c46 Iustin Pop
  p * s pairs, we first determine the set of primary nodes that can hold this
94 e4f08c46 Iustin Pop
  instance (without failing N+1), we can cut (N-1) secondary placements for
95 e4f08c46 Iustin Pop
  each primary node removed; and since this applies at every iteration of phase
96 e4f08c46 Iustin Pop
  2 it linearly decreases the solution space, and on full clusters, this can
97 e4f08c46 Iustin Pop
  mean a four-five times reductions of solution space
98 e4f08c46 Iustin Pop
- since the number of solutions is very high even for smaller depths (on the
99 e4f08c46 Iustin Pop
  test data, depth=4 results in 1.8M solutions) we can't compare them at the
100 e4f08c46 Iustin Pop
  end, so at each iteration in phase 2 we only promote the best solution out of
101 e4f08c46 Iustin Pop
  our own set of solutions
102 e4f08c46 Iustin Pop
- since the placement of instances can only increase the delta of the solution
103 e4f08c46 Iustin Pop
  (placing a new instance will add zero or more replace-disks steps), it means
104 e4f08c46 Iustin Pop
  the delta will only increase while recursing during phase 2; therefore, if we
105 e4f08c46 Iustin Pop
  know at one point that we have a current delta that is equal or higher to the
106 e4f08c46 Iustin Pop
  delta of the best solution so far, we can abort the recursion; this cuts a
107 e4f08c46 Iustin Pop
  tremendous number of branches; further promotion of the best solution from
108 e4f08c46 Iustin Pop
  one removal set to another can cut entire removal sets after a few recursions
109 e4f08c46 Iustin Pop
110 e4f08c46 Iustin Pop
Command line usage
111 e4f08c46 Iustin Pop
++++++++++++++++++
112 e4f08c46 Iustin Pop
113 e4f08c46 Iustin Pop
Synopsis::
114 e4f08c46 Iustin Pop
115 e4f08c46 Iustin Pop
    hn1 [-n NODES_FILE] [-i INSTANCES_FILE] [-d START_DEPTH] \
116 e4f08c46 Iustin Pop
        [-r MAX_REMOVALS] [-l MIN_DELTA] [-L MAX_DELTA] \
117 e4f08c46 Iustin Pop
        [-p] [-C]
118 e4f08c46 Iustin Pop
119 e4f08c46 Iustin Pop
The -n and -i options change the names of the input files. The -d option
120 e4f08c46 Iustin Pop
changes the start depth, as a higher depth can give (with a longer computation
121 e4f08c46 Iustin Pop
time) a solution with better delta. The -r option restricts at each depth the
122 e4f08c46 Iustin Pop
number of solutions considered - with r=1000 for example even depth=10 finishes
123 e4f08c46 Iustin Pop
in less than a second.
124 e4f08c46 Iustin Pop
125 e4f08c46 Iustin Pop
The -p option will show the cluster state after the solution is implemented,
126 e4f08c46 Iustin Pop
while the -C option will show the needed gnt-instance commands to implement
127 e4f08c46 Iustin Pop
it.
128 e4f08c46 Iustin Pop
129 e4f08c46 Iustin Pop
The -l (--min-delta) and -L (--max-delta) options restrict the solution in the
130 e4f08c46 Iustin Pop
following ways:
131 e4f08c46 Iustin Pop
132 e4f08c46 Iustin Pop
- min-delta will cause the search to abort early once we find a solution with
133 e4f08c46 Iustin Pop
  delta less than or equal to this parameter; this can cause extremely fast
134 e4f08c46 Iustin Pop
  results in case a desired solution is found quickly; the default value for
135 e4f08c46 Iustin Pop
  this parameter is zero, so once we find a "perfect" solution we finish early
136 e4f08c46 Iustin Pop
- max-delta causes rejection of valid solution but which have delta higher
137 e4f08c46 Iustin Pop
  than the value of this parameter; this can reduce the depth of the search
138 e4f08c46 Iustin Pop
  tree, with sometimes significant speedups; by default, this optimization is
139 e4f08c46 Iustin Pop
  not used
140 e4f08c46 Iustin Pop
141 e4f08c46 Iustin Pop
Individually or combined, these two parameters can (if there are any) very
142 e4f08c46 Iustin Pop
fast result; on our test data, depth=34 (max depth!) is solved in 2 seconds
143 e4f08c46 Iustin Pop
with min-delta=0/max-delta=1 (since there is such a solution), and the
144 e4f08c46 Iustin Pop
extremely low max-delta causes extreme pruning.
145 e4f08c46 Iustin Pop
146 e4f08c46 Iustin Pop
Cluster rebalancer
147 e4f08c46 Iustin Pop
------------------
148 e4f08c46 Iustin Pop
149 e4f08c46 Iustin Pop
Compared to the N+1 solver, the rebalancer uses a very simple algorithm:
150 e4f08c46 Iustin Pop
repeatedly try to move each instance one step, so that the cluster score
151 e4f08c46 Iustin Pop
becomes better. We stop when no further move can improve the score.
152 e4f08c46 Iustin Pop
153 e4f08c46 Iustin Pop
The algorithm is divided into rounds (all identical):
154 e4f08c46 Iustin Pop
155 e4f08c46 Iustin Pop
#. Repeat for each instance:
156 e4f08c46 Iustin Pop
157 e4f08c46 Iustin Pop
    #. Compute score after the potential failover of the instance
158 e4f08c46 Iustin Pop
159 e4f08c46 Iustin Pop
    #. For each node that is different from the current primary/secondary
160 e4f08c46 Iustin Pop
161 e4f08c46 Iustin Pop
        #. Compute score after replacing the primary with this new node
162 e4f08c46 Iustin Pop
163 e4f08c46 Iustin Pop
        #. Compute score after replacing the secondary with this new node
164 e4f08c46 Iustin Pop
165 e4f08c46 Iustin Pop
166 e4f08c46 Iustin Pop
    #. Out of this N*2+1 possible new scores (and their associated move) for
167 e4f08c46 Iustin Pop
       this instance, we choose the one that is the best in terms of cluster
168 e4f08c46 Iustin Pop
       score, and then proceed to the next instance
169 e4f08c46 Iustin Pop
170 e4f08c46 Iustin Pop
Since we don't compute all combinations of moves for instances (e.g. the first
171 e4f08c46 Iustin Pop
instance's all moves Cartesian product with second instance's all moves, etc.)
172 e4f08c46 Iustin Pop
but we proceed serially instance A, then B, then C, the total computations we
173 e4f08c46 Iustin Pop
make in one steps is simply N(number of nodes)*2+1 times I(number of instances),
174 e4f08c46 Iustin Pop
instead of (N*2+1)^I. So therefore the runtime for a round is trivial.
175 e4f08c46 Iustin Pop
176 e4f08c46 Iustin Pop
Further rounds are done, since the relocation of instances might offer better
177 e4f08c46 Iustin Pop
places for instances which we didn't move, or simply didn't move to the best
178 e4f08c46 Iustin Pop
place. It is possible to limit the rounds, but usually the algorithm finishes
179 e4f08c46 Iustin Pop
after a few rounds by itself.
180 e4f08c46 Iustin Pop
181 e4f08c46 Iustin Pop
Note that the cluster *must* be N+1 compliant before this algorithm is run, and
182 e4f08c46 Iustin Pop
will stay at each move N+1 compliant. Therefore, the final cluster will be N+1
183 e4f08c46 Iustin Pop
compliant.
184 e4f08c46 Iustin Pop
185 e4f08c46 Iustin Pop
Single-round solutions
186 e4f08c46 Iustin Pop
++++++++++++++++++++++
187 e4f08c46 Iustin Pop
188 e4f08c46 Iustin Pop
Single-round solutions have the very nice property that they are
189 e4f08c46 Iustin Pop
incrementally-valid. In other words, if you have a 10-step solution, at each
190 e4f08c46 Iustin Pop
step the cluster is both N+1 compliant and better than the previous step.
191 e4f08c46 Iustin Pop
192 e4f08c46 Iustin Pop
This means that you can stop at any point and you will have a better cluster.
193 e4f08c46 Iustin Pop
For this reason, single-round solutions are recommended in the common case of
194 e4f08c46 Iustin Pop
let's make this better. Multi-round solutions will be better though when adding
195 e4f08c46 Iustin Pop
a couple of new, empty nodes to the cluster due to the many relocations needed.
196 e4f08c46 Iustin Pop
197 e4f08c46 Iustin Pop
198 e4f08c46 Iustin Pop
Multi-round solutions
199 e4f08c46 Iustin Pop
+++++++++++++++++++++
200 e4f08c46 Iustin Pop
201 e4f08c46 Iustin Pop
A multi-round solution (not for a single round), due to de-duplication of moves
202 e4f08c46 Iustin Pop
(i.e. just put the instance directly in its final place, and not move it five
203 e4f08c46 Iustin Pop
times around) loses both these properties. It might be that it's not possible to
204 e4f08c46 Iustin Pop
directly put the instance on the final nodes. So it can be possible that yes,
205 e4f08c46 Iustin Pop
the cluster is happy in the final solution and nice, but you cannot do the steps
206 e4f08c46 Iustin Pop
in the shown order. Solving this (via additional instance move(s)) is left to
207 e4f08c46 Iustin Pop
the user.
208 e4f08c46 Iustin Pop
209 e4f08c46 Iustin Pop
Command line usage
210 e4f08c46 Iustin Pop
++++++++++++++++++
211 e4f08c46 Iustin Pop
212 e4f08c46 Iustin Pop
Synopsis::
213 e4f08c46 Iustin Pop
214 e4f08c46 Iustin Pop
    hbal [-n NODES_FILE] [-i INSTANCES_FILE] \
215 e4f08c46 Iustin Pop
         [-r MAX_ROUNDS] \
216 e4f08c46 Iustin Pop
         [-p] [-C]
217 e4f08c46 Iustin Pop
218 e4f08c46 Iustin Pop
The -n and -i options change the names of the input files. The -r option
219 e4f08c46 Iustin Pop
restricts the maximum number of rounds (and is more of safety measure).
220 e4f08c46 Iustin Pop
221 e4f08c46 Iustin Pop
The -p option will show the cluster state after the solution is implemented,
222 e4f08c46 Iustin Pop
while the -C option will show the needed gnt-instance commands to implement
223 e4f08c46 Iustin Pop
it.
224 e4f08c46 Iustin Pop
225 e4f08c46 Iustin Pop
Integration with Ganeti
226 e4f08c46 Iustin Pop
-----------------------
227 e4f08c46 Iustin Pop
228 e4f08c46 Iustin Pop
The programs needs only the output of the node list and instance list. That is,
229 e4f08c46 Iustin Pop
they need the following two commands to be run::
230 e4f08c46 Iustin Pop
231 01f6a5d2 Iustin Pop
    gnt-node list -oname,mtotal,mfree,dtotal,dfree \
232 e4f08c46 Iustin Pop
      --separator '|' --no-headers > nodes
233 01f6a5d2 Iustin Pop
    gnt-instance list -oname,admin_ram,sda_size,pnode,snodes \
234 e4f08c46 Iustin Pop
      --separator '|' --no-head > instances
235 e4f08c46 Iustin Pop
236 e4f08c46 Iustin Pop
These two files should be saved under the names of 'nodes' and 'instances'.
237 e4f08c46 Iustin Pop
238 e4f08c46 Iustin Pop
When run, the programs will show some informational messages and output the
239 e4f08c46 Iustin Pop
chosen solution, in the form of a list of instance name and chosen
240 e4f08c46 Iustin Pop
primary/secondary nodes. The user then needs to run the necessary commands to
241 e4f08c46 Iustin Pop
get the instances to live on those nodes.
242 e4f08c46 Iustin Pop
243 e4f08c46 Iustin Pop
Note that sda_size is less than the total disk size of an instance by 4352
244 e4f08c46 Iustin Pop
MiB, so if disk space is at a premium the calculation could be wrong; in this
245 e4f08c46 Iustin Pop
case, please adjust the values manually.