root / README @ 01f6a5d2
History | View | Annotate | Download (10.8 kB)
1 | e4f08c46 | Iustin Pop | Cluster tools (h-aneti?) |
---|---|---|---|
2 | e4f08c46 | Iustin Pop | ======================== |
3 | e4f08c46 | Iustin Pop | |
4 | e4f08c46 | Iustin Pop | These are some simple cluster tools for fixing common problems. Right now N+1 |
5 | e4f08c46 | Iustin Pop | and rebalancing are included. |
6 | e4f08c46 | Iustin Pop | |
7 | e4f08c46 | Iustin Pop | .. contents:: |
8 | e4f08c46 | Iustin Pop | |
9 | e4f08c46 | Iustin Pop | Cluster N+1 solver |
10 | e4f08c46 | Iustin Pop | ------------------ |
11 | e4f08c46 | Iustin Pop | |
12 | e4f08c46 | Iustin Pop | This program runs a very simple brute force algorithm over the instance |
13 | e4f08c46 | Iustin Pop | placement space in order to determine the shortest number of replace-disks |
14 | e4f08c46 | Iustin Pop | needed to fix the cluster. Note this means we won't get a balanced cluster, |
15 | e4f08c46 | Iustin Pop | just one that passes N+1 checks. |
16 | e4f08c46 | Iustin Pop | |
17 | e4f08c46 | Iustin Pop | Also note that the set of all instance placements on a 20/80 cluster is |
18 | e4f08c46 | Iustin Pop | (20*19)^80, that is ~10^200, so... |
19 | e4f08c46 | Iustin Pop | |
20 | e4f08c46 | Iustin Pop | Algorithm |
21 | e4f08c46 | Iustin Pop | +++++++++ |
22 | e4f08c46 | Iustin Pop | |
23 | e4f08c46 | Iustin Pop | The algorithm is a simple two-phase process. |
24 | e4f08c46 | Iustin Pop | |
25 | e4f08c46 | Iustin Pop | In phase 1 we determine the removal set, that is the set of instances that when |
26 | e4f08c46 | Iustin Pop | removed completely from the cluster, make it healthy again. The instances that |
27 | e4f08c46 | Iustin Pop | can go into the set are all the primary and secondary instances of the failing |
28 | e4f08c46 | Iustin Pop | nodes. The result from this phase is actually a list - we compute all sets of |
29 | e4f08c46 | Iustin Pop | the same minimum length. |
30 | e4f08c46 | Iustin Pop | |
31 | e4f08c46 | Iustin Pop | So basically we aim to determine here: what is the minimum number of instances |
32 | e4f08c46 | Iustin Pop | that need to be removed (this is called the removal depth) and which are the |
33 | 01f6a5d2 | Iustin Pop | actual combinations that fit (called the list of removal sets). |
34 | e4f08c46 | Iustin Pop | |
35 | e4f08c46 | Iustin Pop | In phase 2, for each removal set computed in the previous phase, we take the |
36 | e4f08c46 | Iustin Pop | removed instances and try to determine where we can put them so that the |
37 | e4f08c46 | Iustin Pop | cluster is still passing N+1 checks. From this list of possible solutions |
38 | e4f08c46 | Iustin Pop | (called the list of solutions), we compute the one that has the smallest delta |
39 | e4f08c46 | Iustin Pop | from the original state (the delta is the number of replace disks that needs to |
40 | e4f08c46 | Iustin Pop | be run) and chose this as the final solution. |
41 | e4f08c46 | Iustin Pop | |
42 | e4f08c46 | Iustin Pop | Implementation |
43 | e4f08c46 | Iustin Pop | ++++++++++++++ |
44 | e4f08c46 | Iustin Pop | |
45 | e4f08c46 | Iustin Pop | Of course, a naive implementation based on the above description will run for |
46 | e4f08c46 | Iustin Pop | long periods of time, so the implementation has to be smart in order to prune |
47 | e4f08c46 | Iustin Pop | the solution space as eagerly as possible. |
48 | e4f08c46 | Iustin Pop | |
49 | e4f08c46 | Iustin Pop | In the following, we use as example a set of test data (a cluster with 20 |
50 | e4f08c46 | Iustin Pop | nodes, 80 instances that has 5 nodes failing N+1 checks for a total of 12 |
51 | e4f08c46 | Iustin Pop | warnings). |
52 | e4f08c46 | Iustin Pop | |
53 | e4f08c46 | Iustin Pop | On this set, the minimum depth is 4 (anything below fails), and for this depth |
54 | e4f08c46 | Iustin Pop | the current version of the algorithm generates 5 removal sets; a previous |
55 | e4f08c46 | Iustin Pop | version of the first phase generated a slightly different set of instances, with |
56 | e4f08c46 | Iustin Pop | two removal sets. For the original version of the algorithm: |
57 | e4f08c46 | Iustin Pop | |
58 | e4f08c46 | Iustin Pop | - the first, non-optimized implementation computed a solution of delta=4 in 30 |
59 | e4f08c46 | Iustin Pop | minutes on server-class CPUs and was still running when aborted 10 minutes |
60 | e4f08c46 | Iustin Pop | later |
61 | e4f08c46 | Iustin Pop | - the intermediate optimized version computed the whole solution space and |
62 | e4f08c46 | Iustin Pop | found a delta=3 solution in around 10 seconds on a laptop-class CPU (total |
63 | e4f08c46 | Iustin Pop | number of solutions ~600k) |
64 | e4f08c46 | Iustin Pop | - latest version on server CPUs (which actually computes more removal sets) |
65 | e4f08c46 | Iustin Pop | computes depth=4 in less than a second and depth=5 in around 2 seconds, and |
66 | e4f08c46 | Iustin Pop | depth=6 in less than 20 seconds; depth=8 takes under five minutes (this is |
67 | e4f08c46 | Iustin Pop | 10^10 bigger solution space) |
68 | e4f08c46 | Iustin Pop | |
69 | e4f08c46 | Iustin Pop | Note that when (artificially) increasing the depth to 5 the number of removal |
70 | e4f08c46 | Iustin Pop | sets grows fast (~3000) and a (again artificial) depth 6 generates 61k removal |
71 | e4f08c46 | Iustin Pop | sets. Therefore, it is possible to restrict the number of solution sets |
72 | e4f08c46 | Iustin Pop | examined via a command-line option. |
73 | e4f08c46 | Iustin Pop | |
74 | e4f08c46 | Iustin Pop | The factors that influence the run time are: |
75 | e4f08c46 | Iustin Pop | |
76 | e4f08c46 | Iustin Pop | - the removal depth; for each increase with one of the depth, we grow the |
77 | e4f08c46 | Iustin Pop | solution space by the number of nodes squared (since a new instance can live |
78 | e4f08c46 | Iustin Pop | any two nodes as primary/secondary, therefore (almost) N times N); i.e., |
79 | e4f08c46 | Iustin Pop | depth=1 will create a N^2 solution space, depth two will make this N^4, |
80 | e4f08c46 | Iustin Pop | depth three will be N^6, etc. |
81 | e4f08c46 | Iustin Pop | - the removal depth again; for each increase in the depth, there will be more |
82 | e4f08c46 | Iustin Pop | valid removal sets, and the space of solutions increases linearly with the |
83 | e4f08c46 | Iustin Pop | number of removal sets |
84 | e4f08c46 | Iustin Pop | |
85 | e4f08c46 | Iustin Pop | Therefore, the smaller the depth the faster the algorithm will be; it doesn't |
86 | e4f08c46 | Iustin Pop | seem like this algorithm will work for clusters of 100 nodes and many many |
87 | e4f08c46 | Iustin Pop | small instances (e.g. 256MB instances on 16GB nodes). |
88 | e4f08c46 | Iustin Pop | |
89 | e4f08c46 | Iustin Pop | Currently applied optimizations: |
90 | e4f08c46 | Iustin Pop | |
91 | e4f08c46 | Iustin Pop | - when choosing where to place an instance in phase two, there are N*(N-1) |
92 | e4f08c46 | Iustin Pop | possible primary/secondary options; however, if instead of iterating over all |
93 | e4f08c46 | Iustin Pop | p * s pairs, we first determine the set of primary nodes that can hold this |
94 | e4f08c46 | Iustin Pop | instance (without failing N+1), we can cut (N-1) secondary placements for |
95 | e4f08c46 | Iustin Pop | each primary node removed; and since this applies at every iteration of phase |
96 | e4f08c46 | Iustin Pop | 2 it linearly decreases the solution space, and on full clusters, this can |
97 | e4f08c46 | Iustin Pop | mean a four-five times reductions of solution space |
98 | e4f08c46 | Iustin Pop | - since the number of solutions is very high even for smaller depths (on the |
99 | e4f08c46 | Iustin Pop | test data, depth=4 results in 1.8M solutions) we can't compare them at the |
100 | e4f08c46 | Iustin Pop | end, so at each iteration in phase 2 we only promote the best solution out of |
101 | e4f08c46 | Iustin Pop | our own set of solutions |
102 | e4f08c46 | Iustin Pop | - since the placement of instances can only increase the delta of the solution |
103 | e4f08c46 | Iustin Pop | (placing a new instance will add zero or more replace-disks steps), it means |
104 | e4f08c46 | Iustin Pop | the delta will only increase while recursing during phase 2; therefore, if we |
105 | e4f08c46 | Iustin Pop | know at one point that we have a current delta that is equal or higher to the |
106 | e4f08c46 | Iustin Pop | delta of the best solution so far, we can abort the recursion; this cuts a |
107 | e4f08c46 | Iustin Pop | tremendous number of branches; further promotion of the best solution from |
108 | e4f08c46 | Iustin Pop | one removal set to another can cut entire removal sets after a few recursions |
109 | e4f08c46 | Iustin Pop | |
110 | e4f08c46 | Iustin Pop | Command line usage |
111 | e4f08c46 | Iustin Pop | ++++++++++++++++++ |
112 | e4f08c46 | Iustin Pop | |
113 | e4f08c46 | Iustin Pop | Synopsis:: |
114 | e4f08c46 | Iustin Pop | |
115 | e4f08c46 | Iustin Pop | hn1 [-n NODES_FILE] [-i INSTANCES_FILE] [-d START_DEPTH] \ |
116 | e4f08c46 | Iustin Pop | [-r MAX_REMOVALS] [-l MIN_DELTA] [-L MAX_DELTA] \ |
117 | e4f08c46 | Iustin Pop | [-p] [-C] |
118 | e4f08c46 | Iustin Pop | |
119 | e4f08c46 | Iustin Pop | The -n and -i options change the names of the input files. The -d option |
120 | e4f08c46 | Iustin Pop | changes the start depth, as a higher depth can give (with a longer computation |
121 | e4f08c46 | Iustin Pop | time) a solution with better delta. The -r option restricts at each depth the |
122 | e4f08c46 | Iustin Pop | number of solutions considered - with r=1000 for example even depth=10 finishes |
123 | e4f08c46 | Iustin Pop | in less than a second. |
124 | e4f08c46 | Iustin Pop | |
125 | e4f08c46 | Iustin Pop | The -p option will show the cluster state after the solution is implemented, |
126 | e4f08c46 | Iustin Pop | while the -C option will show the needed gnt-instance commands to implement |
127 | e4f08c46 | Iustin Pop | it. |
128 | e4f08c46 | Iustin Pop | |
129 | e4f08c46 | Iustin Pop | The -l (--min-delta) and -L (--max-delta) options restrict the solution in the |
130 | e4f08c46 | Iustin Pop | following ways: |
131 | e4f08c46 | Iustin Pop | |
132 | e4f08c46 | Iustin Pop | - min-delta will cause the search to abort early once we find a solution with |
133 | e4f08c46 | Iustin Pop | delta less than or equal to this parameter; this can cause extremely fast |
134 | e4f08c46 | Iustin Pop | results in case a desired solution is found quickly; the default value for |
135 | e4f08c46 | Iustin Pop | this parameter is zero, so once we find a "perfect" solution we finish early |
136 | e4f08c46 | Iustin Pop | - max-delta causes rejection of valid solution but which have delta higher |
137 | e4f08c46 | Iustin Pop | than the value of this parameter; this can reduce the depth of the search |
138 | e4f08c46 | Iustin Pop | tree, with sometimes significant speedups; by default, this optimization is |
139 | e4f08c46 | Iustin Pop | not used |
140 | e4f08c46 | Iustin Pop | |
141 | e4f08c46 | Iustin Pop | Individually or combined, these two parameters can (if there are any) very |
142 | e4f08c46 | Iustin Pop | fast result; on our test data, depth=34 (max depth!) is solved in 2 seconds |
143 | e4f08c46 | Iustin Pop | with min-delta=0/max-delta=1 (since there is such a solution), and the |
144 | e4f08c46 | Iustin Pop | extremely low max-delta causes extreme pruning. |
145 | e4f08c46 | Iustin Pop | |
146 | e4f08c46 | Iustin Pop | Cluster rebalancer |
147 | e4f08c46 | Iustin Pop | ------------------ |
148 | e4f08c46 | Iustin Pop | |
149 | e4f08c46 | Iustin Pop | Compared to the N+1 solver, the rebalancer uses a very simple algorithm: |
150 | e4f08c46 | Iustin Pop | repeatedly try to move each instance one step, so that the cluster score |
151 | e4f08c46 | Iustin Pop | becomes better. We stop when no further move can improve the score. |
152 | e4f08c46 | Iustin Pop | |
153 | e4f08c46 | Iustin Pop | The algorithm is divided into rounds (all identical): |
154 | e4f08c46 | Iustin Pop | |
155 | e4f08c46 | Iustin Pop | #. Repeat for each instance: |
156 | e4f08c46 | Iustin Pop | |
157 | e4f08c46 | Iustin Pop | #. Compute score after the potential failover of the instance |
158 | e4f08c46 | Iustin Pop | |
159 | e4f08c46 | Iustin Pop | #. For each node that is different from the current primary/secondary |
160 | e4f08c46 | Iustin Pop | |
161 | e4f08c46 | Iustin Pop | #. Compute score after replacing the primary with this new node |
162 | e4f08c46 | Iustin Pop | |
163 | e4f08c46 | Iustin Pop | #. Compute score after replacing the secondary with this new node |
164 | e4f08c46 | Iustin Pop | |
165 | e4f08c46 | Iustin Pop | |
166 | e4f08c46 | Iustin Pop | #. Out of this N*2+1 possible new scores (and their associated move) for |
167 | e4f08c46 | Iustin Pop | this instance, we choose the one that is the best in terms of cluster |
168 | e4f08c46 | Iustin Pop | score, and then proceed to the next instance |
169 | e4f08c46 | Iustin Pop | |
170 | e4f08c46 | Iustin Pop | Since we don't compute all combinations of moves for instances (e.g. the first |
171 | e4f08c46 | Iustin Pop | instance's all moves Cartesian product with second instance's all moves, etc.) |
172 | e4f08c46 | Iustin Pop | but we proceed serially instance A, then B, then C, the total computations we |
173 | e4f08c46 | Iustin Pop | make in one steps is simply N(number of nodes)*2+1 times I(number of instances), |
174 | e4f08c46 | Iustin Pop | instead of (N*2+1)^I. So therefore the runtime for a round is trivial. |
175 | e4f08c46 | Iustin Pop | |
176 | e4f08c46 | Iustin Pop | Further rounds are done, since the relocation of instances might offer better |
177 | e4f08c46 | Iustin Pop | places for instances which we didn't move, or simply didn't move to the best |
178 | e4f08c46 | Iustin Pop | place. It is possible to limit the rounds, but usually the algorithm finishes |
179 | e4f08c46 | Iustin Pop | after a few rounds by itself. |
180 | e4f08c46 | Iustin Pop | |
181 | e4f08c46 | Iustin Pop | Note that the cluster *must* be N+1 compliant before this algorithm is run, and |
182 | e4f08c46 | Iustin Pop | will stay at each move N+1 compliant. Therefore, the final cluster will be N+1 |
183 | e4f08c46 | Iustin Pop | compliant. |
184 | e4f08c46 | Iustin Pop | |
185 | e4f08c46 | Iustin Pop | Single-round solutions |
186 | e4f08c46 | Iustin Pop | ++++++++++++++++++++++ |
187 | e4f08c46 | Iustin Pop | |
188 | e4f08c46 | Iustin Pop | Single-round solutions have the very nice property that they are |
189 | e4f08c46 | Iustin Pop | incrementally-valid. In other words, if you have a 10-step solution, at each |
190 | e4f08c46 | Iustin Pop | step the cluster is both N+1 compliant and better than the previous step. |
191 | e4f08c46 | Iustin Pop | |
192 | e4f08c46 | Iustin Pop | This means that you can stop at any point and you will have a better cluster. |
193 | e4f08c46 | Iustin Pop | For this reason, single-round solutions are recommended in the common case of |
194 | e4f08c46 | Iustin Pop | let's make this better. Multi-round solutions will be better though when adding |
195 | e4f08c46 | Iustin Pop | a couple of new, empty nodes to the cluster due to the many relocations needed. |
196 | e4f08c46 | Iustin Pop | |
197 | e4f08c46 | Iustin Pop | |
198 | e4f08c46 | Iustin Pop | Multi-round solutions |
199 | e4f08c46 | Iustin Pop | +++++++++++++++++++++ |
200 | e4f08c46 | Iustin Pop | |
201 | e4f08c46 | Iustin Pop | A multi-round solution (not for a single round), due to de-duplication of moves |
202 | e4f08c46 | Iustin Pop | (i.e. just put the instance directly in its final place, and not move it five |
203 | e4f08c46 | Iustin Pop | times around) loses both these properties. It might be that it's not possible to |
204 | e4f08c46 | Iustin Pop | directly put the instance on the final nodes. So it can be possible that yes, |
205 | e4f08c46 | Iustin Pop | the cluster is happy in the final solution and nice, but you cannot do the steps |
206 | e4f08c46 | Iustin Pop | in the shown order. Solving this (via additional instance move(s)) is left to |
207 | e4f08c46 | Iustin Pop | the user. |
208 | e4f08c46 | Iustin Pop | |
209 | e4f08c46 | Iustin Pop | Command line usage |
210 | e4f08c46 | Iustin Pop | ++++++++++++++++++ |
211 | e4f08c46 | Iustin Pop | |
212 | e4f08c46 | Iustin Pop | Synopsis:: |
213 | e4f08c46 | Iustin Pop | |
214 | e4f08c46 | Iustin Pop | hbal [-n NODES_FILE] [-i INSTANCES_FILE] \ |
215 | e4f08c46 | Iustin Pop | [-r MAX_ROUNDS] \ |
216 | e4f08c46 | Iustin Pop | [-p] [-C] |
217 | e4f08c46 | Iustin Pop | |
218 | e4f08c46 | Iustin Pop | The -n and -i options change the names of the input files. The -r option |
219 | e4f08c46 | Iustin Pop | restricts the maximum number of rounds (and is more of safety measure). |
220 | e4f08c46 | Iustin Pop | |
221 | e4f08c46 | Iustin Pop | The -p option will show the cluster state after the solution is implemented, |
222 | e4f08c46 | Iustin Pop | while the -C option will show the needed gnt-instance commands to implement |
223 | e4f08c46 | Iustin Pop | it. |
224 | e4f08c46 | Iustin Pop | |
225 | e4f08c46 | Iustin Pop | Integration with Ganeti |
226 | e4f08c46 | Iustin Pop | ----------------------- |
227 | e4f08c46 | Iustin Pop | |
228 | e4f08c46 | Iustin Pop | The programs needs only the output of the node list and instance list. That is, |
229 | e4f08c46 | Iustin Pop | they need the following two commands to be run:: |
230 | e4f08c46 | Iustin Pop | |
231 | 01f6a5d2 | Iustin Pop | gnt-node list -oname,mtotal,mfree,dtotal,dfree \ |
232 | e4f08c46 | Iustin Pop | --separator '|' --no-headers > nodes |
233 | 01f6a5d2 | Iustin Pop | gnt-instance list -oname,admin_ram,sda_size,pnode,snodes \ |
234 | e4f08c46 | Iustin Pop | --separator '|' --no-head > instances |
235 | e4f08c46 | Iustin Pop | |
236 | e4f08c46 | Iustin Pop | These two files should be saved under the names of 'nodes' and 'instances'. |
237 | e4f08c46 | Iustin Pop | |
238 | e4f08c46 | Iustin Pop | When run, the programs will show some informational messages and output the |
239 | e4f08c46 | Iustin Pop | chosen solution, in the form of a list of instance name and chosen |
240 | e4f08c46 | Iustin Pop | primary/secondary nodes. The user then needs to run the necessary commands to |
241 | e4f08c46 | Iustin Pop | get the instances to live on those nodes. |
242 | e4f08c46 | Iustin Pop | |
243 | e4f08c46 | Iustin Pop | Note that sda_size is less than the total disk size of an instance by 4352 |
244 | e4f08c46 | Iustin Pop | MiB, so if disk space is at a premium the calculation could be wrong; in this |
245 | e4f08c46 | Iustin Pop | case, please adjust the values manually. |