root / hn1.1 @ 7b255913
History | View | Annotate | Download (9.1 kB)
1 | d0003b35 | Iustin Pop | .TH HN1 1 2009-03-23 htools "Ganeti H-tools" |
---|---|---|---|
2 | b0045e4d | Iustin Pop | .SH NAME |
3 | b0045e4d | Iustin Pop | hn1 \- N+1 fixer for Ganeti |
4 | b0045e4d | Iustin Pop | |
5 | b0045e4d | Iustin Pop | .SH SYNOPSIS |
6 | b0045e4d | Iustin Pop | .B hn1 |
7 | b0045e4d | Iustin Pop | .B "[-C]" |
8 | b0045e4d | Iustin Pop | .B "[-p]" |
9 | b0045e4d | Iustin Pop | .B "[-o]" |
10 | b0045e4d | Iustin Pop | .BI "[ -m " cluster "]" |
11 | b0045e4d | Iustin Pop | .BI "[-n " nodes-file " ]" |
12 | b0045e4d | Iustin Pop | .BI "[ -i " instances-file "]" |
13 | b0045e4d | Iustin Pop | .BI "[-d " depth "]" |
14 | b0045e4d | Iustin Pop | .BI "[-r " max-removals "]" |
15 | b0045e4d | Iustin Pop | .BI "[-L " max-delta "]" |
16 | b0045e4d | Iustin Pop | .BI "[-l " min-delta "]" |
17 | b0045e4d | Iustin Pop | |
18 | b0045e4d | Iustin Pop | .B hn1 |
19 | b0045e4d | Iustin Pop | .B --version |
20 | b0045e4d | Iustin Pop | |
21 | b0045e4d | Iustin Pop | .SH DESCRIPTION |
22 | b0045e4d | Iustin Pop | hn1 is a cluster N+1 fixer that tries to compute the minimum number of |
23 | b0045e4d | Iustin Pop | moves needed for getting all nodes to be N+1 compliant. |
24 | b0045e4d | Iustin Pop | |
25 | b0045e4d | Iustin Pop | The algorithm is designed to be a 'perfect' algorithm, so that we |
26 | b0045e4d | Iustin Pop | always examine the entire solution space until we find the minimum |
27 | b0045e4d | Iustin Pop | solution. The algorithm can be tweaked via the \fB-d\fR, \fB-r\fR, |
28 | b0045e4d | Iustin Pop | \fB-L\fR and \fB-l\fR options. |
29 | b0045e4d | Iustin Pop | |
30 | b0045e4d | Iustin Pop | By default, the program will show the solution in a somewhat cryptic |
31 | b0045e4d | Iustin Pop | format; for getting the actual Ganeti command list, use the \fB-C\fR |
32 | b0045e4d | Iustin Pop | option. |
33 | b0045e4d | Iustin Pop | |
34 | b0045e4d | Iustin Pop | \fBNote:\fR this program is somewhat deprecated; \fBhbal(1)\fR gives |
35 | b0045e4d | Iustin Pop | usually much faster results, and a better cluster. It is recommended |
36 | b0045e4d | Iustin Pop | to use this program only when \fBhbal\fR doesn't give a N+1 compliant |
37 | b0045e4d | Iustin Pop | cluster. |
38 | b0045e4d | Iustin Pop | |
39 | b0045e4d | Iustin Pop | .SS ALGORITHM |
40 | b0045e4d | Iustin Pop | |
41 | b0045e4d | Iustin Pop | The algorithm works in multiple rounds, of increasing \fIdepth\fR, |
42 | b0045e4d | Iustin Pop | until we have a solution. |
43 | b0045e4d | Iustin Pop | |
44 | b0045e4d | Iustin Pop | First, before starting the solution computation, we compute all the |
45 | b0045e4d | Iustin Pop | N+1-fail nodes and the instances they hold. These instances are |
46 | b0045e4d | Iustin Pop | candidate for replacement (and only these!). |
47 | b0045e4d | Iustin Pop | |
48 | b0045e4d | Iustin Pop | The program start then with \fIdepth\fR one (unless overridden via the |
49 | b0045e4d | Iustin Pop | \fB-d\fR option), and at each round: |
50 | d0003b35 | Iustin Pop | .RS 4 |
51 | d0003b35 | Iustin Pop | .TP 3 |
52 | d0003b35 | Iustin Pop | \(em |
53 | d0003b35 | Iustin Pop | it tries to remove from the cluster as many instances as the current |
54 | d0003b35 | Iustin Pop | depth in order to make the cluster N+1 compliant |
55 | d0003b35 | Iustin Pop | |
56 | d0003b35 | Iustin Pop | .TP |
57 | d0003b35 | Iustin Pop | \(em |
58 | d0003b35 | Iustin Pop | then, for each of the possible instance combinations that allow this |
59 | d0003b35 | Iustin Pop | (unless the total size is reduced via the \fB-r\fR option), it tries |
60 | d0003b35 | Iustin Pop | to put them back on the cluster while maintaining N+1 compliance |
61 | d0003b35 | Iustin Pop | .RE |
62 | b0045e4d | Iustin Pop | |
63 | b0045e4d | Iustin Pop | It might be that at a given round, the results are: |
64 | d0003b35 | Iustin Pop | .RS 4 |
65 | d0003b35 | Iustin Pop | .TP 3 |
66 | d0003b35 | Iustin Pop | \(em |
67 | d0003b35 | Iustin Pop | no instance combination that can be put back; this means it is not |
68 | d0003b35 | Iustin Pop | possible to make the cluster N+1 compliant with this number of |
69 | d0003b35 | Iustin Pop | instances being moved, so we increase the depth and go on to the next |
70 | d0003b35 | Iustin Pop | round |
71 | d0003b35 | Iustin Pop | .TP |
72 | d0003b35 | Iustin Pop | \(em |
73 | d0003b35 | Iustin Pop | one or more successful result, in which case we take the one that has |
74 | d0003b35 | Iustin Pop | as few changes as possible (by change meaning a replace-disks needed) |
75 | d0003b35 | Iustin Pop | .RE |
76 | b0045e4d | Iustin Pop | |
77 | b0045e4d | Iustin Pop | The main problem with the algorithm is that, being an exhaustive |
78 | b0045e4d | Iustin Pop | search, the CPU time required grows very very quickly based on |
79 | b0045e4d | Iustin Pop | depth. On a 20-node, 80-instances cluster, depths up to 5-6 are |
80 | b0045e4d | Iustin Pop | quickly computed, and depth 10 could already take days. |
81 | b0045e4d | Iustin Pop | |
82 | d0003b35 | Iustin Pop | The main factors that influence the run time are: |
83 | d0003b35 | Iustin Pop | .RS 4 |
84 | d0003b35 | Iustin Pop | .TP 3 |
85 | d0003b35 | Iustin Pop | \(em |
86 | d0003b35 | Iustin Pop | the removal depth; for each increase with one of the depth, we grow |
87 | d0003b35 | Iustin Pop | the solution space by the number of nodes squared (since a new |
88 | d0003b35 | Iustin Pop | instance can live any two nodes as primary/secondary, therefore |
89 | d0003b35 | Iustin Pop | (almost) N times N); i.e., depth=1 will create a N^2 solution space, |
90 | d0003b35 | Iustin Pop | depth two will make this N^4, depth three will be N^6, etc. |
91 | d0003b35 | Iustin Pop | |
92 | d0003b35 | Iustin Pop | .TP |
93 | d0003b35 | Iustin Pop | \(em |
94 | d0003b35 | Iustin Pop | the removal depth again; for each increase in the depth, there will be |
95 | d0003b35 | Iustin Pop | more valid removal sets, and the space of solutions increases linearly |
96 | d0003b35 | Iustin Pop | with the number of removal sets |
97 | d0003b35 | Iustin Pop | .RE |
98 | d0003b35 | Iustin Pop | |
99 | d0003b35 | Iustin Pop | Therefore, the smaller the depth the faster the algorithm will be; it doesn't |
100 | d0003b35 | Iustin Pop | seem like this algorithm will work for clusters of 100 nodes and many many |
101 | d0003b35 | Iustin Pop | small instances (e.g. 256MB instances on 16GB nodes). |
102 | d0003b35 | Iustin Pop | |
103 | d0003b35 | Iustin Pop | As an optimisation, since the algorithm is designed to prune the |
104 | d0003b35 | Iustin Pop | search space as quickly as possible, is by luck we find a good |
105 | d0003b35 | Iustin Pop | solution early at a given depth, then the other solutions which would |
106 | d0003b35 | Iustin Pop | result in a bigger delta (the number of changes) will not be |
107 | d0003b35 | Iustin Pop | investigated, and the program will finish fast. Since this is random |
108 | d0003b35 | Iustin Pop | and depends on where in the full solution space the good solution will |
109 | d0003b35 | Iustin Pop | be, there are two options for cutting down the time needed: |
110 | d0003b35 | Iustin Pop | .RS 4 |
111 | d0003b35 | Iustin Pop | .TP 3 |
112 | d0003b35 | Iustin Pop | \(em |
113 | d0003b35 | Iustin Pop | \fB-l\fR makes any solution that has delta lower than its parameter |
114 | d0003b35 | Iustin Pop | succeed instantly; the default value for this parameter is zero, so |
115 | d0003b35 | Iustin Pop | once we find a "perfect" solution we finish early |
116 | d0003b35 | Iustin Pop | |
117 | d0003b35 | Iustin Pop | .TP |
118 | d0003b35 | Iustin Pop | \(em |
119 | d0003b35 | Iustin Pop | \fB-L\fR makes any solution with delta higher than its parameter being |
120 | d0003b35 | Iustin Pop | rejected instantly (and not descend on the search tree); this can |
121 | d0003b35 | Iustin Pop | reduce the depth of the search tree, with sometimes significant |
122 | d0003b35 | Iustin Pop | speedups; by default, this optimization is not used |
123 | d0003b35 | Iustin Pop | .RE |
124 | d0003b35 | Iustin Pop | |
125 | d0003b35 | Iustin Pop | The algorithm also has some other internal optimisations: |
126 | d0003b35 | Iustin Pop | .RS 4 |
127 | d0003b35 | Iustin Pop | .TP 3 |
128 | d0003b35 | Iustin Pop | \(em |
129 | d0003b35 | Iustin Pop | when choosing where to place an instance in phase two, there are |
130 | d0003b35 | Iustin Pop | N*(N-1) possible primary/secondary options; however, if instead of |
131 | d0003b35 | Iustin Pop | iterating over all p * s pairs, we first determine the set of primary |
132 | d0003b35 | Iustin Pop | nodes that can hold this instance (without failing N+1), we can cut |
133 | d0003b35 | Iustin Pop | (N-1) secondary placements for each primary node removed; and since |
134 | d0003b35 | Iustin Pop | this applies at every iteration of phase 2 it linearly decreases the |
135 | d0003b35 | Iustin Pop | solution space, and on full clusters, this can mean a four-five times |
136 | d0003b35 | Iustin Pop | reductions of solution space |
137 | d0003b35 | Iustin Pop | |
138 | d0003b35 | Iustin Pop | .TP |
139 | d0003b35 | Iustin Pop | \(em |
140 | d0003b35 | Iustin Pop | since the number of solutions is very high even for smaller depths (on |
141 | d0003b35 | Iustin Pop | the test data, depth=4 results in 1.8M solutions) we can't compare |
142 | d0003b35 | Iustin Pop | them at the end, so at each iteration in phase 2 we only promote the |
143 | d0003b35 | Iustin Pop | best solution out of our own set of solutions |
144 | d0003b35 | Iustin Pop | |
145 | d0003b35 | Iustin Pop | .TP |
146 | d0003b35 | Iustin Pop | \(em |
147 | d0003b35 | Iustin Pop | since the placement of instances can only increase the delta of the |
148 | d0003b35 | Iustin Pop | solution (placing a new instance will add zero or more replace-disks |
149 | d0003b35 | Iustin Pop | steps), it means the delta will only increase while recursing during |
150 | d0003b35 | Iustin Pop | phase 2; therefore, if we know at one point that we have a current |
151 | d0003b35 | Iustin Pop | delta that is equal or higher to the delta of the best solution so |
152 | d0003b35 | Iustin Pop | far, we can abort the recursion; this cuts a tremendous number of |
153 | d0003b35 | Iustin Pop | branches; further promotion of the best solution from one removal set |
154 | d0003b35 | Iustin Pop | to another can cut entire removal sets after a few recursions |
155 | d0003b35 | Iustin Pop | |
156 | d0003b35 | Iustin Pop | .RE |
157 | b0045e4d | Iustin Pop | |
158 | b0045e4d | Iustin Pop | .SH OPTIONS |
159 | b0045e4d | Iustin Pop | The options that can be passed to the program are as follows: |
160 | b0045e4d | Iustin Pop | .TP |
161 | b0045e4d | Iustin Pop | .B -C, --print-commands |
162 | b0045e4d | Iustin Pop | Print the command list at the end of the run. Without this, the |
163 | b0045e4d | Iustin Pop | program will only show a shorter, but cryptic output. |
164 | b0045e4d | Iustin Pop | .TP |
165 | b0045e4d | Iustin Pop | .B -p, --print-nodes |
166 | b0045e4d | Iustin Pop | Prints the before and after node status, in a format designed to allow |
167 | b0045e4d | Iustin Pop | the user to understand the node's most important parameters. |
168 | b0045e4d | Iustin Pop | |
169 | b0045e4d | Iustin Pop | The node list will contain these informations: |
170 | d2ac5526 | Iustin Pop | .RS |
171 | d2ac5526 | Iustin Pop | .TP |
172 | d2ac5526 | Iustin Pop | .B F |
173 | d2ac5526 | Iustin Pop | a character denoting the status of the node, with '-' meaning an |
174 | d2ac5526 | Iustin Pop | offline node, '*' meaning N+1 failure and blank meaning a good node |
175 | d2ac5526 | Iustin Pop | .TP |
176 | d2ac5526 | Iustin Pop | .B Name |
177 | d2ac5526 | Iustin Pop | the node name |
178 | d2ac5526 | Iustin Pop | .TP |
179 | d2ac5526 | Iustin Pop | .B t_mem |
180 | d2ac5526 | Iustin Pop | the total node memory |
181 | d2ac5526 | Iustin Pop | .TP |
182 | d2ac5526 | Iustin Pop | .B n_mem |
183 | d2ac5526 | Iustin Pop | the memory used by the node itself |
184 | d2ac5526 | Iustin Pop | .TP |
185 | d2ac5526 | Iustin Pop | .B i_mem |
186 | d2ac5526 | Iustin Pop | the memory used by instances |
187 | d2ac5526 | Iustin Pop | .TP |
188 | d2ac5526 | Iustin Pop | .B x_mem |
189 | d2ac5526 | Iustin Pop | amount memory which seems to be in use but cannot be determined why or |
190 | d2ac5526 | Iustin Pop | by which instance; usually this means that the hypervisor has some |
191 | d2ac5526 | Iustin Pop | overhead or that there are other reporting errors |
192 | d2ac5526 | Iustin Pop | .TP |
193 | d2ac5526 | Iustin Pop | .B f_mem |
194 | d2ac5526 | Iustin Pop | the free node memory |
195 | d2ac5526 | Iustin Pop | .TP |
196 | d2ac5526 | Iustin Pop | .B r_mem |
197 | d2ac5526 | Iustin Pop | the reserved node memory, which is the amount of free memory needed |
198 | d2ac5526 | Iustin Pop | for N+1 compliance |
199 | d2ac5526 | Iustin Pop | .TP |
200 | d2ac5526 | Iustin Pop | .B t_dsk |
201 | d2ac5526 | Iustin Pop | total disk |
202 | d2ac5526 | Iustin Pop | .TP |
203 | d2ac5526 | Iustin Pop | .B f_dsk |
204 | d2ac5526 | Iustin Pop | free disk |
205 | d2ac5526 | Iustin Pop | .TP |
206 | d2ac5526 | Iustin Pop | .B pri |
207 | d2ac5526 | Iustin Pop | number of primary instances |
208 | d2ac5526 | Iustin Pop | .TP |
209 | d2ac5526 | Iustin Pop | .B sec |
210 | d2ac5526 | Iustin Pop | number of secondary instances |
211 | d2ac5526 | Iustin Pop | .TP |
212 | d2ac5526 | Iustin Pop | .B p_fmem |
213 | d2ac5526 | Iustin Pop | percent of free memory |
214 | d2ac5526 | Iustin Pop | .TP |
215 | d2ac5526 | Iustin Pop | .B p_fdsk |
216 | d2ac5526 | Iustin Pop | percent of free disk |
217 | d2ac5526 | Iustin Pop | .RE |
218 | b0045e4d | Iustin Pop | |
219 | b0045e4d | Iustin Pop | .TP |
220 | b0045e4d | Iustin Pop | .BI "-n" nodefile ", --nodes=" nodefile |
221 | b0045e4d | Iustin Pop | The name of the file holding node information (if not collecting via |
222 | 7b255913 | Iustin Pop | RAPI), instead of the default \fInodes\fR file (but see below how to |
223 | 7b255913 | Iustin Pop | customize the default value via the environment). |
224 | b0045e4d | Iustin Pop | |
225 | b0045e4d | Iustin Pop | .TP |
226 | b0045e4d | Iustin Pop | .BI "-i" instancefile ", --instances=" instancefile |
227 | b0045e4d | Iustin Pop | The name of the file holding instance information (if not collecting |
228 | 7b255913 | Iustin Pop | via RAPI), instead of the default \fIinstances\fR file (but see below |
229 | 7b255913 | Iustin Pop | how to customize the default value via the environment). |
230 | b0045e4d | Iustin Pop | |
231 | b0045e4d | Iustin Pop | .TP |
232 | b0045e4d | Iustin Pop | .BI "-m" cluster |
233 | b0045e4d | Iustin Pop | Collect data not from files but directly from the |
234 | b0045e4d | Iustin Pop | .I cluster |
235 | b0045e4d | Iustin Pop | given as an argument via RAPI. This work for both Ganeti 1.2 and |
236 | b0045e4d | Iustin Pop | Ganeti 2.0. |
237 | b0045e4d | Iustin Pop | |
238 | b0045e4d | Iustin Pop | .TP |
239 | b0045e4d | Iustin Pop | .BI "-d" DEPTH ", --depth=" DEPTH |
240 | b0045e4d | Iustin Pop | Start the algorithm directly at depth \fID\fR, so that we don't |
241 | b0045e4d | Iustin Pop | examine lower depth. This will be faster if we know a solution is not |
242 | b0045e4d | Iustin Pop | found a lower depths, and thus it's unneeded to search them. |
243 | b0045e4d | Iustin Pop | |
244 | b0045e4d | Iustin Pop | .TP |
245 | b0045e4d | Iustin Pop | .BI "-l" MIN-DELTA ", --min-delta=" MIN-DELTA |
246 | b0045e4d | Iustin Pop | If we find a solution with delta lower or equal to \fIMIN-DELTA\fR, |
247 | b0045e4d | Iustin Pop | consider this a success and don't examine further. |
248 | b0045e4d | Iustin Pop | |
249 | b0045e4d | Iustin Pop | .TP |
250 | b0045e4d | Iustin Pop | .BI "-L" MAX-DELTA ", --max-delta=" MAX-DELTA |
251 | b0045e4d | Iustin Pop | If while computing a solution, it's intermediate delta is already |
252 | b0045e4d | Iustin Pop | higher or equal to \fIMAX-DELTA\fR, consider this a failure and abort |
253 | b0045e4d | Iustin Pop | (as if N+1 checks have failed). |
254 | b0045e4d | Iustin Pop | |
255 | b0045e4d | Iustin Pop | .TP |
256 | b0045e4d | Iustin Pop | .B -V, --version |
257 | b0045e4d | Iustin Pop | Just show the program version and exit. |
258 | b0045e4d | Iustin Pop | |
259 | b0045e4d | Iustin Pop | .SH EXIT STATUS |
260 | b0045e4d | Iustin Pop | |
261 | b0045e4d | Iustin Pop | The exist status of the command will be zero, unless for some reason |
262 | b0045e4d | Iustin Pop | the algorithm fatally failed (e.g. wrong node or instance data). |
263 | b0045e4d | Iustin Pop | |
264 | 7b255913 | Iustin Pop | .SH ENVIRONMENT |
265 | 7b255913 | Iustin Pop | |
266 | 7b255913 | Iustin Pop | If the variables \fBHTOOLS_NODES\fR and \fBHTOOLS_INSTANCES\fR are |
267 | 7b255913 | Iustin Pop | present in the environment, they will override the default names for |
268 | 7b255913 | Iustin Pop | the nodes and instances files. These will have of course no effect |
269 | 7b255913 | Iustin Pop | when RAPI is used. |
270 | 7b255913 | Iustin Pop | |
271 | b0045e4d | Iustin Pop | .SH BUGS |
272 | b0045e4d | Iustin Pop | |
273 | b0045e4d | Iustin Pop | The program does not check its input data for consistency, and aborts |
274 | b0045e4d | Iustin Pop | with cryptic errors messages in this case. |
275 | b0045e4d | Iustin Pop | |
276 | d0003b35 | Iustin Pop | The algorithm doesn't deal with non-\fBdrbd\fR instances, and chokes |
277 | d0003b35 | Iustin Pop | on input data which has such instances. |
278 | d0003b35 | Iustin Pop | |
279 | b0045e4d | Iustin Pop | The algorithm doesn't know when it won't be possible to reach N+1 |
280 | b0045e4d | Iustin Pop | compliance at all, and will happily churn CPU for ages without |
281 | b0045e4d | Iustin Pop | realising it won't reach a solution. |
282 | b0045e4d | Iustin Pop | |
283 | b0045e4d | Iustin Pop | The algorithm is too slow. |
284 | b0045e4d | Iustin Pop | |
285 | b0045e4d | Iustin Pop | The output format is not easily scriptable, and the program should |
286 | b0045e4d | Iustin Pop | feed moves directly into Ganeti (either via RAPI or via a gnt-debug |
287 | b0045e4d | Iustin Pop | input file). |
288 | b0045e4d | Iustin Pop | |
289 | b0045e4d | Iustin Pop | .SH SEE ALSO |
290 | d2ac5526 | Iustin Pop | .BR hbal "(1), " hscan "(1), " ganeti "(7), " gnt-instance "(8), " |
291 | d2ac5526 | Iustin Pop | .BR gnt-node "(8)" |