root / doc / iallocator.rst @ 5bbd3f7f
History | View | Annotate | Download (11.5 kB)
1 |
Ganeti automatic instance allocation |
---|---|
2 |
==================================== |
3 |
|
4 |
Documents Ganeti version 2.0 |
5 |
|
6 |
.. contents:: |
7 |
|
8 |
Introduction |
9 |
------------ |
10 |
|
11 |
Currently in Ganeti the admin has to specify the exact locations for |
12 |
an instance's node(s). This prevents a completely automatic node |
13 |
evacuation, and is in general a nuisance. |
14 |
|
15 |
The *iallocator* framework will enable automatic placement via |
16 |
external scripts, which allows customization of the cluster layout per |
17 |
the site's requirements. |
18 |
|
19 |
User-visible changes |
20 |
~~~~~~~~~~~~~~~~~~~~ |
21 |
|
22 |
There are two parts of the ganeti operation that are impacted by the |
23 |
auto-allocation: how the cluster knows what the allocator algorithms |
24 |
are and how the admin uses these in creating instances. |
25 |
|
26 |
An allocation algorithm is just the filename of a program installed in |
27 |
a defined list of directories. |
28 |
|
29 |
Cluster configuration |
30 |
~~~~~~~~~~~~~~~~~~~~~ |
31 |
|
32 |
At configure time, the list of the directories can be selected via the |
33 |
``--with-iallocator-search-path=LIST`` option, where *LIST* is a |
34 |
comma-separated list of directories. If not given, this defaults to |
35 |
``$libdir/ganeti/iallocators``, i.e. for an installation under |
36 |
``/usr``, this will be ``/usr/lib/ganeti/iallocators``. |
37 |
|
38 |
Ganeti will then search for allocator script in the configured list, |
39 |
using the first one whose filename matches the one given by the user. |
40 |
|
41 |
Command line interface changes |
42 |
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
43 |
|
44 |
The node selection options in instanece add and instance replace disks |
45 |
can be replace by the new ``--iallocator=NAME`` option (shortened to |
46 |
``-I``), which will cause the auto-assignement of nodes with the |
47 |
passed iallocator. The selected node(s) will be show as part of the |
48 |
command output. |
49 |
|
50 |
IAllocator API |
51 |
-------------- |
52 |
|
53 |
The protocol for communication between Ganeti and an allocator script |
54 |
will be the following: |
55 |
|
56 |
#. ganeti launches the program with a single argument, a filename that |
57 |
contains a JSON-encoded structure (the input message) |
58 |
|
59 |
#. if the script finishes with exit code different from zero, it is |
60 |
considered a general failure and the full output will be reported to |
61 |
the users; this can be the case when the allocator can't parse the |
62 |
input message |
63 |
|
64 |
#. if the allocator finishes with exit code zero, it is expected to |
65 |
output (on its stdout) a JSON-encoded structure (the response) |
66 |
|
67 |
Input message |
68 |
~~~~~~~~~~~~~ |
69 |
|
70 |
The input message will be the JSON encoding of a dictionary containing |
71 |
the following: |
72 |
|
73 |
version |
74 |
the version of the protocol; this document |
75 |
specifies version 2 |
76 |
|
77 |
cluster_name |
78 |
the cluster name |
79 |
|
80 |
cluster_tags |
81 |
the list of cluster tags |
82 |
|
83 |
enabled_hypervisors |
84 |
the list of enabled hypervisors |
85 |
|
86 |
request |
87 |
a dictionary containing the request data: |
88 |
|
89 |
type |
90 |
the request type; this can be either ``allocate`` or ``relocate``; |
91 |
the ``allocate`` request is used when a new instance needs to be |
92 |
placed on the cluster, while the ``relocate`` request is used when |
93 |
an existing instance needs to be moved within the cluster |
94 |
|
95 |
name |
96 |
the name of the instance; if the request is a realocation, then |
97 |
this name will be found in the list of instances (see below), |
98 |
otherwise is the FQDN of the new instance |
99 |
|
100 |
required_nodes |
101 |
how many nodes should the algorithm return; while this information |
102 |
can be deduced from the instace's disk template, it's better if |
103 |
this computation is left to Ganeti as then allocator scripts are |
104 |
less sensitive to changes to the disk templates |
105 |
|
106 |
disk_space_total |
107 |
the total disk space that will be used by this instance on the |
108 |
(new) nodes; again, this information can be computed from the list |
109 |
of instance disks and its template type, but Ganeti is better |
110 |
suited to compute it |
111 |
|
112 |
If the request is an allocation, then there are extra fields in the |
113 |
request dictionary: |
114 |
|
115 |
disks |
116 |
list of dictionaries holding the disk definitions for this |
117 |
instance (in the order they are exported to the hypervisor): |
118 |
|
119 |
mode |
120 |
either ``r`` or ``w`` denoting if the disk is read-only or |
121 |
writable |
122 |
|
123 |
size |
124 |
the size of this disk in mebibytes |
125 |
|
126 |
nics |
127 |
a list of dictionaries holding the network interfaces for this |
128 |
instance, containing: |
129 |
|
130 |
ip |
131 |
the IP address that Ganeti know for this instance, or null |
132 |
|
133 |
mac |
134 |
the MAC address for this interface |
135 |
|
136 |
bridge |
137 |
the bridge to which this interface will be connected |
138 |
|
139 |
vcpus |
140 |
the number of VCPUs for the instance |
141 |
|
142 |
disk_template |
143 |
the disk template for the instance |
144 |
|
145 |
memory |
146 |
the memory size for the instance |
147 |
|
148 |
os |
149 |
the OS type for the instance |
150 |
|
151 |
tags |
152 |
the list of the instance's tags |
153 |
|
154 |
hypervisor |
155 |
the hypervisor of this instance |
156 |
|
157 |
|
158 |
If the request is of type relocate, then there is one more entry in |
159 |
the request dictionary, named ``relocate_from``, and it contains a |
160 |
list of nodes to move the instance away from; note that with Ganeti |
161 |
2.0, this list will always contain a single node, the current |
162 |
secondary of the instance. |
163 |
|
164 |
instances |
165 |
a dictionary with the data for the current existing instance on the |
166 |
cluster, indexed by instance name; the contents are similar to the |
167 |
instance definitions for the allocate mode, with the addition of: |
168 |
|
169 |
admin_up |
170 |
if this instance is set to run (but not the actual status of the |
171 |
instance) |
172 |
|
173 |
nodes |
174 |
list of nodes on which this instance is placed; the primary node |
175 |
of the instance is always the first one |
176 |
|
177 |
nodes |
178 |
dictionary with the data for the nodes in the cluster, indexed by |
179 |
the node name; the dict contains: |
180 |
|
181 |
total_disk |
182 |
the total disk size of this node (mebibytes) |
183 |
|
184 |
free_disk |
185 |
the free disk space on the node |
186 |
|
187 |
total_memory |
188 |
the total memory size |
189 |
|
190 |
free_memory |
191 |
free memory on the node; note that currently this does not take |
192 |
into account the instances which are down on the node |
193 |
|
194 |
total_cpus |
195 |
the physical number of CPUs present on the machine; depending on |
196 |
the hypervisor, this might or might not be equal to how many CPUs |
197 |
the node operating system sees; |
198 |
|
199 |
primary_ip |
200 |
the primary IP address of the node |
201 |
|
202 |
secondary_ip |
203 |
the secondary IP address of the node (the one used for the DRBD |
204 |
replication); note that this can be the same as the primary one |
205 |
|
206 |
tags |
207 |
list with the tags of the node |
208 |
|
209 |
master_candidate: |
210 |
a boolean flag denoting whether this node is a master candidate |
211 |
|
212 |
drained: |
213 |
a boolean flag denoting whether this node is being drained |
214 |
|
215 |
offline: |
216 |
a boolean flag denoting whether this node is offline |
217 |
|
218 |
i_pri_memory: |
219 |
total memory required by primary instances |
220 |
|
221 |
i_pri_up_memory: |
222 |
total memory required by running primary instances |
223 |
|
224 |
No allocations should be made on nodes having either the ``drained`` |
225 |
or ``offline`` flags set. More details about these of node status |
226 |
flags is available in the manpage *ganeti(7)*. |
227 |
|
228 |
|
229 |
Respone message |
230 |
~~~~~~~~~~~~~~~ |
231 |
|
232 |
The response message is much more simple than the input one. It is |
233 |
also a dict having three keys: |
234 |
|
235 |
success |
236 |
a boolean value denoting if the allocation was successful or not |
237 |
|
238 |
info |
239 |
a string with information from the scripts; if the allocation fails, |
240 |
this will be shown to the user |
241 |
|
242 |
nodes |
243 |
the list of nodes computed by the algorithm; even if the algorithm |
244 |
failed (i.e. success is false), this must be returned as an empty |
245 |
list; also note that the length of this list must equal the |
246 |
``requested_nodes`` entry in the input message, otherwise Ganeti |
247 |
will consider the result as failed |
248 |
|
249 |
Examples |
250 |
-------- |
251 |
|
252 |
Input messages to scripts |
253 |
~~~~~~~~~~~~~~~~~~~~~~~~~ |
254 |
|
255 |
Input message, new instance allocation:: |
256 |
|
257 |
{ |
258 |
"cluster_tags": [], |
259 |
"request": { |
260 |
"required_nodes": 2, |
261 |
"name": "instance3.example.com", |
262 |
"tags": [ |
263 |
"type:test", |
264 |
"owner:foo" |
265 |
], |
266 |
"type": "allocate", |
267 |
"disks": [ |
268 |
{ |
269 |
"mode": "w", |
270 |
"size": 1024 |
271 |
}, |
272 |
{ |
273 |
"mode": "w", |
274 |
"size": 2048 |
275 |
} |
276 |
], |
277 |
"nics": [ |
278 |
{ |
279 |
"ip": null, |
280 |
"mac": "00:11:22:33:44:55", |
281 |
"bridge": null |
282 |
} |
283 |
], |
284 |
"vcpus": 1, |
285 |
"disk_template": "drbd", |
286 |
"memory": 2048, |
287 |
"disk_space_total": 3328, |
288 |
"os": "etch-image" |
289 |
}, |
290 |
"cluster_name": "cluster1.example.com", |
291 |
"instances": { |
292 |
"instance1.example.com": { |
293 |
"tags": [], |
294 |
"should_run": false, |
295 |
"disks": [ |
296 |
{ |
297 |
"mode": "w", |
298 |
"size": 64 |
299 |
}, |
300 |
{ |
301 |
"mode": "w", |
302 |
"size": 512 |
303 |
} |
304 |
], |
305 |
"nics": [ |
306 |
{ |
307 |
"ip": null, |
308 |
"mac": "aa:00:00:00:60:bf", |
309 |
"bridge": "xen-br0" |
310 |
} |
311 |
], |
312 |
"vcpus": 1, |
313 |
"disk_template": "plain", |
314 |
"memory": 128, |
315 |
"nodes": [ |
316 |
"nodee1.com" |
317 |
], |
318 |
"os": "etch-image" |
319 |
}, |
320 |
"instance2.example.com": { |
321 |
"tags": [], |
322 |
"should_run": false, |
323 |
"disks": [ |
324 |
{ |
325 |
"mode": "w", |
326 |
"size": 512 |
327 |
}, |
328 |
{ |
329 |
"mode": "w", |
330 |
"size": 256 |
331 |
} |
332 |
], |
333 |
"nics": [ |
334 |
{ |
335 |
"ip": null, |
336 |
"mac": "aa:00:00:55:f8:38", |
337 |
"bridge": "xen-br0" |
338 |
} |
339 |
], |
340 |
"vcpus": 1, |
341 |
"disk_template": "drbd", |
342 |
"memory": 512, |
343 |
"nodes": [ |
344 |
"node2.example.com", |
345 |
"node3.example.com" |
346 |
], |
347 |
"os": "etch-image" |
348 |
} |
349 |
}, |
350 |
"version": 1, |
351 |
"nodes": { |
352 |
"node1.example.com": { |
353 |
"total_disk": 858276, |
354 |
"primary_ip": "192.168.1.1", |
355 |
"secondary_ip": "192.168.2.1", |
356 |
"tags": [], |
357 |
"free_memory": 3505, |
358 |
"free_disk": 856740, |
359 |
"total_memory": 4095 |
360 |
}, |
361 |
"node2.example.com": { |
362 |
"total_disk": 858240, |
363 |
"primary_ip": "192.168.1.3", |
364 |
"secondary_ip": "192.168.2.3", |
365 |
"tags": ["test"], |
366 |
"free_memory": 3505, |
367 |
"free_disk": 848320, |
368 |
"total_memory": 4095 |
369 |
}, |
370 |
"node3.example.com.com": { |
371 |
"total_disk": 572184, |
372 |
"primary_ip": "192.168.1.3", |
373 |
"secondary_ip": "192.168.2.3", |
374 |
"tags": [], |
375 |
"free_memory": 3505, |
376 |
"free_disk": 570648, |
377 |
"total_memory": 4095 |
378 |
} |
379 |
} |
380 |
} |
381 |
|
382 |
Input message, reallocation. Since only the request entry in the input |
383 |
message is changed, we show only this changed entry:: |
384 |
|
385 |
"request": { |
386 |
"relocate_from": [ |
387 |
"node3.example.com" |
388 |
], |
389 |
"required_nodes": 1, |
390 |
"type": "relocate", |
391 |
"name": "instance2.example.com", |
392 |
"disk_space_total": 832 |
393 |
}, |
394 |
|
395 |
|
396 |
Response messages |
397 |
~~~~~~~~~~~~~~~~~ |
398 |
Successful response message:: |
399 |
|
400 |
{ |
401 |
"info": "Allocation successful", |
402 |
"nodes": [ |
403 |
"node2.example.com", |
404 |
"node1.example.com" |
405 |
], |
406 |
"success": true |
407 |
} |
408 |
|
409 |
Failed response message:: |
410 |
|
411 |
{ |
412 |
"info": "Can't find a suitable node for position 2 (already selected: node2.example.com)", |
413 |
"nodes": [], |
414 |
"success": false |
415 |
} |
416 |
|
417 |
Command line messages |
418 |
~~~~~~~~~~~~~~~~~~~~~ |
419 |
:: |
420 |
|
421 |
# gnt-instance add -t plain -m 2g --os-size 1g --swap-size 512m --iallocator dumb-allocator -o etch-image instance3 |
422 |
Selected nodes for the instance: node1.example.com |
423 |
* creating instance disks... |
424 |
[...] |
425 |
|
426 |
# gnt-instance add -t plain -m 3400m --os-size 1g --swap-size 512m --iallocator dumb-allocator -o etch-image instance4 |
427 |
Failure: prerequisites not met for this operation: |
428 |
Can't compute nodes using iallocator 'dumb-allocator': Can't find a suitable node for position 1 (already selected: ) |
429 |
|
430 |
# gnt-instance add -t drbd -m 1400m --os-size 1g --swap-size 512m --iallocator dumb-allocator -o etch-image instance5 |
431 |
Failure: prerequisites not met for this operation: |
432 |
Can't compute nodes using iallocator 'dumb-allocator': Can't find a suitable node for position 2 (already selected: node1.example.com) |