root / doc / iallocator.rst @ c7a02959
History | View | Annotate | Download (18 kB)
1 |
Ganeti automatic instance allocation |
---|---|
2 |
==================================== |
3 |
|
4 |
Documents Ganeti version 2.11 |
5 |
|
6 |
.. contents:: |
7 |
|
8 |
Introduction |
9 |
------------ |
10 |
|
11 |
Currently in Ganeti the admin has to specify the exact locations for |
12 |
an instance's node(s). This prevents a completely automatic node |
13 |
evacuation, and is in general a nuisance. |
14 |
|
15 |
The *iallocator* framework will enable automatic placement via |
16 |
external scripts, which allows customization of the cluster layout per |
17 |
the site's requirements. |
18 |
|
19 |
User-visible changes |
20 |
~~~~~~~~~~~~~~~~~~~~ |
21 |
|
22 |
There are two parts of the ganeti operation that are impacted by the |
23 |
auto-allocation: how the cluster knows what the allocator algorithms |
24 |
are and how the admin uses these in creating instances. |
25 |
|
26 |
An allocation algorithm is just the filename of a program installed in |
27 |
a defined list of directories. |
28 |
|
29 |
Cluster configuration |
30 |
~~~~~~~~~~~~~~~~~~~~~ |
31 |
|
32 |
At configure time, the list of the directories can be selected via the |
33 |
``--with-iallocator-search-path=LIST`` option, where *LIST* is a |
34 |
comma-separated list of directories. If not given, this defaults to |
35 |
``$libdir/ganeti/iallocators``, i.e. for an installation under |
36 |
``/usr``, this will be ``/usr/lib/ganeti/iallocators``. |
37 |
|
38 |
Ganeti will then search for allocator script in the configured list, |
39 |
using the first one whose filename matches the one given by the user. |
40 |
|
41 |
Command line interface changes |
42 |
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
43 |
|
44 |
The node selection options in instance add and instance replace disks |
45 |
can be replace by the new ``--iallocator=NAME`` option (shortened to |
46 |
``-I``), which will cause the auto-assignement of nodes with the |
47 |
passed iallocator. The selected node(s) will be shown as part of the |
48 |
command output. |
49 |
|
50 |
IAllocator API |
51 |
-------------- |
52 |
|
53 |
The protocol for communication between Ganeti and an allocator script |
54 |
will be the following: |
55 |
|
56 |
#. ganeti launches the program with a single argument, a filename that |
57 |
contains a JSON-encoded structure (the input message) |
58 |
|
59 |
#. if the script finishes with exit code different from zero, it is |
60 |
considered a general failure and the full output will be reported to |
61 |
the users; this can be the case when the allocator can't parse the |
62 |
input message |
63 |
|
64 |
#. if the allocator finishes with exit code zero, it is expected to |
65 |
output (on its stdout) a JSON-encoded structure (the response) |
66 |
|
67 |
Input message |
68 |
~~~~~~~~~~~~~ |
69 |
|
70 |
The input message will be the JSON encoding of a dictionary containing |
71 |
all the required information to perform the operation. We explain the |
72 |
contents of this dictionary in two parts: common information that every |
73 |
type of operation requires, and operation-specific information. |
74 |
|
75 |
Common information |
76 |
++++++++++++++++++ |
77 |
|
78 |
All input dictionaries to the IAllocator must carry the following keys: |
79 |
|
80 |
version |
81 |
the version of the protocol; this document |
82 |
specifies version 2 |
83 |
|
84 |
cluster_name |
85 |
the cluster name |
86 |
|
87 |
cluster_tags |
88 |
the list of cluster tags |
89 |
|
90 |
enabled_hypervisors |
91 |
the list of enabled hypervisors |
92 |
|
93 |
ipolicy |
94 |
the cluster-wide instance policy (for information; the per-node group |
95 |
values take precedence and should be used instead) |
96 |
|
97 |
request |
98 |
a dictionary containing the details of the request; the keys vary |
99 |
depending on the type of operation that's being requested, as |
100 |
explained in `Operation-specific input`_ below. |
101 |
|
102 |
nodegroups |
103 |
a dictionary with the data for the cluster's node groups; it is keyed |
104 |
on the group UUID, and the values are a dictionary with the following |
105 |
keys: |
106 |
|
107 |
name |
108 |
the node group name |
109 |
alloc_policy |
110 |
the allocation policy of the node group (consult the semantics of |
111 |
this attribute in the :manpage:`gnt-group(8)` manpage) |
112 |
networks |
113 |
the list of network UUID's this node group is connected to |
114 |
ipolicy |
115 |
the instance policy of the node group |
116 |
tags |
117 |
the list of node group tags |
118 |
|
119 |
instances |
120 |
a dictionary with the data for the current existing instance on the |
121 |
cluster, indexed by instance name; the contents are similar to the |
122 |
instance definitions for the allocate mode, with the addition of: |
123 |
|
124 |
admin_state |
125 |
if this instance is set to run (but not the actual status of the |
126 |
instance) |
127 |
|
128 |
nodes |
129 |
list of nodes on which this instance is placed; the primary node |
130 |
of the instance is always the first one |
131 |
|
132 |
nodes |
133 |
dictionary with the data for the nodes in the cluster, indexed by |
134 |
the node name; the dict contains [*]_ : |
135 |
|
136 |
total_disk |
137 |
the total disk size of this node (mebibytes) |
138 |
|
139 |
free_disk |
140 |
the free disk space on the node |
141 |
|
142 |
total_memory |
143 |
the total memory size |
144 |
|
145 |
free_memory |
146 |
free memory on the node; note that currently this does not take |
147 |
into account the instances which are down on the node |
148 |
|
149 |
total_cpus |
150 |
the physical number of CPUs present on the machine; depending on |
151 |
the hypervisor, this might or might not be equal to how many CPUs |
152 |
the node operating system sees; |
153 |
|
154 |
primary_ip |
155 |
the primary IP address of the node |
156 |
|
157 |
secondary_ip |
158 |
the secondary IP address of the node (the one used for the DRBD |
159 |
replication); note that this can be the same as the primary one |
160 |
|
161 |
tags |
162 |
list with the tags of the node |
163 |
|
164 |
master_candidate: |
165 |
a boolean flag denoting whether this node is a master candidate |
166 |
|
167 |
drained: |
168 |
a boolean flag denoting whether this node is being drained |
169 |
|
170 |
offline: |
171 |
a boolean flag denoting whether this node is offline |
172 |
|
173 |
i_pri_memory: |
174 |
total memory required by primary instances |
175 |
|
176 |
i_pri_up_memory: |
177 |
total memory required by running primary instances |
178 |
|
179 |
group: |
180 |
the node group that this node belongs to |
181 |
|
182 |
No allocations should be made on nodes having either the ``drained`` |
183 |
or ``offline`` flags set. More details about these of node status |
184 |
flags is available in the manpage :manpage:`ganeti(7)`. |
185 |
|
186 |
.. [*] Note that no run-time data is present for offline, drained or |
187 |
non-vm_capable nodes; this means the tags total_memory, |
188 |
reserved_memory, free_memory, total_disk, free_disk, total_cpus, |
189 |
i_pri_memory and i_pri_up memory will be absent |
190 |
|
191 |
Operation-specific input |
192 |
++++++++++++++++++++++++ |
193 |
|
194 |
All input dictionaries to the IAllocator carry, in the ``request`` |
195 |
dictionary, detailed information about the operation that's being |
196 |
requested. The required keys vary depending on the type of operation, as |
197 |
follows. |
198 |
|
199 |
In all cases, it includes: |
200 |
|
201 |
type |
202 |
the request type; this can be either ``allocate``, ``relocate``, |
203 |
``change-group`` or ``node-evacuate``. The |
204 |
``allocate`` request is used when a new instance needs to be placed |
205 |
on the cluster. The ``relocate`` request is used when an existing |
206 |
instance needs to be moved within its node group. |
207 |
|
208 |
The ``multi-evacuate`` protocol used to request that the script |
209 |
computes the optimal relocate solution for all secondary instances |
210 |
of the given nodes. It is now deprecated and needs only be |
211 |
implemented if backwards compatibility with Ganeti 2.4 and lower is |
212 |
needed. |
213 |
|
214 |
The ``change-group`` request is used to relocate multiple instances |
215 |
across multiple node groups. ``node-evacuate`` evacuates instances |
216 |
off their node(s). These are described in a separate :ref:`design |
217 |
document <multi-reloc-detailed-design>`. |
218 |
|
219 |
The ``multi-allocate`` request is used to allocate multiple |
220 |
instances on the cluster. The request is beside of that very |
221 |
similiar to the ``allocate`` one. For more details look at |
222 |
:doc:`Ganeti bulk create <design-bulk-create>`. |
223 |
|
224 |
For both allocate and relocate mode, the following extra keys are needed |
225 |
in the ``request`` dictionary: |
226 |
|
227 |
name |
228 |
the name of the instance; if the request is a realocation, then this |
229 |
name will be found in the list of instances (see below), otherwise |
230 |
is the FQDN of the new instance; type *string* |
231 |
|
232 |
required_nodes |
233 |
how many nodes should the algorithm return; while this information |
234 |
can be deduced from the instace's disk template, it's better if |
235 |
this computation is left to Ganeti as then allocator scripts are |
236 |
less sensitive to changes to the disk templates; type *integer* |
237 |
|
238 |
disk_space_total |
239 |
the total disk space that will be used by this instance on the |
240 |
(new) nodes; again, this information can be computed from the list |
241 |
of instance disks and its template type, but Ganeti is better |
242 |
suited to compute it; type *integer* |
243 |
|
244 |
.. pyassert:: |
245 |
|
246 |
constants.DISK_ACCESS_SET == set([constants.DISK_RDONLY, |
247 |
constants.DISK_RDWR]) |
248 |
|
249 |
Allocation needs, in addition: |
250 |
|
251 |
disks |
252 |
list of dictionaries holding the disk definitions for this |
253 |
instance (in the order they are exported to the hypervisor): |
254 |
|
255 |
mode |
256 |
either :pyeval:`constants.DISK_RDONLY` or |
257 |
:pyeval:`constants.DISK_RDWR` denoting if the disk is read-only or |
258 |
writable |
259 |
|
260 |
size |
261 |
the size of this disk in mebibytes |
262 |
|
263 |
nics |
264 |
a list of dictionaries holding the network interfaces for this |
265 |
instance, containing: |
266 |
|
267 |
ip |
268 |
the IP address that Ganeti know for this instance, or null |
269 |
|
270 |
mac |
271 |
the MAC address for this interface |
272 |
|
273 |
bridge |
274 |
the bridge to which this interface will be connected |
275 |
|
276 |
vcpus |
277 |
the number of VCPUs for the instance |
278 |
|
279 |
disk_template |
280 |
the disk template for the instance |
281 |
|
282 |
memory |
283 |
the memory size for the instance |
284 |
|
285 |
os |
286 |
the OS type for the instance |
287 |
|
288 |
tags |
289 |
the list of the instance's tags |
290 |
|
291 |
hypervisor |
292 |
the hypervisor of this instance |
293 |
|
294 |
Relocation: |
295 |
|
296 |
relocate_from |
297 |
a list of nodes to move the instance away from; for DRBD-based |
298 |
instances, this will contain a single node, the current secondary |
299 |
of the instance, whereas for shared-storage instance, this will |
300 |
contain also a single node, the current primary of the instance; |
301 |
type *list of strings* |
302 |
|
303 |
As for ``node-evacuate``, it needs the following request arguments: |
304 |
|
305 |
instances |
306 |
a list of instance names to evacuate; type *list of strings* |
307 |
|
308 |
evac_mode |
309 |
specify which instances to evacuate; one of ``primary-only``, |
310 |
``secondary-only``, ``all``, type *string* |
311 |
|
312 |
``change-group`` needs the following request arguments: |
313 |
|
314 |
instances |
315 |
a list of instance names whose group to change; type |
316 |
*list of strings* |
317 |
|
318 |
target_groups |
319 |
must either be the empty list, or contain a list of group UUIDs that |
320 |
should be considered for relocating instances to; type |
321 |
*list of strings* |
322 |
|
323 |
``multi-allocate`` needs the following request arguments: |
324 |
|
325 |
instances |
326 |
a list of request dicts |
327 |
|
328 |
MonD data |
329 |
+++++++++ |
330 |
|
331 |
Additional information is available from mond. Mond's data collectors |
332 |
provide information that can help an allocator script make better |
333 |
decisions when allocating a new instance. Mond's information may also be |
334 |
accessible from a mock file mainly for testing purposes. The file will |
335 |
be in JSON format and will present an array of :ref:`report objects |
336 |
<monitoring-agent-format-of-the-report>`. |
337 |
|
338 |
Response message |
339 |
~~~~~~~~~~~~~~~~ |
340 |
|
341 |
The response message is much more simple than the input one. It is |
342 |
also a dict having three keys: |
343 |
|
344 |
success |
345 |
a boolean value denoting if the allocation was successful or not |
346 |
|
347 |
info |
348 |
a string with information from the scripts; if the allocation fails, |
349 |
this will be shown to the user |
350 |
|
351 |
result |
352 |
the output of the algorithm; even if the algorithm failed |
353 |
(i.e. success is false), this must be returned as an empty list |
354 |
|
355 |
for allocate/relocate, this is the list of node(s) for the instance; |
356 |
note that the length of this list must equal the ``requested_nodes`` |
357 |
entry in the input message, otherwise Ganeti will consider the result |
358 |
as failed |
359 |
|
360 |
for the ``node-evacuate`` and ``change-group`` modes, this is a |
361 |
dictionary containing, among other information, a list of lists of |
362 |
serialized opcodes; see the :ref:`design document |
363 |
<multi-reloc-result>` for a detailed description |
364 |
|
365 |
for the ``multi-allocate`` mode this is a tuple of 2 lists, the first |
366 |
being element of the tuple is a list of succeeded allocation, with the |
367 |
instance name as first element of each entry and the node placement in |
368 |
the second. The second element of the tuple is the instance list of |
369 |
failed allocations. |
370 |
|
371 |
.. note:: Current Ganeti version accepts either ``result`` or ``nodes`` |
372 |
as a backwards-compatibility measure (older versions only supported |
373 |
``nodes``) |
374 |
|
375 |
Examples |
376 |
-------- |
377 |
|
378 |
Input messages to scripts |
379 |
~~~~~~~~~~~~~~~~~~~~~~~~~ |
380 |
|
381 |
Input message, new instance allocation (common elements are listed this |
382 |
time, but not included in further examples below):: |
383 |
|
384 |
{ |
385 |
"version": 2, |
386 |
"cluster_name": "cluster1.example.com", |
387 |
"cluster_tags": [], |
388 |
"enabled_hypervisors": [ |
389 |
"xen-pvm" |
390 |
], |
391 |
"nodegroups": { |
392 |
"f4e06e0d-528a-4963-a5ad-10f3e114232d": { |
393 |
"name": "default", |
394 |
"alloc_policy": "preferred", |
395 |
"networks": ["net-uuid-1", "net-uuid-2"], |
396 |
"ipolicy": { |
397 |
"disk-templates": ["drbd", "plain"], |
398 |
"minmax": [ |
399 |
{ |
400 |
"max": { |
401 |
"cpu-count": 2, |
402 |
"disk-count": 8, |
403 |
"disk-size": 2048, |
404 |
"memory-size": 12800, |
405 |
"nic-count": 8, |
406 |
"spindle-use": 8 |
407 |
}, |
408 |
"min": { |
409 |
"cpu-count": 1, |
410 |
"disk-count": 1, |
411 |
"disk-size": 1024, |
412 |
"memory-size": 128, |
413 |
"nic-count": 1, |
414 |
"spindle-use": 1 |
415 |
} |
416 |
} |
417 |
], |
418 |
"spindle-ratio": 32.0, |
419 |
"std": { |
420 |
"cpu-count": 1, |
421 |
"disk-count": 1, |
422 |
"disk-size": 1024, |
423 |
"memory-size": 128, |
424 |
"nic-count": 1, |
425 |
"spindle-use": 1 |
426 |
}, |
427 |
"vcpu-ratio": 4.0 |
428 |
}, |
429 |
"tags": ["ng-tag-1", "ng-tag-2"] |
430 |
} |
431 |
}, |
432 |
"instances": { |
433 |
"instance1.example.com": { |
434 |
"tags": [], |
435 |
"should_run": false, |
436 |
"disks": [ |
437 |
{ |
438 |
"mode": "w", |
439 |
"size": 64 |
440 |
}, |
441 |
{ |
442 |
"mode": "w", |
443 |
"size": 512 |
444 |
} |
445 |
], |
446 |
"nics": [ |
447 |
{ |
448 |
"ip": null, |
449 |
"mac": "aa:00:00:00:60:bf", |
450 |
"bridge": "xen-br0" |
451 |
} |
452 |
], |
453 |
"vcpus": 1, |
454 |
"disk_template": "plain", |
455 |
"memory": 128, |
456 |
"nodes": [ |
457 |
"nodee1.com" |
458 |
], |
459 |
"os": "debootstrap+default" |
460 |
}, |
461 |
"instance2.example.com": { |
462 |
"tags": [], |
463 |
"should_run": false, |
464 |
"disks": [ |
465 |
{ |
466 |
"mode": "w", |
467 |
"size": 512 |
468 |
}, |
469 |
{ |
470 |
"mode": "w", |
471 |
"size": 256 |
472 |
} |
473 |
], |
474 |
"nics": [ |
475 |
{ |
476 |
"ip": null, |
477 |
"mac": "aa:00:00:55:f8:38", |
478 |
"bridge": "xen-br0" |
479 |
} |
480 |
], |
481 |
"vcpus": 1, |
482 |
"disk_template": "drbd", |
483 |
"memory": 512, |
484 |
"nodes": [ |
485 |
"node2.example.com", |
486 |
"node3.example.com" |
487 |
], |
488 |
"os": "debootstrap+default" |
489 |
} |
490 |
}, |
491 |
"nodes": { |
492 |
"node1.example.com": { |
493 |
"total_disk": 858276, |
494 |
"primary_ip": "198.51.100.1", |
495 |
"secondary_ip": "192.0.2.1", |
496 |
"tags": [], |
497 |
"group": "f4e06e0d-528a-4963-a5ad-10f3e114232d", |
498 |
"free_memory": 3505, |
499 |
"free_disk": 856740, |
500 |
"total_memory": 4095 |
501 |
}, |
502 |
"node2.example.com": { |
503 |
"total_disk": 858240, |
504 |
"primary_ip": "198.51.100.2", |
505 |
"secondary_ip": "192.0.2.2", |
506 |
"tags": ["test"], |
507 |
"group": "f4e06e0d-528a-4963-a5ad-10f3e114232d", |
508 |
"free_memory": 3505, |
509 |
"free_disk": 848320, |
510 |
"total_memory": 4095 |
511 |
}, |
512 |
"node3.example.com.com": { |
513 |
"total_disk": 572184, |
514 |
"primary_ip": "198.51.100.3", |
515 |
"secondary_ip": "192.0.2.3", |
516 |
"tags": [], |
517 |
"group": "f4e06e0d-528a-4963-a5ad-10f3e114232d", |
518 |
"free_memory": 3505, |
519 |
"free_disk": 570648, |
520 |
"total_memory": 4095 |
521 |
} |
522 |
}, |
523 |
"request": { |
524 |
"type": "allocate", |
525 |
"name": "instance3.example.com", |
526 |
"required_nodes": 2, |
527 |
"disk_space_total": 3328, |
528 |
"disks": [ |
529 |
{ |
530 |
"mode": "w", |
531 |
"size": 1024 |
532 |
}, |
533 |
{ |
534 |
"mode": "w", |
535 |
"size": 2048 |
536 |
} |
537 |
], |
538 |
"nics": [ |
539 |
{ |
540 |
"ip": null, |
541 |
"mac": "00:11:22:33:44:55", |
542 |
"bridge": null |
543 |
} |
544 |
], |
545 |
"vcpus": 1, |
546 |
"disk_template": "drbd", |
547 |
"memory": 2048, |
548 |
"os": "debootstrap+default", |
549 |
"tags": [ |
550 |
"type:test", |
551 |
"owner:foo" |
552 |
], |
553 |
hypervisor: "xen-pvm" |
554 |
} |
555 |
} |
556 |
|
557 |
Input message, reallocation:: |
558 |
|
559 |
{ |
560 |
"version": 2, |
561 |
... |
562 |
"request": { |
563 |
"type": "relocate", |
564 |
"name": "instance2.example.com", |
565 |
"required_nodes": 1, |
566 |
"disk_space_total": 832, |
567 |
"relocate_from": [ |
568 |
"node3.example.com" |
569 |
] |
570 |
} |
571 |
} |
572 |
|
573 |
|
574 |
Response messages |
575 |
~~~~~~~~~~~~~~~~~ |
576 |
Successful response message:: |
577 |
|
578 |
{ |
579 |
"success": true, |
580 |
"info": "Allocation successful", |
581 |
"result": [ |
582 |
"node2.example.com", |
583 |
"node1.example.com" |
584 |
] |
585 |
} |
586 |
|
587 |
Failed response message:: |
588 |
|
589 |
{ |
590 |
"success": false, |
591 |
"info": "Can't find a suitable node for position 2 (already selected: node2.example.com)", |
592 |
"result": [] |
593 |
} |
594 |
|
595 |
Successful node evacuation message:: |
596 |
|
597 |
{ |
598 |
"success": true, |
599 |
"info": "Request successful", |
600 |
"result": [ |
601 |
[ |
602 |
"instance1", |
603 |
"node3" |
604 |
], |
605 |
[ |
606 |
"instance2", |
607 |
"node1" |
608 |
] |
609 |
] |
610 |
} |
611 |
|
612 |
|
613 |
Command line messages |
614 |
~~~~~~~~~~~~~~~~~~~~~ |
615 |
:: |
616 |
|
617 |
# gnt-instance add -t plain -m 2g --os-size 1g --swap-size 512m --iallocator hail -o debootstrap+default instance3 |
618 |
Selected nodes for the instance: node1.example.com |
619 |
* creating instance disks... |
620 |
[...] |
621 |
|
622 |
# gnt-instance add -t plain -m 3400m --os-size 1g --swap-size 512m --iallocator hail -o debootstrap+default instance4 |
623 |
Failure: prerequisites not met for this operation: |
624 |
Can't compute nodes using iallocator 'hail': Can't find a suitable node for position 1 (already selected: ) |
625 |
|
626 |
# gnt-instance add -t drbd -m 1400m --os-size 1g --swap-size 512m --iallocator hail -o debootstrap+default instance5 |
627 |
Failure: prerequisites not met for this operation: |
628 |
Can't compute nodes using iallocator 'hail': Can't find a suitable node for position 2 (already selected: node1.example.com) |
629 |
|
630 |
Reference implementation |
631 |
~~~~~~~~~~~~~~~~~~~~~~~~ |
632 |
|
633 |
Ganeti's default iallocator is "hail" which is available when "htools" |
634 |
components have been enabled at build time (see :doc:`install-quick` for |
635 |
more details). |
636 |
|
637 |
.. vim: set textwidth=72 : |
638 |
.. Local Variables: |
639 |
.. mode: rst |
640 |
.. fill-column: 72 |
641 |
.. End: |