code.grnet.gr Git - ganeti-local/blob - doc/iallocator.rst

   1 Ganeti automatic instance allocation
   2 ====================================
   3
   4 Documents Ganeti version 2.8
   5
   6 .. contents::
   7
   8 Introduction
   9 ------------
  10
  11 Currently in Ganeti the admin has to specify the exact locations for
  12 an instance's node(s). This prevents a completely automatic node
  13 evacuation, and is in general a nuisance.
  14
  15 The *iallocator* framework will enable automatic placement via
  16 external scripts, which allows customization of the cluster layout per
  17 the site's requirements.
  18
  19 User-visible changes
  20 ~~~~~~~~~~~~~~~~~~~~
  21
  22 There are two parts of the ganeti operation that are impacted by the
  23 auto-allocation: how the cluster knows what the allocator algorithms
  24 are and how the admin uses these in creating instances.
  25
  26 An allocation algorithm is just the filename of a program installed in
  27 a defined list of directories.
  28
  29 Cluster configuration
  30 ~~~~~~~~~~~~~~~~~~~~~
  31
  32 At configure time, the list of the directories can be selected via the
  33 ``--with-iallocator-search-path=LIST`` option, where *LIST* is a
  34 comma-separated list of directories. If not given, this defaults to
  35 ``$libdir/ganeti/iallocators``, i.e. for an installation under
  36 ``/usr``, this will be ``/usr/lib/ganeti/iallocators``.
  37
  38 Ganeti will then search for allocator script in the configured list,
  39 using the first one whose filename matches the one given by the user.
  40
  41 Command line interface changes
  42 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  43
  44 The node selection options in instance add and instance replace disks
  45 can be replace by the new ``--iallocator=NAME`` option (shortened to
  46 ``-I``), which will cause the auto-assignement of nodes with the
  47 passed iallocator. The selected node(s) will be show as part of the
  48 command output.
  49
  50 IAllocator API
  51 --------------
  52
  53 The protocol for communication between Ganeti and an allocator script
  54 will be the following:
  55
  56 #. ganeti launches the program with a single argument, a filename that
  57    contains a JSON-encoded structure (the input message)
  58
  59 #. if the script finishes with exit code different from zero, it is
  60    considered a general failure and the full output will be reported to
  61    the users; this can be the case when the allocator can't parse the
  62    input message
  63
  64 #. if the allocator finishes with exit code zero, it is expected to
  65    output (on its stdout) a JSON-encoded structure (the response)
  66
  67 Input message
  68 ~~~~~~~~~~~~~
  69
  70 The input message will be the JSON encoding of a dictionary containing
  71 all the required information to perform the operation. We explain the
  72 contents of this dictionary in two parts: common information that every
  73 type of operation requires, and operation-specific information.
  74
  75 Common information
  76 ++++++++++++++++++
  77
  78 All input dictionaries to the IAllocator must carry the following keys:
  79
  80 version
  81   the version of the protocol; this document
  82   specifies version 2
  83
  84 cluster_name
  85   the cluster name
  86
  87 cluster_tags
  88   the list of cluster tags
  89
  90 enabled_hypervisors
  91   the list of enabled hypervisors
  92
  93 ipolicy
  94   the cluster-wide instance policy (for information; the per-node group
  95   values take precedence and should be used instead)
  96
  97 request
  98   a dictionary containing the details of the request; the keys vary
  99   depending on the type of operation that's being requested, as
 100   explained in `Operation-specific input`_ below.
 101
 102 nodegroups
 103   a dictionary with the data for the cluster's node groups; it is keyed
 104   on the group UUID, and the values are a dictionary with the following
 105   keys:
 106
 107   name
 108     the node group name
 109   alloc_policy
 110     the allocation policy of the node group (consult the semantics of
 111     this attribute in the :manpage:`gnt-group(8)` manpage)
 112   networks
 113     the list of network UUID's this node group is connected to
 114   ipolicy
 115     the instance policy of the node group
 116   tags
 117     the list of node group tags
 118
 119 instances
 120   a dictionary with the data for the current existing instance on the
 121   cluster, indexed by instance name; the contents are similar to the
 122   instance definitions for the allocate mode, with the addition of:
 123
 124   admin_state
 125     if this instance is set to run (but not the actual status of the
 126     instance)
 127
 128   nodes
 129     list of nodes on which this instance is placed; the primary node
 130     of the instance is always the first one
 131
 132 nodes
 133   dictionary with the data for the nodes in the cluster, indexed by
 134   the node name; the dict contains [*]_ :
 135
 136   total_disk
 137     the total disk size of this node (mebibytes)
 138
 139   free_disk
 140     the free disk space on the node
 141
 142   total_memory
 143     the total memory size
 144
 145   free_memory
 146     free memory on the node; note that currently this does not take
 147     into account the instances which are down on the node
 148
 149   total_cpus
 150     the physical number of CPUs present on the machine; depending on
 151     the hypervisor, this might or might not be equal to how many CPUs
 152     the node operating system sees;
 153
 154   primary_ip
 155     the primary IP address of the node
 156
 157   secondary_ip
 158     the secondary IP address of the node (the one used for the DRBD
 159     replication); note that this can be the same as the primary one
 160
 161   tags
 162     list with the tags of the node
 163
 164   master_candidate:
 165     a boolean flag denoting whether this node is a master candidate
 166
 167   drained:
 168     a boolean flag denoting whether this node is being drained
 169
 170   offline:
 171     a boolean flag denoting whether this node is offline
 172
 173   i_pri_memory:
 174     total memory required by primary instances
 175
 176   i_pri_up_memory:
 177     total memory required by running primary instances
 178
 179   group:
 180     the node group that this node belongs to
 181
 182   No allocations should be made on nodes having either the ``drained``
 183   or ``offline`` flags set. More details about these of node status
 184   flags is available in the manpage :manpage:`ganeti(7)`.
 185
 186 .. [*] Note that no run-time data is present for offline, drained or
 187    non-vm_capable nodes; this means the tags total_memory,
 188    reserved_memory, free_memory, total_disk, free_disk, total_cpus,
 189    i_pri_memory and i_pri_up memory will be absent
 190
 191 Operation-specific input
 192 ++++++++++++++++++++++++
 193
 194 All input dictionaries to the IAllocator carry, in the ``request``
 195 dictionary, detailed information about the operation that's being
 196 requested. The required keys vary depending on the type of operation, as
 197 follows.
 198
 199 In all cases, it includes:
 200
 201   type
 202     the request type; this can be either ``allocate``, ``relocate``,
 203     ``change-group`` or ``node-evacuate``. The
 204     ``allocate`` request is used when a new instance needs to be placed
 205     on the cluster. The ``relocate`` request is used when an existing
 206     instance needs to be moved within its node group.
 207
 208     The ``multi-evacuate`` protocol used to request that the script
 209     computes the optimal relocate solution for all secondary instances
 210     of the given nodes. It is now deprecated and needs only be
 211     implemented if backwards compatibility with Ganeti 2.4 and lower is
 212     needed.
 213
 214     The ``change-group`` request is used to relocate multiple instances
 215     across multiple node groups. ``node-evacuate`` evacuates instances
 216     off their node(s). These are described in a separate :ref:`design
 217     document <multi-reloc-detailed-design>`.
 218
 219     The ``multi-allocate`` request is used to allocate multiple
 220     instances on the cluster. The request is beside of that very
 221     similiar to the ``allocate`` one. For more details look at
 222     :doc:`Ganeti bulk create <design-bulk-create>`.
 223
 224 For both allocate and relocate mode, the following extra keys are needed
 225 in the ``request`` dictionary:
 226
 227   name
 228     the name of the instance; if the request is a realocation, then this
 229     name will be found in the list of instances (see below), otherwise
 230     is the FQDN of the new instance; type *string*
 231
 232   required_nodes
 233     how many nodes should the algorithm return; while this information
 234     can be deduced from the instace's disk template, it's better if
 235     this computation is left to Ganeti as then allocator scripts are
 236     less sensitive to changes to the disk templates; type *integer*
 237
 238   disk_space_total
 239     the total disk space that will be used by this instance on the
 240     (new) nodes; again, this information can be computed from the list
 241     of instance disks and its template type, but Ganeti is better
 242     suited to compute it; type *integer*
 243
 244 .. pyassert::
 245
 246    constants.DISK_ACCESS_SET == set([constants.DISK_RDONLY,
 247      constants.DISK_RDWR])
 248
 249 Allocation needs, in addition:
 250
 251   disks
 252     list of dictionaries holding the disk definitions for this
 253     instance (in the order they are exported to the hypervisor):
 254
 255     mode
 256       either :pyeval:`constants.DISK_RDONLY` or
 257       :pyeval:`constants.DISK_RDWR` denoting if the disk is read-only or
 258       writable
 259
 260     size
 261       the size of this disk in mebibytes
 262
 263   nics
 264     a list of dictionaries holding the network interfaces for this
 265     instance, containing:
 266
 267     ip
 268       the IP address that Ganeti know for this instance, or null
 269
 270     mac
 271       the MAC address for this interface
 272
 273     bridge
 274       the bridge to which this interface will be connected
 275
 276   vcpus
 277     the number of VCPUs for the instance
 278
 279   disk_template
 280     the disk template for the instance
 281
 282   memory
 283    the memory size for the instance
 284
 285   os
 286    the OS type for the instance
 287
 288   tags
 289     the list of the instance's tags
 290
 291   hypervisor
 292     the hypervisor of this instance
 293
 294 Relocation:
 295
 296   relocate_from
 297      a list of nodes to move the instance away from; for DRBD-based
 298      instances, this will contain a single node, the current secondary
 299      of the instance, whereas for shared-storage instance, this will
 300      contain also a single node, the current primary of the instance;
 301      type *list of strings*
 302
 303 As for ``node-evacuate``, it needs the following request arguments:
 304
 305   instances
 306     a list of instance names to evacuate; type *list of strings*
 307
 308   evac_mode
 309     specify which instances to evacuate; one of ``primary-only``,
 310     ``secondary-only``, ``all``, type *string*
 311
 312 ``change-group`` needs the following request arguments:
 313
 314   instances
 315     a list of instance names whose group to change; type
 316     *list of strings*
 317
 318   target_groups
 319     must either be the empty list, or contain a list of group UUIDs that
 320     should be considered for relocating instances to; type
 321     *list of strings*
 322
 323 ``multi-allocate`` needs the following request arguments:
 324
 325   instances
 326     a list of request dicts
 327
 328 Response message
 329 ~~~~~~~~~~~~~~~~
 330
 331 The response message is much more simple than the input one. It is
 332 also a dict having three keys:
 333
 334 success
 335   a boolean value denoting if the allocation was successful or not
 336
 337 info
 338   a string with information from the scripts; if the allocation fails,
 339   this will be shown to the user
 340
 341 result
 342   the output of the algorithm; even if the algorithm failed
 343   (i.e. success is false), this must be returned as an empty list
 344
 345   for allocate/relocate, this is the list of node(s) for the instance;
 346   note that the length of this list must equal the ``requested_nodes``
 347   entry in the input message, otherwise Ganeti will consider the result
 348   as failed
 349
 350   for the ``node-evacuate`` and ``change-group`` modes, this is a
 351   dictionary containing, among other information, a list of lists of
 352   serialized opcodes; see the :ref:`design document
 353   <multi-reloc-result>` for a detailed description
 354
 355   for the ``multi-allocate`` mode this is a tuple of 2 lists, the first
 356   being element of the tuple is a list of succeeded allocation, with the
 357   instance name as first element of each entry and the node placement in
 358   the second. The second element of the tuple is the instance list of
 359   failed allocations.
 360
 361 .. note:: Current Ganeti version accepts either ``result`` or ``nodes``
 362    as a backwards-compatibility measure (older versions only supported
 363    ``nodes``)
 364
 365 Examples
 366 --------
 367
 368 Input messages to scripts
 369 ~~~~~~~~~~~~~~~~~~~~~~~~~
 370
 371 Input message, new instance allocation (common elements are listed this
 372 time, but not included in further examples below)::
 373
 374   {
 375     "version": 2,
 376     "cluster_name": "cluster1.example.com",
 377     "cluster_tags": [],
 378     "enabled_hypervisors": [
 379       "xen-pvm"
 380     ],
 381     "nodegroups": {
 382       "f4e06e0d-528a-4963-a5ad-10f3e114232d": {
 383         "name": "default",
 384         "alloc_policy": "preferred",
 385         "networks": ["net-uuid-1", "net-uuid-2"],
 386         "ipolicy": {
 387           "disk-templates": ["drbd", "plain"],
 388           "minmax": [
 389             {
 390               "max": {
 391                 "cpu-count": 2,
 392                 "disk-count": 8,
 393                 "disk-size": 2048,
 394                 "memory-size": 12800,
 395                 "nic-count": 8,
 396                 "spindle-use": 8
 397               },
 398               "min": {
 399                 "cpu-count": 1,
 400                 "disk-count": 1,
 401                 "disk-size": 1024,
 402                 "memory-size": 128,
 403                 "nic-count": 1,
 404                 "spindle-use": 1
 405               }
 406             }
 407           ],
 408           "spindle-ratio": 32.0,
 409           "std": {
 410             "cpu-count": 1,
 411             "disk-count": 1,
 412             "disk-size": 1024,
 413             "memory-size": 128,
 414             "nic-count": 1,
 415             "spindle-use": 1
 416           },
 417           "vcpu-ratio": 4.0
 418         },
 419         "tags": ["ng-tag-1", "ng-tag-2"]
 420       }
 421     },
 422     "instances": {
 423       "instance1.example.com": {
 424         "tags": [],
 425         "should_run": false,
 426         "disks": [
 427           {
 428             "mode": "w",
 429             "size": 64
 430           },
 431           {
 432             "mode": "w",
 433             "size": 512
 434           }
 435         ],
 436         "nics": [
 437           {
 438             "ip": null,
 439             "mac": "aa:00:00:00:60:bf",
 440             "bridge": "xen-br0"
 441           }
 442         ],
 443         "vcpus": 1,
 444         "disk_template": "plain",
 445         "memory": 128,
 446         "nodes": [
 447           "nodee1.com"
 448         ],
 449         "os": "debootstrap+default"
 450       },
 451       "instance2.example.com": {
 452         "tags": [],
 453         "should_run": false,
 454         "disks": [
 455           {
 456             "mode": "w",
 457             "size": 512
 458           },
 459           {
 460             "mode": "w",
 461             "size": 256
 462           }
 463         ],
 464         "nics": [
 465           {
 466             "ip": null,
 467             "mac": "aa:00:00:55:f8:38",
 468             "bridge": "xen-br0"
 469           }
 470         ],
 471         "vcpus": 1,
 472         "disk_template": "drbd",
 473         "memory": 512,
 474         "nodes": [
 475           "node2.example.com",
 476           "node3.example.com"
 477         ],
 478         "os": "debootstrap+default"
 479       }
 480     },
 481     "nodes": {
 482       "node1.example.com": {
 483         "total_disk": 858276,
 484         "primary_ip": "198.51.100.1",
 485         "secondary_ip": "192.0.2.1",
 486         "tags": [],
 487         "group": "f4e06e0d-528a-4963-a5ad-10f3e114232d",
 488         "free_memory": 3505,
 489         "free_disk": 856740,
 490         "total_memory": 4095
 491       },
 492       "node2.example.com": {
 493         "total_disk": 858240,
 494         "primary_ip": "198.51.100.2",
 495         "secondary_ip": "192.0.2.2",
 496         "tags": ["test"],
 497         "group": "f4e06e0d-528a-4963-a5ad-10f3e114232d",
 498         "free_memory": 3505,
 499         "free_disk": 848320,
 500         "total_memory": 4095
 501       },
 502       "node3.example.com.com": {
 503         "total_disk": 572184,
 504         "primary_ip": "198.51.100.3",
 505         "secondary_ip": "192.0.2.3",
 506         "tags": [],
 507         "group": "f4e06e0d-528a-4963-a5ad-10f3e114232d",
 508         "free_memory": 3505,
 509         "free_disk": 570648,
 510         "total_memory": 4095
 511       }
 512     },
 513     "request": {
 514       "type": "allocate",
 515       "name": "instance3.example.com",
 516       "required_nodes": 2,
 517       "disk_space_total": 3328,
 518       "disks": [
 519         {
 520           "mode": "w",
 521           "size": 1024
 522         },
 523         {
 524           "mode": "w",
 525           "size": 2048
 526         }
 527       ],
 528       "nics": [
 529         {
 530           "ip": null,
 531           "mac": "00:11:22:33:44:55",
 532           "bridge": null
 533         }
 534       ],
 535       "vcpus": 1,
 536       "disk_template": "drbd",
 537       "memory": 2048,
 538       "os": "debootstrap+default",
 539       "tags": [
 540         "type:test",
 541         "owner:foo"
 542       ],
 543       hypervisor: "xen-pvm"
 544     }
 545   }
 546
 547 Input message, reallocation::
 548
 549   {
 550     "version": 2,
 551     ...
 552     "request": {
 553       "type": "relocate",
 554       "name": "instance2.example.com",
 555       "required_nodes": 1,
 556       "disk_space_total": 832,
 557       "relocate_from": [
 558         "node3.example.com"
 559       ]
 560     }
 561   }
 562
 563
 564 Response messages
 565 ~~~~~~~~~~~~~~~~~
 566 Successful response message::
 567
 568   {
 569     "success": true,
 570     "info": "Allocation successful",
 571     "result": [
 572       "node2.example.com",
 573       "node1.example.com"
 574     ]
 575   }
 576
 577 Failed response message::
 578
 579   {
 580     "success": false,
 581     "info": "Can't find a suitable node for position 2 (already selected: node2.example.com)",
 582     "result": []
 583   }
 584
 585 Successful node evacuation message::
 586
 587   {
 588     "success": true,
 589     "info": "Request successful",
 590     "result": [
 591       [
 592         "instance1",
 593         "node3"
 594       ],
 595       [
 596         "instance2",
 597         "node1"
 598       ]
 599     ]
 600   }
 601
 602
 603 Command line messages
 604 ~~~~~~~~~~~~~~~~~~~~~
 605 ::
 606
 607   # gnt-instance add -t plain -m 2g --os-size 1g --swap-size 512m --iallocator hail -o debootstrap+default instance3
 608   Selected nodes for the instance: node1.example.com
 609   * creating instance disks...
 610   [...]
 611
 612   # gnt-instance add -t plain -m 3400m --os-size 1g --swap-size 512m --iallocator hail -o debootstrap+default instance4
 613   Failure: prerequisites not met for this operation:
 614   Can't compute nodes using iallocator 'hail': Can't find a suitable node for position 1 (already selected: )
 615
 616   # gnt-instance add -t drbd -m 1400m --os-size 1g --swap-size 512m --iallocator hail -o debootstrap+default instance5
 617   Failure: prerequisites not met for this operation:
 618   Can't compute nodes using iallocator 'hail': Can't find a suitable node for position 2 (already selected: node1.example.com)
 619
 620 Reference implementation
 621 ~~~~~~~~~~~~~~~~~~~~~~~~
 622
 623 Ganeti's default iallocator is "hail" which is available when "htools"
 624 components have been enabled at build time (see :doc:`install-quick` for
 625 more details).
 626
 627 .. vim: set textwidth=72 :
 628 .. Local Variables:
 629 .. mode: rst
 630 .. fill-column: 72
 631 .. End: