code.grnet.gr Git - ganeti-local/blob - doc/design-resource-model.rst

   1 ========================
   2  Resource model changes
   3 ========================
   4
   5
   6 Introduction
   7 ============
   8
   9 In order to manage virtual machines across the cluster, Ganeti needs to
  10 understand the resources present on the nodes, the hardware and software
  11 limitations of the nodes, and how much can be allocated safely on each
  12 node. Some of these decisions are delegated to IAllocator plugins, for
  13 easier site-level customisation.
  14
  15 Similarly, the HTools suite has an internal model that simulates the
  16 hardware resource changes in response to Ganeti operations, in order to
  17 provide both an iallocator plugin and for balancing the
  18 cluster.
  19
  20 While currently the HTools model is much more advanced than Ganeti's,
  21 neither one is flexible enough and both are heavily geared toward a
  22 specific Xen model; they fail to work well with (e.g.) KVM or LXC, or
  23 with Xen when :term:`tmem` is enabled. Furthermore, the set of metrics
  24 contained in the models is limited to historic requirements and fails to
  25 account for (e.g.)  heterogeneity in the I/O performance of the nodes.
  26
  27 Current situation
  28 =================
  29
  30 Ganeti
  31 ------
  32
  33 At this moment, Ganeti itself doesn't do any static modelling of the
  34 cluster resources. It only does some runtime checks:
  35
  36 - when creating instances, for the (current) free disk space
  37 - when starting instances, for the (current) free memory
  38 - during cluster verify, for enough N+1 memory on the secondaries, based
  39   on the (current) free memory
  40
  41 Basically this model is a pure :term:`SoW` one, and it works well when
  42 there are other instances/LVs on the nodes, as it allows Ganeti to deal
  43 with ‘orphan’ resource usage, but on the other hand it has many issues,
  44 described below.
  45
  46 HTools
  47 ------
  48
  49 Since HTools does an pure in-memory modelling of the cluster changes as
  50 it executes the balancing or allocation steps, it had to introduce a
  51 static (:term:`SoR`) cluster model.
  52
  53 The model is constructed based on the received node properties from
  54 Ganeti (hence it basically is constructed on what Ganeti can export).
  55
  56 Disk
  57 ~~~~
  58
  59 For disk it consists of just the total (``tdsk``) and the free disk
  60 space (``fdsk``); we don't directly track the used disk space. On top of
  61 this, we compute and warn if the sum of disk sizes used by instance does
  62 not match with ``tdsk - fdsk``, but otherwise we do not track this
  63 separately.
  64
  65 Memory
  66 ~~~~~~
  67
  68 For memory, the model is more complex and tracks some variables that
  69 Ganeti itself doesn't compute. We start from the total (``tmem``), free
  70 (``fmem``) and node memory (``nmem``) as supplied by Ganeti, and
  71 additionally we track:
  72
  73 instance memory (``imem``)
  74     the total memory used by primary instances on the node, computed
  75     as the sum of instance memory
  76
  77 reserved memory (``rmem``)
  78     the memory reserved by peer nodes for N+1 redundancy; this memory is
  79     tracked per peer-node, and the maximum value out of the peer memory
  80     lists is the node's ``rmem``; when not using DRBD, this will be
  81     equal to zero
  82
  83 unaccounted memory (``xmem``)
  84     memory that cannot be unaccounted for via the Ganeti model; this is
  85     computed at startup as::
  86
  87         tmem - imem - nmem - fmem
  88
  89     and is presumed to remain constant irrespective of any instance
  90     moves
  91
  92 available memory (``amem``)
  93     this is simply ``fmem - rmem``, so unless we use DRBD, this will be
  94     equal to ``fmem``
  95
  96 ``tmem``, ``nmem`` and ``xmem`` are presumed constant during the
  97 instance moves, whereas the ``fmem``, ``imem``, ``rmem`` and ``amem``
  98 values are updated according to the executed moves.
  99
 100 CPU
 101 ~~~
 102
 103 The CPU model is different than the disk/memory models, since it's the
 104 only one where:
 105
 106 #. we do oversubscribe physical CPUs
 107 #. and there is no natural limit for the number of VCPUs we can allocate
 108
 109 We therefore track the total number of VCPUs used on the node and the
 110 number of physical CPUs, and we cap the vcpu-to-cpu ratio in order to
 111 make this somewhat more similar to the other resources which are
 112 limited.
 113
 114 Dynamic load
 115 ~~~~~~~~~~~~
 116
 117 There is also a model that deals with *dynamic load* values in
 118 htools. As far as we know, it is not currently used actually with load
 119 values, but it is active by default with unitary values for all
 120 instances; it currently tracks these metrics:
 121
 122 - disk load
 123 - memory load
 124 - cpu load
 125 - network load
 126
 127 Even though we do not assign real values to these load values, the fact
 128 that we at least sum them means that the algorithm tries to equalise
 129 these loads, and especially the network load, which is otherwise not
 130 tracked at all. The practical result (due to a combination of these four
 131 metrics) is that the number of secondaries will be balanced.
 132
 133 Limitations
 134 -----------
 135
 136
 137 There are unfortunately many limitations to the current model.
 138
 139 Memory
 140 ~~~~~~
 141
 142 The memory model doesn't work well in case of KVM. For Xen, the memory
 143 for the node (i.e. ``dom0``) can be static or dynamic; we don't support
 144 the latter case, but for the former case, the static value is configured
 145 in Xen/kernel command line, and can be queried from Xen
 146 itself. Therefore, Ganeti can query the hypervisor for the memory used
 147 for the node; the same model was adopted for the chroot/KVM/LXC
 148 hypervisors, but in these cases there's no natural value for the memory
 149 used by the base OS/kernel, and we currently try to compute a value for
 150 the node memory based on current consumption. This, being variable,
 151 breaks the assumptions in both Ganeti and HTools.
 152
 153 This problem also shows for the free memory: if the free memory on the
 154 node is not constant (Xen with :term:`tmem` auto-ballooning enabled), or
 155 if the node and instance memory are pooled together (Linux-based
 156 hypervisors like KVM and LXC), the current value of the free memory is
 157 meaningless and cannot be used for instance checks.
 158
 159 A separate issue related to the free memory tracking is that since we
 160 don't track memory use but rather memory availability, an instance that
 161 is temporary down changes Ganeti's understanding of the memory status of
 162 the node. This can lead to problems such as:
 163
 164 .. digraph:: "free-mem-issue"
 165
 166   node  [shape=box];
 167   inst1 [label="instance1"];
 168   inst2 [label="instance2"];
 169
 170   node  [shape=note];
 171   nodeA [label="fmem=0"];
 172   nodeB [label="fmem=1"];
 173   nodeC [label="fmem=0"];
 174
 175   node  [shape=ellipse, style=filled, fillcolor=green]
 176
 177   {rank=same; inst1 inst2}
 178
 179   stop    [label="crash!", fillcolor=orange];
 180   migrate [label="migrate/ok"];
 181   start   [style=filled, fillcolor=red, label="start/fail"];
 182   inst1   -> stop -> start;
 183   stop    -> migrate -> start [style=invis, weight=0];
 184   inst2   -> migrate;
 185
 186   {rank=same; inst1 inst2 nodeA}
 187   {rank=same; stop nodeB}
 188   {rank=same; migrate nodeC}
 189
 190   nodeA -> nodeB -> nodeC [style=invis, weight=1];
 191
 192 The behaviour here is wrong; the migration of *instance2* to the node in
 193 question will succeed or fail depending on whether *instance1* is
 194 running or not. And for *instance1*, it can lead to cases where it if
 195 crashes, it cannot restart anymore.
 196
 197 Finally, not a problem but rather a missing important feature is support
 198 for memory over-subscription: both Xen and KVM support memory
 199 ballooning, even automatic memory ballooning, for a while now. The
 200 entire memory model is based on a fixed memory size for instances, and
 201 if memory ballooning is enabled, it will “break” the HTools
 202 algorithm. Even the fact that KVM instances do not use all memory from
 203 the start creates problems (although not as high, since it will grow and
 204 stabilise in the end).
 205
 206 Disks
 207 ~~~~~
 208
 209 Because we only track disk space currently, this means if we have a
 210 cluster of ``N`` otherwise identical nodes but half of them have 10
 211 drives of size ``X`` and the other half 2 drives of size ``5X``, HTools
 212 will consider them exactly the same. However, in the case of mechanical
 213 drives at least, the I/O performance will differ significantly based on
 214 spindle count, and a “fair” load distribution should take this into
 215 account (a similar comment can be made about processor/memory/network
 216 speed).
 217
 218 Another problem related to the spindle count is the LVM allocation
 219 algorithm. Currently, the algorithm always creates (or tries to create)
 220 striped volumes, with the stripe count being hard-coded to the
 221 ``./configure`` parameter ``--with-lvm-stripecount``. This creates
 222 problems like:
 223
 224 - when installing from a distribution package, all clusters will be
 225   either limited or overloaded due to this fixed value
 226 - it is not possible to mix heterogeneous nodes (even in different node
 227   groups) and have optimal settings for all nodes
 228 - the striping value applies both to LVM/DRBD data volumes (which are on
 229   the order of gigabytes to hundreds of gigabytes) and to DRBD metadata
 230   volumes (whose size is always fixed at 128MB); when stripping such
 231   small volumes over many PVs, their size will increase needlessly (and
 232   this can confuse HTools' disk computation algorithm)
 233
 234 Moreover, the allocation currently allocates based on a ‘most free
 235 space’ algorithm. This balances the free space usage on disks, but on
 236 the other hand it tends to mix rather badly the data and metadata
 237 volumes of different instances. For example, it cannot do the following:
 238
 239 - keep DRBD data and metadata volumes on the same drives, in order to
 240   reduce exposure to drive failure in a many-drives system
 241 - keep DRBD data and metadata volumes on different drives, to reduce
 242   performance impact of metadata writes
 243
 244 Additionally, while Ganeti supports setting the volume separately for
 245 data and metadata volumes at instance creation, there are no defaults
 246 for this setting.
 247
 248 Similar to the above stripe count problem (which is about not good
 249 enough customisation of Ganeti's behaviour), we have limited
 250 pass-through customisation of the various options of our storage
 251 backends; while LVM has a system-wide configuration file that can be
 252 used to tweak some of its behaviours, for DRBD we don't use the
 253 :command:`drbdadmin` tool, and instead we call :command:`drbdsetup`
 254 directly, with a fixed/restricted set of options; so for example one
 255 cannot tweak the buffer sizes.
 256
 257 Another current problem is that the support for shared storage in HTools
 258 is still limited, but this problem is outside of this design document.
 259
 260 Locking
 261 ~~~~~~~
 262
 263 A further problem generated by the “current free” model is that during a
 264 long operation which affects resource usage (e.g. disk replaces,
 265 instance creations) we have to keep the respective objects locked
 266 (sometimes even in exclusive mode), since we don't want any concurrent
 267 modifications to the *free* values.
 268
 269 A classic example of the locking problem is the following:
 270
 271 .. digraph:: "iallocator-lock-issues"
 272
 273   rankdir=TB;
 274
 275   start [style=invis];
 276   node  [shape=box,width=2];
 277   job1  [label="add instance\niallocator run\nchoose A,B"];
 278   job1e [label="finish add"];
 279   job2  [label="add instance\niallocator run\nwait locks"];
 280   job2s [label="acquire locks\nchoose C,D"];
 281   job2e [label="finish add"];
 282
 283   job1  -> job1e;
 284   job2  -> job2s -> job2e;
 285   edge [style=invis,weight=0];
 286   start -> {job1; job2}
 287   job1  -> job2;
 288   job2  -> job1e;
 289   job1e -> job2s [style=dotted,label="release locks"];
 290
 291 In the above example, the second IAllocator run will wait for locks for
 292 nodes ``A`` and ``B``, even though in the end the second instance will
 293 be placed on another set of nodes (``C`` and ``D``). This wait shouldn't
 294 be needed, since right after the first IAllocator run has finished,
 295 :command:`hail` knows the status of the cluster after the allocation,
 296 and it could answer the question for the second run too; however, Ganeti
 297 doesn't have such visibility into the cluster state and thus it is
 298 forced to wait with the second job.
 299
 300 Similar examples can be made about replace disks (another long-running
 301 opcode).
 302
 303 .. _label-policies:
 304
 305 Policies
 306 ~~~~~~~~
 307
 308 For most of the resources, we have metrics defined by policy: e.g. the
 309 over-subscription ratio for CPUs, the amount of space to reserve,
 310 etc. Furthermore, although there are no such definitions in Ganeti such
 311 as minimum/maximum instance size, a real deployment will need to have
 312 them, especially in a fully-automated workflow where end-users can
 313 request instances via an automated interface (that talks to the cluster
 314 via RAPI, LUXI or command line). However, such an automated interface
 315 will need to also take into account cluster capacity, and if the
 316 :command:`hspace` tool is used for the capacity computation, it needs to
 317 be told the maximum instance size, however it has a built-in minimum
 318 instance size which is not customisable.
 319
 320 It is clear that this situation leads to duplicate definition of
 321 resource policies which makes it hard to easily change per-cluster (or
 322 globally) the respective policies, and furthermore it creates
 323 inconsistencies if such policies are not enforced at the source (i.e. in
 324 Ganeti).
 325
 326 Balancing algorithm
 327 ~~~~~~~~~~~~~~~~~~~
 328
 329 The balancing algorithm, as documented in the HTools ``README`` file,
 330 tries to minimise the cluster score; this score is based on a set of
 331 metrics that describe both exceptional conditions and how spread the
 332 instances are across the nodes. In order to achieve this goal, it moves
 333 the instances around, with a series of moves of various types:
 334
 335 - disk replaces (for DRBD-based instances)
 336 - instance failover/migrations (for all types)
 337
 338 However, the algorithm only looks at the cluster score, and not at the
 339 *“cost”* of the moves. In other words, the following can and will happen
 340 on a cluster:
 341
 342 .. digraph:: "balancing-cost-issues"
 343
 344   rankdir=LR;
 345   ranksep=1;
 346
 347   start     [label="score α", shape=hexagon];
 348
 349   node      [shape=box, width=2];
 350   replace1  [label="replace_disks 500G\nscore α-3ε\ncost 3"];
 351   replace2a [label="replace_disks 20G\nscore α-2ε\ncost 2"];
 352   migrate1  [label="migrate\nscore α-ε\ncost 1"];
 353
 354   choose    [shape=ellipse,label="choose min(score)=α-3ε\ncost 3"];
 355
 356   start -> {replace1; replace2a; migrate1} -> choose;
 357
 358 Even though a migration is much, much cheaper than a disk replace (in
 359 terms of network and disk traffic on the cluster), if the disk replace
 360 results in a score infinitesimally smaller, then it will be
 361 chosen. Similarly, between two disk replaces, one moving e.g. ``500GiB``
 362 and one moving ``20GiB``, the first one will be chosen if it results in
 363 a score smaller than the second one. Furthermore, even if the resulting
 364 scores are equal, the first computed solution will be kept, whichever it
 365 is.
 366
 367 Fixing this algorithmic problem is doable, but currently Ganeti doesn't
 368 export enough information about nodes to make an informed decision; in
 369 the above example, if the ``500GiB`` move is between nodes having fast
 370 I/O (both disks and network), it makes sense to execute it over a disk
 371 replace of ``100GiB`` between nodes with slow I/O, so simply relating to
 372 the properties of the move itself is not enough; we need more node
 373 information for cost computation.
 374
 375 Allocation algorithm
 376 ~~~~~~~~~~~~~~~~~~~~
 377
 378 .. note:: This design document will not address this limitation, but it
 379   is worth mentioning as it directly related to the resource model.
 380
 381 The current allocation/capacity algorithm works as follows (per
 382 node-group)::
 383
 384     repeat:
 385         allocate instance without failing N+1
 386
 387 This simple algorithm, and its use of ``N+1`` criterion, has a built-in
 388 limit of 1 machine failure in case of DRBD. This means the algorithm
 389 guarantees that, if using DRBD storage, there are enough resources to
 390 (re)start all affected instances in case of one machine failure. This
 391 relates mostly to memory; there is no account for CPU over-subscription
 392 (i.e. in case of failure, make sure we can failover while still not
 393 going over CPU limits), or for any other resource.
 394
 395 In case of shared storage, there's not even the memory guarantee, as the
 396 N+1 protection doesn't work for shared storage.
 397
 398 If a given cluster administrator wants to survive up to two machine
 399 failures, or wants to ensure CPU limits too for DRBD, there is no
 400 possibility to configure this in HTools (neither in :command:`hail` nor
 401 in :command:`hspace`). Current workaround employ for example deducting a
 402 certain number of instances from the size computed by :command:`hspace`,
 403 but this is a very crude method, and requires that instance creations
 404 are limited before Ganeti (otherwise :command:`hail` would allocate
 405 until the cluster is full).
 406
 407 Proposed architecture
 408 =====================
 409
 410
 411 There are two main changes proposed:
 412
 413 - changing the resource model from a pure :term:`SoW` to a hybrid
 414   :term:`SoR`/:term:`SoW` one, where the :term:`SoR` component is
 415   heavily emphasised
 416 - extending the resource model to cover additional properties,
 417   completing the “holes” in the current coverage
 418
 419 The second change is rather straightforward, but will add more
 420 complexity in the modelling of the cluster. The first change, however,
 421 represents a significant shift from the current model, which Ganeti had
 422 from its beginnings.
 423
 424 Lock-improved resource model
 425 ----------------------------
 426
 427 Hybrid SoR/SoW model
 428 ~~~~~~~~~~~~~~~~~~~~
 429
 430 The resources of a node can be characterised in two broad classes:
 431
 432 - mostly static resources
 433 - dynamically changing resources
 434
 435 In the first category, we have things such as total core count, total
 436 memory size, total disk size, number of network interfaces etc. In the
 437 second category we have things such as free disk space, free memory, CPU
 438 load, etc. Note that nowadays we don't have (anymore) fully-static
 439 resources: features like CPU and memory hot-plug, online disk replace,
 440 etc. mean that theoretically all resources can change (there are some
 441 practical limitations, of course).
 442
 443 Even though the rate of change of the two resource types is wildly
 444 different, right now Ganeti handles both the same. Given that the
 445 interval of change of the semi-static ones is much bigger than most
 446 Ganeti operations, even more than lengthy sequences of Ganeti jobs, it
 447 makes sense to treat them separately.
 448
 449 The proposal is then to move the following resources into the
 450 configuration and treat the configuration as the authoritative source
 451 for them (a :term:`SoR` model):
 452
 453 - CPU resources:
 454     - total core count
 455     - node core usage (*new*)
 456 - memory resources:
 457     - total memory size
 458     - node memory size
 459     - hypervisor overhead (*new*)
 460 - disk resources:
 461     - total disk size
 462     - disk overhead (*new*)
 463
 464 Since these resources can though change at run-time, we will need
 465 functionality to update the recorded values.
 466
 467 Pre-computing dynamic resource values
 468 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 469
 470 Remember that the resource model used by HTools models the clusters as
 471 obeying the following equations:
 472
 473   disk\ :sub:`free` = disk\ :sub:`total` - ∑ disk\ :sub:`instances`
 474
 475   mem\ :sub:`free` = mem\ :sub:`total` - ∑ mem\ :sub:`instances` - mem\
 476   :sub:`node` - mem\ :sub:`overhead`
 477
 478 As this model worked fine for HTools, we can consider it valid and adopt
 479 it in Ganeti. Furthermore, note that all values in the right-hand side
 480 come now from the configuration:
 481
 482 - the per-instance usage values were already stored in the configuration
 483 - the other values will are moved to the configuration per the previous
 484   section
 485
 486 This means that we can now compute the free values without having to
 487 actually live-query the nodes, which brings a significant advantage.
 488
 489 There are a couple of caveats to this model though. First, as the
 490 run-time state of the instance is no longer taken into consideration, it
 491 means that we have to introduce a new *offline* state for an instance
 492 (similar to the node one). In this state, the instance's runtime
 493 resources (memory and VCPUs) are no longer reserved for it, and can be
 494 reused by other instances. Static resources like disk and MAC addresses
 495 are still reserved though. Transitioning into and out of this reserved
 496 state will be more involved than simply stopping/starting the instance
 497 (e.g. de-offlining can fail due to missing resources). This complexity
 498 is compensated by the increased consistency of what guarantees we have
 499 in the stopped state (we always guarantee resource reservation), and the
 500 potential for management tools to restrict which users can transition
 501 into/out of this state separate from which users can stop/start the
 502 instance.
 503
 504 Separating per-node resource locks
 505 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 506
 507 Many of the current node locks in Ganeti exist in order to guarantee
 508 correct resource state computation, whereas others are designed to
 509 guarantee reasonable run-time performance of nodes (e.g. by not
 510 overloading the I/O subsystem). This is an unfortunate coupling, since
 511 it means for example that the following two operations conflict in
 512 practice even though they are orthogonal:
 513
 514 - replacing a instance's disk on a node
 515 - computing node disk/memory free for an IAllocator run
 516
 517 This conflict increases significantly the lock contention on a big/busy
 518 cluster and at odds with the goal of increasing the cluster size.
 519
 520 The proposal is therefore to add a new level of locking that is only
 521 used to prevent concurrent modification to the resource states (either
 522 node properties or instance properties) and not for long-term
 523 operations:
 524
 525 - instance creation needs to acquire and keep this lock until adding the
 526   instance to the configuration
 527 - instance modification needs to acquire and keep this lock until
 528   updating the instance
 529 - node property changes will need to acquire this lock for the
 530   modification
 531
 532 The new lock level will sit before the instance level (right after BGL)
 533 and could either be single-valued (like the “Big Ganeti Lock”), in which
 534 case we won't be able to modify two nodes at the same time, or per-node,
 535 in which case the list of locks at this level needs to be synchronised
 536 with the node lock level. To be determined.
 537
 538 Lock contention reduction
 539 ~~~~~~~~~~~~~~~~~~~~~~~~~
 540
 541 Based on the above, the locking contention will be reduced as follows:
 542 IAllocator calls will no longer need the ``LEVEL_NODE: ALL_SET`` lock,
 543 only the resource lock (in exclusive mode). Hence allocating/computing
 544 evacuation targets will no longer conflict for longer than the time to
 545 compute the allocation solution.
 546
 547 The remaining long-running locks will be the DRBD replace-disks ones
 548 (exclusive mode). These can also be removed, or changed into shared
 549 locks, but that is a separate design change.
 550
 551 .. admonition:: FIXME
 552
 553   Need to rework instance replace disks. I don't think we need exclusive
 554   locks for replacing disks: it is safe to stop/start the instance while
 555   it's doing a replace disks. Only modify would need exclusive, and only
 556   for transitioning into/out of offline state.
 557
 558 Instance memory model
 559 ---------------------
 560
 561 In order to support ballooning, the instance memory model needs to be
 562 changed from a “memory size” one to a “min/max memory size”. This
 563 interacts with the new static resource model, however, and thus we need
 564 to declare a-priori the expected oversubscription ratio on the cluster.
 565
 566 The new minimum memory size parameter will be similar to the current
 567 memory size; the cluster will guarantee that in all circumstances, all
 568 instances will have available their minimum memory size. The maximum
 569 memory size will permit burst usage of more memory by instances, with
 570 the restriction that the sum of maximum memory usage will not be more
 571 than the free memory times the oversubscription factor:
 572
 573     ∑ memory\ :sub:`min` ≤ memory\ :sub:`available`
 574
 575     ∑ memory\ :sub:`max` ≤ memory\ :sub:`free` * oversubscription_ratio
 576
 577 The hypervisor will have the possibility of adjusting the instance's
 578 memory size dynamically between these two boundaries.
 579
 580 Note that the minimum memory is related to the available memory on the
 581 node, whereas the maximum memory is related to the free memory. On
 582 DRBD-enabled clusters, this will have the advantage of using the
 583 reserved memory for N+1 failover for burst usage, instead of having it
 584 completely idle.
 585
 586 .. admonition:: FIXME
 587
 588   Need to document how Ganeti forces minimum size at runtime, overriding
 589   the hypervisor, in cases of failover/lack of resources.
 590
 591 New parameters
 592 --------------
 593
 594 Unfortunately the design will add a significant number of new
 595 parameters, and change the meaning of some of the current ones.
 596
 597 Instance size limits
 598 ~~~~~~~~~~~~~~~~~~~~
 599
 600 As described in :ref:`label-policies`, we currently lack a clear
 601 definition of the support instance sizes (minimum, maximum and
 602 standard). As such, we will add the following structure to the cluster
 603 parameters:
 604
 605 - ``min_ispec``, ``max_ispec``: minimum and maximum acceptable instance
 606   specs
 607 - ``std_ispec``: standard instance size, which will be used for capacity
 608   computations and for default parameters on the instance creation
 609   request
 610
 611 Ganeti will by default reject non-standard instance sizes (lower than
 612 ``min_ispec`` or greater than ``max_ispec``), but as usual a
 613 ``--ignore-ipolicy`` option on the command line or in the RAPI request
 614 will override these constraints. The ``std_spec`` structure will be used
 615 to fill in missing instance specifications on create.
 616
 617 Each of the ispec structures will be a dictionary, since the contents
 618 can change over time. Initially, we will define the following variables
 619 in these structures:
 620
 621 +---------------+----------------------------------+--------------+
 622 |Name           |Description                       |Type          |
 623 +===============+==================================+==============+
 624 |mem_size       |Allowed memory size               |int           |
 625 +---------------+----------------------------------+--------------+
 626 |cpu_count      |Allowed vCPU count                |int           |
 627 +---------------+----------------------------------+--------------+
 628 |disk_count     |Allowed disk count                |int           |
 629 +---------------+----------------------------------+--------------+
 630 |disk_size      |Allowed disk size                 |int           |
 631 +---------------+----------------------------------+--------------+
 632 |nic_count      |Alowed NIC count                  |int           |
 633 +---------------+----------------------------------+--------------+
 634
 635 Inheritance
 636 +++++++++++
 637
 638 In a single-group cluster, the above structure is sufficient. However,
 639 on a multi-group cluster, it could be that the hardware specifications
 640 differ across node groups, and thus the following problem appears: how
 641 can Ganeti present unified specifications over RAPI?
 642
 643 Since the set of instance specs is only partially ordered (as opposed to
 644 the sets of values of individual variable in the spec, which are totally
 645 ordered), it follows that we can't present unified specs. As such, the
 646 proposed approach is to allow the ``min_ispec`` and ``max_ispec`` to be
 647 customised per node-group (and export them as a list of specifications),
 648 and a single ``std_spec`` at cluster level (exported as a single value).
 649
 650
 651 Allocation parameters
 652 ~~~~~~~~~~~~~~~~~~~~~
 653
 654 Beside the limits of min/max instance sizes, there are other parameters
 655 related to capacity and allocation limits. These are mostly related to
 656 the problems related to over allocation.
 657
 658 +-----------------+----------+---------------------------+----------+------+
 659 | Name            |Level(s)  |Description                |Current   |Type  |
 660 |                 |          |                           |value     |      |
 661 +=================+==========+===========================+==========+======+
 662 |vcpu_ratio       |cluster,  |Maximum ratio of virtual to|64 (only  |float |
 663 |                 |node group|physical CPUs              |in htools)|      |
 664 +-----------------+----------+---------------------------+----------+------+
 665 |spindle_ratio    |cluster,  |Maximum ratio of instances |none      |float |
 666 |                 |node group|to spindles; when the I/O  |          |      |
 667 |                 |          |model doesn't map directly |          |      |
 668 |                 |          |to spindles, another       |          |      |
 669 |                 |          |measure of I/O should be   |          |      |
 670 |                 |          |used instead               |          |      |
 671 +-----------------+----------+---------------------------+----------+------+
 672 |max_node_failures|cluster,  |Cap allocation/capacity so |1         |int   |
 673 |                 |node group|that the cluster can       |(hardcoded|      |
 674 |                 |          |survive this many node     |in htools)|      |
 675 |                 |          |failures                   |          |      |
 676 +-----------------+----------+---------------------------+----------+------+
 677
 678 Since these are used mostly internally (in htools), they will be
 679 exported as-is from Ganeti, without explicit handling of node-groups
 680 grouping.
 681
 682 Regarding ``spindle_ratio``, in this context spindles do not necessarily
 683 have to mean actual mechanical hard-drivers; it's rather a measure of
 684 I/O performance for internal storage.
 685
 686 Disk parameters
 687 ~~~~~~~~~~~~~~~
 688
 689 The proposed model for the new disk parameters is a simple free-form one
 690 based on dictionaries, indexed per disk template and parameter name.
 691 Only the disk template parameters are visible to the user, and those are
 692 internally translated to logical disk level parameters.
 693
 694 This is a simplification, because each parameter is applied to a whole
 695 nested structure and there is no way of fine-tuning each level's
 696 parameters, but it is good enough for the current parameter set. This
 697 model could need to be expanded, e.g., if support for three-nodes stacked
 698 DRBD setups is added to Ganeti.
 699
 700 At JSON level, since the object key has to be a string, the keys can be
 701 encoded via a separator (e.g. slash), or by having two dict levels.
 702
 703 When needed, the unit of measurement is expressed inside square
 704 brackets.
 705
 706 +--------+--------------+-------------------------+---------------------+------+
 707 |Disk    |Name          |Description              |Current status       |Type  |
 708 |template|              |                         |                     |      |
 709 +========+==============+=========================+=====================+======+
 710 |plain   |stripes       |How many stripes to use  |Configured at        |int   |
 711 |        |              |for newly created (plain)|./configure time, not|      |
 712 |        |              |logical voumes           |overridable at       |      |
 713 |        |              |                         |runtime              |      |
 714 +--------+--------------+-------------------------+---------------------+------+
 715 |drbd    |data-stripes  |How many stripes to use  |Same as for          |int   |
 716 |        |              |for data volumes         |plain/stripes        |      |
 717 +--------+--------------+-------------------------+---------------------+------+
 718 |drbd    |metavg        |Default volume group for |Same as the main     |string|
 719 |        |              |the metadata LVs         |volume group,        |      |
 720 |        |              |                         |overridable via      |      |
 721 |        |              |                         |'metavg' key         |      |
 722 +--------+--------------+-------------------------+---------------------+------+
 723 |drbd    |meta-stripes  |How many stripes to use  |Same as for lvm      |int   |
 724 |        |              |for meta volumes         |'stripes', suboptimal|      |
 725 |        |              |                         |as the meta LVs are  |      |
 726 |        |              |                         |small                |      |
 727 +--------+--------------+-------------------------+---------------------+------+
 728 |drbd    |disk-barriers |What kind of barriers to |Either all enabled or|string|
 729 |        |              |*disable* for disks;     |all disabled, per    |      |
 730 |        |              |either "n" or a string   |./configure time     |      |
 731 |        |              |containing a subset of   |option               |      |
 732 |        |              |"bfd"                    |                     |      |
 733 +--------+--------------+-------------------------+---------------------+------+
 734 |drbd    |meta-barriers |Whether to disable or not|Handled together with|bool  |
 735 |        |              |the barriers for the meta|disk-barriers        |      |
 736 |        |              |volume                   |                     |      |
 737 +--------+--------------+-------------------------+---------------------+------+
 738 |drbd    |resync-rate   |The (static) resync rate |Hardcoded in         |int   |
 739 |        |              |for drbd, when using the |constants.py, not    |      |
 740 |        |              |static syncer, in KiB/s  |changeable via Ganeti|      |
 741 +--------+--------------+-------------------------+---------------------+------+
 742 |drbd    |dynamic-resync|Whether to use the       |Not supported.       |bool  |
 743 |        |              |dynamic resync speed     |                     |      |
 744 |        |              |controller or not. If    |                     |      |
 745 |        |              |enabled, c-plan-ahead    |                     |      |
 746 |        |              |must be non-zero and all |                     |      |
 747 |        |              |the c-* parameters will  |                     |      |
 748 |        |              |be used by DRBD.         |                     |      |
 749 |        |              |Otherwise, the value of  |                     |      |
 750 |        |              |resync-rate will be used |                     |      |
 751 |        |              |as a static resync speed.|                     |      |
 752 +--------+--------------+-------------------------+---------------------+------+
 753 |drbd    |c-plan-ahead  |Agility factor of the    |Not supported.       |int   |
 754 |        |              |dynamic resync speed     |                     |      |
 755 |        |              |controller. (the higher, |                     |      |
 756 |        |              |the slower the algorithm |                     |      |
 757 |        |              |will adapt the resync    |                     |      |
 758 |        |              |speed). A value of 0     |                     |      |
 759 |        |              |(that is the default)    |                     |      |
 760 |        |              |disables the controller  |                     |      |
 761 |        |              |[ds]                     |                     |      |
 762 +--------+--------------+-------------------------+---------------------+------+
 763 |drbd    |c-fill-target |Maximum amount of        |Not supported.       |int   |
 764 |        |              |in-flight resync data    |                     |      |
 765 |        |              |for the dynamic resync   |                     |      |
 766 |        |              |speed controller         |                     |      |
 767 |        |              |[sectors]                |                     |      |
 768 +--------+--------------+-------------------------+---------------------+------+
 769 |drbd    |c-delay-target|Maximum estimated peer   |Not supported.       |int   |
 770 |        |              |response latency for the |                     |      |
 771 |        |              |dynamic resync speed     |                     |      |
 772 |        |              |controller [ds]          |                     |      |
 773 +--------+--------------+-------------------------+---------------------+------+
 774 |drbd    |c-max-rate    |Upper bound on resync    |Not supported.       |int   |
 775 |        |              |speed for the dynamic    |                     |      |
 776 |        |              |resync speed controller  |                     |      |
 777 |        |              |[KiB/s]                  |                     |      |
 778 +--------+--------------+-------------------------+---------------------+------+
 779 |drbd    |c-min-rate    |Minimum resync speed for |Not supported.       |int   |
 780 |        |              |the dynamic resync speed |                     |      |
 781 |        |              |controller [KiB/s]       |                     |      |
 782 +--------+--------------+-------------------------+---------------------+------+
 783 |drbd    |disk-custom   |Free-form string that    |Not supported        |string|
 784 |        |              |will be appended to the  |                     |      |
 785 |        |              |drbdsetup disk command   |                     |      |
 786 |        |              |line, for custom options |                     |      |
 787 |        |              |not supported by Ganeti  |                     |      |
 788 |        |              |itself                   |                     |      |
 789 +--------+--------------+-------------------------+---------------------+------+
 790 |drbd    |net-custom    |Free-form string for     |Not supported        |string|
 791 |        |              |custom net setup options |                     |      |
 792 +--------+--------------+-------------------------+---------------------+------+
 793
 794 Currently Ganeti supports only DRBD 8.0.x, 8.2.x, 8.3.x.  It will refuse
 795 to work with DRBD 8.4 since the :command:`drbdsetup` syntax has changed
 796 significantly.
 797
 798 The barriers-related parameters have been introduced in different DRBD
 799 versions; please make sure that your version supports all the barrier
 800 parameters that you pass to Ganeti. Any version later than 8.3.0
 801 implements all of them.
 802
 803 The minimum DRBD version for using the dynamic resync speed controller
 804 is 8.3.9, since previous versions implement different parameters.
 805
 806 A more detailed discussion of the dynamic resync speed controller
 807 parameters is outside the scope of the present document. Please refer to
 808 the ``drbdsetup`` man page
 809 (`8.3 <http://www.drbd.org/users-guide-8.3/re-drbdsetup.html>`_ and
 810 `8.4 <http://www.drbd.org/users-guide/re-drbdsetup.html>`_). An
 811 interesting discussion about them can also be found in a
 812 `drbd-user mailing list post
 813 <http://lists.linbit.com/pipermail/drbd-user/2011-August/016739.html>`_.
 814
 815 All the above parameters are at cluster and node group level; as in
 816 other parts of the code, the intention is that all nodes in a node group
 817 should be equal. It will later be decided to which node group give
 818 precedence in case of instances split over node groups.
 819
 820 .. admonition:: FIXME
 821
 822    Add details about when each parameter change takes effect (device
 823    creation vs. activation)
 824
 825 Node parameters
 826 ~~~~~~~~~~~~~~~
 827
 828 For the new memory model, we'll add the following parameters, in a
 829 dictionary indexed by the hypervisor name (node attribute
 830 ``hv_state``). The rationale is that, even though multi-hypervisor
 831 clusters are rare, they make sense sometimes, and thus we need to
 832 support multipe node states (one per hypervisor).
 833
 834 Since usually only one of the multiple hypervisors is the 'main' one
 835 (and the others used sparringly), capacity computation will still only
 836 use the first hypervisor, and not all of them. Thus we avoid possible
 837 inconsistencies.
 838
 839 +----------+-----------------------------------+---------------+-------+
 840 |Name      |Description                        |Current state  |Type   |
 841 |          |                                   |               |       |
 842 +==========+===================================+===============+=======+
 843 |mem_total |Total node memory, as discovered by|Queried at     |int    |
 844 |          |this hypervisor                    |runtime        |       |
 845 +----------+-----------------------------------+---------------+-------+
 846 |mem_node  |Memory used by, or reserved for,   |Queried at     |int    |
 847 |          |the node itself; not that some     |runtime        |       |
 848 |          |hypervisors can report this in an  |               |       |
 849 |          |authoritative way, other not       |               |       |
 850 +----------+-----------------------------------+---------------+-------+
 851 |mem_hv    |Memory used either by the          |Not used,      |int    |
 852 |          |hypervisor itself or lost due to   |htools computes|       |
 853 |          |instance allocation rounding;      |it internally  |       |
 854 |          |usually this cannot be precisely   |               |       |
 855 |          |computed, but only roughly         |               |       |
 856 |          |estimated                          |               |       |
 857 +----------+-----------------------------------+---------------+-------+
 858 |cpu_total |Total node cpu (core) count;       |Queried at     |int    |
 859 |          |usually this can be discovered     |runtime        |       |
 860 |          |automatically                      |               |       |
 861 |          |                                   |               |       |
 862 |          |                                   |               |       |
 863 |          |                                   |               |       |
 864 +----------+-----------------------------------+---------------+-------+
 865 |cpu_node  |Number of cores reserved for the   |Not used at all|int    |
 866 |          |node itself; this can either be    |               |       |
 867 |          |discovered or set manually. Only   |               |       |
 868 |          |used for estimating how many VCPUs |               |       |
 869 |          |are left for instances             |               |       |
 870 |          |                                   |               |       |
 871 +----------+-----------------------------------+---------------+-------+
 872
 873 Of the above parameters, only ``_total`` ones are straight-forward. The
 874 others have sometimes strange semantics:
 875
 876 - Xen can report ``mem_node``, if configured statically (as we
 877   recommend); but Linux-based hypervisors (KVM, chroot, LXC) do not, and
 878   this needs to be configured statically for these values
 879 - ``mem_hv``, representing unaccounted for memory, is not directly
 880   computable; on Xen, it can be seen that on a N GB machine, with 1 GB
 881   for dom0 and N-2 GB for instances, there's just a few MB left, instead
 882   fo a full 1 GB of RAM; however, the exact value varies with the total
 883   memory size (at least)
 884 - ``cpu_node`` only makes sense on Xen (currently), in the case when we
 885   restrict dom0; for Linux-based hypervisors, the node itself cannot be
 886   easily restricted, so it should be set as an estimate of how "heavy"
 887   the node loads will be
 888
 889 Since these two values cannot be auto-computed from the node, we need to
 890 be able to declare a default at cluster level (debatable how useful they
 891 are at node group level); the proposal is to do this via a cluster-level
 892 ``hv_state`` dict (per hypervisor).
 893
 894 Beside the per-hypervisor attributes, we also have disk attributes,
 895 which are queried directly on the node (without hypervisor
 896 involvment). The are stored in a separate attribute (``disk_state``),
 897 which is indexed per storage type and name; currently this will be just
 898 ``LD_LV`` and the volume name as key.
 899
 900 +-------------+-------------------------+--------------------+--------+
 901 |Name         |Description              |Current state       |Type    |
 902 |             |                         |                    |        |
 903 +=============+=========================+====================+========+
 904 |disk_total   |Total disk size          |Queried at runtime  |int     |
 905 |             |                         |                    |        |
 906 +-------------+-------------------------+--------------------+--------+
 907 |disk_reserved|Reserved disk size; this |None used in Ganeti;|int     |
 908 |             |is a lower limit on the  |htools has a        |        |
 909 |             |free space, if such a    |parameter for this  |        |
 910 |             |limit is desired         |                    |        |
 911 +-------------+-------------------------+--------------------+--------+
 912 |disk_overhead|Disk that is expected to |None used in Ganeti;|int     |
 913 |             |be used by other volumes |htools detects this |        |
 914 |             |(set via                 |at runtime          |        |
 915 |             |``reserved_lvs``);       |                    |        |
 916 |             |usually should be zero   |                    |        |
 917 +-------------+-------------------------+--------------------+--------+
 918
 919
 920 Instance parameters
 921 ~~~~~~~~~~~~~~~~~~~
 922
 923 New instance parameters, needed especially for supporting the new memory
 924 model:
 925
 926 +--------------+----------------------------------+-----------------+------+
 927 |Name          |Description                       |Current status   |Type  |
 928 |              |                                  |                 |      |
 929 +==============+==================================+=================+======+
 930 |offline       |Whether the instance is in        |Not supported    |bool  |
 931 |              |“permanent” offline mode; this is |                 |      |
 932 |              |stronger than the "admin_down”    |                 |      |
 933 |              |state, and is similar to the node |                 |      |
 934 |              |offline attribute                 |                 |      |
 935 +--------------+----------------------------------+-----------------+------+
 936 |be/max_memory |The maximum memory the instance is|Not existent, but|int   |
 937 |              |allowed                           |virtually        |      |
 938 |              |                                  |identical to     |      |
 939 |              |                                  |memory           |      |
 940 +--------------+----------------------------------+-----------------+------+
 941
 942 HTools changes
 943 --------------
 944
 945 All the new parameters (node, instance, cluster, not so much disk) will
 946 need to be taken into account by HTools, both in balancing and in
 947 capacity computation.
 948
 949 Since the Ganeti's cluster model is much enhanced, Ganeti can also
 950 export its own reserved/overhead variables, and as such HTools can make
 951 less “guesses” as to the difference in values.
 952
 953 .. admonition:: FIXME
 954
 955    Need to detail more the htools changes; the model is clear to me, but
 956    need to write it down.
 957
 958 .. vim: set textwidth=72 :
 959 .. Local Variables:
 960 .. mode: rst
 961 .. fill-column: 72
 962 .. End: