Revision d85f01e7

b/Makefile.am
306 306
	doc/design-network.rst \
307 307
	doc/design-chained-jobs.rst \
308 308
	doc/design-ovf-support.rst \
309
	doc/design-resource-model.rst \
309 310
	doc/cluster-merge.rst \
310 311
	doc/design-shared-storage.rst \
311 312
	doc/design-node-state-cache.rst \
b/doc/design-draft.rst
12 12
   design-ovf-support.rst
13 13
   design-network.rst
14 14
   design-node-state-cache.rst
15
   design-resource-model.rst
15 16

  
16 17
.. vim: set textwidth=72 :
17 18
.. Local Variables:
b/doc/design-resource-model.rst
1
========================
2
 Resource model changes
3
========================
4

  
5

  
6
Introduction
7
============
8

  
9
In order to manage virtual machines across the cluster, Ganeti needs to
10
understand the resources present on the nodes, the hardware and software
11
limitations of the nodes, and how much can be allocated safely on each
12
node. Some of these decisions are delegated to IAllocator plugins, for
13
easier site-level customisation.
14

  
15
Similarly, the HTools suite has an internal model that simulates the
16
hardware resource changes in response to Ganeti operations, in order to
17
provide both an iallocator plugin and for balancing the
18
cluster.
19

  
20
While currently the HTools model is much more advanced than Ganeti's,
21
neither one is flexible enough and both are heavily geared toward a
22
specific Xen model; they fail to work well with (e.g.) KVM or LXC, or
23
with Xen when :term:`tmem` is enabled. Furthermore, the set of metrics
24
contained in the models is limited to historic requirements and fails to
25
account for (e.g.)  heterogeneity in the I/O performance of the nodes.
26

  
27
Current situation
28
=================
29

  
30
Ganeti
31
------
32

  
33
At this moment, Ganeti itself doesn't do any static modelling of the
34
cluster resources. It only does some runtime checks:
35

  
36
- when creating instances, for the (current) free disk space
37
- when starting instances, for the (current) free memory
38
- during cluster verify, for enough N+1 memory on the secondaries, based
39
  on the (current) free memory
40

  
41
Basically this model is a pure :term:`SoW` one, and it works well when
42
there are other instances/LVs on the nodes, as it allows Ganeti to deal
43
with ‘orphan’ resource usage, but on the other hand it has many issues,
44
described below.
45

  
46
HTools
47
------
48

  
49
Since HTools does an pure in-memory modelling of the cluster changes as
50
it executes the balancing or allocation steps, it had to introduce a
51
static (:term:`SoR`) cluster model.
52

  
53
The model is constructed based on the received node properties from
54
Ganeti (hence it basically is constructed on what Ganeti can export).
55

  
56
Disk
57
~~~~
58

  
59
For disk it consists of just the total (``tdsk``) and the free disk
60
space (``fdsk``); we don't directly track the used disk space. On top of
61
this, we compute and warn if the sum of disk sizes used by instance does
62
not match with ``tdsk - fdsk``, but otherwise we do not track this
63
separately.
64

  
65
Memory
66
~~~~~~
67

  
68
For memory, the model is more complex and tracks some variables that
69
Ganeti itself doesn't compute. We start from the total (``tmem``), free
70
(``fmem``) and node memory (``nmem``) as supplied by Ganeti, and
71
additionally we track:
72

  
73
instance memory (``imem``)
74
    the total memory used by primary instances on the node, computed
75
    as the sum of instance memory
76

  
77
reserved memory (``rmem``)
78
    the memory reserved by peer nodes for N+1 redundancy; this memory is
79
    tracked per peer-node, and the maximum value out of the peer memory
80
    lists is the node's ``rmem``; when not using DRBD, this will be
81
    equal to zero
82

  
83
unaccounted memory (``xmem``)
84
    memory that cannot be unaccounted for via the Ganeti model; this is
85
    computed at startup as::
86

  
87
        tmem - imem - nmem - fmem
88

  
89
    and is presumed to remain constant irrespective of any instance
90
    moves
91

  
92
available memory (``amem``)
93
    this is simply ``fmem - rmem``, so unless we use DRBD, this will be
94
    equal to ``fmem``
95

  
96
``tmem``, ``nmem`` and ``xmem`` are presumed constant during the
97
instance moves, whereas the ``fmem``, ``imem``, ``rmem`` and ``amem``
98
values are updated according to the executed moves.
99

  
100
CPU
101
~~~
102

  
103
The CPU model is different than the disk/memory models, since it's the
104
only one where:
105

  
106
#. we do oversubscribe physical CPUs
107
#. and there is no natural limit for the number of VCPUs we can allocate
108

  
109
We therefore track the total number of VCPUs used on the node and the
110
number of physical CPUs, and we cap the vcpu-to-cpu ratio in order to
111
make this somewhat more similar to the other resources which are
112
limited.
113

  
114
Dynamic load
115
~~~~~~~~~~~~
116

  
117
There is also a model that deals with *dynamic load* values in
118
htools. As far as we know, it is not currently used actually with load
119
values, but it is active by default with unitary values for all
120
instances; it currently tracks these metrics:
121

  
122
- disk load
123
- memory load
124
- cpu load
125
- network load
126

  
127
Even though we do not assign real values to these load values, the fact
128
that we at least sum them means that the algorithm tries to equalise
129
these loads, and especially the network load, which is otherwise not
130
tracked at all. The practical result (due to a combination of these four
131
metrics) is that the number of secondaries will be balanced.
132

  
133
Limitations
134
-----------
135

  
136

  
137
There are unfortunately many limitations to the current model.
138

  
139
Memory
140
~~~~~~
141

  
142
The memory model doesn't work well in case of KVM. For Xen, the memory
143
for the node (i.e. ``dom0``) can be static or dynamic; we don't support
144
the latter case, but for the former case, the static value is configured
145
in Xen/kernel command line, and can be queried from Xen
146
itself. Therefore, Ganeti can query the hypervisor for the memory used
147
for the node; the same model was adopted for the chroot/KVM/LXC
148
hypervisors, but in these cases there's no natural value for the memory
149
used by the base OS/kernel, and we currently try to compute a value for
150
the node memory based on current consumption. This, being variable,
151
breaks the assumptions in both Ganeti and HTools.
152

  
153
This problem also shows for the free memory: if the free memory on the
154
node is not constant (Xen with :term:`tmem` auto-ballooning enabled), or
155
if the node and instance memory are pooled together (Linux-based
156
hypervisors like KVM and LXC), the current value of the free memory is
157
meaningless and cannot be used for instance checks.
158

  
159
A separate issue related to the free memory tracking is that since we
160
don't track memory use but rather memory availability, an instance that
161
is temporary down changes Ganeti's understanding of the memory status of
162
the node. This can lead to problems such as:
163

  
164
.. digraph:: "free-mem-issue"
165

  
166
  node  [shape=box];
167
  inst1 [label="instance1"];
168
  inst2 [label="instance2"];
169

  
170
  node  [shape=note];
171
  nodeA [label="fmem=0"];
172
  nodeB [label="fmem=1"];
173
  nodeC [label="fmem=0"];
174

  
175
  node  [shape=ellipse, style=filled, fillcolor=green]
176

  
177
  {rank=same; inst1 inst2}
178

  
179
  stop    [label="crash!", fillcolor=orange];
180
  migrate [label="migrate/ok"];
181
  start   [style=filled, fillcolor=red, label="start/fail"];
182
  inst1   -> stop -> start;
183
  stop    -> migrate -> start [style=invis, weight=0];
184
  inst2   -> migrate;
185

  
186
  {rank=same; inst1 inst2 nodeA}
187
  {rank=same; stop nodeB}
188
  {rank=same; migrate nodeC}
189

  
190
  nodeA -> nodeB -> nodeC [style=invis, weight=1];
191

  
192
The behaviour here is wrong; the migration of *instance2* to the node in
193
question will succeed or fail depending on whether *instance1* is
194
running or not. And for *instance1*, it can lead to cases where it if
195
crashes, it cannot restart anymore.
196

  
197
Finally, not a problem but rather a missing important feature is support
198
for memory over-subscription: both Xen and KVM support memory
199
ballooning, even automatic memory ballooning, for a while now. The
200
entire memory model is based on a fixed memory size for instances, and
201
if memory ballooning is enabled, it will “break” the HTools
202
algorithm. Even the fact that KVM instances do not use all memory from
203
the start creates problems (although not as high, since it will grow and
204
stabilise in the end).
205

  
206
Disks
207
~~~~~
208

  
209
Because we only track disk space currently, this means if we have a
210
cluster of ``N`` otherwise identical nodes but half of them have 10
211
drives of size ``X`` and the other half 2 drives of size ``5X``, HTools
212
will consider them exactly the same. However, in the case of mechanical
213
drives at least, the I/O performance will differ significantly based on
214
spindle count, and a “fair” load distribution should take this into
215
account (a similar comment can be made about processor/memory/network
216
speed).
217

  
218
Another problem related to the spindle count is the LVM allocation
219
algorithm. Currently, the algorithm always creates (or tries to create)
220
striped volumes, with the stripe count being hard-coded to the
221
``./configure`` parameter ``--with-lvm-stripecount``. This creates
222
problems like:
223

  
224
- when installing from a distribution package, all clusters will be
225
  either limited or overloaded due to this fixed value
226
- it is not possible to mix heterogeneous nodes (even in different node
227
  groups) and have optimal settings for all nodes
228
- the striping value applies both to LVM/DRBD data volumes (which are on
229
  the order of gigabytes to hundreds of gigabytes) and to DRBD metadata
230
  volumes (whose size is always fixed at 128MB); when stripping such
231
  small volumes over many PVs, their size will increase needlessly (and
232
  this can confuse HTools' disk computation algorithm)
233

  
234
Moreover, the allocation currently allocates based on a ‘most free
235
space’ algorithm. This balances the free space usage on disks, but on
236
the other hand it tends to mix rather badly the data and metadata
237
volumes of different instances. For example, it cannot do the following:
238

  
239
- keep DRBD data and metadata volumes on the same drives, in order to
240
  reduce exposure to drive failure in a many-drives system
241
- keep DRBD data and metadata volumes on different drives, to reduce
242
  performance impact of metadata writes
243

  
244
Additionally, while Ganeti supports setting the volume separately for
245
data and metadata volumes at instance creation, there are no defaults
246
for this setting.
247

  
248
Similar to the above stripe count problem (which is about not good
249
enough customisation of Ganeti's behaviour), we have limited
250
pass-through customisation of the various options of our storage
251
backends; while LVM has a system-wide configuration file that can be
252
used to tweak some of its behaviours, for DRBD we don't use the
253
:command:`drbdadmin` tool, and instead we call :command:`drbdsetup`
254
directly, with a fixed/restricted set of options; so for example one
255
cannot tweak the buffer sizes.
256

  
257
Another current problem is that the support for shared storage in HTools
258
is still limited, but this problem is outside of this design document.
259

  
260
Locking
261
~~~~~~~
262

  
263
A further problem generated by the “current free” model is that during a
264
long operation which affects resource usage (e.g. disk replaces,
265
instance creations) we have to keep the respective objects locked
266
(sometimes even in exclusive mode), since we don't want any concurrent
267
modifications to the *free* values.
268

  
269
A classic example of the locking problem is the following:
270

  
271
.. digraph:: "iallocator-lock-issues"
272

  
273
  rankdir=TB;
274

  
275
  start [style=invis];
276
  node  [shape=box,width=2];
277
  job1  [label="add instance\niallocator run\nchoose A,B"];
278
  job1e [label="finish add"];
279
  job2  [label="add instance\niallocator run\nwait locks"];
280
  job2s [label="acquire locks\nchoose C,D"];
281
  job2e [label="finish add"];
282

  
283
  job1  -> job1e;
284
  job2  -> job2s -> job2e;
285
  edge [style=invis,weight=0];
286
  start -> {job1; job2}
287
  job1  -> job2;
288
  job2  -> job1e;
289
  job1e -> job2s [style=dotted,label="release locks"];
290

  
291
In the above example, the second IAllocator run will wait for locks for
292
nodes ``A`` and ``B``, even though in the end the second instance will
293
be placed on another set of nodes (``C`` and ``D``). This wait shouldn't
294
be needed, since right after the first IAllocator run has finished,
295
:command:`hail` knows the status of the cluster after the allocation,
296
and it could answer the question for the second run too; however, Ganeti
297
doesn't have such visibility into the cluster state and thus it is
298
forced to wait with the second job.
299

  
300
Similar examples can be made about replace disks (another long-running
301
opcode).
302

  
303
.. _label-policies:
304

  
305
Policies
306
~~~~~~~~
307

  
308
For most of the resources, we have metrics defined by policy: e.g. the
309
over-subscription ratio for CPUs, the amount of space to reserve,
310
etc. Furthermore, although there are no such definitions in Ganeti such
311
as minimum/maximum instance size, a real deployment will need to have
312
them, especially in a fully-automated workflow where end-users can
313
request instances via an automated interface (that talks to the cluster
314
via RAPI, LUXI or command line). However, such an automated interface
315
will need to also take into account cluster capacity, and if the
316
:command:`hspace` tool is used for the capacity computation, it needs to
317
be told the maximum instance size, however it has a built-in minimum
318
instance size which is not customisable.
319

  
320
It is clear that this situation leads to duplicate definition of
321
resource policies which makes it hard to easily change per-cluster (or
322
globally) the respective policies, and furthermore it creates
323
inconsistencies if such policies are not enforced at the source (i.e. in
324
Ganeti).
325

  
326
Balancing algorithm
327
~~~~~~~~~~~~~~~~~~~
328

  
329
The balancing algorithm, as documented in the HTools ``README`` file,
330
tries to minimise the cluster score; this score is based on a set of
331
metrics that describe both exceptional conditions and how spread the
332
instances are across the nodes. In order to achieve this goal, it moves
333
the instances around, with a series of moves of various types:
334

  
335
- disk replaces (for DRBD-based instances)
336
- instance failover/migrations (for all types)
337

  
338
However, the algorithm only looks at the cluster score, and not at the
339
*“cost”* of the moves. In other words, the following can and will happen
340
on a cluster:
341

  
342
.. digraph:: "balancing-cost-issues"
343

  
344
  rankdir=LR;
345
  ranksep=1;
346

  
347
  start     [label="score α", shape=hexagon];
348

  
349
  node      [shape=box, width=2];
350
  replace1  [label="replace_disks 500G\nscore α-3ε\ncost 3"];
351
  replace2a [label="replace_disks 20G\nscore α-2ε\ncost 2"];
352
  migrate1  [label="migrate\nscore α-ε\ncost 1"];
353

  
354
  choose    [shape=ellipse,label="choose min(score)=α-3ε\ncost 3"];
355

  
356
  start -> {replace1; replace2a; migrate1} -> choose;
357

  
358
Even though a migration is much, much cheaper than a disk replace (in
359
terms of network and disk traffic on the cluster), if the disk replace
360
results in a score infinitesimally smaller, then it will be
361
chosen. Similarly, between two disk replaces, one moving e.g. ``500GiB``
362
and one moving ``20GiB``, the first one will be chosen if it results in
363
a score smaller than the second one. Furthermore, even if the resulting
364
scores are equal, the first computed solution will be kept, whichever it
365
is.
366

  
367
Fixing this algorithmic problem is doable, but currently Ganeti doesn't
368
export enough information about nodes to make an informed decision; in
369
the above example, if the ``500GiB`` move is between nodes having fast
370
I/O (both disks and network), it makes sense to execute it over a disk
371
replace of ``100GiB`` between nodes with slow I/O, so simply relating to
372
the properties of the move itself is not enough; we need more node
373
information for cost computation.
374

  
375
Allocation algorithm
376
~~~~~~~~~~~~~~~~~~~~
377

  
378
.. note:: This design document will not address this limitation, but it
379
  is worth mentioning as it directly related to the resource model.
380

  
381
The current allocation/capacity algorithm works as follows (per
382
node-group)::
383

  
384
    repeat:
385
        allocate instance without failing N+1
386

  
387
This simple algorithm, and its use of ``N+1`` criterion, has a built-in
388
limit of 1 machine failure in case of DRBD. This means the algorithm
389
guarantees that, if using DRBD storage, there are enough resources to
390
(re)start all affected instances in case of one machine failure. This
391
relates mostly to memory; there is no account for CPU over-subscription
392
(i.e. in case of failure, make sure we can failover while still not
393
going over CPU limits), or for any other resource.
394

  
395
In case of shared storage, there's not even the memory guarantee, as the
396
N+1 protection doesn't work for shared storage.
397

  
398
If a given cluster administrator wants to survive up to two machine
399
failures, or wants to ensure CPU limits too for DRBD, there is no
400
possibility to configure this in HTools (neither in :command:`hail` nor
401
in :command:`hspace`). Current workaround employ for example deducting a
402
certain number of instances from the size computed by :command:`hspace`,
403
but this is a very crude method, and requires that instance creations
404
are limited before Ganeti (otherwise :command:`hail` would allocate
405
until the cluster is full).
406

  
407
Proposed architecture
408
=====================
409

  
410

  
411
There are two main changes proposed:
412

  
413
- changing the resource model from a pure :term:`SoW` to a hybrid
414
  :term:`SoR`/:term:`SoW` one, where the :term:`SoR` component is
415
  heavily emphasised
416
- extending the resource model to cover additional properties,
417
  completing the “holes” in the current coverage
418

  
419
The second change is rather straightforward, but will add more
420
complexity in the modelling of the cluster. The first change, however,
421
represents a significant shift from the current model, which Ganeti had
422
from its beginnings.
423

  
424
Lock-improved resource model
425
----------------------------
426

  
427
Hybrid SoR/SoW model
428
~~~~~~~~~~~~~~~~~~~~
429

  
430
The resources of a node can be characterised in two broad classes:
431

  
432
- mostly static resources
433
- dynamically changing resources
434

  
435
In the first category, we have things such as total core count, total
436
memory size, total disk size, number of network interfaces etc. In the
437
second category we have things such as free disk space, free memory, CPU
438
load, etc. Note that nowadays we don't have (anymore) fully-static
439
resources: features like CPU and memory hot-plug, online disk replace,
440
etc. mean that theoretically all resources can change (there are some
441
practical limitations, of course).
442

  
443
Even though the rate of change of the two resource types is wildly
444
different, right now Ganeti handles both the same. Given that the
445
interval of change of the semi-static ones is much bigger than most
446
Ganeti operations, even more than lengthy sequences of Ganeti jobs, it
447
makes sense to treat them separately.
448

  
449
The proposal is then to move the following resources into the
450
configuration and treat the configuration as the authoritative source
451
for them (a :term:`SoR` model):
452

  
453
- CPU resources:
454
    - total core count
455
    - node core usage (*new*)
456
- memory resources:
457
    - total memory size
458
    - node memory size
459
    - hypervisor overhead (*new*)
460
- disk resources:
461
    - total disk size
462
    - disk overhead (*new*)
463

  
464
Since these resources can though change at run-time, we will need
465
functionality to update the recorded values.
466

  
467
Pre-computing dynamic resource values
468
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
469

  
470
Remember that the resource model used by HTools models the clusters as
471
obeying the following equations:
472

  
473
  disk\ :sub:`free` = disk\ :sub:`total` - ∑ disk\ :sub:`instances`
474

  
475
  mem\ :sub:`free` = mem\ :sub:`total` - ∑ mem\ :sub:`instances` - mem\
476
  :sub:`node` - mem\ :sub:`overhead`
477

  
478
As this model worked fine for HTools, we can consider it valid and adopt
479
it in Ganeti. Furthermore, note that all values in the right-hand side
480
come now from the configuration:
481

  
482
- the per-instance usage values were already stored in the configuration
483
- the other values will are moved to the configuration per the previous
484
  section
485

  
486
This means that we can now compute the free values without having to
487
actually live-query the nodes, which brings a significant advantage.
488

  
489
There are a couple of caveats to this model though. First, as the
490
run-time state of the instance is no longer taken into consideration, it
491
means that we have to introduce a new *offline* state for an instance
492
(similar to the node one). In this state, the instance's runtime
493
resources (memory and VCPUs) are no longer reserved for it, and can be
494
reused by other instances. Static resources like disk and MAC addresses
495
are still reserved though. Transitioning into and out of this reserved
496
state will be more involved than simply stopping/starting the instance
497
(e.g. de-offlining can fail due to missing resources). This complexity
498
is compensated by the increased consistency of what guarantees we have
499
in the stopped state (we always guarantee resource reservation), and the
500
potential for management tools to restrict which users can transition
501
into/out of this state separate from which users can stop/start the
502
instance.
503

  
504
Separating per-node resource locks
505
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
506

  
507
Many of the current node locks in Ganeti exist in order to guarantee
508
correct resource state computation, whereas others are designed to
509
guarantee reasonable run-time performance of nodes (e.g. by not
510
overloading the I/O subsystem). This is an unfortunate coupling, since
511
it means for example that the following two operations conflict in
512
practice even though they are orthogonal:
513

  
514
- replacing a instance's disk on a node
515
- computing node disk/memory free for an IAllocator run
516

  
517
This conflict increases significantly the lock contention on a big/busy
518
cluster and at odds with the goal of increasing the cluster size.
519

  
520
The proposal is therefore to add a new level of locking that is only
521
used to prevent concurrent modification to the resource states (either
522
node properties or instance properties) and not for long-term
523
operations:
524

  
525
- instance creation needs to acquire and keep this lock until adding the
526
  instance to the configuration
527
- instance modification needs to acquire and keep this lock until
528
  updating the instance
529
- node property changes will need to acquire this lock for the
530
  modification
531

  
532
The new lock level will sit before the instance level (right after BGL)
533
and could either be single-valued (like the “Big Ganeti Lock”), in which
534
case we won't be able to modify two nodes at the same time, or per-node,
535
in which case the list of locks at this level needs to be synchronised
536
with the node lock level. To be determined.
537

  
538
Lock contention reduction
539
~~~~~~~~~~~~~~~~~~~~~~~~~
540

  
541
Based on the above, the locking contention will be reduced as follows:
542
IAllocator calls will no longer need the ``LEVEL_NODE: ALL_SET`` lock,
543
only the resource lock (in exclusive mode). Hence allocating/computing
544
evacuation targets will no longer conflict for longer than the time to
545
compute the allocation solution.
546

  
547
The remaining long-running locks will be the DRBD replace-disks ones
548
(exclusive mode). These can also be removed, or changed into shared
549
locks, but that is a separate design change.
550

  
551
.. admonition:: FIXME
552

  
553
  Need to rework instance console vs. instance replace disks. I don't
554
  think we need exclusive locks for console and neither for replace
555
  disk: it is safe to stop/start the instance while it's doing a replace
556
  disks. Only modify would need exclusive, and only for transitioning
557
  into/out of offline state.
558

  
559
Instance memory model
560
---------------------
561

  
562
In order to support ballooning, the instance memory model needs to be
563
changed from a “memory size” one to a “min/max memory size”. This
564
interacts with the new static resource model, however, and thus we need
565
to declare a-priori the expected oversubscription ratio on the cluster.
566

  
567
The new minimum memory size parameter will be similar to the current
568
memory size; the cluster will guarantee that in all circumstances, all
569
instances will have available their minimum memory size. The maximum
570
memory size will permit burst usage of more memory by instances, with
571
the restriction that the sum of maximum memory usage will not be more
572
than the free memory times the oversubscription factor:
573

  
574
    ∑ memory\ :sub:`min` ≤ memory\ :sub:`available`
575

  
576
    ∑ memory\ :sub:`max` ≤ memory\ :sub:`free` * oversubscription_ratio
577

  
578
The hypervisor will have the possibility of adjusting the instance's
579
memory size dynamically between these two boundaries.
580

  
581
Note that the minimum memory is related to the available memory on the
582
node, whereas the maximum memory is related to the free memory. On
583
DRBD-enabled clusters, this will have the advantage of using the
584
reserved memory for N+1 failover for burst usage, instead of having it
585
completely idle.
586

  
587
.. admonition:: FIXME
588

  
589
  Need to document how Ganeti forces minimum size at runtime, overriding
590
  the hypervisor, in cases of failover/lack of resources.
591

  
592
New parameters
593
--------------
594

  
595
Unfortunately the design will add a significant number of new
596
parameters, and change the meaning of some of the current ones.
597

  
598
Instance size limits
599
~~~~~~~~~~~~~~~~~~~~
600

  
601
As described in :ref:`label-policies`, we currently lack a clear
602
definition of the support instance sizes (minimum, maximum and
603
standard). As such, we will add the following structure to the cluster
604
parameters:
605

  
606
- ``min_ispec``, ``max_ispec``: minimum and maximum acceptable instance
607
  specs
608
- ``std_ispec``: standard instance size, which will be used for capacity
609
  computations and for default parameters on the instance creation
610
  request
611

  
612
Ganeti will by default reject non-standard instance sizes (lower than
613
``min_ispec`` or greater than ``max_ispec``), but as usual a ``--force``
614
option on the command line or in the RAPI request will override these
615
constraints. The ``std_spec`` structure will be used to fill in missing
616
instance specifications on create.
617

  
618
Each of the ispec structures will be a dictionary, since the contents
619
can change over time. Initially, we will define the following variables
620
in these structures:
621

  
622
+---------------+----------------------------------+--------------+
623
|Name           |Description                       |Type          |
624
+===============+==================================+==============+
625
|mem_min        |Minimum memory size allowed       |int           |
626
+---------------+----------------------------------+--------------+
627
|mem_max        |Maximum allowed memory size       |int           |
628
+---------------+----------------------------------+--------------+
629
|cpu_count      |Allowed vCPU count                |int           |
630
+---------------+----------------------------------+--------------+
631
|disk_count     |Allowed disk count                |int           |
632
+---------------+----------------------------------+--------------+
633
|disk_size      |Allowed disk size                 |int           |
634
+---------------+----------------------------------+--------------+
635
|nic_count      |Alowed NIC count                  |int           |
636
+---------------+----------------------------------+--------------+
637

  
638
Inheritance
639
+++++++++++
640

  
641
In a single-group cluster, the above structure is sufficient. However,
642
on a multi-group cluster, it could be that the hardware specifications
643
differ across node groups, and thus the following problem appears: how
644
can Ganeti present unified specifications over RAPI?
645

  
646
Since the set of instance specs is only partially ordered (as opposed to
647
the sets of values of individual variable in the spec, which are totally
648
ordered), it follows that we can't present unified specs. As such, the
649
proposed approach is to allow the ``min_ispec`` and ``max_ispec`` to be
650
customised per node-group (and export them as a list of specifications),
651
and a single ``std_spec`` at cluster level (exported as a single value).
652

  
653

  
654
Allocation parameters
655
~~~~~~~~~~~~~~~~~~~~~
656

  
657
Beside the limits of min/max instance sizes, there are other parameters
658
related to capacity and allocation limits. These are mostly related to
659
the problems related to over allocation.
660

  
661
+-----------------+----------+---------------------------+----------+------+
662
| Name            |Level(s)  |Description                |Current   |Type  |
663
|                 |          |                           |value     |      |
664
+=================+==========+===========================+==========+======+
665
|vcpu_ratio       |cluster,  |Maximum ratio of virtual to|64 (only  |float |
666
|                 |node group|physical CPUs              |in htools)|      |
667
+-----------------+----------+---------------------------+----------+------+
668
|spindle_ratio    |cluster,  |Maximum ratio of instances |none      |float |
669
|                 |node group|to spindles; when the I/O  |          |      |
670
|                 |          |model doesn't map directly |          |      |
671
|                 |          |to spindles, another       |          |      |
672
|                 |          |measure of I/O should be   |          |      |
673
|                 |          |used instead               |          |      |
674
+-----------------+----------+---------------------------+----------+------+
675
|max_node_failures|cluster,  |Cap allocation/capacity so |1         |int   |
676
|                 |node group|that the cluster can       |(hardcoded|      |
677
|                 |          |survive this many node     |in htools)|      |
678
|                 |          |failures                   |          |      |
679
+-----------------+----------+---------------------------+----------+------+
680

  
681
Since these are used mostly internally (in htools), they will be
682
exported as-is from Ganeti, without explicit handling of node-groups
683
grouping.
684

  
685
Regarding ``spindle_ratio``, in this context spindles do not necessarily
686
have to mean actual mechanical hard-drivers; it's rather a measure of
687
I/O performance for internal storage.
688

  
689
Disk parameters
690
~~~~~~~~~~~~~~~
691

  
692
The propose model for new disk parameters is a simple free-form one
693
based on dictionaries, indexed per disk level (template or logical disk)
694
and type (which depends on the level). At JSON level, since the object
695
key has to be a string, we can encode the keys via a separator
696
(e.g. slash), or by having two dict levels.
697

  
698
+--------+-------------+-------------------------+---------------------+------+
699
|Disk    |Name         |Description              |Current status       |Type  |
700
|template|             |                         |                     |      |
701
+========+=============+=========================+=====================+======+
702
|dt/plain|stripes      |How many stripes to use  |Configured at        |int   |
703
|        |             |for newly created (plain)|./configure time, not|      |
704
|        |             |logical voumes           |overridable at       |      |
705
|        |             |                         |runtime              |      |
706
+--------+-------------+-------------------------+---------------------+------+
707
|dt/drdb |stripes      |How many stripes to use  |Same as for lvm      |int   |
708
|        |             |for data volumes         |                     |      |
709
+--------+-------------+-------------------------+---------------------+------+
710
|dt/drbd |metavg       |Default volume group for |Same as the main     |string|
711
|        |             |the metadata LVs         |volume group,        |      |
712
|        |             |                         |overridable via      |      |
713
|        |             |                         |'metavg' key         |      |
714
|        |             |                         |                     |      |
715
+--------+-------------+-------------------------+---------------------+------+
716
|dt/drbd |metastripes  |How many stripes to use  |Same as for lvm      |int   |
717
|        |             |for meta volumes         |'stripes', suboptimal|      |
718
|        |             |                         |as the meta LVs are  |      |
719
|        |             |                         |small                |      |
720
+--------+-------------+-------------------------+---------------------+------+
721
|ld/drbd8|disk_barriers|What kind of barriers to |Either all enabled or|string|
722
|        |             |*disable* for disks;     |all disabled, per    |      |
723
|        |             |either "n" or a string   |./configure time     |      |
724
|        |             |containing a subset of   |option               |      |
725
|        |             |"bfd"                    |                     |      |
726
+--------+-------------+-------------------------+---------------------+------+
727
|ld/drbd8|meta_barriers|Whether barriers are     |Handled together with|bool  |
728
|        |             |enabled or not for the   |disk_barriers        |      |
729
|        |             |meta volume              |                     |      |
730
|        |             |                         |                     |      |
731
+--------+-------------+-------------------------+---------------------+------+
732
|ld/drbd8|resync_rate  |The (static) resync rate |Hardcoded in         |int   |
733
|        |             |for drbd, when using the |constants.py, not    |      |
734
|        |             |static syncer, in MiB/s  |changeable via Ganeti|      |
735
|        |             |                         |                     |      |
736
|        |             |                         |                     |      |
737
|        |             |                         |                     |      |
738
+--------+-------------+-------------------------+---------------------+------+
739
|ld/drbd8|disk_custom  |Free-form string that    |Not supported        |string|
740
|        |             |will be appended to the  |                     |      |
741
|        |             |drbdsetup disk command   |                     |      |
742
|        |             |line, for custom options |                     |      |
743
|        |             |not supported by Ganeti  |                     |      |
744
|        |             |itself                   |                     |      |
745
+--------+-------------+-------------------------+---------------------+------+
746
|ld/drbd8|net_custom   |Free-form string for     |                     |      |
747
|        |             |custom net setup options |                     |      |
748
|        |             |                         |                     |      |
749
|        |             |                         |                     |      |
750
|        |             |                         |                     |      |
751
|        |             |                         |                     |      |
752
+--------+-------------+-------------------------+---------------------+------+
753

  
754
Note that the DRBD8 parameters will change once we support DRBD 8.4,
755
which has changed syntax significantly; new syncer modes will be added
756
for that release.
757

  
758
All the above parameters are at cluster and node group level; as in
759
other parts of the code, the intention is that all nodes in a node group
760
should be equal.
761

  
762
Node parameters
763
~~~~~~~~~~~~~~~
764

  
765
For the new memory model, we'll add the following parameters, in a
766
dictionary indexed by the hypervisor name (node attribute
767
``hv_state``). The rationale is that, even though multi-hypervisor
768
clusters are rare, they make sense sometimes, and thus we need to
769
support multipe node states (one per hypervisor).
770

  
771
Since usually only one of the multiple hypervisors is the 'main' one
772
(and the others used sparringly), capacity computation will still only
773
use the first hypervisor, and not all of them. Thus we avoid possible
774
inconsistencies.
775

  
776
+----------+-----------------------------------+---------------+-------+
777
|Name      |Description                        |Current state  |Type   |
778
|          |                                   |               |       |
779
+==========+===================================+===============+=======+
780
|mem_total |Total node memory, as discovered by|Queried at     |int    |
781
|          |this hypervisor                    |runtime        |       |
782
+----------+-----------------------------------+---------------+-------+
783
|mem_node  |Memory used by, or reserved for,   |Queried at     |int    |
784
|          |the node itself; not that some     |runtime        |       |
785
|          |hypervisors can report this in an  |               |       |
786
|          |authoritative way, other not       |               |       |
787
+----------+-----------------------------------+---------------+-------+
788
|mem_hv    |Memory used either by the          |Not used,      |int    |
789
|          |hypervisor itself or lost due to   |htools computes|       |
790
|          |instance allocation rounding;      |it internally  |       |
791
|          |usually this cannot be precisely   |               |       |
792
|          |computed, but only roughly         |               |       |
793
|          |estimated                          |               |       |
794
+----------+-----------------------------------+---------------+-------+
795
|cpu_total |Total node cpu (core) count;       |Queried at     |int    |
796
|          |usually this can be discovered     |runtime        |       |
797
|          |automatically                      |               |       |
798
|          |                                   |               |       |
799
|          |                                   |               |       |
800
|          |                                   |               |       |
801
+----------+-----------------------------------+---------------+-------+
802
|cpu_node  |Number of cores reserved for the   |Not used at all|int    |
803
|          |node itself; this can either be    |               |       |
804
|          |discovered or set manually. Only   |               |       |
805
|          |used for estimating how many VCPUs |               |       |
806
|          |are left for instances             |               |       |
807
|          |                                   |               |       |
808
+----------+-----------------------------------+---------------+-------+
809

  
810
Of the above parameters, only ``_total`` ones are straight-forward. The
811
others have sometimes strange semantics:
812

  
813
- Xen can report ``mem_node``, if configured statically (as we
814
  recommend); but Linux-based hypervisors (KVM, chroot, LXC) do not, and
815
  this needs to be configured statically for these values
816
- ``mem_hv``, representing unaccounted for memory, is not directly
817
  computable; on Xen, it can be seen that on a N GB machine, with 1 GB
818
  for dom0 and N-2 GB for instances, there's just a few MB left, instead
819
  fo a full 1 GB of RAM; however, the exact value varies with the total
820
  memory size (at least)
821
- ``cpu_node`` only makes sense on Xen (currently), in the case when we
822
  restrict dom0; for Linux-based hypervisors, the node itself cannot be
823
  easily restricted, so it should be set as an estimate of how "heavy"
824
  the node loads will be
825

  
826
Since these two values cannot be auto-computed from the node, we need to
827
be able to declare a default at cluster level (debatable how useful they
828
are at node group level); the proposal is to do this via a cluster-level
829
``hv_state`` dict (per hypervisor).
830

  
831
Beside the per-hypervisor attributes, we also have disk attributes,
832
which are queried directly on the node (without hypervisor
833
involvment). The are stored in a separate attribute (``disk_state``),
834
which is indexed per storage type and name; currently this will be just
835
``LD_LV`` and the volume name as key.
836

  
837
+-------------+-------------------------+--------------------+--------+
838
|Name         |Description              |Current state       |Type    |
839
|             |                         |                    |        |
840
+=============+=========================+====================+========+
841
|disk_total   |Total disk size          |Queried at runtime  |int     |
842
|             |                         |                    |        |
843
+-------------+-------------------------+--------------------+--------+
844
|disk_reserved|Reserved disk size; this |None used in Ganeti;|int     |
845
|             |is a lower limit on the  |htools has a        |        |
846
|             |free space, if such a    |parameter for this  |        |
847
|             |limit is desired         |                    |        |
848
+-------------+-------------------------+--------------------+--------+
849
|disk_overhead|Disk that is expected to |None used in Ganeti;|int     |
850
|             |be used by other volumes |htools detects this |        |
851
|             |(set via                 |at runtime          |        |
852
|             |``reserved_lvs``);       |                    |        |
853
|             |usually should be zero   |                    |        |
854
+-------------+-------------------------+--------------------+--------+
855

  
856

  
857
Instance parameters
858
~~~~~~~~~~~~~~~~~~~
859

  
860
New instance parameters, needed especially for supporting the new memory
861
model:
862

  
863
+--------------+----------------------------------+-----------------+------+
864
|Name          |Description                       |Current status   |Type  |
865
|              |                                  |                 |      |
866
+==============+==================================+=================+======+
867
|offline       |Whether the instance is in        |Not supported    |bool  |
868
|              |“permanent” offline mode; this is |                 |      |
869
|              |stronger than the "admin_down”    |                 |      |
870
|              |state, and is similar to the node |                 |      |
871
|              |offline attribute                 |                 |      |
872
+--------------+----------------------------------+-----------------+------+
873
|be/max_memory |The maximum memory the instance is|Not existent, but|int   |
874
|              |allowed                           |virtually        |      |
875
|              |                                  |identical to     |      |
876
|              |                                  |memory           |      |
877
+--------------+----------------------------------+-----------------+------+
878

  
879
HTools changes
880
--------------
881

  
882
All the new parameters (node, instance, cluster, not so much disk) will
883
need to be taken into account by HTools, both in balancing and in
884
capacity computation.
885

  
886
Since the Ganeti's cluster model is much enhanced, Ganeti can also
887
export its own reserved/overhead variables, and as such HTools can make
888
less “guesses” as to the difference in values.
889

  
890
.. admonition:: FIXME
891

  
892
   Need to detail more the htools changes; the model is clear to me, but
893
   need to write it down.
894

  
895
.. vim: set textwidth=72 :
896
.. Local Variables:
897
.. mode: rst
898
.. fill-column: 72
899
.. End:

Also available in: Unified diff