root / doc / design-resource-model.rst @ 11c97d7c
History | View | Annotate | Download (44.4 kB)
1 |
======================== |
---|---|
2 |
Resource model changes |
3 |
======================== |
4 |
|
5 |
|
6 |
Introduction |
7 |
============ |
8 |
|
9 |
In order to manage virtual machines across the cluster, Ganeti needs to |
10 |
understand the resources present on the nodes, the hardware and software |
11 |
limitations of the nodes, and how much can be allocated safely on each |
12 |
node. Some of these decisions are delegated to IAllocator plugins, for |
13 |
easier site-level customisation. |
14 |
|
15 |
Similarly, the HTools suite has an internal model that simulates the |
16 |
hardware resource changes in response to Ganeti operations, in order to |
17 |
provide both an iallocator plugin and for balancing the |
18 |
cluster. |
19 |
|
20 |
While currently the HTools model is much more advanced than Ganeti's, |
21 |
neither one is flexible enough and both are heavily geared toward a |
22 |
specific Xen model; they fail to work well with (e.g.) KVM or LXC, or |
23 |
with Xen when :term:`tmem` is enabled. Furthermore, the set of metrics |
24 |
contained in the models is limited to historic requirements and fails to |
25 |
account for (e.g.) heterogeneity in the I/O performance of the nodes. |
26 |
|
27 |
Current situation |
28 |
================= |
29 |
|
30 |
Ganeti |
31 |
------ |
32 |
|
33 |
At this moment, Ganeti itself doesn't do any static modelling of the |
34 |
cluster resources. It only does some runtime checks: |
35 |
|
36 |
- when creating instances, for the (current) free disk space |
37 |
- when starting instances, for the (current) free memory |
38 |
- during cluster verify, for enough N+1 memory on the secondaries, based |
39 |
on the (current) free memory |
40 |
|
41 |
Basically this model is a pure :term:`SoW` one, and it works well when |
42 |
there are other instances/LVs on the nodes, as it allows Ganeti to deal |
43 |
with ‘orphan’ resource usage, but on the other hand it has many issues, |
44 |
described below. |
45 |
|
46 |
HTools |
47 |
------ |
48 |
|
49 |
Since HTools does an pure in-memory modelling of the cluster changes as |
50 |
it executes the balancing or allocation steps, it had to introduce a |
51 |
static (:term:`SoR`) cluster model. |
52 |
|
53 |
The model is constructed based on the received node properties from |
54 |
Ganeti (hence it basically is constructed on what Ganeti can export). |
55 |
|
56 |
Disk |
57 |
~~~~ |
58 |
|
59 |
For disk it consists of just the total (``tdsk``) and the free disk |
60 |
space (``fdsk``); we don't directly track the used disk space. On top of |
61 |
this, we compute and warn if the sum of disk sizes used by instance does |
62 |
not match with ``tdsk - fdsk``, but otherwise we do not track this |
63 |
separately. |
64 |
|
65 |
Memory |
66 |
~~~~~~ |
67 |
|
68 |
For memory, the model is more complex and tracks some variables that |
69 |
Ganeti itself doesn't compute. We start from the total (``tmem``), free |
70 |
(``fmem``) and node memory (``nmem``) as supplied by Ganeti, and |
71 |
additionally we track: |
72 |
|
73 |
instance memory (``imem``) |
74 |
the total memory used by primary instances on the node, computed |
75 |
as the sum of instance memory |
76 |
|
77 |
reserved memory (``rmem``) |
78 |
the memory reserved by peer nodes for N+1 redundancy; this memory is |
79 |
tracked per peer-node, and the maximum value out of the peer memory |
80 |
lists is the node's ``rmem``; when not using DRBD, this will be |
81 |
equal to zero |
82 |
|
83 |
unaccounted memory (``xmem``) |
84 |
memory that cannot be unaccounted for via the Ganeti model; this is |
85 |
computed at startup as:: |
86 |
|
87 |
tmem - imem - nmem - fmem |
88 |
|
89 |
and is presumed to remain constant irrespective of any instance |
90 |
moves |
91 |
|
92 |
available memory (``amem``) |
93 |
this is simply ``fmem - rmem``, so unless we use DRBD, this will be |
94 |
equal to ``fmem`` |
95 |
|
96 |
``tmem``, ``nmem`` and ``xmem`` are presumed constant during the |
97 |
instance moves, whereas the ``fmem``, ``imem``, ``rmem`` and ``amem`` |
98 |
values are updated according to the executed moves. |
99 |
|
100 |
CPU |
101 |
~~~ |
102 |
|
103 |
The CPU model is different than the disk/memory models, since it's the |
104 |
only one where: |
105 |
|
106 |
#. we do oversubscribe physical CPUs |
107 |
#. and there is no natural limit for the number of VCPUs we can allocate |
108 |
|
109 |
We therefore track the total number of VCPUs used on the node and the |
110 |
number of physical CPUs, and we cap the vcpu-to-cpu ratio in order to |
111 |
make this somewhat more similar to the other resources which are |
112 |
limited. |
113 |
|
114 |
Dynamic load |
115 |
~~~~~~~~~~~~ |
116 |
|
117 |
There is also a model that deals with *dynamic load* values in |
118 |
htools. As far as we know, it is not currently used actually with load |
119 |
values, but it is active by default with unitary values for all |
120 |
instances; it currently tracks these metrics: |
121 |
|
122 |
- disk load |
123 |
- memory load |
124 |
- cpu load |
125 |
- network load |
126 |
|
127 |
Even though we do not assign real values to these load values, the fact |
128 |
that we at least sum them means that the algorithm tries to equalise |
129 |
these loads, and especially the network load, which is otherwise not |
130 |
tracked at all. The practical result (due to a combination of these four |
131 |
metrics) is that the number of secondaries will be balanced. |
132 |
|
133 |
Limitations |
134 |
----------- |
135 |
|
136 |
|
137 |
There are unfortunately many limitations to the current model. |
138 |
|
139 |
Memory |
140 |
~~~~~~ |
141 |
|
142 |
The memory model doesn't work well in case of KVM. For Xen, the memory |
143 |
for the node (i.e. ``dom0``) can be static or dynamic; we don't support |
144 |
the latter case, but for the former case, the static value is configured |
145 |
in Xen/kernel command line, and can be queried from Xen |
146 |
itself. Therefore, Ganeti can query the hypervisor for the memory used |
147 |
for the node; the same model was adopted for the chroot/KVM/LXC |
148 |
hypervisors, but in these cases there's no natural value for the memory |
149 |
used by the base OS/kernel, and we currently try to compute a value for |
150 |
the node memory based on current consumption. This, being variable, |
151 |
breaks the assumptions in both Ganeti and HTools. |
152 |
|
153 |
This problem also shows for the free memory: if the free memory on the |
154 |
node is not constant (Xen with :term:`tmem` auto-ballooning enabled), or |
155 |
if the node and instance memory are pooled together (Linux-based |
156 |
hypervisors like KVM and LXC), the current value of the free memory is |
157 |
meaningless and cannot be used for instance checks. |
158 |
|
159 |
A separate issue related to the free memory tracking is that since we |
160 |
don't track memory use but rather memory availability, an instance that |
161 |
is temporary down changes Ganeti's understanding of the memory status of |
162 |
the node. This can lead to problems such as: |
163 |
|
164 |
.. digraph:: "free-mem-issue" |
165 |
|
166 |
node [shape=box]; |
167 |
inst1 [label="instance1"]; |
168 |
inst2 [label="instance2"]; |
169 |
|
170 |
node [shape=note]; |
171 |
nodeA [label="fmem=0"]; |
172 |
nodeB [label="fmem=1"]; |
173 |
nodeC [label="fmem=0"]; |
174 |
|
175 |
node [shape=ellipse, style=filled, fillcolor=green] |
176 |
|
177 |
{rank=same; inst1 inst2} |
178 |
|
179 |
stop [label="crash!", fillcolor=orange]; |
180 |
migrate [label="migrate/ok"]; |
181 |
start [style=filled, fillcolor=red, label="start/fail"]; |
182 |
inst1 -> stop -> start; |
183 |
stop -> migrate -> start [style=invis, weight=0]; |
184 |
inst2 -> migrate; |
185 |
|
186 |
{rank=same; inst1 inst2 nodeA} |
187 |
{rank=same; stop nodeB} |
188 |
{rank=same; migrate nodeC} |
189 |
|
190 |
nodeA -> nodeB -> nodeC [style=invis, weight=1]; |
191 |
|
192 |
The behaviour here is wrong; the migration of *instance2* to the node in |
193 |
question will succeed or fail depending on whether *instance1* is |
194 |
running or not. And for *instance1*, it can lead to cases where it if |
195 |
crashes, it cannot restart anymore. |
196 |
|
197 |
Finally, not a problem but rather a missing important feature is support |
198 |
for memory over-subscription: both Xen and KVM support memory |
199 |
ballooning, even automatic memory ballooning, for a while now. The |
200 |
entire memory model is based on a fixed memory size for instances, and |
201 |
if memory ballooning is enabled, it will “break” the HTools |
202 |
algorithm. Even the fact that KVM instances do not use all memory from |
203 |
the start creates problems (although not as high, since it will grow and |
204 |
stabilise in the end). |
205 |
|
206 |
Disks |
207 |
~~~~~ |
208 |
|
209 |
Because we only track disk space currently, this means if we have a |
210 |
cluster of ``N`` otherwise identical nodes but half of them have 10 |
211 |
drives of size ``X`` and the other half 2 drives of size ``5X``, HTools |
212 |
will consider them exactly the same. However, in the case of mechanical |
213 |
drives at least, the I/O performance will differ significantly based on |
214 |
spindle count, and a “fair” load distribution should take this into |
215 |
account (a similar comment can be made about processor/memory/network |
216 |
speed). |
217 |
|
218 |
Another problem related to the spindle count is the LVM allocation |
219 |
algorithm. Currently, the algorithm always creates (or tries to create) |
220 |
striped volumes, with the stripe count being hard-coded to the |
221 |
``./configure`` parameter ``--with-lvm-stripecount``. This creates |
222 |
problems like: |
223 |
|
224 |
- when installing from a distribution package, all clusters will be |
225 |
either limited or overloaded due to this fixed value |
226 |
- it is not possible to mix heterogeneous nodes (even in different node |
227 |
groups) and have optimal settings for all nodes |
228 |
- the striping value applies both to LVM/DRBD data volumes (which are on |
229 |
the order of gigabytes to hundreds of gigabytes) and to DRBD metadata |
230 |
volumes (whose size is always fixed at 128MB); when stripping such |
231 |
small volumes over many PVs, their size will increase needlessly (and |
232 |
this can confuse HTools' disk computation algorithm) |
233 |
|
234 |
Moreover, the allocation currently allocates based on a ‘most free |
235 |
space’ algorithm. This balances the free space usage on disks, but on |
236 |
the other hand it tends to mix rather badly the data and metadata |
237 |
volumes of different instances. For example, it cannot do the following: |
238 |
|
239 |
- keep DRBD data and metadata volumes on the same drives, in order to |
240 |
reduce exposure to drive failure in a many-drives system |
241 |
- keep DRBD data and metadata volumes on different drives, to reduce |
242 |
performance impact of metadata writes |
243 |
|
244 |
Additionally, while Ganeti supports setting the volume separately for |
245 |
data and metadata volumes at instance creation, there are no defaults |
246 |
for this setting. |
247 |
|
248 |
Similar to the above stripe count problem (which is about not good |
249 |
enough customisation of Ganeti's behaviour), we have limited |
250 |
pass-through customisation of the various options of our storage |
251 |
backends; while LVM has a system-wide configuration file that can be |
252 |
used to tweak some of its behaviours, for DRBD we don't use the |
253 |
:command:`drbdadmin` tool, and instead we call :command:`drbdsetup` |
254 |
directly, with a fixed/restricted set of options; so for example one |
255 |
cannot tweak the buffer sizes. |
256 |
|
257 |
Another current problem is that the support for shared storage in HTools |
258 |
is still limited, but this problem is outside of this design document. |
259 |
|
260 |
Locking |
261 |
~~~~~~~ |
262 |
|
263 |
A further problem generated by the “current free” model is that during a |
264 |
long operation which affects resource usage (e.g. disk replaces, |
265 |
instance creations) we have to keep the respective objects locked |
266 |
(sometimes even in exclusive mode), since we don't want any concurrent |
267 |
modifications to the *free* values. |
268 |
|
269 |
A classic example of the locking problem is the following: |
270 |
|
271 |
.. digraph:: "iallocator-lock-issues" |
272 |
|
273 |
rankdir=TB; |
274 |
|
275 |
start [style=invis]; |
276 |
node [shape=box,width=2]; |
277 |
job1 [label="add instance\niallocator run\nchoose A,B"]; |
278 |
job1e [label="finish add"]; |
279 |
job2 [label="add instance\niallocator run\nwait locks"]; |
280 |
job2s [label="acquire locks\nchoose C,D"]; |
281 |
job2e [label="finish add"]; |
282 |
|
283 |
job1 -> job1e; |
284 |
job2 -> job2s -> job2e; |
285 |
edge [style=invis,weight=0]; |
286 |
start -> {job1; job2} |
287 |
job1 -> job2; |
288 |
job2 -> job1e; |
289 |
job1e -> job2s [style=dotted,label="release locks"]; |
290 |
|
291 |
In the above example, the second IAllocator run will wait for locks for |
292 |
nodes ``A`` and ``B``, even though in the end the second instance will |
293 |
be placed on another set of nodes (``C`` and ``D``). This wait shouldn't |
294 |
be needed, since right after the first IAllocator run has finished, |
295 |
:command:`hail` knows the status of the cluster after the allocation, |
296 |
and it could answer the question for the second run too; however, Ganeti |
297 |
doesn't have such visibility into the cluster state and thus it is |
298 |
forced to wait with the second job. |
299 |
|
300 |
Similar examples can be made about replace disks (another long-running |
301 |
opcode). |
302 |
|
303 |
.. _label-policies: |
304 |
|
305 |
Policies |
306 |
~~~~~~~~ |
307 |
|
308 |
For most of the resources, we have metrics defined by policy: e.g. the |
309 |
over-subscription ratio for CPUs, the amount of space to reserve, |
310 |
etc. Furthermore, although there are no such definitions in Ganeti such |
311 |
as minimum/maximum instance size, a real deployment will need to have |
312 |
them, especially in a fully-automated workflow where end-users can |
313 |
request instances via an automated interface (that talks to the cluster |
314 |
via RAPI, LUXI or command line). However, such an automated interface |
315 |
will need to also take into account cluster capacity, and if the |
316 |
:command:`hspace` tool is used for the capacity computation, it needs to |
317 |
be told the maximum instance size, however it has a built-in minimum |
318 |
instance size which is not customisable. |
319 |
|
320 |
It is clear that this situation leads to duplicate definition of |
321 |
resource policies which makes it hard to easily change per-cluster (or |
322 |
globally) the respective policies, and furthermore it creates |
323 |
inconsistencies if such policies are not enforced at the source (i.e. in |
324 |
Ganeti). |
325 |
|
326 |
Balancing algorithm |
327 |
~~~~~~~~~~~~~~~~~~~ |
328 |
|
329 |
The balancing algorithm, as documented in the HTools ``README`` file, |
330 |
tries to minimise the cluster score; this score is based on a set of |
331 |
metrics that describe both exceptional conditions and how spread the |
332 |
instances are across the nodes. In order to achieve this goal, it moves |
333 |
the instances around, with a series of moves of various types: |
334 |
|
335 |
- disk replaces (for DRBD-based instances) |
336 |
- instance failover/migrations (for all types) |
337 |
|
338 |
However, the algorithm only looks at the cluster score, and not at the |
339 |
*“cost”* of the moves. In other words, the following can and will happen |
340 |
on a cluster: |
341 |
|
342 |
.. digraph:: "balancing-cost-issues" |
343 |
|
344 |
rankdir=LR; |
345 |
ranksep=1; |
346 |
|
347 |
start [label="score α", shape=hexagon]; |
348 |
|
349 |
node [shape=box, width=2]; |
350 |
replace1 [label="replace_disks 500G\nscore α-3ε\ncost 3"]; |
351 |
replace2a [label="replace_disks 20G\nscore α-2ε\ncost 2"]; |
352 |
migrate1 [label="migrate\nscore α-ε\ncost 1"]; |
353 |
|
354 |
choose [shape=ellipse,label="choose min(score)=α-3ε\ncost 3"]; |
355 |
|
356 |
start -> {replace1; replace2a; migrate1} -> choose; |
357 |
|
358 |
Even though a migration is much, much cheaper than a disk replace (in |
359 |
terms of network and disk traffic on the cluster), if the disk replace |
360 |
results in a score infinitesimally smaller, then it will be |
361 |
chosen. Similarly, between two disk replaces, one moving e.g. ``500GiB`` |
362 |
and one moving ``20GiB``, the first one will be chosen if it results in |
363 |
a score smaller than the second one. Furthermore, even if the resulting |
364 |
scores are equal, the first computed solution will be kept, whichever it |
365 |
is. |
366 |
|
367 |
Fixing this algorithmic problem is doable, but currently Ganeti doesn't |
368 |
export enough information about nodes to make an informed decision; in |
369 |
the above example, if the ``500GiB`` move is between nodes having fast |
370 |
I/O (both disks and network), it makes sense to execute it over a disk |
371 |
replace of ``100GiB`` between nodes with slow I/O, so simply relating to |
372 |
the properties of the move itself is not enough; we need more node |
373 |
information for cost computation. |
374 |
|
375 |
Allocation algorithm |
376 |
~~~~~~~~~~~~~~~~~~~~ |
377 |
|
378 |
.. note:: This design document will not address this limitation, but it |
379 |
is worth mentioning as it directly related to the resource model. |
380 |
|
381 |
The current allocation/capacity algorithm works as follows (per |
382 |
node-group):: |
383 |
|
384 |
repeat: |
385 |
allocate instance without failing N+1 |
386 |
|
387 |
This simple algorithm, and its use of ``N+1`` criterion, has a built-in |
388 |
limit of 1 machine failure in case of DRBD. This means the algorithm |
389 |
guarantees that, if using DRBD storage, there are enough resources to |
390 |
(re)start all affected instances in case of one machine failure. This |
391 |
relates mostly to memory; there is no account for CPU over-subscription |
392 |
(i.e. in case of failure, make sure we can failover while still not |
393 |
going over CPU limits), or for any other resource. |
394 |
|
395 |
In case of shared storage, there's not even the memory guarantee, as the |
396 |
N+1 protection doesn't work for shared storage. |
397 |
|
398 |
If a given cluster administrator wants to survive up to two machine |
399 |
failures, or wants to ensure CPU limits too for DRBD, there is no |
400 |
possibility to configure this in HTools (neither in :command:`hail` nor |
401 |
in :command:`hspace`). Current workaround employ for example deducting a |
402 |
certain number of instances from the size computed by :command:`hspace`, |
403 |
but this is a very crude method, and requires that instance creations |
404 |
are limited before Ganeti (otherwise :command:`hail` would allocate |
405 |
until the cluster is full). |
406 |
|
407 |
Proposed architecture |
408 |
===================== |
409 |
|
410 |
|
411 |
There are two main changes proposed: |
412 |
|
413 |
- changing the resource model from a pure :term:`SoW` to a hybrid |
414 |
:term:`SoR`/:term:`SoW` one, where the :term:`SoR` component is |
415 |
heavily emphasised |
416 |
- extending the resource model to cover additional properties, |
417 |
completing the “holes” in the current coverage |
418 |
|
419 |
The second change is rather straightforward, but will add more |
420 |
complexity in the modelling of the cluster. The first change, however, |
421 |
represents a significant shift from the current model, which Ganeti had |
422 |
from its beginnings. |
423 |
|
424 |
Lock-improved resource model |
425 |
---------------------------- |
426 |
|
427 |
Hybrid SoR/SoW model |
428 |
~~~~~~~~~~~~~~~~~~~~ |
429 |
|
430 |
The resources of a node can be characterised in two broad classes: |
431 |
|
432 |
- mostly static resources |
433 |
- dynamically changing resources |
434 |
|
435 |
In the first category, we have things such as total core count, total |
436 |
memory size, total disk size, number of network interfaces etc. In the |
437 |
second category we have things such as free disk space, free memory, CPU |
438 |
load, etc. Note that nowadays we don't have (anymore) fully-static |
439 |
resources: features like CPU and memory hot-plug, online disk replace, |
440 |
etc. mean that theoretically all resources can change (there are some |
441 |
practical limitations, of course). |
442 |
|
443 |
Even though the rate of change of the two resource types is wildly |
444 |
different, right now Ganeti handles both the same. Given that the |
445 |
interval of change of the semi-static ones is much bigger than most |
446 |
Ganeti operations, even more than lengthy sequences of Ganeti jobs, it |
447 |
makes sense to treat them separately. |
448 |
|
449 |
The proposal is then to move the following resources into the |
450 |
configuration and treat the configuration as the authoritative source |
451 |
for them (a :term:`SoR` model): |
452 |
|
453 |
- CPU resources: |
454 |
- total core count |
455 |
- node core usage (*new*) |
456 |
- memory resources: |
457 |
- total memory size |
458 |
- node memory size |
459 |
- hypervisor overhead (*new*) |
460 |
- disk resources: |
461 |
- total disk size |
462 |
- disk overhead (*new*) |
463 |
|
464 |
Since these resources can though change at run-time, we will need |
465 |
functionality to update the recorded values. |
466 |
|
467 |
Pre-computing dynamic resource values |
468 |
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
469 |
|
470 |
Remember that the resource model used by HTools models the clusters as |
471 |
obeying the following equations: |
472 |
|
473 |
disk\ :sub:`free` = disk\ :sub:`total` - ∑ disk\ :sub:`instances` |
474 |
|
475 |
mem\ :sub:`free` = mem\ :sub:`total` - ∑ mem\ :sub:`instances` - mem\ |
476 |
:sub:`node` - mem\ :sub:`overhead` |
477 |
|
478 |
As this model worked fine for HTools, we can consider it valid and adopt |
479 |
it in Ganeti. Furthermore, note that all values in the right-hand side |
480 |
come now from the configuration: |
481 |
|
482 |
- the per-instance usage values were already stored in the configuration |
483 |
- the other values will are moved to the configuration per the previous |
484 |
section |
485 |
|
486 |
This means that we can now compute the free values without having to |
487 |
actually live-query the nodes, which brings a significant advantage. |
488 |
|
489 |
There are a couple of caveats to this model though. First, as the |
490 |
run-time state of the instance is no longer taken into consideration, it |
491 |
means that we have to introduce a new *offline* state for an instance |
492 |
(similar to the node one). In this state, the instance's runtime |
493 |
resources (memory and VCPUs) are no longer reserved for it, and can be |
494 |
reused by other instances. Static resources like disk and MAC addresses |
495 |
are still reserved though. Transitioning into and out of this reserved |
496 |
state will be more involved than simply stopping/starting the instance |
497 |
(e.g. de-offlining can fail due to missing resources). This complexity |
498 |
is compensated by the increased consistency of what guarantees we have |
499 |
in the stopped state (we always guarantee resource reservation), and the |
500 |
potential for management tools to restrict which users can transition |
501 |
into/out of this state separate from which users can stop/start the |
502 |
instance. |
503 |
|
504 |
Separating per-node resource locks |
505 |
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
506 |
|
507 |
Many of the current node locks in Ganeti exist in order to guarantee |
508 |
correct resource state computation, whereas others are designed to |
509 |
guarantee reasonable run-time performance of nodes (e.g. by not |
510 |
overloading the I/O subsystem). This is an unfortunate coupling, since |
511 |
it means for example that the following two operations conflict in |
512 |
practice even though they are orthogonal: |
513 |
|
514 |
- replacing a instance's disk on a node |
515 |
- computing node disk/memory free for an IAllocator run |
516 |
|
517 |
This conflict increases significantly the lock contention on a big/busy |
518 |
cluster and at odds with the goal of increasing the cluster size. |
519 |
|
520 |
The proposal is therefore to add a new level of locking that is only |
521 |
used to prevent concurrent modification to the resource states (either |
522 |
node properties or instance properties) and not for long-term |
523 |
operations: |
524 |
|
525 |
- instance creation needs to acquire and keep this lock until adding the |
526 |
instance to the configuration |
527 |
- instance modification needs to acquire and keep this lock until |
528 |
updating the instance |
529 |
- node property changes will need to acquire this lock for the |
530 |
modification |
531 |
|
532 |
The new lock level will sit before the instance level (right after BGL) |
533 |
and could either be single-valued (like the “Big Ganeti Lock”), in which |
534 |
case we won't be able to modify two nodes at the same time, or per-node, |
535 |
in which case the list of locks at this level needs to be synchronised |
536 |
with the node lock level. To be determined. |
537 |
|
538 |
Lock contention reduction |
539 |
~~~~~~~~~~~~~~~~~~~~~~~~~ |
540 |
|
541 |
Based on the above, the locking contention will be reduced as follows: |
542 |
IAllocator calls will no longer need the ``LEVEL_NODE: ALL_SET`` lock, |
543 |
only the resource lock (in exclusive mode). Hence allocating/computing |
544 |
evacuation targets will no longer conflict for longer than the time to |
545 |
compute the allocation solution. |
546 |
|
547 |
The remaining long-running locks will be the DRBD replace-disks ones |
548 |
(exclusive mode). These can also be removed, or changed into shared |
549 |
locks, but that is a separate design change. |
550 |
|
551 |
.. admonition:: FIXME |
552 |
|
553 |
Need to rework instance replace disks. I don't think we need exclusive |
554 |
locks for replacing disks: it is safe to stop/start the instance while |
555 |
it's doing a replace disks. Only modify would need exclusive, and only |
556 |
for transitioning into/out of offline state. |
557 |
|
558 |
Instance memory model |
559 |
--------------------- |
560 |
|
561 |
In order to support ballooning, the instance memory model needs to be |
562 |
changed from a “memory size” one to a “min/max memory size”. This |
563 |
interacts with the new static resource model, however, and thus we need |
564 |
to declare a-priori the expected oversubscription ratio on the cluster. |
565 |
|
566 |
The new minimum memory size parameter will be similar to the current |
567 |
memory size; the cluster will guarantee that in all circumstances, all |
568 |
instances will have available their minimum memory size. The maximum |
569 |
memory size will permit burst usage of more memory by instances, with |
570 |
the restriction that the sum of maximum memory usage will not be more |
571 |
than the free memory times the oversubscription factor: |
572 |
|
573 |
∑ memory\ :sub:`min` ≤ memory\ :sub:`available` |
574 |
|
575 |
∑ memory\ :sub:`max` ≤ memory\ :sub:`free` * oversubscription_ratio |
576 |
|
577 |
The hypervisor will have the possibility of adjusting the instance's |
578 |
memory size dynamically between these two boundaries. |
579 |
|
580 |
Note that the minimum memory is related to the available memory on the |
581 |
node, whereas the maximum memory is related to the free memory. On |
582 |
DRBD-enabled clusters, this will have the advantage of using the |
583 |
reserved memory for N+1 failover for burst usage, instead of having it |
584 |
completely idle. |
585 |
|
586 |
.. admonition:: FIXME |
587 |
|
588 |
Need to document how Ganeti forces minimum size at runtime, overriding |
589 |
the hypervisor, in cases of failover/lack of resources. |
590 |
|
591 |
New parameters |
592 |
-------------- |
593 |
|
594 |
Unfortunately the design will add a significant number of new |
595 |
parameters, and change the meaning of some of the current ones. |
596 |
|
597 |
Instance size limits |
598 |
~~~~~~~~~~~~~~~~~~~~ |
599 |
|
600 |
As described in :ref:`label-policies`, we currently lack a clear |
601 |
definition of the support instance sizes (minimum, maximum and |
602 |
standard). As such, we will add the following structure to the cluster |
603 |
parameters: |
604 |
|
605 |
- ``min_ispec``, ``max_ispec``: minimum and maximum acceptable instance |
606 |
specs |
607 |
- ``std_ispec``: standard instance size, which will be used for capacity |
608 |
computations and for default parameters on the instance creation |
609 |
request |
610 |
|
611 |
Ganeti will by default reject non-standard instance sizes (lower than |
612 |
``min_ispec`` or greater than ``max_ispec``), but as usual a |
613 |
``--ignore-ipolicy`` option on the command line or in the RAPI request |
614 |
will override these constraints. The ``std_spec`` structure will be used |
615 |
to fill in missing instance specifications on create. |
616 |
|
617 |
Each of the ispec structures will be a dictionary, since the contents |
618 |
can change over time. Initially, we will define the following variables |
619 |
in these structures: |
620 |
|
621 |
+---------------+----------------------------------+--------------+ |
622 |
|Name |Description |Type | |
623 |
+===============+==================================+==============+ |
624 |
|mem_size |Allowed memory size |int | |
625 |
+---------------+----------------------------------+--------------+ |
626 |
|cpu_count |Allowed vCPU count |int | |
627 |
+---------------+----------------------------------+--------------+ |
628 |
|disk_count |Allowed disk count |int | |
629 |
+---------------+----------------------------------+--------------+ |
630 |
|disk_size |Allowed disk size |int | |
631 |
+---------------+----------------------------------+--------------+ |
632 |
|nic_count |Alowed NIC count |int | |
633 |
+---------------+----------------------------------+--------------+ |
634 |
|
635 |
Inheritance |
636 |
+++++++++++ |
637 |
|
638 |
In a single-group cluster, the above structure is sufficient. However, |
639 |
on a multi-group cluster, it could be that the hardware specifications |
640 |
differ across node groups, and thus the following problem appears: how |
641 |
can Ganeti present unified specifications over RAPI? |
642 |
|
643 |
Since the set of instance specs is only partially ordered (as opposed to |
644 |
the sets of values of individual variable in the spec, which are totally |
645 |
ordered), it follows that we can't present unified specs. As such, the |
646 |
proposed approach is to allow the ``min_ispec`` and ``max_ispec`` to be |
647 |
customised per node-group (and export them as a list of specifications), |
648 |
and a single ``std_spec`` at cluster level (exported as a single value). |
649 |
|
650 |
|
651 |
Allocation parameters |
652 |
~~~~~~~~~~~~~~~~~~~~~ |
653 |
|
654 |
Beside the limits of min/max instance sizes, there are other parameters |
655 |
related to capacity and allocation limits. These are mostly related to |
656 |
the problems related to over allocation. |
657 |
|
658 |
+-----------------+----------+---------------------------+----------+------+ |
659 |
| Name |Level(s) |Description |Current |Type | |
660 |
| | | |value | | |
661 |
+=================+==========+===========================+==========+======+ |
662 |
|vcpu_ratio |cluster, |Maximum ratio of virtual to|64 (only |float | |
663 |
| |node group|physical CPUs |in htools)| | |
664 |
+-----------------+----------+---------------------------+----------+------+ |
665 |
|spindle_ratio |cluster, |Maximum ratio of instances |none |float | |
666 |
| |node group|to spindles; when the I/O | | | |
667 |
| | |model doesn't map directly | | | |
668 |
| | |to spindles, another | | | |
669 |
| | |measure of I/O should be | | | |
670 |
| | |used instead | | | |
671 |
+-----------------+----------+---------------------------+----------+------+ |
672 |
|max_node_failures|cluster, |Cap allocation/capacity so |1 |int | |
673 |
| |node group|that the cluster can |(hardcoded| | |
674 |
| | |survive this many node |in htools)| | |
675 |
| | |failures | | | |
676 |
+-----------------+----------+---------------------------+----------+------+ |
677 |
|
678 |
Since these are used mostly internally (in htools), they will be |
679 |
exported as-is from Ganeti, without explicit handling of node-groups |
680 |
grouping. |
681 |
|
682 |
Regarding ``spindle_ratio``, in this context spindles do not necessarily |
683 |
have to mean actual mechanical hard-drivers; it's rather a measure of |
684 |
I/O performance for internal storage. |
685 |
|
686 |
Disk parameters |
687 |
~~~~~~~~~~~~~~~ |
688 |
|
689 |
The proposed model for the new disk parameters is a simple free-form one |
690 |
based on dictionaries, indexed per disk template and parameter name. |
691 |
Only the disk template parameters are visible to the user, and those are |
692 |
internally translated to logical disk level parameters. |
693 |
|
694 |
This is a simplification, because each parameter is applied to a whole |
695 |
nested structure and there is no way of fine-tuning each level's |
696 |
parameters, but it is good enough for the current parameter set. This |
697 |
model could need to be expanded, e.g., if support for three-nodes stacked |
698 |
DRBD setups is added to Ganeti. |
699 |
|
700 |
At JSON level, since the object key has to be a string, the keys can be |
701 |
encoded via a separator (e.g. slash), or by having two dict levels. |
702 |
|
703 |
When needed, the unit of measurement is expressed inside square |
704 |
brackets. |
705 |
|
706 |
+--------+--------------+-------------------------+---------------------+------+ |
707 |
|Disk |Name |Description |Current status |Type | |
708 |
|template| | | | | |
709 |
+========+==============+=========================+=====================+======+ |
710 |
|plain |stripes |How many stripes to use |Configured at |int | |
711 |
| | |for newly created (plain)|./configure time, not| | |
712 |
| | |logical voumes |overridable at | | |
713 |
| | | |runtime | | |
714 |
+--------+--------------+-------------------------+---------------------+------+ |
715 |
|drbd |data-stripes |How many stripes to use |Same as for |int | |
716 |
| | |for data volumes |plain/stripes | | |
717 |
+--------+--------------+-------------------------+---------------------+------+ |
718 |
|drbd |metavg |Default volume group for |Same as the main |string| |
719 |
| | |the metadata LVs |volume group, | | |
720 |
| | | |overridable via | | |
721 |
| | | |'metavg' key | | |
722 |
+--------+--------------+-------------------------+---------------------+------+ |
723 |
|drbd |meta-stripes |How many stripes to use |Same as for lvm |int | |
724 |
| | |for meta volumes |'stripes', suboptimal| | |
725 |
| | | |as the meta LVs are | | |
726 |
| | | |small | | |
727 |
+--------+--------------+-------------------------+---------------------+------+ |
728 |
|drbd |disk-barriers |What kind of barriers to |Either all enabled or|string| |
729 |
| | |*disable* for disks; |all disabled, per | | |
730 |
| | |either "n" or a string |./configure time | | |
731 |
| | |containing a subset of |option | | |
732 |
| | |"bfd" | | | |
733 |
+--------+--------------+-------------------------+---------------------+------+ |
734 |
|drbd |meta-barriers |Whether to disable or not|Handled together with|bool | |
735 |
| | |the barriers for the meta|disk-barriers | | |
736 |
| | |volume | | | |
737 |
+--------+--------------+-------------------------+---------------------+------+ |
738 |
|drbd |resync-rate |The (static) resync rate |Hardcoded in |int | |
739 |
| | |for drbd, when using the |constants.py, not | | |
740 |
| | |static syncer, in KiB/s |changeable via Ganeti| | |
741 |
+--------+--------------+-------------------------+---------------------+------+ |
742 |
|drbd |dynamic-resync|Whether to use the |Not supported. |bool | |
743 |
| | |dynamic resync speed | | | |
744 |
| | |controller or not. If | | | |
745 |
| | |enabled, c-plan-ahead | | | |
746 |
| | |must be non-zero and all | | | |
747 |
| | |the c-* parameters will | | | |
748 |
| | |be used by DRBD. | | | |
749 |
| | |Otherwise, the value of | | | |
750 |
| | |resync-rate will be used | | | |
751 |
| | |as a static resync speed.| | | |
752 |
+--------+--------------+-------------------------+---------------------+------+ |
753 |
|drbd |c-plan-ahead |Agility factor of the |Not supported. |int | |
754 |
| | |dynamic resync speed | | | |
755 |
| | |controller. (the higher, | | | |
756 |
| | |the slower the algorithm | | | |
757 |
| | |will adapt the resync | | | |
758 |
| | |speed). A value of 0 | | | |
759 |
| | |(that is the default) | | | |
760 |
| | |disables the controller | | | |
761 |
| | |[ds] | | | |
762 |
+--------+--------------+-------------------------+---------------------+------+ |
763 |
|drbd |c-fill-target |Maximum amount of |Not supported. |int | |
764 |
| | |in-flight resync data | | | |
765 |
| | |for the dynamic resync | | | |
766 |
| | |speed controller | | | |
767 |
| | |[sectors] | | | |
768 |
+--------+--------------+-------------------------+---------------------+------+ |
769 |
|drbd |c-delay-target|Maximum estimated peer |Not supported. |int | |
770 |
| | |response latency for the | | | |
771 |
| | |dynamic resync speed | | | |
772 |
| | |controller [ds] | | | |
773 |
+--------+--------------+-------------------------+---------------------+------+ |
774 |
|drbd |c-max-rate |Upper bound on resync |Not supported. |int | |
775 |
| | |speed for the dynamic | | | |
776 |
| | |resync speed controller | | | |
777 |
| | |[KiB/s] | | | |
778 |
+--------+--------------+-------------------------+---------------------+------+ |
779 |
|drbd |c-min-rate |Minimum resync speed for |Not supported. |int | |
780 |
| | |the dynamic resync speed | | | |
781 |
| | |controller [KiB/s] | | | |
782 |
+--------+--------------+-------------------------+---------------------+------+ |
783 |
|drbd |disk-custom |Free-form string that |Not supported |string| |
784 |
| | |will be appended to the | | | |
785 |
| | |drbdsetup disk command | | | |
786 |
| | |line, for custom options | | | |
787 |
| | |not supported by Ganeti | | | |
788 |
| | |itself | | | |
789 |
+--------+--------------+-------------------------+---------------------+------+ |
790 |
|drbd |net-custom |Free-form string for |Not supported |string| |
791 |
| | |custom net setup options | | | |
792 |
+--------+--------------+-------------------------+---------------------+------+ |
793 |
|
794 |
Currently Ganeti supports only DRBD 8.0.x, 8.2.x, 8.3.x. It will refuse |
795 |
to work with DRBD 8.4 since the :command:`drbdsetup` syntax has changed |
796 |
significantly. |
797 |
|
798 |
The barriers-related parameters have been introduced in different DRBD |
799 |
versions; please make sure that your version supports all the barrier |
800 |
parameters that you pass to Ganeti. Any version later than 8.3.0 |
801 |
implements all of them. |
802 |
|
803 |
The minimum DRBD version for using the dynamic resync speed controller |
804 |
is 8.3.9, since previous versions implement different parameters. |
805 |
|
806 |
A more detailed discussion of the dynamic resync speed controller |
807 |
parameters is outside the scope of the present document. Please refer to |
808 |
the ``drbdsetup`` man page |
809 |
(`8.3 <http://www.drbd.org/users-guide-8.3/re-drbdsetup.html>`_ and |
810 |
`8.4 <http://www.drbd.org/users-guide/re-drbdsetup.html>`_). An |
811 |
interesting discussion about them can also be found in a |
812 |
`drbd-user mailing list post |
813 |
<http://lists.linbit.com/pipermail/drbd-user/2011-August/016739.html>`_. |
814 |
|
815 |
All the above parameters are at cluster and node group level; as in |
816 |
other parts of the code, the intention is that all nodes in a node group |
817 |
should be equal. It will later be decided to which node group give |
818 |
precedence in case of instances split over node groups. |
819 |
|
820 |
.. admonition:: FIXME |
821 |
|
822 |
Add details about when each parameter change takes effect (device |
823 |
creation vs. activation) |
824 |
|
825 |
Node parameters |
826 |
~~~~~~~~~~~~~~~ |
827 |
|
828 |
For the new memory model, we'll add the following parameters, in a |
829 |
dictionary indexed by the hypervisor name (node attribute |
830 |
``hv_state``). The rationale is that, even though multi-hypervisor |
831 |
clusters are rare, they make sense sometimes, and thus we need to |
832 |
support multipe node states (one per hypervisor). |
833 |
|
834 |
Since usually only one of the multiple hypervisors is the 'main' one |
835 |
(and the others used sparringly), capacity computation will still only |
836 |
use the first hypervisor, and not all of them. Thus we avoid possible |
837 |
inconsistencies. |
838 |
|
839 |
+----------+-----------------------------------+---------------+-------+ |
840 |
|Name |Description |Current state |Type | |
841 |
| | | | | |
842 |
+==========+===================================+===============+=======+ |
843 |
|mem_total |Total node memory, as discovered by|Queried at |int | |
844 |
| |this hypervisor |runtime | | |
845 |
+----------+-----------------------------------+---------------+-------+ |
846 |
|mem_node |Memory used by, or reserved for, |Queried at |int | |
847 |
| |the node itself; not that some |runtime | | |
848 |
| |hypervisors can report this in an | | | |
849 |
| |authoritative way, other not | | | |
850 |
+----------+-----------------------------------+---------------+-------+ |
851 |
|mem_hv |Memory used either by the |Not used, |int | |
852 |
| |hypervisor itself or lost due to |htools computes| | |
853 |
| |instance allocation rounding; |it internally | | |
854 |
| |usually this cannot be precisely | | | |
855 |
| |computed, but only roughly | | | |
856 |
| |estimated | | | |
857 |
+----------+-----------------------------------+---------------+-------+ |
858 |
|cpu_total |Total node cpu (core) count; |Queried at |int | |
859 |
| |usually this can be discovered |runtime | | |
860 |
| |automatically | | | |
861 |
| | | | | |
862 |
| | | | | |
863 |
| | | | | |
864 |
+----------+-----------------------------------+---------------+-------+ |
865 |
|cpu_node |Number of cores reserved for the |Not used at all|int | |
866 |
| |node itself; this can either be | | | |
867 |
| |discovered or set manually. Only | | | |
868 |
| |used for estimating how many VCPUs | | | |
869 |
| |are left for instances | | | |
870 |
| | | | | |
871 |
+----------+-----------------------------------+---------------+-------+ |
872 |
|
873 |
Of the above parameters, only ``_total`` ones are straight-forward. The |
874 |
others have sometimes strange semantics: |
875 |
|
876 |
- Xen can report ``mem_node``, if configured statically (as we |
877 |
recommend); but Linux-based hypervisors (KVM, chroot, LXC) do not, and |
878 |
this needs to be configured statically for these values |
879 |
- ``mem_hv``, representing unaccounted for memory, is not directly |
880 |
computable; on Xen, it can be seen that on a N GB machine, with 1 GB |
881 |
for dom0 and N-2 GB for instances, there's just a few MB left, instead |
882 |
fo a full 1 GB of RAM; however, the exact value varies with the total |
883 |
memory size (at least) |
884 |
- ``cpu_node`` only makes sense on Xen (currently), in the case when we |
885 |
restrict dom0; for Linux-based hypervisors, the node itself cannot be |
886 |
easily restricted, so it should be set as an estimate of how "heavy" |
887 |
the node loads will be |
888 |
|
889 |
Since these two values cannot be auto-computed from the node, we need to |
890 |
be able to declare a default at cluster level (debatable how useful they |
891 |
are at node group level); the proposal is to do this via a cluster-level |
892 |
``hv_state`` dict (per hypervisor). |
893 |
|
894 |
Beside the per-hypervisor attributes, we also have disk attributes, |
895 |
which are queried directly on the node (without hypervisor |
896 |
involvment). The are stored in a separate attribute (``disk_state``), |
897 |
which is indexed per storage type and name; currently this will be just |
898 |
``LD_LV`` and the volume name as key. |
899 |
|
900 |
+-------------+-------------------------+--------------------+--------+ |
901 |
|Name |Description |Current state |Type | |
902 |
| | | | | |
903 |
+=============+=========================+====================+========+ |
904 |
|disk_total |Total disk size |Queried at runtime |int | |
905 |
| | | | | |
906 |
+-------------+-------------------------+--------------------+--------+ |
907 |
|disk_reserved|Reserved disk size; this |None used in Ganeti;|int | |
908 |
| |is a lower limit on the |htools has a | | |
909 |
| |free space, if such a |parameter for this | | |
910 |
| |limit is desired | | | |
911 |
+-------------+-------------------------+--------------------+--------+ |
912 |
|disk_overhead|Disk that is expected to |None used in Ganeti;|int | |
913 |
| |be used by other volumes |htools detects this | | |
914 |
| |(set via |at runtime | | |
915 |
| |``reserved_lvs``); | | | |
916 |
| |usually should be zero | | | |
917 |
+-------------+-------------------------+--------------------+--------+ |
918 |
|
919 |
|
920 |
Instance parameters |
921 |
~~~~~~~~~~~~~~~~~~~ |
922 |
|
923 |
New instance parameters, needed especially for supporting the new memory |
924 |
model: |
925 |
|
926 |
+--------------+----------------------------------+-----------------+------+ |
927 |
|Name |Description |Current status |Type | |
928 |
| | | | | |
929 |
+==============+==================================+=================+======+ |
930 |
|offline |Whether the instance is in |Not supported |bool | |
931 |
| |“permanent” offline mode; this is | | | |
932 |
| |stronger than the "admin_down” | | | |
933 |
| |state, and is similar to the node | | | |
934 |
| |offline attribute | | | |
935 |
+--------------+----------------------------------+-----------------+------+ |
936 |
|be/max_memory |The maximum memory the instance is|Not existent, but|int | |
937 |
| |allowed |virtually | | |
938 |
| | |identical to | | |
939 |
| | |memory | | |
940 |
+--------------+----------------------------------+-----------------+------+ |
941 |
|
942 |
HTools changes |
943 |
-------------- |
944 |
|
945 |
All the new parameters (node, instance, cluster, not so much disk) will |
946 |
need to be taken into account by HTools, both in balancing and in |
947 |
capacity computation. |
948 |
|
949 |
Since the Ganeti's cluster model is much enhanced, Ganeti can also |
950 |
export its own reserved/overhead variables, and as such HTools can make |
951 |
less “guesses” as to the difference in values. |
952 |
|
953 |
.. admonition:: FIXME |
954 |
|
955 |
Need to detail more the htools changes; the model is clear to me, but |
956 |
need to write it down. |
957 |
|
958 |
.. vim: set textwidth=72 : |
959 |
.. Local Variables: |
960 |
.. mode: rst |
961 |
.. fill-column: 72 |
962 |
.. End: |