|
1 |
========================
|
|
2 |
Resource model changes
|
|
3 |
========================
|
|
4 |
|
|
5 |
|
|
6 |
Introduction
|
|
7 |
============
|
|
8 |
|
|
9 |
In order to manage virtual machines across the cluster, Ganeti needs to
|
|
10 |
understand the resources present on the nodes, the hardware and software
|
|
11 |
limitations of the nodes, and how much can be allocated safely on each
|
|
12 |
node. Some of these decisions are delegated to IAllocator plugins, for
|
|
13 |
easier site-level customisation.
|
|
14 |
|
|
15 |
Similarly, the HTools suite has an internal model that simulates the
|
|
16 |
hardware resource changes in response to Ganeti operations, in order to
|
|
17 |
provide both an iallocator plugin and for balancing the
|
|
18 |
cluster.
|
|
19 |
|
|
20 |
While currently the HTools model is much more advanced than Ganeti's,
|
|
21 |
neither one is flexible enough and both are heavily geared toward a
|
|
22 |
specific Xen model; they fail to work well with (e.g.) KVM or LXC, or
|
|
23 |
with Xen when :term:`tmem` is enabled. Furthermore, the set of metrics
|
|
24 |
contained in the models is limited to historic requirements and fails to
|
|
25 |
account for (e.g.) heterogeneity in the I/O performance of the nodes.
|
|
26 |
|
|
27 |
Current situation
|
|
28 |
=================
|
|
29 |
|
|
30 |
Ganeti
|
|
31 |
------
|
|
32 |
|
|
33 |
At this moment, Ganeti itself doesn't do any static modelling of the
|
|
34 |
cluster resources. It only does some runtime checks:
|
|
35 |
|
|
36 |
- when creating instances, for the (current) free disk space
|
|
37 |
- when starting instances, for the (current) free memory
|
|
38 |
- during cluster verify, for enough N+1 memory on the secondaries, based
|
|
39 |
on the (current) free memory
|
|
40 |
|
|
41 |
Basically this model is a pure :term:`SoW` one, and it works well when
|
|
42 |
there are other instances/LVs on the nodes, as it allows Ganeti to deal
|
|
43 |
with ‘orphan’ resource usage, but on the other hand it has many issues,
|
|
44 |
described below.
|
|
45 |
|
|
46 |
HTools
|
|
47 |
------
|
|
48 |
|
|
49 |
Since HTools does an pure in-memory modelling of the cluster changes as
|
|
50 |
it executes the balancing or allocation steps, it had to introduce a
|
|
51 |
static (:term:`SoR`) cluster model.
|
|
52 |
|
|
53 |
The model is constructed based on the received node properties from
|
|
54 |
Ganeti (hence it basically is constructed on what Ganeti can export).
|
|
55 |
|
|
56 |
Disk
|
|
57 |
~~~~
|
|
58 |
|
|
59 |
For disk it consists of just the total (``tdsk``) and the free disk
|
|
60 |
space (``fdsk``); we don't directly track the used disk space. On top of
|
|
61 |
this, we compute and warn if the sum of disk sizes used by instance does
|
|
62 |
not match with ``tdsk - fdsk``, but otherwise we do not track this
|
|
63 |
separately.
|
|
64 |
|
|
65 |
Memory
|
|
66 |
~~~~~~
|
|
67 |
|
|
68 |
For memory, the model is more complex and tracks some variables that
|
|
69 |
Ganeti itself doesn't compute. We start from the total (``tmem``), free
|
|
70 |
(``fmem``) and node memory (``nmem``) as supplied by Ganeti, and
|
|
71 |
additionally we track:
|
|
72 |
|
|
73 |
instance memory (``imem``)
|
|
74 |
the total memory used by primary instances on the node, computed
|
|
75 |
as the sum of instance memory
|
|
76 |
|
|
77 |
reserved memory (``rmem``)
|
|
78 |
the memory reserved by peer nodes for N+1 redundancy; this memory is
|
|
79 |
tracked per peer-node, and the maximum value out of the peer memory
|
|
80 |
lists is the node's ``rmem``; when not using DRBD, this will be
|
|
81 |
equal to zero
|
|
82 |
|
|
83 |
unaccounted memory (``xmem``)
|
|
84 |
memory that cannot be unaccounted for via the Ganeti model; this is
|
|
85 |
computed at startup as::
|
|
86 |
|
|
87 |
tmem - imem - nmem - fmem
|
|
88 |
|
|
89 |
and is presumed to remain constant irrespective of any instance
|
|
90 |
moves
|
|
91 |
|
|
92 |
available memory (``amem``)
|
|
93 |
this is simply ``fmem - rmem``, so unless we use DRBD, this will be
|
|
94 |
equal to ``fmem``
|
|
95 |
|
|
96 |
``tmem``, ``nmem`` and ``xmem`` are presumed constant during the
|
|
97 |
instance moves, whereas the ``fmem``, ``imem``, ``rmem`` and ``amem``
|
|
98 |
values are updated according to the executed moves.
|
|
99 |
|
|
100 |
CPU
|
|
101 |
~~~
|
|
102 |
|
|
103 |
The CPU model is different than the disk/memory models, since it's the
|
|
104 |
only one where:
|
|
105 |
|
|
106 |
#. we do oversubscribe physical CPUs
|
|
107 |
#. and there is no natural limit for the number of VCPUs we can allocate
|
|
108 |
|
|
109 |
We therefore track the total number of VCPUs used on the node and the
|
|
110 |
number of physical CPUs, and we cap the vcpu-to-cpu ratio in order to
|
|
111 |
make this somewhat more similar to the other resources which are
|
|
112 |
limited.
|
|
113 |
|
|
114 |
Dynamic load
|
|
115 |
~~~~~~~~~~~~
|
|
116 |
|
|
117 |
There is also a model that deals with *dynamic load* values in
|
|
118 |
htools. As far as we know, it is not currently used actually with load
|
|
119 |
values, but it is active by default with unitary values for all
|
|
120 |
instances; it currently tracks these metrics:
|
|
121 |
|
|
122 |
- disk load
|
|
123 |
- memory load
|
|
124 |
- cpu load
|
|
125 |
- network load
|
|
126 |
|
|
127 |
Even though we do not assign real values to these load values, the fact
|
|
128 |
that we at least sum them means that the algorithm tries to equalise
|
|
129 |
these loads, and especially the network load, which is otherwise not
|
|
130 |
tracked at all. The practical result (due to a combination of these four
|
|
131 |
metrics) is that the number of secondaries will be balanced.
|
|
132 |
|
|
133 |
Limitations
|
|
134 |
-----------
|
|
135 |
|
|
136 |
|
|
137 |
There are unfortunately many limitations to the current model.
|
|
138 |
|
|
139 |
Memory
|
|
140 |
~~~~~~
|
|
141 |
|
|
142 |
The memory model doesn't work well in case of KVM. For Xen, the memory
|
|
143 |
for the node (i.e. ``dom0``) can be static or dynamic; we don't support
|
|
144 |
the latter case, but for the former case, the static value is configured
|
|
145 |
in Xen/kernel command line, and can be queried from Xen
|
|
146 |
itself. Therefore, Ganeti can query the hypervisor for the memory used
|
|
147 |
for the node; the same model was adopted for the chroot/KVM/LXC
|
|
148 |
hypervisors, but in these cases there's no natural value for the memory
|
|
149 |
used by the base OS/kernel, and we currently try to compute a value for
|
|
150 |
the node memory based on current consumption. This, being variable,
|
|
151 |
breaks the assumptions in both Ganeti and HTools.
|
|
152 |
|
|
153 |
This problem also shows for the free memory: if the free memory on the
|
|
154 |
node is not constant (Xen with :term:`tmem` auto-ballooning enabled), or
|
|
155 |
if the node and instance memory are pooled together (Linux-based
|
|
156 |
hypervisors like KVM and LXC), the current value of the free memory is
|
|
157 |
meaningless and cannot be used for instance checks.
|
|
158 |
|
|
159 |
A separate issue related to the free memory tracking is that since we
|
|
160 |
don't track memory use but rather memory availability, an instance that
|
|
161 |
is temporary down changes Ganeti's understanding of the memory status of
|
|
162 |
the node. This can lead to problems such as:
|
|
163 |
|
|
164 |
.. digraph:: "free-mem-issue"
|
|
165 |
|
|
166 |
node [shape=box];
|
|
167 |
inst1 [label="instance1"];
|
|
168 |
inst2 [label="instance2"];
|
|
169 |
|
|
170 |
node [shape=note];
|
|
171 |
nodeA [label="fmem=0"];
|
|
172 |
nodeB [label="fmem=1"];
|
|
173 |
nodeC [label="fmem=0"];
|
|
174 |
|
|
175 |
node [shape=ellipse, style=filled, fillcolor=green]
|
|
176 |
|
|
177 |
{rank=same; inst1 inst2}
|
|
178 |
|
|
179 |
stop [label="crash!", fillcolor=orange];
|
|
180 |
migrate [label="migrate/ok"];
|
|
181 |
start [style=filled, fillcolor=red, label="start/fail"];
|
|
182 |
inst1 -> stop -> start;
|
|
183 |
stop -> migrate -> start [style=invis, weight=0];
|
|
184 |
inst2 -> migrate;
|
|
185 |
|
|
186 |
{rank=same; inst1 inst2 nodeA}
|
|
187 |
{rank=same; stop nodeB}
|
|
188 |
{rank=same; migrate nodeC}
|
|
189 |
|
|
190 |
nodeA -> nodeB -> nodeC [style=invis, weight=1];
|
|
191 |
|
|
192 |
The behaviour here is wrong; the migration of *instance2* to the node in
|
|
193 |
question will succeed or fail depending on whether *instance1* is
|
|
194 |
running or not. And for *instance1*, it can lead to cases where it if
|
|
195 |
crashes, it cannot restart anymore.
|
|
196 |
|
|
197 |
Finally, not a problem but rather a missing important feature is support
|
|
198 |
for memory over-subscription: both Xen and KVM support memory
|
|
199 |
ballooning, even automatic memory ballooning, for a while now. The
|
|
200 |
entire memory model is based on a fixed memory size for instances, and
|
|
201 |
if memory ballooning is enabled, it will “break” the HTools
|
|
202 |
algorithm. Even the fact that KVM instances do not use all memory from
|
|
203 |
the start creates problems (although not as high, since it will grow and
|
|
204 |
stabilise in the end).
|
|
205 |
|
|
206 |
Disks
|
|
207 |
~~~~~
|
|
208 |
|
|
209 |
Because we only track disk space currently, this means if we have a
|
|
210 |
cluster of ``N`` otherwise identical nodes but half of them have 10
|
|
211 |
drives of size ``X`` and the other half 2 drives of size ``5X``, HTools
|
|
212 |
will consider them exactly the same. However, in the case of mechanical
|
|
213 |
drives at least, the I/O performance will differ significantly based on
|
|
214 |
spindle count, and a “fair” load distribution should take this into
|
|
215 |
account (a similar comment can be made about processor/memory/network
|
|
216 |
speed).
|
|
217 |
|
|
218 |
Another problem related to the spindle count is the LVM allocation
|
|
219 |
algorithm. Currently, the algorithm always creates (or tries to create)
|
|
220 |
striped volumes, with the stripe count being hard-coded to the
|
|
221 |
``./configure`` parameter ``--with-lvm-stripecount``. This creates
|
|
222 |
problems like:
|
|
223 |
|
|
224 |
- when installing from a distribution package, all clusters will be
|
|
225 |
either limited or overloaded due to this fixed value
|
|
226 |
- it is not possible to mix heterogeneous nodes (even in different node
|
|
227 |
groups) and have optimal settings for all nodes
|
|
228 |
- the striping value applies both to LVM/DRBD data volumes (which are on
|
|
229 |
the order of gigabytes to hundreds of gigabytes) and to DRBD metadata
|
|
230 |
volumes (whose size is always fixed at 128MB); when stripping such
|
|
231 |
small volumes over many PVs, their size will increase needlessly (and
|
|
232 |
this can confuse HTools' disk computation algorithm)
|
|
233 |
|
|
234 |
Moreover, the allocation currently allocates based on a ‘most free
|
|
235 |
space’ algorithm. This balances the free space usage on disks, but on
|
|
236 |
the other hand it tends to mix rather badly the data and metadata
|
|
237 |
volumes of different instances. For example, it cannot do the following:
|
|
238 |
|
|
239 |
- keep DRBD data and metadata volumes on the same drives, in order to
|
|
240 |
reduce exposure to drive failure in a many-drives system
|
|
241 |
- keep DRBD data and metadata volumes on different drives, to reduce
|
|
242 |
performance impact of metadata writes
|
|
243 |
|
|
244 |
Additionally, while Ganeti supports setting the volume separately for
|
|
245 |
data and metadata volumes at instance creation, there are no defaults
|
|
246 |
for this setting.
|
|
247 |
|
|
248 |
Similar to the above stripe count problem (which is about not good
|
|
249 |
enough customisation of Ganeti's behaviour), we have limited
|
|
250 |
pass-through customisation of the various options of our storage
|
|
251 |
backends; while LVM has a system-wide configuration file that can be
|
|
252 |
used to tweak some of its behaviours, for DRBD we don't use the
|
|
253 |
:command:`drbdadmin` tool, and instead we call :command:`drbdsetup`
|
|
254 |
directly, with a fixed/restricted set of options; so for example one
|
|
255 |
cannot tweak the buffer sizes.
|
|
256 |
|
|
257 |
Another current problem is that the support for shared storage in HTools
|
|
258 |
is still limited, but this problem is outside of this design document.
|
|
259 |
|
|
260 |
Locking
|
|
261 |
~~~~~~~
|
|
262 |
|
|
263 |
A further problem generated by the “current free” model is that during a
|
|
264 |
long operation which affects resource usage (e.g. disk replaces,
|
|
265 |
instance creations) we have to keep the respective objects locked
|
|
266 |
(sometimes even in exclusive mode), since we don't want any concurrent
|
|
267 |
modifications to the *free* values.
|
|
268 |
|
|
269 |
A classic example of the locking problem is the following:
|
|
270 |
|
|
271 |
.. digraph:: "iallocator-lock-issues"
|
|
272 |
|
|
273 |
rankdir=TB;
|
|
274 |
|
|
275 |
start [style=invis];
|
|
276 |
node [shape=box,width=2];
|
|
277 |
job1 [label="add instance\niallocator run\nchoose A,B"];
|
|
278 |
job1e [label="finish add"];
|
|
279 |
job2 [label="add instance\niallocator run\nwait locks"];
|
|
280 |
job2s [label="acquire locks\nchoose C,D"];
|
|
281 |
job2e [label="finish add"];
|
|
282 |
|
|
283 |
job1 -> job1e;
|
|
284 |
job2 -> job2s -> job2e;
|
|
285 |
edge [style=invis,weight=0];
|
|
286 |
start -> {job1; job2}
|
|
287 |
job1 -> job2;
|
|
288 |
job2 -> job1e;
|
|
289 |
job1e -> job2s [style=dotted,label="release locks"];
|
|
290 |
|
|
291 |
In the above example, the second IAllocator run will wait for locks for
|
|
292 |
nodes ``A`` and ``B``, even though in the end the second instance will
|
|
293 |
be placed on another set of nodes (``C`` and ``D``). This wait shouldn't
|
|
294 |
be needed, since right after the first IAllocator run has finished,
|
|
295 |
:command:`hail` knows the status of the cluster after the allocation,
|
|
296 |
and it could answer the question for the second run too; however, Ganeti
|
|
297 |
doesn't have such visibility into the cluster state and thus it is
|
|
298 |
forced to wait with the second job.
|
|
299 |
|
|
300 |
Similar examples can be made about replace disks (another long-running
|
|
301 |
opcode).
|
|
302 |
|
|
303 |
.. _label-policies:
|
|
304 |
|
|
305 |
Policies
|
|
306 |
~~~~~~~~
|
|
307 |
|
|
308 |
For most of the resources, we have metrics defined by policy: e.g. the
|
|
309 |
over-subscription ratio for CPUs, the amount of space to reserve,
|
|
310 |
etc. Furthermore, although there are no such definitions in Ganeti such
|
|
311 |
as minimum/maximum instance size, a real deployment will need to have
|
|
312 |
them, especially in a fully-automated workflow where end-users can
|
|
313 |
request instances via an automated interface (that talks to the cluster
|
|
314 |
via RAPI, LUXI or command line). However, such an automated interface
|
|
315 |
will need to also take into account cluster capacity, and if the
|
|
316 |
:command:`hspace` tool is used for the capacity computation, it needs to
|
|
317 |
be told the maximum instance size, however it has a built-in minimum
|
|
318 |
instance size which is not customisable.
|
|
319 |
|
|
320 |
It is clear that this situation leads to duplicate definition of
|
|
321 |
resource policies which makes it hard to easily change per-cluster (or
|
|
322 |
globally) the respective policies, and furthermore it creates
|
|
323 |
inconsistencies if such policies are not enforced at the source (i.e. in
|
|
324 |
Ganeti).
|
|
325 |
|
|
326 |
Balancing algorithm
|
|
327 |
~~~~~~~~~~~~~~~~~~~
|
|
328 |
|
|
329 |
The balancing algorithm, as documented in the HTools ``README`` file,
|
|
330 |
tries to minimise the cluster score; this score is based on a set of
|
|
331 |
metrics that describe both exceptional conditions and how spread the
|
|
332 |
instances are across the nodes. In order to achieve this goal, it moves
|
|
333 |
the instances around, with a series of moves of various types:
|
|
334 |
|
|
335 |
- disk replaces (for DRBD-based instances)
|
|
336 |
- instance failover/migrations (for all types)
|
|
337 |
|
|
338 |
However, the algorithm only looks at the cluster score, and not at the
|
|
339 |
*“cost”* of the moves. In other words, the following can and will happen
|
|
340 |
on a cluster:
|
|
341 |
|
|
342 |
.. digraph:: "balancing-cost-issues"
|
|
343 |
|
|
344 |
rankdir=LR;
|
|
345 |
ranksep=1;
|
|
346 |
|
|
347 |
start [label="score α", shape=hexagon];
|
|
348 |
|
|
349 |
node [shape=box, width=2];
|
|
350 |
replace1 [label="replace_disks 500G\nscore α-3ε\ncost 3"];
|
|
351 |
replace2a [label="replace_disks 20G\nscore α-2ε\ncost 2"];
|
|
352 |
migrate1 [label="migrate\nscore α-ε\ncost 1"];
|
|
353 |
|
|
354 |
choose [shape=ellipse,label="choose min(score)=α-3ε\ncost 3"];
|
|
355 |
|
|
356 |
start -> {replace1; replace2a; migrate1} -> choose;
|
|
357 |
|
|
358 |
Even though a migration is much, much cheaper than a disk replace (in
|
|
359 |
terms of network and disk traffic on the cluster), if the disk replace
|
|
360 |
results in a score infinitesimally smaller, then it will be
|
|
361 |
chosen. Similarly, between two disk replaces, one moving e.g. ``500GiB``
|
|
362 |
and one moving ``20GiB``, the first one will be chosen if it results in
|
|
363 |
a score smaller than the second one. Furthermore, even if the resulting
|
|
364 |
scores are equal, the first computed solution will be kept, whichever it
|
|
365 |
is.
|
|
366 |
|
|
367 |
Fixing this algorithmic problem is doable, but currently Ganeti doesn't
|
|
368 |
export enough information about nodes to make an informed decision; in
|
|
369 |
the above example, if the ``500GiB`` move is between nodes having fast
|
|
370 |
I/O (both disks and network), it makes sense to execute it over a disk
|
|
371 |
replace of ``100GiB`` between nodes with slow I/O, so simply relating to
|
|
372 |
the properties of the move itself is not enough; we need more node
|
|
373 |
information for cost computation.
|
|
374 |
|
|
375 |
Allocation algorithm
|
|
376 |
~~~~~~~~~~~~~~~~~~~~
|
|
377 |
|
|
378 |
.. note:: This design document will not address this limitation, but it
|
|
379 |
is worth mentioning as it directly related to the resource model.
|
|
380 |
|
|
381 |
The current allocation/capacity algorithm works as follows (per
|
|
382 |
node-group)::
|
|
383 |
|
|
384 |
repeat:
|
|
385 |
allocate instance without failing N+1
|
|
386 |
|
|
387 |
This simple algorithm, and its use of ``N+1`` criterion, has a built-in
|
|
388 |
limit of 1 machine failure in case of DRBD. This means the algorithm
|
|
389 |
guarantees that, if using DRBD storage, there are enough resources to
|
|
390 |
(re)start all affected instances in case of one machine failure. This
|
|
391 |
relates mostly to memory; there is no account for CPU over-subscription
|
|
392 |
(i.e. in case of failure, make sure we can failover while still not
|
|
393 |
going over CPU limits), or for any other resource.
|
|
394 |
|
|
395 |
In case of shared storage, there's not even the memory guarantee, as the
|
|
396 |
N+1 protection doesn't work for shared storage.
|
|
397 |
|
|
398 |
If a given cluster administrator wants to survive up to two machine
|
|
399 |
failures, or wants to ensure CPU limits too for DRBD, there is no
|
|
400 |
possibility to configure this in HTools (neither in :command:`hail` nor
|
|
401 |
in :command:`hspace`). Current workaround employ for example deducting a
|
|
402 |
certain number of instances from the size computed by :command:`hspace`,
|
|
403 |
but this is a very crude method, and requires that instance creations
|
|
404 |
are limited before Ganeti (otherwise :command:`hail` would allocate
|
|
405 |
until the cluster is full).
|
|
406 |
|
|
407 |
Proposed architecture
|
|
408 |
=====================
|
|
409 |
|
|
410 |
|
|
411 |
There are two main changes proposed:
|
|
412 |
|
|
413 |
- changing the resource model from a pure :term:`SoW` to a hybrid
|
|
414 |
:term:`SoR`/:term:`SoW` one, where the :term:`SoR` component is
|
|
415 |
heavily emphasised
|
|
416 |
- extending the resource model to cover additional properties,
|
|
417 |
completing the “holes” in the current coverage
|
|
418 |
|
|
419 |
The second change is rather straightforward, but will add more
|
|
420 |
complexity in the modelling of the cluster. The first change, however,
|
|
421 |
represents a significant shift from the current model, which Ganeti had
|
|
422 |
from its beginnings.
|
|
423 |
|
|
424 |
Lock-improved resource model
|
|
425 |
----------------------------
|
|
426 |
|
|
427 |
Hybrid SoR/SoW model
|
|
428 |
~~~~~~~~~~~~~~~~~~~~
|
|
429 |
|
|
430 |
The resources of a node can be characterised in two broad classes:
|
|
431 |
|
|
432 |
- mostly static resources
|
|
433 |
- dynamically changing resources
|
|
434 |
|
|
435 |
In the first category, we have things such as total core count, total
|
|
436 |
memory size, total disk size, number of network interfaces etc. In the
|
|
437 |
second category we have things such as free disk space, free memory, CPU
|
|
438 |
load, etc. Note that nowadays we don't have (anymore) fully-static
|
|
439 |
resources: features like CPU and memory hot-plug, online disk replace,
|
|
440 |
etc. mean that theoretically all resources can change (there are some
|
|
441 |
practical limitations, of course).
|
|
442 |
|
|
443 |
Even though the rate of change of the two resource types is wildly
|
|
444 |
different, right now Ganeti handles both the same. Given that the
|
|
445 |
interval of change of the semi-static ones is much bigger than most
|
|
446 |
Ganeti operations, even more than lengthy sequences of Ganeti jobs, it
|
|
447 |
makes sense to treat them separately.
|
|
448 |
|
|
449 |
The proposal is then to move the following resources into the
|
|
450 |
configuration and treat the configuration as the authoritative source
|
|
451 |
for them (a :term:`SoR` model):
|
|
452 |
|
|
453 |
- CPU resources:
|
|
454 |
- total core count
|
|
455 |
- node core usage (*new*)
|
|
456 |
- memory resources:
|
|
457 |
- total memory size
|
|
458 |
- node memory size
|
|
459 |
- hypervisor overhead (*new*)
|
|
460 |
- disk resources:
|
|
461 |
- total disk size
|
|
462 |
- disk overhead (*new*)
|
|
463 |
|
|
464 |
Since these resources can though change at run-time, we will need
|
|
465 |
functionality to update the recorded values.
|
|
466 |
|
|
467 |
Pre-computing dynamic resource values
|
|
468 |
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
469 |
|
|
470 |
Remember that the resource model used by HTools models the clusters as
|
|
471 |
obeying the following equations:
|
|
472 |
|
|
473 |
disk\ :sub:`free` = disk\ :sub:`total` - ∑ disk\ :sub:`instances`
|
|
474 |
|
|
475 |
mem\ :sub:`free` = mem\ :sub:`total` - ∑ mem\ :sub:`instances` - mem\
|
|
476 |
:sub:`node` - mem\ :sub:`overhead`
|
|
477 |
|
|
478 |
As this model worked fine for HTools, we can consider it valid and adopt
|
|
479 |
it in Ganeti. Furthermore, note that all values in the right-hand side
|
|
480 |
come now from the configuration:
|
|
481 |
|
|
482 |
- the per-instance usage values were already stored in the configuration
|
|
483 |
- the other values will are moved to the configuration per the previous
|
|
484 |
section
|
|
485 |
|
|
486 |
This means that we can now compute the free values without having to
|
|
487 |
actually live-query the nodes, which brings a significant advantage.
|
|
488 |
|
|
489 |
There are a couple of caveats to this model though. First, as the
|
|
490 |
run-time state of the instance is no longer taken into consideration, it
|
|
491 |
means that we have to introduce a new *offline* state for an instance
|
|
492 |
(similar to the node one). In this state, the instance's runtime
|
|
493 |
resources (memory and VCPUs) are no longer reserved for it, and can be
|
|
494 |
reused by other instances. Static resources like disk and MAC addresses
|
|
495 |
are still reserved though. Transitioning into and out of this reserved
|
|
496 |
state will be more involved than simply stopping/starting the instance
|
|
497 |
(e.g. de-offlining can fail due to missing resources). This complexity
|
|
498 |
is compensated by the increased consistency of what guarantees we have
|
|
499 |
in the stopped state (we always guarantee resource reservation), and the
|
|
500 |
potential for management tools to restrict which users can transition
|
|
501 |
into/out of this state separate from which users can stop/start the
|
|
502 |
instance.
|
|
503 |
|
|
504 |
Separating per-node resource locks
|
|
505 |
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
506 |
|
|
507 |
Many of the current node locks in Ganeti exist in order to guarantee
|
|
508 |
correct resource state computation, whereas others are designed to
|
|
509 |
guarantee reasonable run-time performance of nodes (e.g. by not
|
|
510 |
overloading the I/O subsystem). This is an unfortunate coupling, since
|
|
511 |
it means for example that the following two operations conflict in
|
|
512 |
practice even though they are orthogonal:
|
|
513 |
|
|
514 |
- replacing a instance's disk on a node
|
|
515 |
- computing node disk/memory free for an IAllocator run
|
|
516 |
|
|
517 |
This conflict increases significantly the lock contention on a big/busy
|
|
518 |
cluster and at odds with the goal of increasing the cluster size.
|
|
519 |
|
|
520 |
The proposal is therefore to add a new level of locking that is only
|
|
521 |
used to prevent concurrent modification to the resource states (either
|
|
522 |
node properties or instance properties) and not for long-term
|
|
523 |
operations:
|
|
524 |
|
|
525 |
- instance creation needs to acquire and keep this lock until adding the
|
|
526 |
instance to the configuration
|
|
527 |
- instance modification needs to acquire and keep this lock until
|
|
528 |
updating the instance
|
|
529 |
- node property changes will need to acquire this lock for the
|
|
530 |
modification
|
|
531 |
|
|
532 |
The new lock level will sit before the instance level (right after BGL)
|
|
533 |
and could either be single-valued (like the “Big Ganeti Lock”), in which
|
|
534 |
case we won't be able to modify two nodes at the same time, or per-node,
|
|
535 |
in which case the list of locks at this level needs to be synchronised
|
|
536 |
with the node lock level. To be determined.
|
|
537 |
|
|
538 |
Lock contention reduction
|
|
539 |
~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
540 |
|
|
541 |
Based on the above, the locking contention will be reduced as follows:
|
|
542 |
IAllocator calls will no longer need the ``LEVEL_NODE: ALL_SET`` lock,
|
|
543 |
only the resource lock (in exclusive mode). Hence allocating/computing
|
|
544 |
evacuation targets will no longer conflict for longer than the time to
|
|
545 |
compute the allocation solution.
|
|
546 |
|
|
547 |
The remaining long-running locks will be the DRBD replace-disks ones
|
|
548 |
(exclusive mode). These can also be removed, or changed into shared
|
|
549 |
locks, but that is a separate design change.
|
|
550 |
|
|
551 |
.. admonition:: FIXME
|
|
552 |
|
|
553 |
Need to rework instance console vs. instance replace disks. I don't
|
|
554 |
think we need exclusive locks for console and neither for replace
|
|
555 |
disk: it is safe to stop/start the instance while it's doing a replace
|
|
556 |
disks. Only modify would need exclusive, and only for transitioning
|
|
557 |
into/out of offline state.
|
|
558 |
|
|
559 |
Instance memory model
|
|
560 |
---------------------
|
|
561 |
|
|
562 |
In order to support ballooning, the instance memory model needs to be
|
|
563 |
changed from a “memory size” one to a “min/max memory size”. This
|
|
564 |
interacts with the new static resource model, however, and thus we need
|
|
565 |
to declare a-priori the expected oversubscription ratio on the cluster.
|
|
566 |
|
|
567 |
The new minimum memory size parameter will be similar to the current
|
|
568 |
memory size; the cluster will guarantee that in all circumstances, all
|
|
569 |
instances will have available their minimum memory size. The maximum
|
|
570 |
memory size will permit burst usage of more memory by instances, with
|
|
571 |
the restriction that the sum of maximum memory usage will not be more
|
|
572 |
than the free memory times the oversubscription factor:
|
|
573 |
|
|
574 |
∑ memory\ :sub:`min` ≤ memory\ :sub:`available`
|
|
575 |
|
|
576 |
∑ memory\ :sub:`max` ≤ memory\ :sub:`free` * oversubscription_ratio
|
|
577 |
|
|
578 |
The hypervisor will have the possibility of adjusting the instance's
|
|
579 |
memory size dynamically between these two boundaries.
|
|
580 |
|
|
581 |
Note that the minimum memory is related to the available memory on the
|
|
582 |
node, whereas the maximum memory is related to the free memory. On
|
|
583 |
DRBD-enabled clusters, this will have the advantage of using the
|
|
584 |
reserved memory for N+1 failover for burst usage, instead of having it
|
|
585 |
completely idle.
|
|
586 |
|
|
587 |
.. admonition:: FIXME
|
|
588 |
|
|
589 |
Need to document how Ganeti forces minimum size at runtime, overriding
|
|
590 |
the hypervisor, in cases of failover/lack of resources.
|
|
591 |
|
|
592 |
New parameters
|
|
593 |
--------------
|
|
594 |
|
|
595 |
Unfortunately the design will add a significant number of new
|
|
596 |
parameters, and change the meaning of some of the current ones.
|
|
597 |
|
|
598 |
Instance size limits
|
|
599 |
~~~~~~~~~~~~~~~~~~~~
|
|
600 |
|
|
601 |
As described in :ref:`label-policies`, we currently lack a clear
|
|
602 |
definition of the support instance sizes (minimum, maximum and
|
|
603 |
standard). As such, we will add the following structure to the cluster
|
|
604 |
parameters:
|
|
605 |
|
|
606 |
- ``min_ispec``, ``max_ispec``: minimum and maximum acceptable instance
|
|
607 |
specs
|
|
608 |
- ``std_ispec``: standard instance size, which will be used for capacity
|
|
609 |
computations and for default parameters on the instance creation
|
|
610 |
request
|
|
611 |
|
|
612 |
Ganeti will by default reject non-standard instance sizes (lower than
|
|
613 |
``min_ispec`` or greater than ``max_ispec``), but as usual a ``--force``
|
|
614 |
option on the command line or in the RAPI request will override these
|
|
615 |
constraints. The ``std_spec`` structure will be used to fill in missing
|
|
616 |
instance specifications on create.
|
|
617 |
|
|
618 |
Each of the ispec structures will be a dictionary, since the contents
|
|
619 |
can change over time. Initially, we will define the following variables
|
|
620 |
in these structures:
|
|
621 |
|
|
622 |
+---------------+----------------------------------+--------------+
|
|
623 |
|Name |Description |Type |
|
|
624 |
+===============+==================================+==============+
|
|
625 |
|mem_min |Minimum memory size allowed |int |
|
|
626 |
+---------------+----------------------------------+--------------+
|
|
627 |
|mem_max |Maximum allowed memory size |int |
|
|
628 |
+---------------+----------------------------------+--------------+
|
|
629 |
|cpu_count |Allowed vCPU count |int |
|
|
630 |
+---------------+----------------------------------+--------------+
|
|
631 |
|disk_count |Allowed disk count |int |
|
|
632 |
+---------------+----------------------------------+--------------+
|
|
633 |
|disk_size |Allowed disk size |int |
|
|
634 |
+---------------+----------------------------------+--------------+
|
|
635 |
|nic_count |Alowed NIC count |int |
|
|
636 |
+---------------+----------------------------------+--------------+
|
|
637 |
|
|
638 |
Inheritance
|
|
639 |
+++++++++++
|
|
640 |
|
|
641 |
In a single-group cluster, the above structure is sufficient. However,
|
|
642 |
on a multi-group cluster, it could be that the hardware specifications
|
|
643 |
differ across node groups, and thus the following problem appears: how
|
|
644 |
can Ganeti present unified specifications over RAPI?
|
|
645 |
|
|
646 |
Since the set of instance specs is only partially ordered (as opposed to
|
|
647 |
the sets of values of individual variable in the spec, which are totally
|
|
648 |
ordered), it follows that we can't present unified specs. As such, the
|
|
649 |
proposed approach is to allow the ``min_ispec`` and ``max_ispec`` to be
|
|
650 |
customised per node-group (and export them as a list of specifications),
|
|
651 |
and a single ``std_spec`` at cluster level (exported as a single value).
|
|
652 |
|
|
653 |
|
|
654 |
Allocation parameters
|
|
655 |
~~~~~~~~~~~~~~~~~~~~~
|
|
656 |
|
|
657 |
Beside the limits of min/max instance sizes, there are other parameters
|
|
658 |
related to capacity and allocation limits. These are mostly related to
|
|
659 |
the problems related to over allocation.
|
|
660 |
|
|
661 |
+-----------------+----------+---------------------------+----------+------+
|
|
662 |
| Name |Level(s) |Description |Current |Type |
|
|
663 |
| | | |value | |
|
|
664 |
+=================+==========+===========================+==========+======+
|
|
665 |
|vcpu_ratio |cluster, |Maximum ratio of virtual to|64 (only |float |
|
|
666 |
| |node group|physical CPUs |in htools)| |
|
|
667 |
+-----------------+----------+---------------------------+----------+------+
|
|
668 |
|spindle_ratio |cluster, |Maximum ratio of instances |none |float |
|
|
669 |
| |node group|to spindles; when the I/O | | |
|
|
670 |
| | |model doesn't map directly | | |
|
|
671 |
| | |to spindles, another | | |
|
|
672 |
| | |measure of I/O should be | | |
|
|
673 |
| | |used instead | | |
|
|
674 |
+-----------------+----------+---------------------------+----------+------+
|
|
675 |
|max_node_failures|cluster, |Cap allocation/capacity so |1 |int |
|
|
676 |
| |node group|that the cluster can |(hardcoded| |
|
|
677 |
| | |survive this many node |in htools)| |
|
|
678 |
| | |failures | | |
|
|
679 |
+-----------------+----------+---------------------------+----------+------+
|
|
680 |
|
|
681 |
Since these are used mostly internally (in htools), they will be
|
|
682 |
exported as-is from Ganeti, without explicit handling of node-groups
|
|
683 |
grouping.
|
|
684 |
|
|
685 |
Regarding ``spindle_ratio``, in this context spindles do not necessarily
|
|
686 |
have to mean actual mechanical hard-drivers; it's rather a measure of
|
|
687 |
I/O performance for internal storage.
|
|
688 |
|
|
689 |
Disk parameters
|
|
690 |
~~~~~~~~~~~~~~~
|
|
691 |
|
|
692 |
The propose model for new disk parameters is a simple free-form one
|
|
693 |
based on dictionaries, indexed per disk level (template or logical disk)
|
|
694 |
and type (which depends on the level). At JSON level, since the object
|
|
695 |
key has to be a string, we can encode the keys via a separator
|
|
696 |
(e.g. slash), or by having two dict levels.
|
|
697 |
|
|
698 |
+--------+-------------+-------------------------+---------------------+------+
|
|
699 |
|Disk |Name |Description |Current status |Type |
|
|
700 |
|template| | | | |
|
|
701 |
+========+=============+=========================+=====================+======+
|
|
702 |
|dt/plain|stripes |How many stripes to use |Configured at |int |
|
|
703 |
| | |for newly created (plain)|./configure time, not| |
|
|
704 |
| | |logical voumes |overridable at | |
|
|
705 |
| | | |runtime | |
|
|
706 |
+--------+-------------+-------------------------+---------------------+------+
|
|
707 |
|dt/drdb |stripes |How many stripes to use |Same as for lvm |int |
|
|
708 |
| | |for data volumes | | |
|
|
709 |
+--------+-------------+-------------------------+---------------------+------+
|
|
710 |
|dt/drbd |metavg |Default volume group for |Same as the main |string|
|
|
711 |
| | |the metadata LVs |volume group, | |
|
|
712 |
| | | |overridable via | |
|
|
713 |
| | | |'metavg' key | |
|
|
714 |
| | | | | |
|
|
715 |
+--------+-------------+-------------------------+---------------------+------+
|
|
716 |
|dt/drbd |metastripes |How many stripes to use |Same as for lvm |int |
|
|
717 |
| | |for meta volumes |'stripes', suboptimal| |
|
|
718 |
| | | |as the meta LVs are | |
|
|
719 |
| | | |small | |
|
|
720 |
+--------+-------------+-------------------------+---------------------+------+
|
|
721 |
|ld/drbd8|disk_barriers|What kind of barriers to |Either all enabled or|string|
|
|
722 |
| | |*disable* for disks; |all disabled, per | |
|
|
723 |
| | |either "n" or a string |./configure time | |
|
|
724 |
| | |containing a subset of |option | |
|
|
725 |
| | |"bfd" | | |
|
|
726 |
+--------+-------------+-------------------------+---------------------+------+
|
|
727 |
|ld/drbd8|meta_barriers|Whether barriers are |Handled together with|bool |
|
|
728 |
| | |enabled or not for the |disk_barriers | |
|
|
729 |
| | |meta volume | | |
|
|
730 |
| | | | | |
|
|
731 |
+--------+-------------+-------------------------+---------------------+------+
|
|
732 |
|ld/drbd8|resync_rate |The (static) resync rate |Hardcoded in |int |
|
|
733 |
| | |for drbd, when using the |constants.py, not | |
|
|
734 |
| | |static syncer, in MiB/s |changeable via Ganeti| |
|
|
735 |
| | | | | |
|
|
736 |
| | | | | |
|
|
737 |
| | | | | |
|
|
738 |
+--------+-------------+-------------------------+---------------------+------+
|
|
739 |
|ld/drbd8|disk_custom |Free-form string that |Not supported |string|
|
|
740 |
| | |will be appended to the | | |
|
|
741 |
| | |drbdsetup disk command | | |
|
|
742 |
| | |line, for custom options | | |
|
|
743 |
| | |not supported by Ganeti | | |
|
|
744 |
| | |itself | | |
|
|
745 |
+--------+-------------+-------------------------+---------------------+------+
|
|
746 |
|ld/drbd8|net_custom |Free-form string for | | |
|
|
747 |
| | |custom net setup options | | |
|
|
748 |
| | | | | |
|
|
749 |
| | | | | |
|
|
750 |
| | | | | |
|
|
751 |
| | | | | |
|
|
752 |
+--------+-------------+-------------------------+---------------------+------+
|
|
753 |
|
|
754 |
Note that the DRBD8 parameters will change once we support DRBD 8.4,
|
|
755 |
which has changed syntax significantly; new syncer modes will be added
|
|
756 |
for that release.
|
|
757 |
|
|
758 |
All the above parameters are at cluster and node group level; as in
|
|
759 |
other parts of the code, the intention is that all nodes in a node group
|
|
760 |
should be equal.
|
|
761 |
|
|
762 |
Node parameters
|
|
763 |
~~~~~~~~~~~~~~~
|
|
764 |
|
|
765 |
For the new memory model, we'll add the following parameters, in a
|
|
766 |
dictionary indexed by the hypervisor name (node attribute
|
|
767 |
``hv_state``). The rationale is that, even though multi-hypervisor
|
|
768 |
clusters are rare, they make sense sometimes, and thus we need to
|
|
769 |
support multipe node states (one per hypervisor).
|
|
770 |
|
|
771 |
Since usually only one of the multiple hypervisors is the 'main' one
|
|
772 |
(and the others used sparringly), capacity computation will still only
|
|
773 |
use the first hypervisor, and not all of them. Thus we avoid possible
|
|
774 |
inconsistencies.
|
|
775 |
|
|
776 |
+----------+-----------------------------------+---------------+-------+
|
|
777 |
|Name |Description |Current state |Type |
|
|
778 |
| | | | |
|
|
779 |
+==========+===================================+===============+=======+
|
|
780 |
|mem_total |Total node memory, as discovered by|Queried at |int |
|
|
781 |
| |this hypervisor |runtime | |
|
|
782 |
+----------+-----------------------------------+---------------+-------+
|
|
783 |
|mem_node |Memory used by, or reserved for, |Queried at |int |
|
|
784 |
| |the node itself; not that some |runtime | |
|
|
785 |
| |hypervisors can report this in an | | |
|
|
786 |
| |authoritative way, other not | | |
|
|
787 |
+----------+-----------------------------------+---------------+-------+
|
|
788 |
|mem_hv |Memory used either by the |Not used, |int |
|
|
789 |
| |hypervisor itself or lost due to |htools computes| |
|
|
790 |
| |instance allocation rounding; |it internally | |
|
|
791 |
| |usually this cannot be precisely | | |
|
|
792 |
| |computed, but only roughly | | |
|
|
793 |
| |estimated | | |
|
|
794 |
+----------+-----------------------------------+---------------+-------+
|
|
795 |
|cpu_total |Total node cpu (core) count; |Queried at |int |
|
|
796 |
| |usually this can be discovered |runtime | |
|
|
797 |
| |automatically | | |
|
|
798 |
| | | | |
|
|
799 |
| | | | |
|
|
800 |
| | | | |
|
|
801 |
+----------+-----------------------------------+---------------+-------+
|
|
802 |
|cpu_node |Number of cores reserved for the |Not used at all|int |
|
|
803 |
| |node itself; this can either be | | |
|
|
804 |
| |discovered or set manually. Only | | |
|
|
805 |
| |used for estimating how many VCPUs | | |
|
|
806 |
| |are left for instances | | |
|
|
807 |
| | | | |
|
|
808 |
+----------+-----------------------------------+---------------+-------+
|
|
809 |
|
|
810 |
Of the above parameters, only ``_total`` ones are straight-forward. The
|
|
811 |
others have sometimes strange semantics:
|
|
812 |
|
|
813 |
- Xen can report ``mem_node``, if configured statically (as we
|
|
814 |
recommend); but Linux-based hypervisors (KVM, chroot, LXC) do not, and
|
|
815 |
this needs to be configured statically for these values
|
|
816 |
- ``mem_hv``, representing unaccounted for memory, is not directly
|
|
817 |
computable; on Xen, it can be seen that on a N GB machine, with 1 GB
|
|
818 |
for dom0 and N-2 GB for instances, there's just a few MB left, instead
|
|
819 |
fo a full 1 GB of RAM; however, the exact value varies with the total
|
|
820 |
memory size (at least)
|
|
821 |
- ``cpu_node`` only makes sense on Xen (currently), in the case when we
|
|
822 |
restrict dom0; for Linux-based hypervisors, the node itself cannot be
|
|
823 |
easily restricted, so it should be set as an estimate of how "heavy"
|
|
824 |
the node loads will be
|
|
825 |
|
|
826 |
Since these two values cannot be auto-computed from the node, we need to
|
|
827 |
be able to declare a default at cluster level (debatable how useful they
|
|
828 |
are at node group level); the proposal is to do this via a cluster-level
|
|
829 |
``hv_state`` dict (per hypervisor).
|
|
830 |
|
|
831 |
Beside the per-hypervisor attributes, we also have disk attributes,
|
|
832 |
which are queried directly on the node (without hypervisor
|
|
833 |
involvment). The are stored in a separate attribute (``disk_state``),
|
|
834 |
which is indexed per storage type and name; currently this will be just
|
|
835 |
``LD_LV`` and the volume name as key.
|
|
836 |
|
|
837 |
+-------------+-------------------------+--------------------+--------+
|
|
838 |
|Name |Description |Current state |Type |
|
|
839 |
| | | | |
|
|
840 |
+=============+=========================+====================+========+
|
|
841 |
|disk_total |Total disk size |Queried at runtime |int |
|
|
842 |
| | | | |
|
|
843 |
+-------------+-------------------------+--------------------+--------+
|
|
844 |
|disk_reserved|Reserved disk size; this |None used in Ganeti;|int |
|
|
845 |
| |is a lower limit on the |htools has a | |
|
|
846 |
| |free space, if such a |parameter for this | |
|
|
847 |
| |limit is desired | | |
|
|
848 |
+-------------+-------------------------+--------------------+--------+
|
|
849 |
|disk_overhead|Disk that is expected to |None used in Ganeti;|int |
|
|
850 |
| |be used by other volumes |htools detects this | |
|
|
851 |
| |(set via |at runtime | |
|
|
852 |
| |``reserved_lvs``); | | |
|
|
853 |
| |usually should be zero | | |
|
|
854 |
+-------------+-------------------------+--------------------+--------+
|
|
855 |
|
|
856 |
|
|
857 |
Instance parameters
|
|
858 |
~~~~~~~~~~~~~~~~~~~
|
|
859 |
|
|
860 |
New instance parameters, needed especially for supporting the new memory
|
|
861 |
model:
|
|
862 |
|
|
863 |
+--------------+----------------------------------+-----------------+------+
|
|
864 |
|Name |Description |Current status |Type |
|
|
865 |
| | | | |
|
|
866 |
+==============+==================================+=================+======+
|
|
867 |
|offline |Whether the instance is in |Not supported |bool |
|
|
868 |
| |“permanent” offline mode; this is | | |
|
|
869 |
| |stronger than the "admin_down” | | |
|
|
870 |
| |state, and is similar to the node | | |
|
|
871 |
| |offline attribute | | |
|
|
872 |
+--------------+----------------------------------+-----------------+------+
|
|
873 |
|be/max_memory |The maximum memory the instance is|Not existent, but|int |
|
|
874 |
| |allowed |virtually | |
|
|
875 |
| | |identical to | |
|
|
876 |
| | |memory | |
|
|
877 |
+--------------+----------------------------------+-----------------+------+
|
|
878 |
|
|
879 |
HTools changes
|
|
880 |
--------------
|
|
881 |
|
|
882 |
All the new parameters (node, instance, cluster, not so much disk) will
|
|
883 |
need to be taken into account by HTools, both in balancing and in
|
|
884 |
capacity computation.
|
|
885 |
|
|
886 |
Since the Ganeti's cluster model is much enhanced, Ganeti can also
|
|
887 |
export its own reserved/overhead variables, and as such HTools can make
|
|
888 |
less “guesses” as to the difference in values.
|
|
889 |
|
|
890 |
.. admonition:: FIXME
|
|
891 |
|
|
892 |
Need to detail more the htools changes; the model is clear to me, but
|
|
893 |
need to write it down.
|
|
894 |
|
|
895 |
.. vim: set textwidth=72 :
|
|
896 |
.. Local Variables:
|
|
897 |
.. mode: rst
|
|
898 |
.. fill-column: 72
|
|
899 |
.. End:
|