Statistics
| Branch: | Tag: | Revision:

root / doc / design-resource-model.rst @ 5d40c988

History | View | Annotate | Download (40.4 kB)

1 d85f01e7 Iustin Pop
========================
2 d85f01e7 Iustin Pop
 Resource model changes
3 d85f01e7 Iustin Pop
========================
4 d85f01e7 Iustin Pop
5 d85f01e7 Iustin Pop
6 d85f01e7 Iustin Pop
Introduction
7 d85f01e7 Iustin Pop
============
8 d85f01e7 Iustin Pop
9 d85f01e7 Iustin Pop
In order to manage virtual machines across the cluster, Ganeti needs to
10 d85f01e7 Iustin Pop
understand the resources present on the nodes, the hardware and software
11 d85f01e7 Iustin Pop
limitations of the nodes, and how much can be allocated safely on each
12 d85f01e7 Iustin Pop
node. Some of these decisions are delegated to IAllocator plugins, for
13 d85f01e7 Iustin Pop
easier site-level customisation.
14 d85f01e7 Iustin Pop
15 d85f01e7 Iustin Pop
Similarly, the HTools suite has an internal model that simulates the
16 d85f01e7 Iustin Pop
hardware resource changes in response to Ganeti operations, in order to
17 d85f01e7 Iustin Pop
provide both an iallocator plugin and for balancing the
18 d85f01e7 Iustin Pop
cluster.
19 d85f01e7 Iustin Pop
20 d85f01e7 Iustin Pop
While currently the HTools model is much more advanced than Ganeti's,
21 d85f01e7 Iustin Pop
neither one is flexible enough and both are heavily geared toward a
22 d85f01e7 Iustin Pop
specific Xen model; they fail to work well with (e.g.) KVM or LXC, or
23 d85f01e7 Iustin Pop
with Xen when :term:`tmem` is enabled. Furthermore, the set of metrics
24 d85f01e7 Iustin Pop
contained in the models is limited to historic requirements and fails to
25 d85f01e7 Iustin Pop
account for (e.g.)  heterogeneity in the I/O performance of the nodes.
26 d85f01e7 Iustin Pop
27 d85f01e7 Iustin Pop
Current situation
28 d85f01e7 Iustin Pop
=================
29 d85f01e7 Iustin Pop
30 d85f01e7 Iustin Pop
Ganeti
31 d85f01e7 Iustin Pop
------
32 d85f01e7 Iustin Pop
33 d85f01e7 Iustin Pop
At this moment, Ganeti itself doesn't do any static modelling of the
34 d85f01e7 Iustin Pop
cluster resources. It only does some runtime checks:
35 d85f01e7 Iustin Pop
36 d85f01e7 Iustin Pop
- when creating instances, for the (current) free disk space
37 d85f01e7 Iustin Pop
- when starting instances, for the (current) free memory
38 d85f01e7 Iustin Pop
- during cluster verify, for enough N+1 memory on the secondaries, based
39 d85f01e7 Iustin Pop
  on the (current) free memory
40 d85f01e7 Iustin Pop
41 d85f01e7 Iustin Pop
Basically this model is a pure :term:`SoW` one, and it works well when
42 d85f01e7 Iustin Pop
there are other instances/LVs on the nodes, as it allows Ganeti to deal
43 d85f01e7 Iustin Pop
with ‘orphan’ resource usage, but on the other hand it has many issues,
44 d85f01e7 Iustin Pop
described below.
45 d85f01e7 Iustin Pop
46 d85f01e7 Iustin Pop
HTools
47 d85f01e7 Iustin Pop
------
48 d85f01e7 Iustin Pop
49 d85f01e7 Iustin Pop
Since HTools does an pure in-memory modelling of the cluster changes as
50 d85f01e7 Iustin Pop
it executes the balancing or allocation steps, it had to introduce a
51 d85f01e7 Iustin Pop
static (:term:`SoR`) cluster model.
52 d85f01e7 Iustin Pop
53 d85f01e7 Iustin Pop
The model is constructed based on the received node properties from
54 d85f01e7 Iustin Pop
Ganeti (hence it basically is constructed on what Ganeti can export).
55 d85f01e7 Iustin Pop
56 d85f01e7 Iustin Pop
Disk
57 d85f01e7 Iustin Pop
~~~~
58 d85f01e7 Iustin Pop
59 d85f01e7 Iustin Pop
For disk it consists of just the total (``tdsk``) and the free disk
60 d85f01e7 Iustin Pop
space (``fdsk``); we don't directly track the used disk space. On top of
61 d85f01e7 Iustin Pop
this, we compute and warn if the sum of disk sizes used by instance does
62 d85f01e7 Iustin Pop
not match with ``tdsk - fdsk``, but otherwise we do not track this
63 d85f01e7 Iustin Pop
separately.
64 d85f01e7 Iustin Pop
65 d85f01e7 Iustin Pop
Memory
66 d85f01e7 Iustin Pop
~~~~~~
67 d85f01e7 Iustin Pop
68 d85f01e7 Iustin Pop
For memory, the model is more complex and tracks some variables that
69 d85f01e7 Iustin Pop
Ganeti itself doesn't compute. We start from the total (``tmem``), free
70 d85f01e7 Iustin Pop
(``fmem``) and node memory (``nmem``) as supplied by Ganeti, and
71 d85f01e7 Iustin Pop
additionally we track:
72 d85f01e7 Iustin Pop
73 d85f01e7 Iustin Pop
instance memory (``imem``)
74 d85f01e7 Iustin Pop
    the total memory used by primary instances on the node, computed
75 d85f01e7 Iustin Pop
    as the sum of instance memory
76 d85f01e7 Iustin Pop
77 d85f01e7 Iustin Pop
reserved memory (``rmem``)
78 d85f01e7 Iustin Pop
    the memory reserved by peer nodes for N+1 redundancy; this memory is
79 d85f01e7 Iustin Pop
    tracked per peer-node, and the maximum value out of the peer memory
80 d85f01e7 Iustin Pop
    lists is the node's ``rmem``; when not using DRBD, this will be
81 d85f01e7 Iustin Pop
    equal to zero
82 d85f01e7 Iustin Pop
83 d85f01e7 Iustin Pop
unaccounted memory (``xmem``)
84 d85f01e7 Iustin Pop
    memory that cannot be unaccounted for via the Ganeti model; this is
85 d85f01e7 Iustin Pop
    computed at startup as::
86 d85f01e7 Iustin Pop
87 d85f01e7 Iustin Pop
        tmem - imem - nmem - fmem
88 d85f01e7 Iustin Pop
89 d85f01e7 Iustin Pop
    and is presumed to remain constant irrespective of any instance
90 d85f01e7 Iustin Pop
    moves
91 d85f01e7 Iustin Pop
92 d85f01e7 Iustin Pop
available memory (``amem``)
93 d85f01e7 Iustin Pop
    this is simply ``fmem - rmem``, so unless we use DRBD, this will be
94 d85f01e7 Iustin Pop
    equal to ``fmem``
95 d85f01e7 Iustin Pop
96 d85f01e7 Iustin Pop
``tmem``, ``nmem`` and ``xmem`` are presumed constant during the
97 d85f01e7 Iustin Pop
instance moves, whereas the ``fmem``, ``imem``, ``rmem`` and ``amem``
98 d85f01e7 Iustin Pop
values are updated according to the executed moves.
99 d85f01e7 Iustin Pop
100 d85f01e7 Iustin Pop
CPU
101 d85f01e7 Iustin Pop
~~~
102 d85f01e7 Iustin Pop
103 d85f01e7 Iustin Pop
The CPU model is different than the disk/memory models, since it's the
104 d85f01e7 Iustin Pop
only one where:
105 d85f01e7 Iustin Pop
106 d85f01e7 Iustin Pop
#. we do oversubscribe physical CPUs
107 d85f01e7 Iustin Pop
#. and there is no natural limit for the number of VCPUs we can allocate
108 d85f01e7 Iustin Pop
109 d85f01e7 Iustin Pop
We therefore track the total number of VCPUs used on the node and the
110 d85f01e7 Iustin Pop
number of physical CPUs, and we cap the vcpu-to-cpu ratio in order to
111 d85f01e7 Iustin Pop
make this somewhat more similar to the other resources which are
112 d85f01e7 Iustin Pop
limited.
113 d85f01e7 Iustin Pop
114 d85f01e7 Iustin Pop
Dynamic load
115 d85f01e7 Iustin Pop
~~~~~~~~~~~~
116 d85f01e7 Iustin Pop
117 d85f01e7 Iustin Pop
There is also a model that deals with *dynamic load* values in
118 d85f01e7 Iustin Pop
htools. As far as we know, it is not currently used actually with load
119 d85f01e7 Iustin Pop
values, but it is active by default with unitary values for all
120 d85f01e7 Iustin Pop
instances; it currently tracks these metrics:
121 d85f01e7 Iustin Pop
122 d85f01e7 Iustin Pop
- disk load
123 d85f01e7 Iustin Pop
- memory load
124 d85f01e7 Iustin Pop
- cpu load
125 d85f01e7 Iustin Pop
- network load
126 d85f01e7 Iustin Pop
127 d85f01e7 Iustin Pop
Even though we do not assign real values to these load values, the fact
128 d85f01e7 Iustin Pop
that we at least sum them means that the algorithm tries to equalise
129 d85f01e7 Iustin Pop
these loads, and especially the network load, which is otherwise not
130 d85f01e7 Iustin Pop
tracked at all. The practical result (due to a combination of these four
131 d85f01e7 Iustin Pop
metrics) is that the number of secondaries will be balanced.
132 d85f01e7 Iustin Pop
133 d85f01e7 Iustin Pop
Limitations
134 d85f01e7 Iustin Pop
-----------
135 d85f01e7 Iustin Pop
136 d85f01e7 Iustin Pop
137 d85f01e7 Iustin Pop
There are unfortunately many limitations to the current model.
138 d85f01e7 Iustin Pop
139 d85f01e7 Iustin Pop
Memory
140 d85f01e7 Iustin Pop
~~~~~~
141 d85f01e7 Iustin Pop
142 d85f01e7 Iustin Pop
The memory model doesn't work well in case of KVM. For Xen, the memory
143 d85f01e7 Iustin Pop
for the node (i.e. ``dom0``) can be static or dynamic; we don't support
144 d85f01e7 Iustin Pop
the latter case, but for the former case, the static value is configured
145 d85f01e7 Iustin Pop
in Xen/kernel command line, and can be queried from Xen
146 d85f01e7 Iustin Pop
itself. Therefore, Ganeti can query the hypervisor for the memory used
147 d85f01e7 Iustin Pop
for the node; the same model was adopted for the chroot/KVM/LXC
148 d85f01e7 Iustin Pop
hypervisors, but in these cases there's no natural value for the memory
149 d85f01e7 Iustin Pop
used by the base OS/kernel, and we currently try to compute a value for
150 d85f01e7 Iustin Pop
the node memory based on current consumption. This, being variable,
151 d85f01e7 Iustin Pop
breaks the assumptions in both Ganeti and HTools.
152 d85f01e7 Iustin Pop
153 d85f01e7 Iustin Pop
This problem also shows for the free memory: if the free memory on the
154 d85f01e7 Iustin Pop
node is not constant (Xen with :term:`tmem` auto-ballooning enabled), or
155 d85f01e7 Iustin Pop
if the node and instance memory are pooled together (Linux-based
156 d85f01e7 Iustin Pop
hypervisors like KVM and LXC), the current value of the free memory is
157 d85f01e7 Iustin Pop
meaningless and cannot be used for instance checks.
158 d85f01e7 Iustin Pop
159 d85f01e7 Iustin Pop
A separate issue related to the free memory tracking is that since we
160 d85f01e7 Iustin Pop
don't track memory use but rather memory availability, an instance that
161 d85f01e7 Iustin Pop
is temporary down changes Ganeti's understanding of the memory status of
162 d85f01e7 Iustin Pop
the node. This can lead to problems such as:
163 d85f01e7 Iustin Pop
164 d85f01e7 Iustin Pop
.. digraph:: "free-mem-issue"
165 d85f01e7 Iustin Pop
166 d85f01e7 Iustin Pop
  node  [shape=box];
167 d85f01e7 Iustin Pop
  inst1 [label="instance1"];
168 d85f01e7 Iustin Pop
  inst2 [label="instance2"];
169 d85f01e7 Iustin Pop
170 d85f01e7 Iustin Pop
  node  [shape=note];
171 d85f01e7 Iustin Pop
  nodeA [label="fmem=0"];
172 d85f01e7 Iustin Pop
  nodeB [label="fmem=1"];
173 d85f01e7 Iustin Pop
  nodeC [label="fmem=0"];
174 d85f01e7 Iustin Pop
175 d85f01e7 Iustin Pop
  node  [shape=ellipse, style=filled, fillcolor=green]
176 d85f01e7 Iustin Pop
177 d85f01e7 Iustin Pop
  {rank=same; inst1 inst2}
178 d85f01e7 Iustin Pop
179 d85f01e7 Iustin Pop
  stop    [label="crash!", fillcolor=orange];
180 d85f01e7 Iustin Pop
  migrate [label="migrate/ok"];
181 d85f01e7 Iustin Pop
  start   [style=filled, fillcolor=red, label="start/fail"];
182 d85f01e7 Iustin Pop
  inst1   -> stop -> start;
183 d85f01e7 Iustin Pop
  stop    -> migrate -> start [style=invis, weight=0];
184 d85f01e7 Iustin Pop
  inst2   -> migrate;
185 d85f01e7 Iustin Pop
186 d85f01e7 Iustin Pop
  {rank=same; inst1 inst2 nodeA}
187 d85f01e7 Iustin Pop
  {rank=same; stop nodeB}
188 d85f01e7 Iustin Pop
  {rank=same; migrate nodeC}
189 d85f01e7 Iustin Pop
190 d85f01e7 Iustin Pop
  nodeA -> nodeB -> nodeC [style=invis, weight=1];
191 d85f01e7 Iustin Pop
192 d85f01e7 Iustin Pop
The behaviour here is wrong; the migration of *instance2* to the node in
193 d85f01e7 Iustin Pop
question will succeed or fail depending on whether *instance1* is
194 d85f01e7 Iustin Pop
running or not. And for *instance1*, it can lead to cases where it if
195 d85f01e7 Iustin Pop
crashes, it cannot restart anymore.
196 d85f01e7 Iustin Pop
197 d85f01e7 Iustin Pop
Finally, not a problem but rather a missing important feature is support
198 d85f01e7 Iustin Pop
for memory over-subscription: both Xen and KVM support memory
199 d85f01e7 Iustin Pop
ballooning, even automatic memory ballooning, for a while now. The
200 d85f01e7 Iustin Pop
entire memory model is based on a fixed memory size for instances, and
201 d85f01e7 Iustin Pop
if memory ballooning is enabled, it will “break” the HTools
202 d85f01e7 Iustin Pop
algorithm. Even the fact that KVM instances do not use all memory from
203 d85f01e7 Iustin Pop
the start creates problems (although not as high, since it will grow and
204 d85f01e7 Iustin Pop
stabilise in the end).
205 d85f01e7 Iustin Pop
206 d85f01e7 Iustin Pop
Disks
207 d85f01e7 Iustin Pop
~~~~~
208 d85f01e7 Iustin Pop
209 d85f01e7 Iustin Pop
Because we only track disk space currently, this means if we have a
210 d85f01e7 Iustin Pop
cluster of ``N`` otherwise identical nodes but half of them have 10
211 d85f01e7 Iustin Pop
drives of size ``X`` and the other half 2 drives of size ``5X``, HTools
212 d85f01e7 Iustin Pop
will consider them exactly the same. However, in the case of mechanical
213 d85f01e7 Iustin Pop
drives at least, the I/O performance will differ significantly based on
214 d85f01e7 Iustin Pop
spindle count, and a “fair” load distribution should take this into
215 d85f01e7 Iustin Pop
account (a similar comment can be made about processor/memory/network
216 d85f01e7 Iustin Pop
speed).
217 d85f01e7 Iustin Pop
218 d85f01e7 Iustin Pop
Another problem related to the spindle count is the LVM allocation
219 d85f01e7 Iustin Pop
algorithm. Currently, the algorithm always creates (or tries to create)
220 d85f01e7 Iustin Pop
striped volumes, with the stripe count being hard-coded to the
221 d85f01e7 Iustin Pop
``./configure`` parameter ``--with-lvm-stripecount``. This creates
222 d85f01e7 Iustin Pop
problems like:
223 d85f01e7 Iustin Pop
224 d85f01e7 Iustin Pop
- when installing from a distribution package, all clusters will be
225 d85f01e7 Iustin Pop
  either limited or overloaded due to this fixed value
226 d85f01e7 Iustin Pop
- it is not possible to mix heterogeneous nodes (even in different node
227 d85f01e7 Iustin Pop
  groups) and have optimal settings for all nodes
228 d85f01e7 Iustin Pop
- the striping value applies both to LVM/DRBD data volumes (which are on
229 d85f01e7 Iustin Pop
  the order of gigabytes to hundreds of gigabytes) and to DRBD metadata
230 d85f01e7 Iustin Pop
  volumes (whose size is always fixed at 128MB); when stripping such
231 d85f01e7 Iustin Pop
  small volumes over many PVs, their size will increase needlessly (and
232 d85f01e7 Iustin Pop
  this can confuse HTools' disk computation algorithm)
233 d85f01e7 Iustin Pop
234 d85f01e7 Iustin Pop
Moreover, the allocation currently allocates based on a ‘most free
235 d85f01e7 Iustin Pop
space’ algorithm. This balances the free space usage on disks, but on
236 d85f01e7 Iustin Pop
the other hand it tends to mix rather badly the data and metadata
237 d85f01e7 Iustin Pop
volumes of different instances. For example, it cannot do the following:
238 d85f01e7 Iustin Pop
239 d85f01e7 Iustin Pop
- keep DRBD data and metadata volumes on the same drives, in order to
240 d85f01e7 Iustin Pop
  reduce exposure to drive failure in a many-drives system
241 d85f01e7 Iustin Pop
- keep DRBD data and metadata volumes on different drives, to reduce
242 d85f01e7 Iustin Pop
  performance impact of metadata writes
243 d85f01e7 Iustin Pop
244 d85f01e7 Iustin Pop
Additionally, while Ganeti supports setting the volume separately for
245 d85f01e7 Iustin Pop
data and metadata volumes at instance creation, there are no defaults
246 d85f01e7 Iustin Pop
for this setting.
247 d85f01e7 Iustin Pop
248 d85f01e7 Iustin Pop
Similar to the above stripe count problem (which is about not good
249 d85f01e7 Iustin Pop
enough customisation of Ganeti's behaviour), we have limited
250 d85f01e7 Iustin Pop
pass-through customisation of the various options of our storage
251 d85f01e7 Iustin Pop
backends; while LVM has a system-wide configuration file that can be
252 d85f01e7 Iustin Pop
used to tweak some of its behaviours, for DRBD we don't use the
253 d85f01e7 Iustin Pop
:command:`drbdadmin` tool, and instead we call :command:`drbdsetup`
254 d85f01e7 Iustin Pop
directly, with a fixed/restricted set of options; so for example one
255 d85f01e7 Iustin Pop
cannot tweak the buffer sizes.
256 d85f01e7 Iustin Pop
257 d85f01e7 Iustin Pop
Another current problem is that the support for shared storage in HTools
258 d85f01e7 Iustin Pop
is still limited, but this problem is outside of this design document.
259 d85f01e7 Iustin Pop
260 d85f01e7 Iustin Pop
Locking
261 d85f01e7 Iustin Pop
~~~~~~~
262 d85f01e7 Iustin Pop
263 d85f01e7 Iustin Pop
A further problem generated by the “current free” model is that during a
264 d85f01e7 Iustin Pop
long operation which affects resource usage (e.g. disk replaces,
265 d85f01e7 Iustin Pop
instance creations) we have to keep the respective objects locked
266 d85f01e7 Iustin Pop
(sometimes even in exclusive mode), since we don't want any concurrent
267 d85f01e7 Iustin Pop
modifications to the *free* values.
268 d85f01e7 Iustin Pop
269 d85f01e7 Iustin Pop
A classic example of the locking problem is the following:
270 d85f01e7 Iustin Pop
271 d85f01e7 Iustin Pop
.. digraph:: "iallocator-lock-issues"
272 d85f01e7 Iustin Pop
273 d85f01e7 Iustin Pop
  rankdir=TB;
274 d85f01e7 Iustin Pop
275 d85f01e7 Iustin Pop
  start [style=invis];
276 d85f01e7 Iustin Pop
  node  [shape=box,width=2];
277 d85f01e7 Iustin Pop
  job1  [label="add instance\niallocator run\nchoose A,B"];
278 d85f01e7 Iustin Pop
  job1e [label="finish add"];
279 d85f01e7 Iustin Pop
  job2  [label="add instance\niallocator run\nwait locks"];
280 d85f01e7 Iustin Pop
  job2s [label="acquire locks\nchoose C,D"];
281 d85f01e7 Iustin Pop
  job2e [label="finish add"];
282 d85f01e7 Iustin Pop
283 d85f01e7 Iustin Pop
  job1  -> job1e;
284 d85f01e7 Iustin Pop
  job2  -> job2s -> job2e;
285 d85f01e7 Iustin Pop
  edge [style=invis,weight=0];
286 d85f01e7 Iustin Pop
  start -> {job1; job2}
287 d85f01e7 Iustin Pop
  job1  -> job2;
288 d85f01e7 Iustin Pop
  job2  -> job1e;
289 d85f01e7 Iustin Pop
  job1e -> job2s [style=dotted,label="release locks"];
290 d85f01e7 Iustin Pop
291 d85f01e7 Iustin Pop
In the above example, the second IAllocator run will wait for locks for
292 d85f01e7 Iustin Pop
nodes ``A`` and ``B``, even though in the end the second instance will
293 d85f01e7 Iustin Pop
be placed on another set of nodes (``C`` and ``D``). This wait shouldn't
294 d85f01e7 Iustin Pop
be needed, since right after the first IAllocator run has finished,
295 d85f01e7 Iustin Pop
:command:`hail` knows the status of the cluster after the allocation,
296 d85f01e7 Iustin Pop
and it could answer the question for the second run too; however, Ganeti
297 d85f01e7 Iustin Pop
doesn't have such visibility into the cluster state and thus it is
298 d85f01e7 Iustin Pop
forced to wait with the second job.
299 d85f01e7 Iustin Pop
300 d85f01e7 Iustin Pop
Similar examples can be made about replace disks (another long-running
301 d85f01e7 Iustin Pop
opcode).
302 d85f01e7 Iustin Pop
303 d85f01e7 Iustin Pop
.. _label-policies:
304 d85f01e7 Iustin Pop
305 d85f01e7 Iustin Pop
Policies
306 d85f01e7 Iustin Pop
~~~~~~~~
307 d85f01e7 Iustin Pop
308 d85f01e7 Iustin Pop
For most of the resources, we have metrics defined by policy: e.g. the
309 d85f01e7 Iustin Pop
over-subscription ratio for CPUs, the amount of space to reserve,
310 d85f01e7 Iustin Pop
etc. Furthermore, although there are no such definitions in Ganeti such
311 d85f01e7 Iustin Pop
as minimum/maximum instance size, a real deployment will need to have
312 d85f01e7 Iustin Pop
them, especially in a fully-automated workflow where end-users can
313 d85f01e7 Iustin Pop
request instances via an automated interface (that talks to the cluster
314 d85f01e7 Iustin Pop
via RAPI, LUXI or command line). However, such an automated interface
315 d85f01e7 Iustin Pop
will need to also take into account cluster capacity, and if the
316 d85f01e7 Iustin Pop
:command:`hspace` tool is used for the capacity computation, it needs to
317 d85f01e7 Iustin Pop
be told the maximum instance size, however it has a built-in minimum
318 d85f01e7 Iustin Pop
instance size which is not customisable.
319 d85f01e7 Iustin Pop
320 d85f01e7 Iustin Pop
It is clear that this situation leads to duplicate definition of
321 d85f01e7 Iustin Pop
resource policies which makes it hard to easily change per-cluster (or
322 d85f01e7 Iustin Pop
globally) the respective policies, and furthermore it creates
323 d85f01e7 Iustin Pop
inconsistencies if such policies are not enforced at the source (i.e. in
324 d85f01e7 Iustin Pop
Ganeti).
325 d85f01e7 Iustin Pop
326 d85f01e7 Iustin Pop
Balancing algorithm
327 d85f01e7 Iustin Pop
~~~~~~~~~~~~~~~~~~~
328 d85f01e7 Iustin Pop
329 d85f01e7 Iustin Pop
The balancing algorithm, as documented in the HTools ``README`` file,
330 d85f01e7 Iustin Pop
tries to minimise the cluster score; this score is based on a set of
331 d85f01e7 Iustin Pop
metrics that describe both exceptional conditions and how spread the
332 d85f01e7 Iustin Pop
instances are across the nodes. In order to achieve this goal, it moves
333 d85f01e7 Iustin Pop
the instances around, with a series of moves of various types:
334 d85f01e7 Iustin Pop
335 d85f01e7 Iustin Pop
- disk replaces (for DRBD-based instances)
336 d85f01e7 Iustin Pop
- instance failover/migrations (for all types)
337 d85f01e7 Iustin Pop
338 d85f01e7 Iustin Pop
However, the algorithm only looks at the cluster score, and not at the
339 d85f01e7 Iustin Pop
*“cost”* of the moves. In other words, the following can and will happen
340 d85f01e7 Iustin Pop
on a cluster:
341 d85f01e7 Iustin Pop
342 d85f01e7 Iustin Pop
.. digraph:: "balancing-cost-issues"
343 d85f01e7 Iustin Pop
344 d85f01e7 Iustin Pop
  rankdir=LR;
345 d85f01e7 Iustin Pop
  ranksep=1;
346 d85f01e7 Iustin Pop
347 d85f01e7 Iustin Pop
  start     [label="score α", shape=hexagon];
348 d85f01e7 Iustin Pop
349 d85f01e7 Iustin Pop
  node      [shape=box, width=2];
350 d85f01e7 Iustin Pop
  replace1  [label="replace_disks 500G\nscore α-3ε\ncost 3"];
351 d85f01e7 Iustin Pop
  replace2a [label="replace_disks 20G\nscore α-2ε\ncost 2"];
352 d85f01e7 Iustin Pop
  migrate1  [label="migrate\nscore α-ε\ncost 1"];
353 d85f01e7 Iustin Pop
354 d85f01e7 Iustin Pop
  choose    [shape=ellipse,label="choose min(score)=α-3ε\ncost 3"];
355 d85f01e7 Iustin Pop
356 d85f01e7 Iustin Pop
  start -> {replace1; replace2a; migrate1} -> choose;
357 d85f01e7 Iustin Pop
358 d85f01e7 Iustin Pop
Even though a migration is much, much cheaper than a disk replace (in
359 d85f01e7 Iustin Pop
terms of network and disk traffic on the cluster), if the disk replace
360 d85f01e7 Iustin Pop
results in a score infinitesimally smaller, then it will be
361 d85f01e7 Iustin Pop
chosen. Similarly, between two disk replaces, one moving e.g. ``500GiB``
362 d85f01e7 Iustin Pop
and one moving ``20GiB``, the first one will be chosen if it results in
363 d85f01e7 Iustin Pop
a score smaller than the second one. Furthermore, even if the resulting
364 d85f01e7 Iustin Pop
scores are equal, the first computed solution will be kept, whichever it
365 d85f01e7 Iustin Pop
is.
366 d85f01e7 Iustin Pop
367 d85f01e7 Iustin Pop
Fixing this algorithmic problem is doable, but currently Ganeti doesn't
368 d85f01e7 Iustin Pop
export enough information about nodes to make an informed decision; in
369 d85f01e7 Iustin Pop
the above example, if the ``500GiB`` move is between nodes having fast
370 d85f01e7 Iustin Pop
I/O (both disks and network), it makes sense to execute it over a disk
371 d85f01e7 Iustin Pop
replace of ``100GiB`` between nodes with slow I/O, so simply relating to
372 d85f01e7 Iustin Pop
the properties of the move itself is not enough; we need more node
373 d85f01e7 Iustin Pop
information for cost computation.
374 d85f01e7 Iustin Pop
375 d85f01e7 Iustin Pop
Allocation algorithm
376 d85f01e7 Iustin Pop
~~~~~~~~~~~~~~~~~~~~
377 d85f01e7 Iustin Pop
378 d85f01e7 Iustin Pop
.. note:: This design document will not address this limitation, but it
379 d85f01e7 Iustin Pop
  is worth mentioning as it directly related to the resource model.
380 d85f01e7 Iustin Pop
381 d85f01e7 Iustin Pop
The current allocation/capacity algorithm works as follows (per
382 d85f01e7 Iustin Pop
node-group)::
383 d85f01e7 Iustin Pop
384 d85f01e7 Iustin Pop
    repeat:
385 d85f01e7 Iustin Pop
        allocate instance without failing N+1
386 d85f01e7 Iustin Pop
387 d85f01e7 Iustin Pop
This simple algorithm, and its use of ``N+1`` criterion, has a built-in
388 d85f01e7 Iustin Pop
limit of 1 machine failure in case of DRBD. This means the algorithm
389 d85f01e7 Iustin Pop
guarantees that, if using DRBD storage, there are enough resources to
390 d85f01e7 Iustin Pop
(re)start all affected instances in case of one machine failure. This
391 d85f01e7 Iustin Pop
relates mostly to memory; there is no account for CPU over-subscription
392 d85f01e7 Iustin Pop
(i.e. in case of failure, make sure we can failover while still not
393 d85f01e7 Iustin Pop
going over CPU limits), or for any other resource.
394 d85f01e7 Iustin Pop
395 d85f01e7 Iustin Pop
In case of shared storage, there's not even the memory guarantee, as the
396 d85f01e7 Iustin Pop
N+1 protection doesn't work for shared storage.
397 d85f01e7 Iustin Pop
398 d85f01e7 Iustin Pop
If a given cluster administrator wants to survive up to two machine
399 d85f01e7 Iustin Pop
failures, or wants to ensure CPU limits too for DRBD, there is no
400 d85f01e7 Iustin Pop
possibility to configure this in HTools (neither in :command:`hail` nor
401 d85f01e7 Iustin Pop
in :command:`hspace`). Current workaround employ for example deducting a
402 d85f01e7 Iustin Pop
certain number of instances from the size computed by :command:`hspace`,
403 d85f01e7 Iustin Pop
but this is a very crude method, and requires that instance creations
404 d85f01e7 Iustin Pop
are limited before Ganeti (otherwise :command:`hail` would allocate
405 d85f01e7 Iustin Pop
until the cluster is full).
406 d85f01e7 Iustin Pop
407 d85f01e7 Iustin Pop
Proposed architecture
408 d85f01e7 Iustin Pop
=====================
409 d85f01e7 Iustin Pop
410 d85f01e7 Iustin Pop
411 d85f01e7 Iustin Pop
There are two main changes proposed:
412 d85f01e7 Iustin Pop
413 d85f01e7 Iustin Pop
- changing the resource model from a pure :term:`SoW` to a hybrid
414 d85f01e7 Iustin Pop
  :term:`SoR`/:term:`SoW` one, where the :term:`SoR` component is
415 d85f01e7 Iustin Pop
  heavily emphasised
416 d85f01e7 Iustin Pop
- extending the resource model to cover additional properties,
417 d85f01e7 Iustin Pop
  completing the “holes” in the current coverage
418 d85f01e7 Iustin Pop
419 d85f01e7 Iustin Pop
The second change is rather straightforward, but will add more
420 d85f01e7 Iustin Pop
complexity in the modelling of the cluster. The first change, however,
421 d85f01e7 Iustin Pop
represents a significant shift from the current model, which Ganeti had
422 d85f01e7 Iustin Pop
from its beginnings.
423 d85f01e7 Iustin Pop
424 d85f01e7 Iustin Pop
Lock-improved resource model
425 d85f01e7 Iustin Pop
----------------------------
426 d85f01e7 Iustin Pop
427 d85f01e7 Iustin Pop
Hybrid SoR/SoW model
428 d85f01e7 Iustin Pop
~~~~~~~~~~~~~~~~~~~~
429 d85f01e7 Iustin Pop
430 d85f01e7 Iustin Pop
The resources of a node can be characterised in two broad classes:
431 d85f01e7 Iustin Pop
432 d85f01e7 Iustin Pop
- mostly static resources
433 d85f01e7 Iustin Pop
- dynamically changing resources
434 d85f01e7 Iustin Pop
435 d85f01e7 Iustin Pop
In the first category, we have things such as total core count, total
436 d85f01e7 Iustin Pop
memory size, total disk size, number of network interfaces etc. In the
437 d85f01e7 Iustin Pop
second category we have things such as free disk space, free memory, CPU
438 d85f01e7 Iustin Pop
load, etc. Note that nowadays we don't have (anymore) fully-static
439 d85f01e7 Iustin Pop
resources: features like CPU and memory hot-plug, online disk replace,
440 d85f01e7 Iustin Pop
etc. mean that theoretically all resources can change (there are some
441 d85f01e7 Iustin Pop
practical limitations, of course).
442 d85f01e7 Iustin Pop
443 d85f01e7 Iustin Pop
Even though the rate of change of the two resource types is wildly
444 d85f01e7 Iustin Pop
different, right now Ganeti handles both the same. Given that the
445 d85f01e7 Iustin Pop
interval of change of the semi-static ones is much bigger than most
446 d85f01e7 Iustin Pop
Ganeti operations, even more than lengthy sequences of Ganeti jobs, it
447 d85f01e7 Iustin Pop
makes sense to treat them separately.
448 d85f01e7 Iustin Pop
449 d85f01e7 Iustin Pop
The proposal is then to move the following resources into the
450 d85f01e7 Iustin Pop
configuration and treat the configuration as the authoritative source
451 d85f01e7 Iustin Pop
for them (a :term:`SoR` model):
452 d85f01e7 Iustin Pop
453 d85f01e7 Iustin Pop
- CPU resources:
454 d85f01e7 Iustin Pop
    - total core count
455 d85f01e7 Iustin Pop
    - node core usage (*new*)
456 d85f01e7 Iustin Pop
- memory resources:
457 d85f01e7 Iustin Pop
    - total memory size
458 d85f01e7 Iustin Pop
    - node memory size
459 d85f01e7 Iustin Pop
    - hypervisor overhead (*new*)
460 d85f01e7 Iustin Pop
- disk resources:
461 d85f01e7 Iustin Pop
    - total disk size
462 d85f01e7 Iustin Pop
    - disk overhead (*new*)
463 d85f01e7 Iustin Pop
464 d85f01e7 Iustin Pop
Since these resources can though change at run-time, we will need
465 d85f01e7 Iustin Pop
functionality to update the recorded values.
466 d85f01e7 Iustin Pop
467 d85f01e7 Iustin Pop
Pre-computing dynamic resource values
468 d85f01e7 Iustin Pop
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
469 d85f01e7 Iustin Pop
470 d85f01e7 Iustin Pop
Remember that the resource model used by HTools models the clusters as
471 d85f01e7 Iustin Pop
obeying the following equations:
472 d85f01e7 Iustin Pop
473 d85f01e7 Iustin Pop
  disk\ :sub:`free` = disk\ :sub:`total` - ∑ disk\ :sub:`instances`
474 d85f01e7 Iustin Pop
475 d85f01e7 Iustin Pop
  mem\ :sub:`free` = mem\ :sub:`total` - ∑ mem\ :sub:`instances` - mem\
476 d85f01e7 Iustin Pop
  :sub:`node` - mem\ :sub:`overhead`
477 d85f01e7 Iustin Pop
478 d85f01e7 Iustin Pop
As this model worked fine for HTools, we can consider it valid and adopt
479 d85f01e7 Iustin Pop
it in Ganeti. Furthermore, note that all values in the right-hand side
480 d85f01e7 Iustin Pop
come now from the configuration:
481 d85f01e7 Iustin Pop
482 d85f01e7 Iustin Pop
- the per-instance usage values were already stored in the configuration
483 d85f01e7 Iustin Pop
- the other values will are moved to the configuration per the previous
484 d85f01e7 Iustin Pop
  section
485 d85f01e7 Iustin Pop
486 d85f01e7 Iustin Pop
This means that we can now compute the free values without having to
487 d85f01e7 Iustin Pop
actually live-query the nodes, which brings a significant advantage.
488 d85f01e7 Iustin Pop
489 d85f01e7 Iustin Pop
There are a couple of caveats to this model though. First, as the
490 d85f01e7 Iustin Pop
run-time state of the instance is no longer taken into consideration, it
491 d85f01e7 Iustin Pop
means that we have to introduce a new *offline* state for an instance
492 d85f01e7 Iustin Pop
(similar to the node one). In this state, the instance's runtime
493 d85f01e7 Iustin Pop
resources (memory and VCPUs) are no longer reserved for it, and can be
494 d85f01e7 Iustin Pop
reused by other instances. Static resources like disk and MAC addresses
495 d85f01e7 Iustin Pop
are still reserved though. Transitioning into and out of this reserved
496 d85f01e7 Iustin Pop
state will be more involved than simply stopping/starting the instance
497 d85f01e7 Iustin Pop
(e.g. de-offlining can fail due to missing resources). This complexity
498 d85f01e7 Iustin Pop
is compensated by the increased consistency of what guarantees we have
499 d85f01e7 Iustin Pop
in the stopped state (we always guarantee resource reservation), and the
500 d85f01e7 Iustin Pop
potential for management tools to restrict which users can transition
501 d85f01e7 Iustin Pop
into/out of this state separate from which users can stop/start the
502 d85f01e7 Iustin Pop
instance.
503 d85f01e7 Iustin Pop
504 d85f01e7 Iustin Pop
Separating per-node resource locks
505 d85f01e7 Iustin Pop
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
506 d85f01e7 Iustin Pop
507 d85f01e7 Iustin Pop
Many of the current node locks in Ganeti exist in order to guarantee
508 d85f01e7 Iustin Pop
correct resource state computation, whereas others are designed to
509 d85f01e7 Iustin Pop
guarantee reasonable run-time performance of nodes (e.g. by not
510 d85f01e7 Iustin Pop
overloading the I/O subsystem). This is an unfortunate coupling, since
511 d85f01e7 Iustin Pop
it means for example that the following two operations conflict in
512 d85f01e7 Iustin Pop
practice even though they are orthogonal:
513 d85f01e7 Iustin Pop
514 d85f01e7 Iustin Pop
- replacing a instance's disk on a node
515 d85f01e7 Iustin Pop
- computing node disk/memory free for an IAllocator run
516 d85f01e7 Iustin Pop
517 d85f01e7 Iustin Pop
This conflict increases significantly the lock contention on a big/busy
518 d85f01e7 Iustin Pop
cluster and at odds with the goal of increasing the cluster size.
519 d85f01e7 Iustin Pop
520 d85f01e7 Iustin Pop
The proposal is therefore to add a new level of locking that is only
521 d85f01e7 Iustin Pop
used to prevent concurrent modification to the resource states (either
522 d85f01e7 Iustin Pop
node properties or instance properties) and not for long-term
523 d85f01e7 Iustin Pop
operations:
524 d85f01e7 Iustin Pop
525 d85f01e7 Iustin Pop
- instance creation needs to acquire and keep this lock until adding the
526 d85f01e7 Iustin Pop
  instance to the configuration
527 d85f01e7 Iustin Pop
- instance modification needs to acquire and keep this lock until
528 d85f01e7 Iustin Pop
  updating the instance
529 d85f01e7 Iustin Pop
- node property changes will need to acquire this lock for the
530 d85f01e7 Iustin Pop
  modification
531 d85f01e7 Iustin Pop
532 d85f01e7 Iustin Pop
The new lock level will sit before the instance level (right after BGL)
533 d85f01e7 Iustin Pop
and could either be single-valued (like the “Big Ganeti Lock”), in which
534 d85f01e7 Iustin Pop
case we won't be able to modify two nodes at the same time, or per-node,
535 d85f01e7 Iustin Pop
in which case the list of locks at this level needs to be synchronised
536 d85f01e7 Iustin Pop
with the node lock level. To be determined.
537 d85f01e7 Iustin Pop
538 d85f01e7 Iustin Pop
Lock contention reduction
539 d85f01e7 Iustin Pop
~~~~~~~~~~~~~~~~~~~~~~~~~
540 d85f01e7 Iustin Pop
541 d85f01e7 Iustin Pop
Based on the above, the locking contention will be reduced as follows:
542 d85f01e7 Iustin Pop
IAllocator calls will no longer need the ``LEVEL_NODE: ALL_SET`` lock,
543 d85f01e7 Iustin Pop
only the resource lock (in exclusive mode). Hence allocating/computing
544 d85f01e7 Iustin Pop
evacuation targets will no longer conflict for longer than the time to
545 d85f01e7 Iustin Pop
compute the allocation solution.
546 d85f01e7 Iustin Pop
547 d85f01e7 Iustin Pop
The remaining long-running locks will be the DRBD replace-disks ones
548 d85f01e7 Iustin Pop
(exclusive mode). These can also be removed, or changed into shared
549 d85f01e7 Iustin Pop
locks, but that is a separate design change.
550 d85f01e7 Iustin Pop
551 d85f01e7 Iustin Pop
.. admonition:: FIXME
552 d85f01e7 Iustin Pop
553 0469fd96 Michael Hanselmann
  Need to rework instance replace disks. I don't think we need exclusive
554 0469fd96 Michael Hanselmann
  locks for replacing disks: it is safe to stop/start the instance while
555 0469fd96 Michael Hanselmann
  it's doing a replace disks. Only modify would need exclusive, and only
556 0469fd96 Michael Hanselmann
  for transitioning into/out of offline state.
557 d85f01e7 Iustin Pop
558 d85f01e7 Iustin Pop
Instance memory model
559 d85f01e7 Iustin Pop
---------------------
560 d85f01e7 Iustin Pop
561 d85f01e7 Iustin Pop
In order to support ballooning, the instance memory model needs to be
562 d85f01e7 Iustin Pop
changed from a “memory size” one to a “min/max memory size”. This
563 d85f01e7 Iustin Pop
interacts with the new static resource model, however, and thus we need
564 d85f01e7 Iustin Pop
to declare a-priori the expected oversubscription ratio on the cluster.
565 d85f01e7 Iustin Pop
566 d85f01e7 Iustin Pop
The new minimum memory size parameter will be similar to the current
567 d85f01e7 Iustin Pop
memory size; the cluster will guarantee that in all circumstances, all
568 d85f01e7 Iustin Pop
instances will have available their minimum memory size. The maximum
569 d85f01e7 Iustin Pop
memory size will permit burst usage of more memory by instances, with
570 d85f01e7 Iustin Pop
the restriction that the sum of maximum memory usage will not be more
571 d85f01e7 Iustin Pop
than the free memory times the oversubscription factor:
572 d85f01e7 Iustin Pop
573 d85f01e7 Iustin Pop
    ∑ memory\ :sub:`min` ≤ memory\ :sub:`available`
574 d85f01e7 Iustin Pop
575 d85f01e7 Iustin Pop
    ∑ memory\ :sub:`max` ≤ memory\ :sub:`free` * oversubscription_ratio
576 d85f01e7 Iustin Pop
577 d85f01e7 Iustin Pop
The hypervisor will have the possibility of adjusting the instance's
578 d85f01e7 Iustin Pop
memory size dynamically between these two boundaries.
579 d85f01e7 Iustin Pop
580 d85f01e7 Iustin Pop
Note that the minimum memory is related to the available memory on the
581 d85f01e7 Iustin Pop
node, whereas the maximum memory is related to the free memory. On
582 d85f01e7 Iustin Pop
DRBD-enabled clusters, this will have the advantage of using the
583 d85f01e7 Iustin Pop
reserved memory for N+1 failover for burst usage, instead of having it
584 d85f01e7 Iustin Pop
completely idle.
585 d85f01e7 Iustin Pop
586 d85f01e7 Iustin Pop
.. admonition:: FIXME
587 d85f01e7 Iustin Pop
588 d85f01e7 Iustin Pop
  Need to document how Ganeti forces minimum size at runtime, overriding
589 d85f01e7 Iustin Pop
  the hypervisor, in cases of failover/lack of resources.
590 d85f01e7 Iustin Pop
591 d85f01e7 Iustin Pop
New parameters
592 d85f01e7 Iustin Pop
--------------
593 d85f01e7 Iustin Pop
594 d85f01e7 Iustin Pop
Unfortunately the design will add a significant number of new
595 d85f01e7 Iustin Pop
parameters, and change the meaning of some of the current ones.
596 d85f01e7 Iustin Pop
597 d85f01e7 Iustin Pop
Instance size limits
598 d85f01e7 Iustin Pop
~~~~~~~~~~~~~~~~~~~~
599 d85f01e7 Iustin Pop
600 d85f01e7 Iustin Pop
As described in :ref:`label-policies`, we currently lack a clear
601 d85f01e7 Iustin Pop
definition of the support instance sizes (minimum, maximum and
602 d85f01e7 Iustin Pop
standard). As such, we will add the following structure to the cluster
603 d85f01e7 Iustin Pop
parameters:
604 d85f01e7 Iustin Pop
605 d85f01e7 Iustin Pop
- ``min_ispec``, ``max_ispec``: minimum and maximum acceptable instance
606 d85f01e7 Iustin Pop
  specs
607 d85f01e7 Iustin Pop
- ``std_ispec``: standard instance size, which will be used for capacity
608 d85f01e7 Iustin Pop
  computations and for default parameters on the instance creation
609 d85f01e7 Iustin Pop
  request
610 d85f01e7 Iustin Pop
611 d85f01e7 Iustin Pop
Ganeti will by default reject non-standard instance sizes (lower than
612 d85f01e7 Iustin Pop
``min_ispec`` or greater than ``max_ispec``), but as usual a ``--force``
613 d85f01e7 Iustin Pop
option on the command line or in the RAPI request will override these
614 d85f01e7 Iustin Pop
constraints. The ``std_spec`` structure will be used to fill in missing
615 d85f01e7 Iustin Pop
instance specifications on create.
616 d85f01e7 Iustin Pop
617 d85f01e7 Iustin Pop
Each of the ispec structures will be a dictionary, since the contents
618 d85f01e7 Iustin Pop
can change over time. Initially, we will define the following variables
619 d85f01e7 Iustin Pop
in these structures:
620 d85f01e7 Iustin Pop
621 d85f01e7 Iustin Pop
+---------------+----------------------------------+--------------+
622 d85f01e7 Iustin Pop
|Name           |Description                       |Type          |
623 d85f01e7 Iustin Pop
+===============+==================================+==============+
624 d85f01e7 Iustin Pop
|mem_min        |Minimum memory size allowed       |int           |
625 d85f01e7 Iustin Pop
+---------------+----------------------------------+--------------+
626 d85f01e7 Iustin Pop
|mem_max        |Maximum allowed memory size       |int           |
627 d85f01e7 Iustin Pop
+---------------+----------------------------------+--------------+
628 d85f01e7 Iustin Pop
|cpu_count      |Allowed vCPU count                |int           |
629 d85f01e7 Iustin Pop
+---------------+----------------------------------+--------------+
630 d85f01e7 Iustin Pop
|disk_count     |Allowed disk count                |int           |
631 d85f01e7 Iustin Pop
+---------------+----------------------------------+--------------+
632 d85f01e7 Iustin Pop
|disk_size      |Allowed disk size                 |int           |
633 d85f01e7 Iustin Pop
+---------------+----------------------------------+--------------+
634 d85f01e7 Iustin Pop
|nic_count      |Alowed NIC count                  |int           |
635 d85f01e7 Iustin Pop
+---------------+----------------------------------+--------------+
636 d85f01e7 Iustin Pop
637 d85f01e7 Iustin Pop
Inheritance
638 d85f01e7 Iustin Pop
+++++++++++
639 d85f01e7 Iustin Pop
640 d85f01e7 Iustin Pop
In a single-group cluster, the above structure is sufficient. However,
641 d85f01e7 Iustin Pop
on a multi-group cluster, it could be that the hardware specifications
642 d85f01e7 Iustin Pop
differ across node groups, and thus the following problem appears: how
643 d85f01e7 Iustin Pop
can Ganeti present unified specifications over RAPI?
644 d85f01e7 Iustin Pop
645 d85f01e7 Iustin Pop
Since the set of instance specs is only partially ordered (as opposed to
646 d85f01e7 Iustin Pop
the sets of values of individual variable in the spec, which are totally
647 d85f01e7 Iustin Pop
ordered), it follows that we can't present unified specs. As such, the
648 d85f01e7 Iustin Pop
proposed approach is to allow the ``min_ispec`` and ``max_ispec`` to be
649 d85f01e7 Iustin Pop
customised per node-group (and export them as a list of specifications),
650 d85f01e7 Iustin Pop
and a single ``std_spec`` at cluster level (exported as a single value).
651 d85f01e7 Iustin Pop
652 d85f01e7 Iustin Pop
653 d85f01e7 Iustin Pop
Allocation parameters
654 d85f01e7 Iustin Pop
~~~~~~~~~~~~~~~~~~~~~
655 d85f01e7 Iustin Pop
656 d85f01e7 Iustin Pop
Beside the limits of min/max instance sizes, there are other parameters
657 d85f01e7 Iustin Pop
related to capacity and allocation limits. These are mostly related to
658 d85f01e7 Iustin Pop
the problems related to over allocation.
659 d85f01e7 Iustin Pop
660 d85f01e7 Iustin Pop
+-----------------+----------+---------------------------+----------+------+
661 d85f01e7 Iustin Pop
| Name            |Level(s)  |Description                |Current   |Type  |
662 d85f01e7 Iustin Pop
|                 |          |                           |value     |      |
663 d85f01e7 Iustin Pop
+=================+==========+===========================+==========+======+
664 d85f01e7 Iustin Pop
|vcpu_ratio       |cluster,  |Maximum ratio of virtual to|64 (only  |float |
665 d85f01e7 Iustin Pop
|                 |node group|physical CPUs              |in htools)|      |
666 d85f01e7 Iustin Pop
+-----------------+----------+---------------------------+----------+------+
667 d85f01e7 Iustin Pop
|spindle_ratio    |cluster,  |Maximum ratio of instances |none      |float |
668 d85f01e7 Iustin Pop
|                 |node group|to spindles; when the I/O  |          |      |
669 d85f01e7 Iustin Pop
|                 |          |model doesn't map directly |          |      |
670 d85f01e7 Iustin Pop
|                 |          |to spindles, another       |          |      |
671 d85f01e7 Iustin Pop
|                 |          |measure of I/O should be   |          |      |
672 d85f01e7 Iustin Pop
|                 |          |used instead               |          |      |
673 d85f01e7 Iustin Pop
+-----------------+----------+---------------------------+----------+------+
674 d85f01e7 Iustin Pop
|max_node_failures|cluster,  |Cap allocation/capacity so |1         |int   |
675 d85f01e7 Iustin Pop
|                 |node group|that the cluster can       |(hardcoded|      |
676 d85f01e7 Iustin Pop
|                 |          |survive this many node     |in htools)|      |
677 d85f01e7 Iustin Pop
|                 |          |failures                   |          |      |
678 d85f01e7 Iustin Pop
+-----------------+----------+---------------------------+----------+------+
679 d85f01e7 Iustin Pop
680 d85f01e7 Iustin Pop
Since these are used mostly internally (in htools), they will be
681 d85f01e7 Iustin Pop
exported as-is from Ganeti, without explicit handling of node-groups
682 d85f01e7 Iustin Pop
grouping.
683 d85f01e7 Iustin Pop
684 d85f01e7 Iustin Pop
Regarding ``spindle_ratio``, in this context spindles do not necessarily
685 d85f01e7 Iustin Pop
have to mean actual mechanical hard-drivers; it's rather a measure of
686 d85f01e7 Iustin Pop
I/O performance for internal storage.
687 d85f01e7 Iustin Pop
688 d85f01e7 Iustin Pop
Disk parameters
689 d85f01e7 Iustin Pop
~~~~~~~~~~~~~~~
690 d85f01e7 Iustin Pop
691 5d40c988 Andrea Spadaccini
The proposed model for the new disk parameters is a simple free-form one
692 5d40c988 Andrea Spadaccini
based on dictionaries, indexed per disk template and parameter name.
693 5d40c988 Andrea Spadaccini
Only the disk template parameters are visible to the user, and those are
694 5d40c988 Andrea Spadaccini
internally translated to logical disk level parameters.
695 5d40c988 Andrea Spadaccini
696 5d40c988 Andrea Spadaccini
This is a simplification, because each parameter is applied to a whole
697 5d40c988 Andrea Spadaccini
nested structure and there is no way of fine-tuning each level's
698 5d40c988 Andrea Spadaccini
parameters, but it is good enough for the current parameter set. This
699 5d40c988 Andrea Spadaccini
model could need to be expanded, e.g., if support for three-nodes stacked
700 5d40c988 Andrea Spadaccini
DRBD setups is added to Ganeti.
701 5d40c988 Andrea Spadaccini
702 5d40c988 Andrea Spadaccini
At JSON level, since the object key has to be a string, the keys can be
703 5d40c988 Andrea Spadaccini
encoded via a separator (e.g. slash), or by having two dict levels.
704 d85f01e7 Iustin Pop
705 d85f01e7 Iustin Pop
+--------+-------------+-------------------------+---------------------+------+
706 d85f01e7 Iustin Pop
|Disk    |Name         |Description              |Current status       |Type  |
707 d85f01e7 Iustin Pop
|template|             |                         |                     |      |
708 d85f01e7 Iustin Pop
+========+=============+=========================+=====================+======+
709 5d40c988 Andrea Spadaccini
|plain   |stripes      |How many stripes to use  |Configured at        |int   |
710 d85f01e7 Iustin Pop
|        |             |for newly created (plain)|./configure time, not|      |
711 d85f01e7 Iustin Pop
|        |             |logical voumes           |overridable at       |      |
712 d85f01e7 Iustin Pop
|        |             |                         |runtime              |      |
713 d85f01e7 Iustin Pop
+--------+-------------+-------------------------+---------------------+------+
714 5d40c988 Andrea Spadaccini
|drbd    |stripes      |How many stripes to use  |Same as for plain    |int   |
715 d85f01e7 Iustin Pop
|        |             |for data volumes         |                     |      |
716 d85f01e7 Iustin Pop
+--------+-------------+-------------------------+---------------------+------+
717 5d40c988 Andrea Spadaccini
|drbd    |metavg       |Default volume group for |Same as the main     |string|
718 d85f01e7 Iustin Pop
|        |             |the metadata LVs         |volume group,        |      |
719 d85f01e7 Iustin Pop
|        |             |                         |overridable via      |      |
720 d85f01e7 Iustin Pop
|        |             |                         |'metavg' key         |      |
721 d85f01e7 Iustin Pop
+--------+-------------+-------------------------+---------------------+------+
722 5d40c988 Andrea Spadaccini
|drbd    |metastripes  |How many stripes to use  |Same as for lvm      |int   |
723 d85f01e7 Iustin Pop
|        |             |for meta volumes         |'stripes', suboptimal|      |
724 d85f01e7 Iustin Pop
|        |             |                         |as the meta LVs are  |      |
725 d85f01e7 Iustin Pop
|        |             |                         |small                |      |
726 d85f01e7 Iustin Pop
+--------+-------------+-------------------------+---------------------+------+
727 5d40c988 Andrea Spadaccini
|drbd    |disk_barriers|What kind of barriers to |Either all enabled or|string|
728 d85f01e7 Iustin Pop
|        |             |*disable* for disks;     |all disabled, per    |      |
729 d85f01e7 Iustin Pop
|        |             |either "n" or a string   |./configure time     |      |
730 d85f01e7 Iustin Pop
|        |             |containing a subset of   |option               |      |
731 d85f01e7 Iustin Pop
|        |             |"bfd"                    |                     |      |
732 d85f01e7 Iustin Pop
+--------+-------------+-------------------------+---------------------+------+
733 5d40c988 Andrea Spadaccini
|drbd    |meta_barriers|Whether barriers are     |Handled together with|bool  |
734 d85f01e7 Iustin Pop
|        |             |enabled or not for the   |disk_barriers        |      |
735 d85f01e7 Iustin Pop
|        |             |meta volume              |                     |      |
736 d85f01e7 Iustin Pop
+--------+-------------+-------------------------+---------------------+------+
737 5d40c988 Andrea Spadaccini
|drbd    |resync_rate  |The (static) resync rate |Hardcoded in         |int   |
738 d85f01e7 Iustin Pop
|        |             |for drbd, when using the |constants.py, not    |      |
739 d85f01e7 Iustin Pop
|        |             |static syncer, in MiB/s  |changeable via Ganeti|      |
740 d85f01e7 Iustin Pop
+--------+-------------+-------------------------+---------------------+------+
741 5d40c988 Andrea Spadaccini
|drbd    |disk_custom  |Free-form string that    |Not supported        |string|
742 d85f01e7 Iustin Pop
|        |             |will be appended to the  |                     |      |
743 d85f01e7 Iustin Pop
|        |             |drbdsetup disk command   |                     |      |
744 d85f01e7 Iustin Pop
|        |             |line, for custom options |                     |      |
745 d85f01e7 Iustin Pop
|        |             |not supported by Ganeti  |                     |      |
746 d85f01e7 Iustin Pop
|        |             |itself                   |                     |      |
747 d85f01e7 Iustin Pop
+--------+-------------+-------------------------+---------------------+------+
748 5d40c988 Andrea Spadaccini
|drbd    |net_custom   |Free-form string for     |Not supported        |string|
749 d85f01e7 Iustin Pop
|        |             |custom net setup options |                     |      |
750 d85f01e7 Iustin Pop
+--------+-------------+-------------------------+---------------------+------+
751 d85f01e7 Iustin Pop
752 5d40c988 Andrea Spadaccini
Note that the DRBD parameters might change once Ganeti supports DRBD 8.4, in
753 5d40c988 Andrea Spadaccini
which the :command:`drbdsetup` syntax has changed significantly.
754 5d40c988 Andrea Spadaccini
Moreover, new parameters for the dynamic synchronization algorithm will
755 5d40c988 Andrea Spadaccini
be added for DRBD versions >= 8.3.9.
756 d85f01e7 Iustin Pop
757 d85f01e7 Iustin Pop
All the above parameters are at cluster and node group level; as in
758 d85f01e7 Iustin Pop
other parts of the code, the intention is that all nodes in a node group
759 5d40c988 Andrea Spadaccini
should be equal. It will later be decided to which node group give
760 5d40c988 Andrea Spadaccini
precedence in case of instances split over node groups.
761 5d40c988 Andrea Spadaccini
762 5d40c988 Andrea Spadaccini
.. admonition:: FIXME
763 5d40c988 Andrea Spadaccini
764 5d40c988 Andrea Spadaccini
   Add details about when each parameter change takes effect (device
765 5d40c988 Andrea Spadaccini
   creation vs. activation)
766 d85f01e7 Iustin Pop
767 d85f01e7 Iustin Pop
Node parameters
768 d85f01e7 Iustin Pop
~~~~~~~~~~~~~~~
769 d85f01e7 Iustin Pop
770 d85f01e7 Iustin Pop
For the new memory model, we'll add the following parameters, in a
771 d85f01e7 Iustin Pop
dictionary indexed by the hypervisor name (node attribute
772 d85f01e7 Iustin Pop
``hv_state``). The rationale is that, even though multi-hypervisor
773 d85f01e7 Iustin Pop
clusters are rare, they make sense sometimes, and thus we need to
774 d85f01e7 Iustin Pop
support multipe node states (one per hypervisor).
775 d85f01e7 Iustin Pop
776 d85f01e7 Iustin Pop
Since usually only one of the multiple hypervisors is the 'main' one
777 d85f01e7 Iustin Pop
(and the others used sparringly), capacity computation will still only
778 d85f01e7 Iustin Pop
use the first hypervisor, and not all of them. Thus we avoid possible
779 d85f01e7 Iustin Pop
inconsistencies.
780 d85f01e7 Iustin Pop
781 d85f01e7 Iustin Pop
+----------+-----------------------------------+---------------+-------+
782 d85f01e7 Iustin Pop
|Name      |Description                        |Current state  |Type   |
783 d85f01e7 Iustin Pop
|          |                                   |               |       |
784 d85f01e7 Iustin Pop
+==========+===================================+===============+=======+
785 d85f01e7 Iustin Pop
|mem_total |Total node memory, as discovered by|Queried at     |int    |
786 d85f01e7 Iustin Pop
|          |this hypervisor                    |runtime        |       |
787 d85f01e7 Iustin Pop
+----------+-----------------------------------+---------------+-------+
788 d85f01e7 Iustin Pop
|mem_node  |Memory used by, or reserved for,   |Queried at     |int    |
789 d85f01e7 Iustin Pop
|          |the node itself; not that some     |runtime        |       |
790 d85f01e7 Iustin Pop
|          |hypervisors can report this in an  |               |       |
791 d85f01e7 Iustin Pop
|          |authoritative way, other not       |               |       |
792 d85f01e7 Iustin Pop
+----------+-----------------------------------+---------------+-------+
793 d85f01e7 Iustin Pop
|mem_hv    |Memory used either by the          |Not used,      |int    |
794 d85f01e7 Iustin Pop
|          |hypervisor itself or lost due to   |htools computes|       |
795 d85f01e7 Iustin Pop
|          |instance allocation rounding;      |it internally  |       |
796 d85f01e7 Iustin Pop
|          |usually this cannot be precisely   |               |       |
797 d85f01e7 Iustin Pop
|          |computed, but only roughly         |               |       |
798 d85f01e7 Iustin Pop
|          |estimated                          |               |       |
799 d85f01e7 Iustin Pop
+----------+-----------------------------------+---------------+-------+
800 d85f01e7 Iustin Pop
|cpu_total |Total node cpu (core) count;       |Queried at     |int    |
801 d85f01e7 Iustin Pop
|          |usually this can be discovered     |runtime        |       |
802 d85f01e7 Iustin Pop
|          |automatically                      |               |       |
803 d85f01e7 Iustin Pop
|          |                                   |               |       |
804 d85f01e7 Iustin Pop
|          |                                   |               |       |
805 d85f01e7 Iustin Pop
|          |                                   |               |       |
806 d85f01e7 Iustin Pop
+----------+-----------------------------------+---------------+-------+
807 d85f01e7 Iustin Pop
|cpu_node  |Number of cores reserved for the   |Not used at all|int    |
808 d85f01e7 Iustin Pop
|          |node itself; this can either be    |               |       |
809 d85f01e7 Iustin Pop
|          |discovered or set manually. Only   |               |       |
810 d85f01e7 Iustin Pop
|          |used for estimating how many VCPUs |               |       |
811 d85f01e7 Iustin Pop
|          |are left for instances             |               |       |
812 d85f01e7 Iustin Pop
|          |                                   |               |       |
813 d85f01e7 Iustin Pop
+----------+-----------------------------------+---------------+-------+
814 d85f01e7 Iustin Pop
815 d85f01e7 Iustin Pop
Of the above parameters, only ``_total`` ones are straight-forward. The
816 d85f01e7 Iustin Pop
others have sometimes strange semantics:
817 d85f01e7 Iustin Pop
818 d85f01e7 Iustin Pop
- Xen can report ``mem_node``, if configured statically (as we
819 d85f01e7 Iustin Pop
  recommend); but Linux-based hypervisors (KVM, chroot, LXC) do not, and
820 d85f01e7 Iustin Pop
  this needs to be configured statically for these values
821 d85f01e7 Iustin Pop
- ``mem_hv``, representing unaccounted for memory, is not directly
822 d85f01e7 Iustin Pop
  computable; on Xen, it can be seen that on a N GB machine, with 1 GB
823 d85f01e7 Iustin Pop
  for dom0 and N-2 GB for instances, there's just a few MB left, instead
824 d85f01e7 Iustin Pop
  fo a full 1 GB of RAM; however, the exact value varies with the total
825 d85f01e7 Iustin Pop
  memory size (at least)
826 d85f01e7 Iustin Pop
- ``cpu_node`` only makes sense on Xen (currently), in the case when we
827 d85f01e7 Iustin Pop
  restrict dom0; for Linux-based hypervisors, the node itself cannot be
828 d85f01e7 Iustin Pop
  easily restricted, so it should be set as an estimate of how "heavy"
829 d85f01e7 Iustin Pop
  the node loads will be
830 d85f01e7 Iustin Pop
831 d85f01e7 Iustin Pop
Since these two values cannot be auto-computed from the node, we need to
832 d85f01e7 Iustin Pop
be able to declare a default at cluster level (debatable how useful they
833 d85f01e7 Iustin Pop
are at node group level); the proposal is to do this via a cluster-level
834 d85f01e7 Iustin Pop
``hv_state`` dict (per hypervisor).
835 d85f01e7 Iustin Pop
836 d85f01e7 Iustin Pop
Beside the per-hypervisor attributes, we also have disk attributes,
837 d85f01e7 Iustin Pop
which are queried directly on the node (without hypervisor
838 d85f01e7 Iustin Pop
involvment). The are stored in a separate attribute (``disk_state``),
839 d85f01e7 Iustin Pop
which is indexed per storage type and name; currently this will be just
840 d85f01e7 Iustin Pop
``LD_LV`` and the volume name as key.
841 d85f01e7 Iustin Pop
842 d85f01e7 Iustin Pop
+-------------+-------------------------+--------------------+--------+
843 d85f01e7 Iustin Pop
|Name         |Description              |Current state       |Type    |
844 d85f01e7 Iustin Pop
|             |                         |                    |        |
845 d85f01e7 Iustin Pop
+=============+=========================+====================+========+
846 d85f01e7 Iustin Pop
|disk_total   |Total disk size          |Queried at runtime  |int     |
847 d85f01e7 Iustin Pop
|             |                         |                    |        |
848 d85f01e7 Iustin Pop
+-------------+-------------------------+--------------------+--------+
849 d85f01e7 Iustin Pop
|disk_reserved|Reserved disk size; this |None used in Ganeti;|int     |
850 d85f01e7 Iustin Pop
|             |is a lower limit on the  |htools has a        |        |
851 d85f01e7 Iustin Pop
|             |free space, if such a    |parameter for this  |        |
852 d85f01e7 Iustin Pop
|             |limit is desired         |                    |        |
853 d85f01e7 Iustin Pop
+-------------+-------------------------+--------------------+--------+
854 d85f01e7 Iustin Pop
|disk_overhead|Disk that is expected to |None used in Ganeti;|int     |
855 d85f01e7 Iustin Pop
|             |be used by other volumes |htools detects this |        |
856 d85f01e7 Iustin Pop
|             |(set via                 |at runtime          |        |
857 d85f01e7 Iustin Pop
|             |``reserved_lvs``);       |                    |        |
858 d85f01e7 Iustin Pop
|             |usually should be zero   |                    |        |
859 d85f01e7 Iustin Pop
+-------------+-------------------------+--------------------+--------+
860 d85f01e7 Iustin Pop
861 d85f01e7 Iustin Pop
862 d85f01e7 Iustin Pop
Instance parameters
863 d85f01e7 Iustin Pop
~~~~~~~~~~~~~~~~~~~
864 d85f01e7 Iustin Pop
865 d85f01e7 Iustin Pop
New instance parameters, needed especially for supporting the new memory
866 d85f01e7 Iustin Pop
model:
867 d85f01e7 Iustin Pop
868 d85f01e7 Iustin Pop
+--------------+----------------------------------+-----------------+------+
869 d85f01e7 Iustin Pop
|Name          |Description                       |Current status   |Type  |
870 d85f01e7 Iustin Pop
|              |                                  |                 |      |
871 d85f01e7 Iustin Pop
+==============+==================================+=================+======+
872 d85f01e7 Iustin Pop
|offline       |Whether the instance is in        |Not supported    |bool  |
873 d85f01e7 Iustin Pop
|              |“permanent” offline mode; this is |                 |      |
874 d85f01e7 Iustin Pop
|              |stronger than the "admin_down”    |                 |      |
875 d85f01e7 Iustin Pop
|              |state, and is similar to the node |                 |      |
876 d85f01e7 Iustin Pop
|              |offline attribute                 |                 |      |
877 d85f01e7 Iustin Pop
+--------------+----------------------------------+-----------------+------+
878 d85f01e7 Iustin Pop
|be/max_memory |The maximum memory the instance is|Not existent, but|int   |
879 d85f01e7 Iustin Pop
|              |allowed                           |virtually        |      |
880 d85f01e7 Iustin Pop
|              |                                  |identical to     |      |
881 d85f01e7 Iustin Pop
|              |                                  |memory           |      |
882 d85f01e7 Iustin Pop
+--------------+----------------------------------+-----------------+------+
883 d85f01e7 Iustin Pop
884 d85f01e7 Iustin Pop
HTools changes
885 d85f01e7 Iustin Pop
--------------
886 d85f01e7 Iustin Pop
887 d85f01e7 Iustin Pop
All the new parameters (node, instance, cluster, not so much disk) will
888 d85f01e7 Iustin Pop
need to be taken into account by HTools, both in balancing and in
889 d85f01e7 Iustin Pop
capacity computation.
890 d85f01e7 Iustin Pop
891 d85f01e7 Iustin Pop
Since the Ganeti's cluster model is much enhanced, Ganeti can also
892 d85f01e7 Iustin Pop
export its own reserved/overhead variables, and as such HTools can make
893 d85f01e7 Iustin Pop
less “guesses” as to the difference in values.
894 d85f01e7 Iustin Pop
895 d85f01e7 Iustin Pop
.. admonition:: FIXME
896 d85f01e7 Iustin Pop
897 d85f01e7 Iustin Pop
   Need to detail more the htools changes; the model is clear to me, but
898 d85f01e7 Iustin Pop
   need to write it down.
899 d85f01e7 Iustin Pop
900 d85f01e7 Iustin Pop
.. vim: set textwidth=72 :
901 d85f01e7 Iustin Pop
.. Local Variables:
902 d85f01e7 Iustin Pop
.. mode: rst
903 d85f01e7 Iustin Pop
.. fill-column: 72
904 d85f01e7 Iustin Pop
.. End: