Statistics
| Branch: | Tag: | Revision:

root / doc / design-partitioned.rst @ 9110fb4a

History | View | Annotate | Download (12.4 kB)

1
==================
2
Partitioned Ganeti
3
==================
4

    
5
.. contents:: :depth: 4
6

    
7
Current state and shortcomings
8
==============================
9

    
10
Currently Ganeti can be used to easily share a node between multiple
11
virtual instances. While it's easy to do a completely "best effort"
12
sharing it's quite harder to completely reserve resources for the use of
13
a particular instance. In particular this has to be done manually for
14
CPUs and disk, is implemented for RAM under Xen, but not under KVM, and
15
there's no provision for network level QoS.
16

    
17
Proposed changes
18
================
19

    
20
We want to make it easy to partition a node between machines with
21
exclusive use of hardware resources. While some sharing will anyway need
22
to happen (e.g. for operations that use the host domain, or use
23
resources, like buses, which are unique or very scarce on host systems)
24
we'll strive to maintain contention at a minimum, but won't try to avoid
25
all possible sources of it.
26

    
27
Exclusive use of disks
28
----------------------
29

    
30
``exclusive_storage`` is a new node parameter. When it's enabled, Ganeti
31
will allocate entire disks to instances. Though it's possible to think
32
of ways of doing something similar for other storage back-ends, this
33
design targets only ``plain`` and ``drbd``. The name is generic enough
34
in case the feature will be extended to other back-ends. The flag value
35
should be homogeneous within a node-group; ``cluster-verify`` will report
36
any violation of this condition.
37

    
38
Ganeti will consider each physical volume in the destination volume
39
group as a host disk (for proper isolation, an administrator should
40
make sure that there aren't multiple PVs on the same physical
41
disk). When ``exclusive_storage`` is enabled in a node group, all PVs
42
in the node group must have the same size (within a certain margin, say
43
1%, defined through a new parameter). Ganeti will check this condition
44
when the ``exclusive_storage`` flag is set, whenever a new node is added
45
and as part of ``cluster-verify``.
46

    
47
When creating a new disk for an instance, Ganeti will allocate the
48
minimum number of PVs to hold the disk, and those PVs will be excluded
49
from the pool of available PVs for further disk creations. The
50
underlying LV will be striped, when striping is allowed by the current
51
configuration. Ganeti will continue to track only the LVs, and query the
52
LVM layer to figure out which PVs are available and how much space is
53
free. Yet, creation, disk growing, and free-space reporting will ignore
54
any partially allocated PVs, so that PVs won't be shared between
55
instance disks.
56

    
57
For compatibility with the DRBD template and to take into account disk
58
variability, Ganeti will always subtract 2% (this will be a parameter)
59
from the PV space when calculating how many PVs are needed to allocate
60
an instance and when nodes report free space.
61

    
62
The obvious target for this option is plain disk template, which doesn't
63
provide redundancy. An administrator can still provide resilience
64
against disk failures by setting up RAID under PVs, but this is
65
transparent to Ganeti.
66

    
67
Spindles as a resource
68
~~~~~~~~~~~~~~~~~~~~~~
69

    
70
When resources are dedicated and there are more spindles than instances
71
on a node, it is natural to assign more spindles to instances than what
72
is strictly needed. For this reason, we introduce a new resource:
73
spindles. A spindle is a PV in LVM. The number of spindles required for
74
a disk of an instance is specified together with the size. Specifying
75
the number of spindles is possible only when ``exclusive_storage`` is
76
enabled. It is an error to specify a number of spindles insufficient to
77
contain the requested disk size.
78

    
79
When ``exclusive_storage`` is not enabled, spindles are not used in free
80
space calculation, in allocation algorithms, and policies. When it's
81
enabled, ``hspace``, ``hbal``, and allocators will use spindles instead
82
of disk size for their computation. For each node, the number of all the
83
spindles in every LVM group is recorded, and different LVM groups are
84
accounted separately in allocation and balancing.
85

    
86
There is already a concept of spindles in Ganeti. It's not related to
87
any actual spindle or volume count, but it's used in ``spindle_use`` to
88
measure the pressure of an instance on the storage system and in
89
``spindle_ratio`` to balance the I/O load on the nodes. When
90
``exclusive_storage`` is enabled, these parameters as currently defined
91
won't make any sense, so their meaning will be changed in this way:
92

    
93
- ``spindle_use`` refers to the resource, hence to the actual spindles
94
  (PVs in LVM), used by an instance. The values specified in the instance
95
  policy specifications are compared to the run-time numbers of spindle
96
  used by an instance. The ``spindle_use`` back-end parameter will be
97
  ignored.
98
- ``spindle_ratio`` in instance policies and ``spindle_count`` in node
99
  parameters are ignored, as the exclusive assignment of PVs already
100
  implies a value of 1.0 for the first, and the second is replaced by
101
  the actual number of spindles.
102

    
103
When ``exclusive_storage`` is disabled, the existing spindle parameters
104
behave as before.
105

    
106
Dedicated CPUs
107
--------------
108

    
109
``vpcu_ratio`` can be used to tie the number of VCPUs to the number of
110
CPUs provided by the hardware. We need to take into account the CPU
111
usage of the hypervisor. For Xen, this means counting the number of
112
VCPUs assigned to ``Domain-0``.
113

    
114
For KVM, it's more difficult to limit the number of CPUs used by the
115
node OS. ``cgroups`` could be a solution to restrict the node OS to use
116
some of the CPUs, leaving the other ones to instances and KVM processes.
117
For KVM, the number of CPUs for the host system should also be a
118
hypervisor parameter (set at the node group level).
119

    
120
Dedicated RAM
121
-------------
122

    
123
Instances should not compete for RAM. This is easily done on Xen, but it
124
is tricky on KVM.
125

    
126
Xen
127
~~~
128

    
129
Memory is already fully segregated under Xen, if sharing mechanisms
130
(transcendent memory, auto ballooning, etc) are not in use.
131

    
132
KVM
133
~~~
134
Under KVM or LXC memory is fully shared between the host system and all
135
the guests, and instances can even be swapped out by the host OS.
136

    
137
It's not clear if the problem can be solved by limiting the size of the
138
instances, so that there is plenty of room for the host OS.
139

    
140
We could implement segregation using cgroups to limit the memory used by
141
the host OS. This requires finishing the implementation of the memory
142
hypervisor status (set at the node group level) that changes how free
143
memory is computed under KVM systems. Then we have to add a way to
144
enforce this limit on the host system itself, rather than leaving it as
145
a calculation tool only.
146

    
147
Another problem for KVM is that we need to decide about the size of the
148
cgroup versus the size of the VM: some overhead will in particular
149
exist, due to the fact that an instance and its encapsulating KVM
150
process share the same space. For KVM systems the physical memory
151
allocatable to instances should be computed by subtracting an overhead
152
for the KVM processes, whose value can be either statically configured
153
or set in a hypervisor status parameter.
154

    
155
NUMA
156
~~~~
157

    
158
If instances are pinned to CPUs, and the amount of memory used for every
159
instance is proportionate to the number of VCPUs, NUMA shouldn't be a
160
problem, as the hypervisors allocate memory in the appropriate NUMA
161
node. Work is in progress in Xen and the Linux kernel to always allocate
162
memory correctly even without pinning. Therefore, we don't need to
163
address this problem specifically; it will be solved by future versions
164
of the hypervisors or by implementing CPU pinning.
165

    
166
Constrained instance sizes
167
--------------------------
168

    
169
In order to simplify allocation and resource provisioning we want to
170
limit the possible sizes of instances to a finite set of specifications,
171
defined at node-group level.
172

    
173
Currently it's possible to define an instance policy that limits the
174
minimum and maximum value for CPU, memory, and disk usage (and spindles
175
and any other resource, when implemented), independently from each other. We
176
extend the policy by allowing it to contain more occurrences of the
177
specifications for both the limits for the instance resources. Each
178
specification pair (minimum and maximum) has a unique priority
179
associated to it (or in other words, specifications are ordered), which
180
is used by ``hspace`` (see below). The standard specification doesn't
181
change: there is one for the whole cluster.
182

    
183
For example, a policy could be set up to allow instances with this
184
constraints:
185

    
186
- between 1 and 2 CPUs, 2 GB of RAM, and between 10 GB and 400 GB of
187
  disk space;
188
- 4 CPUs, 4 GB of RAM, and between 10 GB and 800 GB of disk space.
189

    
190
Then, an instance using 1 CPU, 2 GB of RAM and 50 GB of disk would be
191
legal, as an instance using 4 CPUs, 4 GB of RAM, and 20 GB of disk,
192
while an instance using 2 CPUs, 4 GB of RAM and 40 GB of disk would be
193
illegal.
194

    
195
Ganeti will refuse to create (or modify) instances that violate instance
196
policy constraints, unless the flag ``--ignore-ipolicy`` is passed.
197

    
198
While the changes needed to check constraint violations are
199
straightforward, ``hspace`` behavior needs some adjustments for tiered
200
allocation. ``hspace`` will start to allocate instances using the
201
maximum specification with the highest priority, then it will try to
202
lower the most constrained resources (without breaking the policy)
203
before moving to the second highest priority, and so on.
204

    
205
For consistent results in capacity calculation, the specifications
206
inside a policy should be ordered so that the biggest specifications
207
have the highest priorities. Also, specifications should not overlap.
208
Ganeti won't check nor enforce such constraints, though.
209

    
210
Implementation order
211
====================
212

    
213
We will implement this design in the following order:
214

    
215
- Exclusive use of disks (without spindles as a resource)
216
- Constrained instance sizes
217
- Spindles as a resource
218
- Dedicated CPU and memory
219

    
220
In this way have always new features that are immediately useful.
221
Spindles as a resource are not needed for correct capacity calculation,
222
as long as allowed disk sizes are multiples of spindle size, so it's
223
been moved after constrained instance sizes. If it turns out that it's
224
easier to implement dedicated disks with spindles as a resource, then we
225
will do that.
226

    
227
Possible future enhancements
228
============================
229

    
230
This section briefly describes some enhancements to the current design.
231
They may require their own design document, and must be re-evaluated
232
when considered for implementation, as Ganeti and the hypervisors may
233
change substantially in the meantime.
234

    
235
Network bandwidth
236
-----------------
237

    
238
A new resource is introduced: network bandwidth. An administrator must
239
be able to assign some network bandwidth to the virtual interfaces of an
240
instance, and set limits in instance policies. Also, a list of the
241
physical network interfaces available for Ganeti use and their maximum
242
bandwidth must be kept at node-group or node level. This information
243
will be taken into account for allocation, balancing, and free-space
244
calculation.
245

    
246
An additional enhancement is Ganeti enforcing the values set in the
247
bandwidth resource. This can be done by configuring limits for example
248
via openvswitch or normal QoS for bridging or routing. The bandwidth
249
resource represents the average bandwidth usage, so a few new back-end
250
parameters are needed to configure how to deal with bursts (they depend
251
on the actual way used to enforce the limit).
252

    
253
CPU pinning
254
-----------
255

    
256
In order to avoid unwarranted migrations between CPUs and to deal with
257
NUMA effectively we may need CPU pinning. CPU scheduling is a complex
258
topic and still under active development in Xen and the Linux kernel, so
259
we wont' try to outsmart their developers. If we need pinning it's more
260
to have predictable performance than to get the maximum performance
261
(which is best done by the hypervisor), so we'll implement a very simple
262
algorithm that allocates CPUs when an instance is assigned to a node
263
(either when it's created or when it's moved) and takes into account
264
NUMA and maybe CPU multithreading. A more refined version might run also
265
when an instance is deleted, but that would involve reassigning CPUs,
266
which could be bad with NUMA.
267

    
268
Overcommit for RAM and disks
269
----------------------------
270

    
271
Right now it is possible to assign more VCPUs to the instances running
272
on a node than there are CPU available. This works as normally CPU usage
273
on average is way below 100%. There are ways to share memory pages
274
(e.g. KSM, transcendent memory) and disk blocks, so we could add new
275
parameters to overcommit memory and disks, similar to ``vcpu_ratio``.
276

    
277
.. vim: set textwidth=72 :
278
.. Local Variables:
279
.. mode: rst
280
.. fill-column: 72
281
.. End: