Statistics
| Branch: | Tag: | Revision:

root / doc / design-partitioned.rst @ 0232b768

History | View | Annotate | Download (11.7 kB)

1
==================
2
Partitioned Ganeti
3
==================
4

    
5
.. contents:: :depth: 4
6

    
7
Current state and shortcomings
8
==============================
9

    
10
Currently Ganeti can be used to easily share a node between multiple
11
virtual instances. While it's easy to do a completely "best effort"
12
sharing it's quite harder to completely reserve resources for the use of
13
a particular instance. In particular this has to be done manually for
14
CPUs and disk, is implemented for RAM under Xen, but not under KVM, and
15
there's no provision for network level QoS.
16

    
17
Proposed changes
18
================
19

    
20
We want to make it easy to partition a node between machines with
21
exclusive use of hardware resources. While some sharing will anyway need
22
to happen (e.g. for operations that use the host domain, or use
23
resources, like buses, which are unique or very scarce on host systems)
24
we'll strive to maintain contention at a minimum, but won't try to avoid
25
all possible sources of it.
26

    
27
Exclusive use of disks
28
----------------------
29

    
30
``exclusive_storage`` is a configuration flag at node-group and cluster
31
level. When it's enabled, Ganeti will allocate entire disks to
32
instances. Though it's possible to think of ways of doing something
33
similar for other storage back-ends, this design targets only ``plain``
34
and ``drbd``. The name is generic enough in case the feature will be
35
extended to other back-ends.
36

    
37
Ganeti will consider each physical volume in the destination volume
38
group as a host disk (for proper isolation, an administrator should
39
make sure that there aren't multiple PVs on the same physical
40
disk). When ``exclusive_storage`` is enabled in a node group, all PVs
41
in the node group must have the same size (within a certain margin, say
42
1%, defined through a new parameter). Ganeti will check this condition
43
when the ``exclusive_storage`` flag is set, whenever a new node is added
44
and as part of ``cluster-verify``.
45

    
46
When creating a new disk for an instance, Ganeti will allocate the
47
minimum number of PVs to hold the disk, and those PVs will be excluded
48
from the pool of available PVs by marking them as unallocatable; in this
49
way, PVs won't be shared between instance disks, and any remaining space
50
won't be used by mistake for anything else. The underlying LV will be
51
striped, when striping is allowed by the current configuration. Ganeti
52
will continue to track only the LVs, and query the LVM layer to figure
53
out which PVs are available and how much space is free.
54

    
55
For compatibility with the DRBD template and to take into account disk
56
variability, Ganeti will always subtract 2% (this will be a parameter)
57
from the PV space when calculating how many PVs are needed to allocate
58
an instance and when nodes report free space.
59

    
60
The obvious target for this option is plain disk template, which doesn't
61
provide redundancy. An administrator can still provide resilience
62
against disk failures by setting up RAID under PVs, but this is
63
transparent to Ganeti.
64

    
65
Spindles as a resource
66
~~~~~~~~~~~~~~~~~~~~~~
67

    
68
When resources are dedicated and there are more spindles than instances
69
on a node, it is natural to assign more spindles to instances than what
70
is strictly needed. For this reason, we introduce a new resource:
71
spindles. A spindle is a PV in LVM. The number of spindles required for
72
a disk of an instance is specified together with the size. Specifying
73
the number of spindles is possible only when ``exclusive_storage`` is
74
enabled. It is an error to specify a number of spindles insufficient to
75
contain the requested disk size.
76

    
77
When ``exclusive_storage`` is not enabled, spindles are not used in free
78
space calculation, in allocation algorithms, and policies. When it's
79
enabled, ``hspace``, ``hbal``, and allocators will use spindles instead
80
of disk size for their computation. For each node, the number of all the
81
spindles in every LVM group is recorded, and different LVM groups are
82
accounted separately in allocation and balancing.
83

    
84
There is already a concept of spindles in Ganeti. It's not related to
85
any actual spindle or volume count, but it's used in ``spindle_use`` to
86
measure the pressure of an instance on the storage system and in
87
``spindle_ratio`` to balance the I/O load on the nodes. These two
88
parameters will be renamed to ``storage_io_use`` and
89
``storage_io_ratio`` to reflect better their meaning. When
90
``exclusive_storage`` is enabled, such parameters are ignored, as
91
balancing the use of storage I/O is already addressed by the exclusive
92
assignment of PVs.
93

    
94
Dedicated CPUs
95
--------------
96

    
97
``vpcu_ratio`` can be used to tie the number of VCPUs to the number of
98
CPUs provided by the hardware. We need to take into account the CPU
99
usage of the hypervisor. For Xen, this means counting the number of
100
VCPUs assigned to ``Domain-0``.
101

    
102
For KVM, it's more difficult to limit the number of CPUs used by the
103
node OS. ``cgroups`` could be a solution to restrict the node OS to use
104
some of the CPUs, leaving the other ones to instances and KVM processes.
105
For KVM, the number of CPUs for the host system should also be a
106
hypervisor parameter (set at the node group level).
107

    
108
Dedicated RAM
109
-------------
110

    
111
Instances should not compete for RAM. This is easily done on Xen, but it
112
is tricky on KVM.
113

    
114
Xen
115
~~~
116

    
117
Memory is already fully segregated under Xen, if sharing mechanisms
118
(transcendent memory, auto ballooning, etc) are not in use.
119

    
120
KVM
121
~~~
122
Under KVM or LXC memory is fully shared between the host system and all
123
the guests, and instances can even be swapped out by the host OS.
124

    
125
It's not clear if the problem can be solved by limiting the size of the
126
instances, so that there is plenty of room for the host OS. 
127

    
128
We could implement segregation using cgroups to limit the memory used by
129
the host OS. This requires finishing the implementation of the memory
130
hypervisor status (set at the node group level) that changes how free
131
memory is computed under KVM systems. Then we have to add a way to
132
enforce this limit on the host system itself, rather than leaving it as
133
a calculation tool only.
134

    
135
Another problem for KVM is that we need to decide about the size of the
136
cgroup versus the size of the VM: some overhead will in particular
137
exist, due to the fact that an instance and its encapsulating KVM
138
process share the same space. For KVM systems the physical memory
139
allocatable to instances should be computed by subtracting an overhead
140
for the KVM processes, whose value can be either statically configured
141
or set in a hypervisor status parameter.
142

    
143
NUMA
144
~~~~
145

    
146
If instances are pinned to CPUs, and the amount of memory used for every
147
instance is proportionate to the number of VCPUs, NUMA shouldn't be a
148
problem, as the hypervisors allocate memory in the appropriate NUMA
149
node. Work is in progress in Xen and the Linux kernel to always allocate
150
memory correctly even without pinning. Therefore, we don't need to
151
address this problem specifically; it will be solved by future versions
152
of the hypervisors or by implementing CPU pinning.
153

    
154
Constrained instance sizes
155
--------------------------
156

    
157
In order to simplify allocation and resource provisioning we want to
158
limit the possible sizes of instances to a finite set of specifications,
159
defined at node-group level.
160

    
161
Currently it's possible to define an instance policy that limits the
162
minimum and maximum value for CPU, memory, and disk usage (and spindles
163
and any other resource, when implemented), independently from each other. We
164
extend the policy by allowing it to specify more specifications, where
165
each specification contains the limits (minimum, maximum, and standard)
166
for all the resources. Each specification has a unique priority (an
167
integer) associated to it, which is used by ``hspace`` (see below).
168

    
169
For example, a policy could be set up to allow instances with this
170
constraints:
171
- between 1 and 2 CPUs, 2 GB of RAM, and between 10 GB and 400 GB of
172
disk space;
173
- 4 CPUs, 4 GB of RAM, and between 10 GB and 800 GB of disk space.
174

    
175
Then, an instance using 1 CPU, 2 GB of RAM and 50 GB of disk would be
176
legal, as an instance using 4 CPUs, 4 GB of RAM, and 20 GB of disk,
177
while an instance using 2 CPUs, 4 GB of RAM and 40 GB of disk would be
178
illegal.
179

    
180
Ganeti will refuse to create (or modify) instances that violate instance
181
policy constraints, unless the flag ``--ignore-ipolicy`` is passed.
182

    
183
While the changes needed to check constraint violations are
184
straightforward, ``hspace`` behavior needs some adjustments. For both
185
standard and tiered allocation, ``hspace`` will start to allocate
186
instances using the specification with the highest priority, then it
187
will fall back to second highest priority, and so on. For tiered
188
allocation, it will try to lower the most constrained resources (without
189
breaking the policy) before going to the next specification.
190

    
191
For consistent results in capacity calculation, the specifications
192
inside a policy should be ordered so that the biggest specifications
193
have the highest priorities. Also, specifications should not overlap.
194
Ganeti won't check nor enforce such constraints, though.
195

    
196
Implementation order
197
====================
198

    
199
We will implement this design in the following order:
200

    
201
- Exclusive use of disks (without spindles as a resource)
202
- Constrained instance sizes
203
- Spindles as a resource
204
- Dedicated CPU and memory
205

    
206
In this way have always new features that are immediately useful.
207
Spindles as a resource are not needed for correct capacity calculation,
208
as long as allowed disk sizes are multiples of spindle size, so it's
209
been moved after constrained instance sizes. If it turns out that it's
210
easier to implement dedicated disks with spindles as a resource, then we
211
will do that.
212

    
213
Possible future enhancements
214
============================
215

    
216
This section briefly describes some enhancements to the current design.
217
They may require their own design document, and must be re-evaluated
218
when considered for implementation, as Ganeti and the hypervisors may
219
change substantially in the meantime.
220

    
221
Network bandwidth
222
-----------------
223

    
224
A new resource is introduced: network bandwidth. An administrator must
225
be able to assign some network bandwidth to the virtual interfaces of an
226
instance, and set limits in instance policies. Also, a list of the
227
physical network interfaces available for Ganeti use and their maximum
228
bandwidth must be kept at node-group or node level. This information
229
will be taken into account for allocation, balancing, and free-space
230
calculation.
231

    
232
An additional enhancement is Ganeti enforcing the values set in the
233
bandwidth resource. This can be done by configuring limits for example
234
via openvswitch or normal QoS for bridging or routing. The bandwidth
235
resource represents the average bandwidth usage, so a few new back-end
236
parameters are needed to configure how to deal with bursts (they depend
237
on the actual way used to enforce the limit).
238

    
239
CPU pinning
240
-----------
241

    
242
In order to avoid unwarranted migrations between CPUs and to deal with
243
NUMA effectively we may need CPU pinning. CPU scheduling is a complex
244
topic and still under active development in Xen and the Linux kernel, so
245
we wont' try to outsmart their developers. If we need pinning it's more
246
to have predictable performance than to get the maximum performance
247
(which is best done by the hypervisor), so we'll implement a very simple
248
algorithm that allocates CPUs when an instance is assigned to a node
249
(either when it's created or when it's moved) and takes into account
250
NUMA and maybe CPU multithreading. A more refined version might run also
251
when an instance is deleted, but that would involve reassigning CPUs,
252
which could be bad with NUMA.
253

    
254
Overcommit for RAM and disks
255
----------------------------
256

    
257
Right now it is possible to assign more VCPUs to the instances running
258
on a node than there are CPU available. This works as normally CPU usage
259
on average is way below 100%. There are ways to share memory pages
260
(e.g. KSM, transcendent memory) and disk blocks, so we could add new
261
parameters to overcommit memory and disks, similar to ``vcpu_ratio``.
262

    
263
.. vim: set textwidth=72 :
264
.. Local Variables:
265
.. mode: rst
266
.. fill-column: 72
267
.. End: