|
1 |
==================
|
|
2 |
Partitioned Ganeti
|
|
3 |
==================
|
|
4 |
|
|
5 |
.. contents:: :depth: 4
|
|
6 |
|
|
7 |
Current state and shortcomings
|
|
8 |
==============================
|
|
9 |
|
|
10 |
Currently Ganeti can be used to easily share a node between multiple
|
|
11 |
virtual instances. While it's easy to do a completely "best effort"
|
|
12 |
sharing it's quite harder to completely reserve resources for the use of
|
|
13 |
a particular instance. In particular this has to be done manually for
|
|
14 |
CPUs and disk, is implemented for RAM under Xen, but not under KVM, and
|
|
15 |
there's no provision for network level QoS.
|
|
16 |
|
|
17 |
Proposed changes
|
|
18 |
================
|
|
19 |
|
|
20 |
We want to make it easy to partition a node between machines with
|
|
21 |
exclusive use of hardware resources. While some sharing will anyway need
|
|
22 |
to happen (e.g. for operations that use the host domain, or use
|
|
23 |
resources, like buses, which are unique or very scarce on host systems)
|
|
24 |
we'll strive to maintain contention at a minimum, but won't try to avoid
|
|
25 |
all possible sources of it.
|
|
26 |
|
|
27 |
Exclusive use of disks
|
|
28 |
----------------------
|
|
29 |
|
|
30 |
``exclusive_storage`` is a configuration flag at node-group and cluster
|
|
31 |
level. When it's enabled, Ganeti will allocate entire disks to
|
|
32 |
instances. Though it's possible to think of ways of doing something
|
|
33 |
similar for other storage back-ends, this design targets only ``plain``
|
|
34 |
and ``drbd``. The name is generic enough in case the feature will be
|
|
35 |
extended to other back-ends.
|
|
36 |
|
|
37 |
Ganeti will consider each physical volume in the destination volume
|
|
38 |
group as a host disk (for proper isolation, an administrator should
|
|
39 |
make sure that there aren't multiple PVs on the same physical
|
|
40 |
disk). When ``exclusive_storage`` is enabled in a node group, all PVs
|
|
41 |
in the node group must have the same size (within a certain margin, say
|
|
42 |
1%, defined through a new parameter). Ganeti will check this condition
|
|
43 |
when the ``exclusive_storage`` flag is set, whenever a new node is added
|
|
44 |
and as part of ``cluster-verify``.
|
|
45 |
|
|
46 |
When creating a new disk for an instance, Ganeti will allocate the
|
|
47 |
minimum number of PVs to hold the disk, and those PVs will be excluded
|
|
48 |
from the pool of available PVs by marking them as unallocatable; in this
|
|
49 |
way, PVs won't be shared between instance disks, and any remaining space
|
|
50 |
won't be used by mistake for anything else. The underlying LV will be
|
|
51 |
striped, when striping is allowed by the current configuration. Ganeti
|
|
52 |
will continue to track only the LVs, and query the LVM layer to figure
|
|
53 |
out which PVs are available and how much space is free.
|
|
54 |
|
|
55 |
For compatibility with the DRBD template and to take into account disk
|
|
56 |
variability, Ganeti will always subtract 2% (this will be a parameter)
|
|
57 |
from the PV space when calculating how many PVs are needed to allocate
|
|
58 |
an instance and when nodes report free space.
|
|
59 |
|
|
60 |
The obvious target for this option is plain disk template, which doesn't
|
|
61 |
provide redundancy. An administrator can still provide resilience
|
|
62 |
against disk failures by setting up RAID under PVs, but this is
|
|
63 |
transparent to Ganeti.
|
|
64 |
|
|
65 |
Spindles as a resource
|
|
66 |
~~~~~~~~~~~~~~~~~~~~~~
|
|
67 |
|
|
68 |
When resources are dedicated and there are more spindles than instances
|
|
69 |
on a node, it is natural to assign more spindles to instances than what
|
|
70 |
is strictly needed. For this reason, we introduce a new resource:
|
|
71 |
spindles. A spindle is a PV in LVM. The number of spindles required for
|
|
72 |
a disk of an instance is specified together with the size. Specifying
|
|
73 |
the number of spindles is possible only when ``exclusive_storage`` is
|
|
74 |
enabled. It is an error to specify a number of spindles insufficient to
|
|
75 |
contain the requested disk size.
|
|
76 |
|
|
77 |
When ``exclusive_storage`` is not enabled, spindles are not used in free
|
|
78 |
space calculation, in allocation algorithms, and policies. When it's
|
|
79 |
enabled, ``hspace``, ``hbal``, and allocators will use spindles instead
|
|
80 |
of disk size for their computation. For each node, the number of all the
|
|
81 |
spindles in every LVM group is recorded, and different LVM groups are
|
|
82 |
accounted separately in allocation and balancing.
|
|
83 |
|
|
84 |
There is already a concept of spindles in Ganeti. It's not related to
|
|
85 |
any actual spindle or volume count, but it's used in ``spindle_use`` to
|
|
86 |
measure the pressure of an instance on the storage system and in
|
|
87 |
``spindle_ratio`` to balance the I/O load on the nodes. These two
|
|
88 |
parameters will be renamed to ``storage_io_use`` and
|
|
89 |
``storage_io_ratio`` to reflect better their meaning. When
|
|
90 |
``exclusive_storage`` is enabled, such parameters are ignored, as
|
|
91 |
balancing the use of storage I/O is already addressed by the exclusive
|
|
92 |
assignment of PVs.
|
|
93 |
|
|
94 |
Dedicated CPUs
|
|
95 |
--------------
|
|
96 |
|
|
97 |
``vpcu_ratio`` can be used to tie the number of VCPUs to the number of
|
|
98 |
CPUs provided by the hardware. We need to take into account the CPU
|
|
99 |
usage of the hypervisor. For Xen, this means counting the number of
|
|
100 |
VCPUs assigned to ``Domain-0``.
|
|
101 |
|
|
102 |
For KVM, it's more difficult to limit the number of CPUs used by the
|
|
103 |
node OS. ``cgroups`` could be a solution to restrict the node OS to use
|
|
104 |
some of the CPUs, leaving the other ones to instances and KVM processes.
|
|
105 |
For KVM, the number of CPUs for the host system should also be a
|
|
106 |
hypervisor parameter (set at the node group level).
|
|
107 |
|
|
108 |
Dedicated RAM
|
|
109 |
-------------
|
|
110 |
|
|
111 |
Instances should not compete for RAM. This is easily done on Xen, but it
|
|
112 |
is tricky on KVM.
|
|
113 |
|
|
114 |
Xen
|
|
115 |
~~~
|
|
116 |
|
|
117 |
Memory is already fully segregated under Xen, if sharing mechanisms
|
|
118 |
(transcendent memory, auto ballooning, etc) are not in use.
|
|
119 |
|
|
120 |
KVM
|
|
121 |
~~~
|
|
122 |
Under KVM or LXC memory is fully shared between the host system and all
|
|
123 |
the guests, and instances can even be swapped out by the host OS.
|
|
124 |
|
|
125 |
It's not clear if the problem can be solved by limiting the size of the
|
|
126 |
instances, so that there is plenty of room for the host OS.
|
|
127 |
|
|
128 |
We could implement segregation using cgroups to limit the memory used by
|
|
129 |
the host OS. This requires finishing the implementation of the memory
|
|
130 |
hypervisor status (set at the node group level) that changes how free
|
|
131 |
memory is computed under KVM systems. Then we have to add a way to
|
|
132 |
enforce this limit on the host system itself, rather than leaving it as
|
|
133 |
a calculation tool only.
|
|
134 |
|
|
135 |
Another problem for KVM is that we need to decide about the size of the
|
|
136 |
cgroup versus the size of the VM: some overhead will in particular
|
|
137 |
exist, due to the fact that an instance and its encapsulating KVM
|
|
138 |
process share the same space. For KVM systems the physical memory
|
|
139 |
allocatable to instances should be computed by subtracting an overhead
|
|
140 |
for the KVM processes, whose value can be either statically configured
|
|
141 |
or set in a hypervisor status parameter.
|
|
142 |
|
|
143 |
NUMA
|
|
144 |
~~~~
|
|
145 |
|
|
146 |
If instances are pinned to CPUs, and the amount of memory used for every
|
|
147 |
instance is proportionate to the number of VCPUs, NUMA shouldn't be a
|
|
148 |
problem, as the hypervisors allocate memory in the appropriate NUMA
|
|
149 |
node. Work is in progress in Xen and the Linux kernel to always allocate
|
|
150 |
memory correctly even without pinning. Therefore, we don't need to
|
|
151 |
address this problem specifically; it will be solved by future versions
|
|
152 |
of the hypervisors or by implementing CPU pinning.
|
|
153 |
|
|
154 |
Constrained instance sizes
|
|
155 |
--------------------------
|
|
156 |
|
|
157 |
In order to simplify allocation and resource provisioning we want to
|
|
158 |
limit the possible sizes of instances to a finite set of specifications,
|
|
159 |
defined at node-group level.
|
|
160 |
|
|
161 |
Currently it's possible to define an instance policy that limits the
|
|
162 |
minimum and maximum value for CPU, memory, and disk usage (and spindles
|
|
163 |
and any other resource, when implemented), independently from each other. We
|
|
164 |
extend the policy by allowing it to specify more specifications, where
|
|
165 |
each specification contains the limits (minimum, maximum, and standard)
|
|
166 |
for all the resources. Each specification has a unique priority (an
|
|
167 |
integer) associated to it, which is used by ``hspace`` (see below).
|
|
168 |
|
|
169 |
For example, a policy could be set up to allow instances with this
|
|
170 |
constraints:
|
|
171 |
- between 1 and 2 CPUs, 2 GB of RAM, and between 10 GB and 400 GB of
|
|
172 |
disk space;
|
|
173 |
- 4 CPUs, 4 GB of RAM, and between 10 GB and 800 GB of disk space.
|
|
174 |
|
|
175 |
Then, an instance using 1 CPU, 2 GB of RAM and 50 GB of disk would be
|
|
176 |
legal, as an instance using 4 CPUs, 4 GB of RAM, and 20 GB of disk,
|
|
177 |
while an instance using 2 CPUs, 4 GB of RAM and 40 GB of disk would be
|
|
178 |
illegal.
|
|
179 |
|
|
180 |
Ganeti will refuse to create (or modify) instances that violate instance
|
|
181 |
policy constraints, unless the flag ``--ignore-ipolicy`` is passed.
|
|
182 |
|
|
183 |
While the changes needed to check constraint violations are
|
|
184 |
straightforward, ``hspace`` behavior needs some adjustments. For both
|
|
185 |
standard and tiered allocation, ``hspace`` will start to allocate
|
|
186 |
instances using the specification with the highest priority, then it
|
|
187 |
will fall back to second highest priority, and so on. For tiered
|
|
188 |
allocation, it will try to lower the most constrained resources (without
|
|
189 |
breaking the policy) before going to the next specification.
|
|
190 |
|
|
191 |
For consistent results in capacity calculation, the specifications
|
|
192 |
inside a policy should be ordered so that the biggest specifications
|
|
193 |
have the highest priorities. Also, specifications should not overlap.
|
|
194 |
Ganeti won't check nor enforce such constraints, though.
|
|
195 |
|
|
196 |
Implementation order
|
|
197 |
====================
|
|
198 |
|
|
199 |
We will implement this design in the following order:
|
|
200 |
|
|
201 |
- Exclusive use of disks (without spindles as a resource)
|
|
202 |
- Constrained instance sizes
|
|
203 |
- Spindles as a resource
|
|
204 |
- Dedicated CPU and memory
|
|
205 |
|
|
206 |
In this way have always new features that are immediately useful.
|
|
207 |
Spindles as a resource are not needed for correct capacity calculation,
|
|
208 |
as long as allowed disk sizes are multiples of spindle size, so it's
|
|
209 |
been moved after constrained instance sizes. If it turns out that it's
|
|
210 |
easier to implement dedicated disks with spindles as a resource, then we
|
|
211 |
will do that.
|
|
212 |
|
|
213 |
Possible future enhancements
|
|
214 |
============================
|
|
215 |
|
|
216 |
This section briefly describes some enhancements to the current design.
|
|
217 |
They may require their own design document, and must be re-evaluated
|
|
218 |
when considered for implementation, as Ganeti and the hypervisors may
|
|
219 |
change substantially in the meantime.
|
|
220 |
|
|
221 |
Network bandwidth
|
|
222 |
-----------------
|
|
223 |
|
|
224 |
A new resource is introduced: network bandwidth. An administrator must
|
|
225 |
be able to assign some network bandwidth to the virtual interfaces of an
|
|
226 |
instance, and set limits in instance policies. Also, a list of the
|
|
227 |
physical network interfaces available for Ganeti use and their maximum
|
|
228 |
bandwidth must be kept at node-group or node level. This information
|
|
229 |
will be taken into account for allocation, balancing, and free-space
|
|
230 |
calculation.
|
|
231 |
|
|
232 |
An additional enhancement is Ganeti enforcing the values set in the
|
|
233 |
bandwidth resource. This can be done by configuring limits for example
|
|
234 |
via openvswitch or normal QoS for bridging or routing. The bandwidth
|
|
235 |
resource represents the average bandwidth usage, so a few new back-end
|
|
236 |
parameters are needed to configure how to deal with bursts (they depend
|
|
237 |
on the actual way used to enforce the limit).
|
|
238 |
|
|
239 |
CPU pinning
|
|
240 |
-----------
|
|
241 |
|
|
242 |
In order to avoid unwarranted migrations between CPUs and to deal with
|
|
243 |
NUMA effectively we may need CPU pinning. CPU scheduling is a complex
|
|
244 |
topic and still under active development in Xen and the Linux kernel, so
|
|
245 |
we wont' try to outsmart their developers. If we need pinning it's more
|
|
246 |
to have predictable performance than to get the maximum performance
|
|
247 |
(which is best done by the hypervisor), so we'll implement a very simple
|
|
248 |
algorithm that allocates CPUs when an instance is assigned to a node
|
|
249 |
(either when it's created or when it's moved) and takes into account
|
|
250 |
NUMA and maybe CPU multithreading. A more refined version might run also
|
|
251 |
when an instance is deleted, but that would involve reassigning CPUs,
|
|
252 |
which could be bad with NUMA.
|
|
253 |
|
|
254 |
Overcommit for RAM and disks
|
|
255 |
----------------------------
|
|
256 |
|
|
257 |
Right now it is possible to assign more VCPUs to the instances running
|
|
258 |
on a node than there are CPU available. This works as normally CPU usage
|
|
259 |
on average is way below 100%. There are ways to share memory pages
|
|
260 |
(e.g. KSM, transcendent memory) and disk blocks, so we could add new
|
|
261 |
parameters to overcommit memory and disks, similar to ``vcpu_ratio``.
|
|
262 |
|
|
263 |
.. vim: set textwidth=72 :
|
|
264 |
.. Local Variables:
|
|
265 |
.. mode: rst
|
|
266 |
.. fill-column: 72
|
|
267 |
.. End:
|