root / doc / design-partitioned.rst @ ab6536ba
History | View | Annotate | Download (12.4 kB)
1 |
================== |
---|---|
2 |
Partitioned Ganeti |
3 |
================== |
4 |
|
5 |
.. contents:: :depth: 4 |
6 |
|
7 |
Current state and shortcomings |
8 |
============================== |
9 |
|
10 |
Currently Ganeti can be used to easily share a node between multiple |
11 |
virtual instances. While it's easy to do a completely "best effort" |
12 |
sharing it's quite harder to completely reserve resources for the use of |
13 |
a particular instance. In particular this has to be done manually for |
14 |
CPUs and disk, is implemented for RAM under Xen, but not under KVM, and |
15 |
there's no provision for network level QoS. |
16 |
|
17 |
Proposed changes |
18 |
================ |
19 |
|
20 |
We want to make it easy to partition a node between machines with |
21 |
exclusive use of hardware resources. While some sharing will anyway need |
22 |
to happen (e.g. for operations that use the host domain, or use |
23 |
resources, like buses, which are unique or very scarce on host systems) |
24 |
we'll strive to maintain contention at a minimum, but won't try to avoid |
25 |
all possible sources of it. |
26 |
|
27 |
Exclusive use of disks |
28 |
---------------------- |
29 |
|
30 |
``exclusive_storage`` is a new node parameter. When it's enabled, Ganeti |
31 |
will allocate entire disks to instances. Though it's possible to think |
32 |
of ways of doing something similar for other storage back-ends, this |
33 |
design targets only ``plain`` and ``drbd``. The name is generic enough |
34 |
in case the feature will be extended to other back-ends. The flag value |
35 |
should be homogeneous within a node-group; ``cluster-verify`` will report |
36 |
any violation of this condition. |
37 |
|
38 |
Ganeti will consider each physical volume in the destination volume |
39 |
group as a host disk (for proper isolation, an administrator should |
40 |
make sure that there aren't multiple PVs on the same physical |
41 |
disk). When ``exclusive_storage`` is enabled in a node group, all PVs |
42 |
in the node group must have the same size (within a certain margin, say |
43 |
1%, defined through a new parameter). Ganeti will check this condition |
44 |
when the ``exclusive_storage`` flag is set, whenever a new node is added |
45 |
and as part of ``cluster-verify``. |
46 |
|
47 |
When creating a new disk for an instance, Ganeti will allocate the |
48 |
minimum number of PVs to hold the disk, and those PVs will be excluded |
49 |
from the pool of available PVs for further disk creations. The |
50 |
underlying LV will be striped, when striping is allowed by the current |
51 |
configuration. Ganeti will continue to track only the LVs, and query the |
52 |
LVM layer to figure out which PVs are available and how much space is |
53 |
free. Yet, creation, disk growing, and free-space reporting will ignore |
54 |
any partially allocated PVs, so that PVs won't be shared between |
55 |
instance disks. |
56 |
|
57 |
For compatibility with the DRBD template and to take into account disk |
58 |
variability, Ganeti will always subtract 2% (this will be a parameter) |
59 |
from the PV space when calculating how many PVs are needed to allocate |
60 |
an instance and when nodes report free space. |
61 |
|
62 |
The obvious target for this option is plain disk template, which doesn't |
63 |
provide redundancy. An administrator can still provide resilience |
64 |
against disk failures by setting up RAID under PVs, but this is |
65 |
transparent to Ganeti. |
66 |
|
67 |
Spindles as a resource |
68 |
~~~~~~~~~~~~~~~~~~~~~~ |
69 |
|
70 |
When resources are dedicated and there are more spindles than instances |
71 |
on a node, it is natural to assign more spindles to instances than what |
72 |
is strictly needed. For this reason, we introduce a new resource: |
73 |
spindles. A spindle is a PV in LVM. The number of spindles required for |
74 |
a disk of an instance is specified together with the size. Specifying |
75 |
the number of spindles is possible only when ``exclusive_storage`` is |
76 |
enabled. It is an error to specify a number of spindles insufficient to |
77 |
contain the requested disk size. |
78 |
|
79 |
When ``exclusive_storage`` is not enabled, spindles are not used in free |
80 |
space calculation, in allocation algorithms, and policies. When it's |
81 |
enabled, ``hspace``, ``hbal``, and allocators will use spindles instead |
82 |
of disk size for their computation. For each node, the number of all the |
83 |
spindles in every LVM group is recorded, and different LVM groups are |
84 |
accounted separately in allocation and balancing. |
85 |
|
86 |
There is already a concept of spindles in Ganeti. It's not related to |
87 |
any actual spindle or volume count, but it's used in ``spindle_use`` to |
88 |
measure the pressure of an instance on the storage system and in |
89 |
``spindle_ratio`` to balance the I/O load on the nodes. When |
90 |
``exclusive_storage`` is enabled, these parameters as currently defined |
91 |
won't make any sense, so their meaning will be changed in this way: |
92 |
|
93 |
- ``spindle_use`` refers to the resource, hence to the actual spindles |
94 |
(PVs in LVM), used by an instance. The values specified in the instance |
95 |
policy specifications are compared to the run-time numbers of spindle |
96 |
used by an instance. The ``spindle_use`` back-end parameter will be |
97 |
ignored. |
98 |
- ``spindle_ratio`` in instance policies and ``spindle_count`` in node |
99 |
parameters are ignored, as the exclusive assignment of PVs already |
100 |
implies a value of 1.0 for the first, and the second is replaced by |
101 |
the actual number of spindles. |
102 |
|
103 |
When ``exclusive_storage`` is disabled, the existing spindle parameters |
104 |
behave as before. |
105 |
|
106 |
Dedicated CPUs |
107 |
-------------- |
108 |
|
109 |
``vpcu_ratio`` can be used to tie the number of VCPUs to the number of |
110 |
CPUs provided by the hardware. We need to take into account the CPU |
111 |
usage of the hypervisor. For Xen, this means counting the number of |
112 |
VCPUs assigned to ``Domain-0``. |
113 |
|
114 |
For KVM, it's more difficult to limit the number of CPUs used by the |
115 |
node OS. ``cgroups`` could be a solution to restrict the node OS to use |
116 |
some of the CPUs, leaving the other ones to instances and KVM processes. |
117 |
For KVM, the number of CPUs for the host system should also be a |
118 |
hypervisor parameter (set at the node group level). |
119 |
|
120 |
Dedicated RAM |
121 |
------------- |
122 |
|
123 |
Instances should not compete for RAM. This is easily done on Xen, but it |
124 |
is tricky on KVM. |
125 |
|
126 |
Xen |
127 |
~~~ |
128 |
|
129 |
Memory is already fully segregated under Xen, if sharing mechanisms |
130 |
(transcendent memory, auto ballooning, etc) are not in use. |
131 |
|
132 |
KVM |
133 |
~~~ |
134 |
Under KVM or LXC memory is fully shared between the host system and all |
135 |
the guests, and instances can even be swapped out by the host OS. |
136 |
|
137 |
It's not clear if the problem can be solved by limiting the size of the |
138 |
instances, so that there is plenty of room for the host OS. |
139 |
|
140 |
We could implement segregation using cgroups to limit the memory used by |
141 |
the host OS. This requires finishing the implementation of the memory |
142 |
hypervisor status (set at the node group level) that changes how free |
143 |
memory is computed under KVM systems. Then we have to add a way to |
144 |
enforce this limit on the host system itself, rather than leaving it as |
145 |
a calculation tool only. |
146 |
|
147 |
Another problem for KVM is that we need to decide about the size of the |
148 |
cgroup versus the size of the VM: some overhead will in particular |
149 |
exist, due to the fact that an instance and its encapsulating KVM |
150 |
process share the same space. For KVM systems the physical memory |
151 |
allocatable to instances should be computed by subtracting an overhead |
152 |
for the KVM processes, whose value can be either statically configured |
153 |
or set in a hypervisor status parameter. |
154 |
|
155 |
NUMA |
156 |
~~~~ |
157 |
|
158 |
If instances are pinned to CPUs, and the amount of memory used for every |
159 |
instance is proportionate to the number of VCPUs, NUMA shouldn't be a |
160 |
problem, as the hypervisors allocate memory in the appropriate NUMA |
161 |
node. Work is in progress in Xen and the Linux kernel to always allocate |
162 |
memory correctly even without pinning. Therefore, we don't need to |
163 |
address this problem specifically; it will be solved by future versions |
164 |
of the hypervisors or by implementing CPU pinning. |
165 |
|
166 |
Constrained instance sizes |
167 |
-------------------------- |
168 |
|
169 |
In order to simplify allocation and resource provisioning we want to |
170 |
limit the possible sizes of instances to a finite set of specifications, |
171 |
defined at node-group level. |
172 |
|
173 |
Currently it's possible to define an instance policy that limits the |
174 |
minimum and maximum value for CPU, memory, and disk usage (and spindles |
175 |
and any other resource, when implemented), independently from each other. We |
176 |
extend the policy by allowing it to contain more occurrences of the |
177 |
specifications for both the limits for the instance resources. Each |
178 |
specification pair (minimum and maximum) has a unique priority |
179 |
associated to it (or in other words, specifications are ordered), which |
180 |
is used by ``hspace`` (see below). The standard specification doesn't |
181 |
change: there is one for the whole cluster. |
182 |
|
183 |
For example, a policy could be set up to allow instances with this |
184 |
constraints: |
185 |
|
186 |
- between 1 and 2 CPUs, 2 GB of RAM, and between 10 GB and 400 GB of |
187 |
disk space; |
188 |
- 4 CPUs, 4 GB of RAM, and between 10 GB and 800 GB of disk space. |
189 |
|
190 |
Then, an instance using 1 CPU, 2 GB of RAM and 50 GB of disk would be |
191 |
legal, as an instance using 4 CPUs, 4 GB of RAM, and 20 GB of disk, |
192 |
while an instance using 2 CPUs, 4 GB of RAM and 40 GB of disk would be |
193 |
illegal. |
194 |
|
195 |
Ganeti will refuse to create (or modify) instances that violate instance |
196 |
policy constraints, unless the flag ``--ignore-ipolicy`` is passed. |
197 |
|
198 |
While the changes needed to check constraint violations are |
199 |
straightforward, ``hspace`` behavior needs some adjustments for tiered |
200 |
allocation. ``hspace`` will start to allocate instances using the |
201 |
maximum specification with the highest priority, then it will try to |
202 |
lower the most constrained resources (without breaking the policy) |
203 |
before moving to the second highest priority, and so on. |
204 |
|
205 |
For consistent results in capacity calculation, the specifications |
206 |
inside a policy should be ordered so that the biggest specifications |
207 |
have the highest priorities. Also, specifications should not overlap. |
208 |
Ganeti won't check nor enforce such constraints, though. |
209 |
|
210 |
Implementation order |
211 |
==================== |
212 |
|
213 |
We will implement this design in the following order: |
214 |
|
215 |
- Exclusive use of disks (without spindles as a resource) |
216 |
- Constrained instance sizes |
217 |
- Spindles as a resource |
218 |
- Dedicated CPU and memory |
219 |
|
220 |
In this way have always new features that are immediately useful. |
221 |
Spindles as a resource are not needed for correct capacity calculation, |
222 |
as long as allowed disk sizes are multiples of spindle size, so it's |
223 |
been moved after constrained instance sizes. If it turns out that it's |
224 |
easier to implement dedicated disks with spindles as a resource, then we |
225 |
will do that. |
226 |
|
227 |
Possible future enhancements |
228 |
============================ |
229 |
|
230 |
This section briefly describes some enhancements to the current design. |
231 |
They may require their own design document, and must be re-evaluated |
232 |
when considered for implementation, as Ganeti and the hypervisors may |
233 |
change substantially in the meantime. |
234 |
|
235 |
Network bandwidth |
236 |
----------------- |
237 |
|
238 |
A new resource is introduced: network bandwidth. An administrator must |
239 |
be able to assign some network bandwidth to the virtual interfaces of an |
240 |
instance, and set limits in instance policies. Also, a list of the |
241 |
physical network interfaces available for Ganeti use and their maximum |
242 |
bandwidth must be kept at node-group or node level. This information |
243 |
will be taken into account for allocation, balancing, and free-space |
244 |
calculation. |
245 |
|
246 |
An additional enhancement is Ganeti enforcing the values set in the |
247 |
bandwidth resource. This can be done by configuring limits for example |
248 |
via openvswitch or normal QoS for bridging or routing. The bandwidth |
249 |
resource represents the average bandwidth usage, so a few new back-end |
250 |
parameters are needed to configure how to deal with bursts (they depend |
251 |
on the actual way used to enforce the limit). |
252 |
|
253 |
CPU pinning |
254 |
----------- |
255 |
|
256 |
In order to avoid unwarranted migrations between CPUs and to deal with |
257 |
NUMA effectively we may need CPU pinning. CPU scheduling is a complex |
258 |
topic and still under active development in Xen and the Linux kernel, so |
259 |
we wont' try to outsmart their developers. If we need pinning it's more |
260 |
to have predictable performance than to get the maximum performance |
261 |
(which is best done by the hypervisor), so we'll implement a very simple |
262 |
algorithm that allocates CPUs when an instance is assigned to a node |
263 |
(either when it's created or when it's moved) and takes into account |
264 |
NUMA and maybe CPU multithreading. A more refined version might run also |
265 |
when an instance is deleted, but that would involve reassigning CPUs, |
266 |
which could be bad with NUMA. |
267 |
|
268 |
Overcommit for RAM and disks |
269 |
---------------------------- |
270 |
|
271 |
Right now it is possible to assign more VCPUs to the instances running |
272 |
on a node than there are CPU available. This works as normally CPU usage |
273 |
on average is way below 100%. There are ways to share memory pages |
274 |
(e.g. KSM, transcendent memory) and disk blocks, so we could add new |
275 |
parameters to overcommit memory and disks, similar to ``vcpu_ratio``. |
276 |
|
277 |
.. vim: set textwidth=72 : |
278 |
.. Local Variables: |
279 |
.. mode: rst |
280 |
.. fill-column: 72 |
281 |
.. End: |