root / doc / design-partitioned.rst @ 46118ed2
History | View | Annotate | Download (11.7 kB)
1 |
================== |
---|---|
2 |
Partitioned Ganeti |
3 |
================== |
4 |
|
5 |
.. contents:: :depth: 4 |
6 |
|
7 |
Current state and shortcomings |
8 |
============================== |
9 |
|
10 |
Currently Ganeti can be used to easily share a node between multiple |
11 |
virtual instances. While it's easy to do a completely "best effort" |
12 |
sharing it's quite harder to completely reserve resources for the use of |
13 |
a particular instance. In particular this has to be done manually for |
14 |
CPUs and disk, is implemented for RAM under Xen, but not under KVM, and |
15 |
there's no provision for network level QoS. |
16 |
|
17 |
Proposed changes |
18 |
================ |
19 |
|
20 |
We want to make it easy to partition a node between machines with |
21 |
exclusive use of hardware resources. While some sharing will anyway need |
22 |
to happen (e.g. for operations that use the host domain, or use |
23 |
resources, like buses, which are unique or very scarce on host systems) |
24 |
we'll strive to maintain contention at a minimum, but won't try to avoid |
25 |
all possible sources of it. |
26 |
|
27 |
Exclusive use of disks |
28 |
---------------------- |
29 |
|
30 |
``exclusive_storage`` is a configuration flag at node-group and cluster |
31 |
level. When it's enabled, Ganeti will allocate entire disks to |
32 |
instances. Though it's possible to think of ways of doing something |
33 |
similar for other storage back-ends, this design targets only ``plain`` |
34 |
and ``drbd``. The name is generic enough in case the feature will be |
35 |
extended to other back-ends. |
36 |
|
37 |
Ganeti will consider each physical volume in the destination volume |
38 |
group as a host disk (for proper isolation, an administrator should |
39 |
make sure that there aren't multiple PVs on the same physical |
40 |
disk). When ``exclusive_storage`` is enabled in a node group, all PVs |
41 |
in the node group must have the same size (within a certain margin, say |
42 |
1%, defined through a new parameter). Ganeti will check this condition |
43 |
when the ``exclusive_storage`` flag is set, whenever a new node is added |
44 |
and as part of ``cluster-verify``. |
45 |
|
46 |
When creating a new disk for an instance, Ganeti will allocate the |
47 |
minimum number of PVs to hold the disk, and those PVs will be excluded |
48 |
from the pool of available PVs by marking them as unallocatable; in this |
49 |
way, PVs won't be shared between instance disks, and any remaining space |
50 |
won't be used by mistake for anything else. The underlying LV will be |
51 |
striped, when striping is allowed by the current configuration. Ganeti |
52 |
will continue to track only the LVs, and query the LVM layer to figure |
53 |
out which PVs are available and how much space is free. |
54 |
|
55 |
For compatibility with the DRBD template and to take into account disk |
56 |
variability, Ganeti will always subtract 2% (this will be a parameter) |
57 |
from the PV space when calculating how many PVs are needed to allocate |
58 |
an instance and when nodes report free space. |
59 |
|
60 |
The obvious target for this option is plain disk template, which doesn't |
61 |
provide redundancy. An administrator can still provide resilience |
62 |
against disk failures by setting up RAID under PVs, but this is |
63 |
transparent to Ganeti. |
64 |
|
65 |
Spindles as a resource |
66 |
~~~~~~~~~~~~~~~~~~~~~~ |
67 |
|
68 |
When resources are dedicated and there are more spindles than instances |
69 |
on a node, it is natural to assign more spindles to instances than what |
70 |
is strictly needed. For this reason, we introduce a new resource: |
71 |
spindles. A spindle is a PV in LVM. The number of spindles required for |
72 |
a disk of an instance is specified together with the size. Specifying |
73 |
the number of spindles is possible only when ``exclusive_storage`` is |
74 |
enabled. It is an error to specify a number of spindles insufficient to |
75 |
contain the requested disk size. |
76 |
|
77 |
When ``exclusive_storage`` is not enabled, spindles are not used in free |
78 |
space calculation, in allocation algorithms, and policies. When it's |
79 |
enabled, ``hspace``, ``hbal``, and allocators will use spindles instead |
80 |
of disk size for their computation. For each node, the number of all the |
81 |
spindles in every LVM group is recorded, and different LVM groups are |
82 |
accounted separately in allocation and balancing. |
83 |
|
84 |
There is already a concept of spindles in Ganeti. It's not related to |
85 |
any actual spindle or volume count, but it's used in ``spindle_use`` to |
86 |
measure the pressure of an instance on the storage system and in |
87 |
``spindle_ratio`` to balance the I/O load on the nodes. These two |
88 |
parameters will be renamed to ``storage_io_use`` and |
89 |
``storage_io_ratio`` to reflect better their meaning. When |
90 |
``exclusive_storage`` is enabled, such parameters are ignored, as |
91 |
balancing the use of storage I/O is already addressed by the exclusive |
92 |
assignment of PVs. |
93 |
|
94 |
Dedicated CPUs |
95 |
-------------- |
96 |
|
97 |
``vpcu_ratio`` can be used to tie the number of VCPUs to the number of |
98 |
CPUs provided by the hardware. We need to take into account the CPU |
99 |
usage of the hypervisor. For Xen, this means counting the number of |
100 |
VCPUs assigned to ``Domain-0``. |
101 |
|
102 |
For KVM, it's more difficult to limit the number of CPUs used by the |
103 |
node OS. ``cgroups`` could be a solution to restrict the node OS to use |
104 |
some of the CPUs, leaving the other ones to instances and KVM processes. |
105 |
For KVM, the number of CPUs for the host system should also be a |
106 |
hypervisor parameter (set at the node group level). |
107 |
|
108 |
Dedicated RAM |
109 |
------------- |
110 |
|
111 |
Instances should not compete for RAM. This is easily done on Xen, but it |
112 |
is tricky on KVM. |
113 |
|
114 |
Xen |
115 |
~~~ |
116 |
|
117 |
Memory is already fully segregated under Xen, if sharing mechanisms |
118 |
(transcendent memory, auto ballooning, etc) are not in use. |
119 |
|
120 |
KVM |
121 |
~~~ |
122 |
Under KVM or LXC memory is fully shared between the host system and all |
123 |
the guests, and instances can even be swapped out by the host OS. |
124 |
|
125 |
It's not clear if the problem can be solved by limiting the size of the |
126 |
instances, so that there is plenty of room for the host OS. |
127 |
|
128 |
We could implement segregation using cgroups to limit the memory used by |
129 |
the host OS. This requires finishing the implementation of the memory |
130 |
hypervisor status (set at the node group level) that changes how free |
131 |
memory is computed under KVM systems. Then we have to add a way to |
132 |
enforce this limit on the host system itself, rather than leaving it as |
133 |
a calculation tool only. |
134 |
|
135 |
Another problem for KVM is that we need to decide about the size of the |
136 |
cgroup versus the size of the VM: some overhead will in particular |
137 |
exist, due to the fact that an instance and its encapsulating KVM |
138 |
process share the same space. For KVM systems the physical memory |
139 |
allocatable to instances should be computed by subtracting an overhead |
140 |
for the KVM processes, whose value can be either statically configured |
141 |
or set in a hypervisor status parameter. |
142 |
|
143 |
NUMA |
144 |
~~~~ |
145 |
|
146 |
If instances are pinned to CPUs, and the amount of memory used for every |
147 |
instance is proportionate to the number of VCPUs, NUMA shouldn't be a |
148 |
problem, as the hypervisors allocate memory in the appropriate NUMA |
149 |
node. Work is in progress in Xen and the Linux kernel to always allocate |
150 |
memory correctly even without pinning. Therefore, we don't need to |
151 |
address this problem specifically; it will be solved by future versions |
152 |
of the hypervisors or by implementing CPU pinning. |
153 |
|
154 |
Constrained instance sizes |
155 |
-------------------------- |
156 |
|
157 |
In order to simplify allocation and resource provisioning we want to |
158 |
limit the possible sizes of instances to a finite set of specifications, |
159 |
defined at node-group level. |
160 |
|
161 |
Currently it's possible to define an instance policy that limits the |
162 |
minimum and maximum value for CPU, memory, and disk usage (and spindles |
163 |
and any other resource, when implemented), independently from each other. We |
164 |
extend the policy by allowing it to specify more specifications, where |
165 |
each specification contains the limits (minimum, maximum, and standard) |
166 |
for all the resources. Each specification has a unique priority (an |
167 |
integer) associated to it, which is used by ``hspace`` (see below). |
168 |
|
169 |
For example, a policy could be set up to allow instances with this |
170 |
constraints: |
171 |
- between 1 and 2 CPUs, 2 GB of RAM, and between 10 GB and 400 GB of |
172 |
disk space; |
173 |
- 4 CPUs, 4 GB of RAM, and between 10 GB and 800 GB of disk space. |
174 |
|
175 |
Then, an instance using 1 CPU, 2 GB of RAM and 50 GB of disk would be |
176 |
legal, as an instance using 4 CPUs, 4 GB of RAM, and 20 GB of disk, |
177 |
while an instance using 2 CPUs, 4 GB of RAM and 40 GB of disk would be |
178 |
illegal. |
179 |
|
180 |
Ganeti will refuse to create (or modify) instances that violate instance |
181 |
policy constraints, unless the flag ``--ignore-ipolicy`` is passed. |
182 |
|
183 |
While the changes needed to check constraint violations are |
184 |
straightforward, ``hspace`` behavior needs some adjustments. For both |
185 |
standard and tiered allocation, ``hspace`` will start to allocate |
186 |
instances using the specification with the highest priority, then it |
187 |
will fall back to second highest priority, and so on. For tiered |
188 |
allocation, it will try to lower the most constrained resources (without |
189 |
breaking the policy) before going to the next specification. |
190 |
|
191 |
For consistent results in capacity calculation, the specifications |
192 |
inside a policy should be ordered so that the biggest specifications |
193 |
have the highest priorities. Also, specifications should not overlap. |
194 |
Ganeti won't check nor enforce such constraints, though. |
195 |
|
196 |
Implementation order |
197 |
==================== |
198 |
|
199 |
We will implement this design in the following order: |
200 |
|
201 |
- Exclusive use of disks (without spindles as a resource) |
202 |
- Constrained instance sizes |
203 |
- Spindles as a resource |
204 |
- Dedicated CPU and memory |
205 |
|
206 |
In this way have always new features that are immediately useful. |
207 |
Spindles as a resource are not needed for correct capacity calculation, |
208 |
as long as allowed disk sizes are multiples of spindle size, so it's |
209 |
been moved after constrained instance sizes. If it turns out that it's |
210 |
easier to implement dedicated disks with spindles as a resource, then we |
211 |
will do that. |
212 |
|
213 |
Possible future enhancements |
214 |
============================ |
215 |
|
216 |
This section briefly describes some enhancements to the current design. |
217 |
They may require their own design document, and must be re-evaluated |
218 |
when considered for implementation, as Ganeti and the hypervisors may |
219 |
change substantially in the meantime. |
220 |
|
221 |
Network bandwidth |
222 |
----------------- |
223 |
|
224 |
A new resource is introduced: network bandwidth. An administrator must |
225 |
be able to assign some network bandwidth to the virtual interfaces of an |
226 |
instance, and set limits in instance policies. Also, a list of the |
227 |
physical network interfaces available for Ganeti use and their maximum |
228 |
bandwidth must be kept at node-group or node level. This information |
229 |
will be taken into account for allocation, balancing, and free-space |
230 |
calculation. |
231 |
|
232 |
An additional enhancement is Ganeti enforcing the values set in the |
233 |
bandwidth resource. This can be done by configuring limits for example |
234 |
via openvswitch or normal QoS for bridging or routing. The bandwidth |
235 |
resource represents the average bandwidth usage, so a few new back-end |
236 |
parameters are needed to configure how to deal with bursts (they depend |
237 |
on the actual way used to enforce the limit). |
238 |
|
239 |
CPU pinning |
240 |
----------- |
241 |
|
242 |
In order to avoid unwarranted migrations between CPUs and to deal with |
243 |
NUMA effectively we may need CPU pinning. CPU scheduling is a complex |
244 |
topic and still under active development in Xen and the Linux kernel, so |
245 |
we wont' try to outsmart their developers. If we need pinning it's more |
246 |
to have predictable performance than to get the maximum performance |
247 |
(which is best done by the hypervisor), so we'll implement a very simple |
248 |
algorithm that allocates CPUs when an instance is assigned to a node |
249 |
(either when it's created or when it's moved) and takes into account |
250 |
NUMA and maybe CPU multithreading. A more refined version might run also |
251 |
when an instance is deleted, but that would involve reassigning CPUs, |
252 |
which could be bad with NUMA. |
253 |
|
254 |
Overcommit for RAM and disks |
255 |
---------------------------- |
256 |
|
257 |
Right now it is possible to assign more VCPUs to the instances running |
258 |
on a node than there are CPU available. This works as normally CPU usage |
259 |
on average is way below 100%. There are ways to share memory pages |
260 |
(e.g. KSM, transcendent memory) and disk blocks, so we could add new |
261 |
parameters to overcommit memory and disks, similar to ``vcpu_ratio``. |
262 |
|
263 |
.. vim: set textwidth=72 : |
264 |
.. Local Variables: |
265 |
.. mode: rst |
266 |
.. fill-column: 72 |
267 |
.. End: |