root / doc / design-partitioned.rst @ d3b06210
History | View | Annotate | Download (11.8 kB)
1 |
================== |
---|---|
2 |
Partitioned Ganeti |
3 |
================== |
4 |
|
5 |
.. contents:: :depth: 4 |
6 |
|
7 |
Current state and shortcomings |
8 |
============================== |
9 |
|
10 |
Currently Ganeti can be used to easily share a node between multiple |
11 |
virtual instances. While it's easy to do a completely "best effort" |
12 |
sharing it's quite harder to completely reserve resources for the use of |
13 |
a particular instance. In particular this has to be done manually for |
14 |
CPUs and disk, is implemented for RAM under Xen, but not under KVM, and |
15 |
there's no provision for network level QoS. |
16 |
|
17 |
Proposed changes |
18 |
================ |
19 |
|
20 |
We want to make it easy to partition a node between machines with |
21 |
exclusive use of hardware resources. While some sharing will anyway need |
22 |
to happen (e.g. for operations that use the host domain, or use |
23 |
resources, like buses, which are unique or very scarce on host systems) |
24 |
we'll strive to maintain contention at a minimum, but won't try to avoid |
25 |
all possible sources of it. |
26 |
|
27 |
Exclusive use of disks |
28 |
---------------------- |
29 |
|
30 |
``exclusive_storage`` is a new node parameter. When it's enabled, Ganeti |
31 |
will allocate entire disks to instances. Though it's possible to think |
32 |
of ways of doing something similar for other storage back-ends, this |
33 |
design targets only ``plain`` and ``drbd``. The name is generic enough |
34 |
in case the feature will be extended to other back-ends. The flag value |
35 |
should be homogeneous within a node-group; ``cluster-verify`` will report |
36 |
any violation of this condition. |
37 |
|
38 |
Ganeti will consider each physical volume in the destination volume |
39 |
group as a host disk (for proper isolation, an administrator should |
40 |
make sure that there aren't multiple PVs on the same physical |
41 |
disk). When ``exclusive_storage`` is enabled in a node group, all PVs |
42 |
in the node group must have the same size (within a certain margin, say |
43 |
1%, defined through a new parameter). Ganeti will check this condition |
44 |
when the ``exclusive_storage`` flag is set, whenever a new node is added |
45 |
and as part of ``cluster-verify``. |
46 |
|
47 |
When creating a new disk for an instance, Ganeti will allocate the |
48 |
minimum number of PVs to hold the disk, and those PVs will be excluded |
49 |
from the pool of available PVs for further disk creations. The |
50 |
underlying LV will be striped, when striping is allowed by the current |
51 |
configuration. Ganeti will continue to track only the LVs, and query the |
52 |
LVM layer to figure out which PVs are available and how much space is |
53 |
free. Yet, creation, disk growing, and free-space reporting will ignore |
54 |
any partially allocated PVs, so that PVs won't be shared between |
55 |
instance disks. |
56 |
|
57 |
For compatibility with the DRBD template and to take into account disk |
58 |
variability, Ganeti will always subtract 2% (this will be a parameter) |
59 |
from the PV space when calculating how many PVs are needed to allocate |
60 |
an instance and when nodes report free space. |
61 |
|
62 |
The obvious target for this option is plain disk template, which doesn't |
63 |
provide redundancy. An administrator can still provide resilience |
64 |
against disk failures by setting up RAID under PVs, but this is |
65 |
transparent to Ganeti. |
66 |
|
67 |
Spindles as a resource |
68 |
~~~~~~~~~~~~~~~~~~~~~~ |
69 |
|
70 |
When resources are dedicated and there are more spindles than instances |
71 |
on a node, it is natural to assign more spindles to instances than what |
72 |
is strictly needed. For this reason, we introduce a new resource: |
73 |
spindles. A spindle is a PV in LVM. The number of spindles required for |
74 |
a disk of an instance is specified together with the size. Specifying |
75 |
the number of spindles is possible only when ``exclusive_storage`` is |
76 |
enabled. It is an error to specify a number of spindles insufficient to |
77 |
contain the requested disk size. |
78 |
|
79 |
When ``exclusive_storage`` is not enabled, spindles are not used in free |
80 |
space calculation, in allocation algorithms, and policies. When it's |
81 |
enabled, ``hspace``, ``hbal``, and allocators will use spindles instead |
82 |
of disk size for their computation. For each node, the number of all the |
83 |
spindles in every LVM group is recorded, and different LVM groups are |
84 |
accounted separately in allocation and balancing. |
85 |
|
86 |
There is already a concept of spindles in Ganeti. It's not related to |
87 |
any actual spindle or volume count, but it's used in ``spindle_use`` to |
88 |
measure the pressure of an instance on the storage system and in |
89 |
``spindle_ratio`` to balance the I/O load on the nodes. These two |
90 |
parameters will be renamed to ``storage_io_use`` and |
91 |
``storage_io_ratio`` to reflect better their meaning. When |
92 |
``exclusive_storage`` is enabled, such parameters are ignored, as |
93 |
balancing the use of storage I/O is already addressed by the exclusive |
94 |
assignment of PVs. |
95 |
|
96 |
Dedicated CPUs |
97 |
-------------- |
98 |
|
99 |
``vpcu_ratio`` can be used to tie the number of VCPUs to the number of |
100 |
CPUs provided by the hardware. We need to take into account the CPU |
101 |
usage of the hypervisor. For Xen, this means counting the number of |
102 |
VCPUs assigned to ``Domain-0``. |
103 |
|
104 |
For KVM, it's more difficult to limit the number of CPUs used by the |
105 |
node OS. ``cgroups`` could be a solution to restrict the node OS to use |
106 |
some of the CPUs, leaving the other ones to instances and KVM processes. |
107 |
For KVM, the number of CPUs for the host system should also be a |
108 |
hypervisor parameter (set at the node group level). |
109 |
|
110 |
Dedicated RAM |
111 |
------------- |
112 |
|
113 |
Instances should not compete for RAM. This is easily done on Xen, but it |
114 |
is tricky on KVM. |
115 |
|
116 |
Xen |
117 |
~~~ |
118 |
|
119 |
Memory is already fully segregated under Xen, if sharing mechanisms |
120 |
(transcendent memory, auto ballooning, etc) are not in use. |
121 |
|
122 |
KVM |
123 |
~~~ |
124 |
Under KVM or LXC memory is fully shared between the host system and all |
125 |
the guests, and instances can even be swapped out by the host OS. |
126 |
|
127 |
It's not clear if the problem can be solved by limiting the size of the |
128 |
instances, so that there is plenty of room for the host OS. |
129 |
|
130 |
We could implement segregation using cgroups to limit the memory used by |
131 |
the host OS. This requires finishing the implementation of the memory |
132 |
hypervisor status (set at the node group level) that changes how free |
133 |
memory is computed under KVM systems. Then we have to add a way to |
134 |
enforce this limit on the host system itself, rather than leaving it as |
135 |
a calculation tool only. |
136 |
|
137 |
Another problem for KVM is that we need to decide about the size of the |
138 |
cgroup versus the size of the VM: some overhead will in particular |
139 |
exist, due to the fact that an instance and its encapsulating KVM |
140 |
process share the same space. For KVM systems the physical memory |
141 |
allocatable to instances should be computed by subtracting an overhead |
142 |
for the KVM processes, whose value can be either statically configured |
143 |
or set in a hypervisor status parameter. |
144 |
|
145 |
NUMA |
146 |
~~~~ |
147 |
|
148 |
If instances are pinned to CPUs, and the amount of memory used for every |
149 |
instance is proportionate to the number of VCPUs, NUMA shouldn't be a |
150 |
problem, as the hypervisors allocate memory in the appropriate NUMA |
151 |
node. Work is in progress in Xen and the Linux kernel to always allocate |
152 |
memory correctly even without pinning. Therefore, we don't need to |
153 |
address this problem specifically; it will be solved by future versions |
154 |
of the hypervisors or by implementing CPU pinning. |
155 |
|
156 |
Constrained instance sizes |
157 |
-------------------------- |
158 |
|
159 |
In order to simplify allocation and resource provisioning we want to |
160 |
limit the possible sizes of instances to a finite set of specifications, |
161 |
defined at node-group level. |
162 |
|
163 |
Currently it's possible to define an instance policy that limits the |
164 |
minimum and maximum value for CPU, memory, and disk usage (and spindles |
165 |
and any other resource, when implemented), independently from each other. We |
166 |
extend the policy by allowing it to specify more specifications, where |
167 |
each specification contains the limits (minimum, maximum, and standard) |
168 |
for all the resources. Each specification has a unique priority (an |
169 |
integer) associated to it, which is used by ``hspace`` (see below). |
170 |
|
171 |
For example, a policy could be set up to allow instances with this |
172 |
constraints: |
173 |
- between 1 and 2 CPUs, 2 GB of RAM, and between 10 GB and 400 GB of |
174 |
disk space; |
175 |
- 4 CPUs, 4 GB of RAM, and between 10 GB and 800 GB of disk space. |
176 |
|
177 |
Then, an instance using 1 CPU, 2 GB of RAM and 50 GB of disk would be |
178 |
legal, as an instance using 4 CPUs, 4 GB of RAM, and 20 GB of disk, |
179 |
while an instance using 2 CPUs, 4 GB of RAM and 40 GB of disk would be |
180 |
illegal. |
181 |
|
182 |
Ganeti will refuse to create (or modify) instances that violate instance |
183 |
policy constraints, unless the flag ``--ignore-ipolicy`` is passed. |
184 |
|
185 |
While the changes needed to check constraint violations are |
186 |
straightforward, ``hspace`` behavior needs some adjustments. For both |
187 |
standard and tiered allocation, ``hspace`` will start to allocate |
188 |
instances using the specification with the highest priority, then it |
189 |
will fall back to second highest priority, and so on. For tiered |
190 |
allocation, it will try to lower the most constrained resources (without |
191 |
breaking the policy) before going to the next specification. |
192 |
|
193 |
For consistent results in capacity calculation, the specifications |
194 |
inside a policy should be ordered so that the biggest specifications |
195 |
have the highest priorities. Also, specifications should not overlap. |
196 |
Ganeti won't check nor enforce such constraints, though. |
197 |
|
198 |
Implementation order |
199 |
==================== |
200 |
|
201 |
We will implement this design in the following order: |
202 |
|
203 |
- Exclusive use of disks (without spindles as a resource) |
204 |
- Constrained instance sizes |
205 |
- Spindles as a resource |
206 |
- Dedicated CPU and memory |
207 |
|
208 |
In this way have always new features that are immediately useful. |
209 |
Spindles as a resource are not needed for correct capacity calculation, |
210 |
as long as allowed disk sizes are multiples of spindle size, so it's |
211 |
been moved after constrained instance sizes. If it turns out that it's |
212 |
easier to implement dedicated disks with spindles as a resource, then we |
213 |
will do that. |
214 |
|
215 |
Possible future enhancements |
216 |
============================ |
217 |
|
218 |
This section briefly describes some enhancements to the current design. |
219 |
They may require their own design document, and must be re-evaluated |
220 |
when considered for implementation, as Ganeti and the hypervisors may |
221 |
change substantially in the meantime. |
222 |
|
223 |
Network bandwidth |
224 |
----------------- |
225 |
|
226 |
A new resource is introduced: network bandwidth. An administrator must |
227 |
be able to assign some network bandwidth to the virtual interfaces of an |
228 |
instance, and set limits in instance policies. Also, a list of the |
229 |
physical network interfaces available for Ganeti use and their maximum |
230 |
bandwidth must be kept at node-group or node level. This information |
231 |
will be taken into account for allocation, balancing, and free-space |
232 |
calculation. |
233 |
|
234 |
An additional enhancement is Ganeti enforcing the values set in the |
235 |
bandwidth resource. This can be done by configuring limits for example |
236 |
via openvswitch or normal QoS for bridging or routing. The bandwidth |
237 |
resource represents the average bandwidth usage, so a few new back-end |
238 |
parameters are needed to configure how to deal with bursts (they depend |
239 |
on the actual way used to enforce the limit). |
240 |
|
241 |
CPU pinning |
242 |
----------- |
243 |
|
244 |
In order to avoid unwarranted migrations between CPUs and to deal with |
245 |
NUMA effectively we may need CPU pinning. CPU scheduling is a complex |
246 |
topic and still under active development in Xen and the Linux kernel, so |
247 |
we wont' try to outsmart their developers. If we need pinning it's more |
248 |
to have predictable performance than to get the maximum performance |
249 |
(which is best done by the hypervisor), so we'll implement a very simple |
250 |
algorithm that allocates CPUs when an instance is assigned to a node |
251 |
(either when it's created or when it's moved) and takes into account |
252 |
NUMA and maybe CPU multithreading. A more refined version might run also |
253 |
when an instance is deleted, but that would involve reassigning CPUs, |
254 |
which could be bad with NUMA. |
255 |
|
256 |
Overcommit for RAM and disks |
257 |
---------------------------- |
258 |
|
259 |
Right now it is possible to assign more VCPUs to the instances running |
260 |
on a node than there are CPU available. This works as normally CPU usage |
261 |
on average is way below 100%. There are ways to share memory pages |
262 |
(e.g. KSM, transcendent memory) and disk blocks, so we could add new |
263 |
parameters to overcommit memory and disks, similar to ``vcpu_ratio``. |
264 |
|
265 |
.. vim: set textwidth=72 : |
266 |
.. Local Variables: |
267 |
.. mode: rst |
268 |
.. fill-column: 72 |
269 |
.. End: |