Statistics
| Branch: | Tag: | Revision:

root / doc / design-partitioned.rst @ 333bd799

History | View | Annotate | Download (12.4 kB)

1 4ff32a35 Bernardo Dal Seno
==================
2 4ff32a35 Bernardo Dal Seno
Partitioned Ganeti
3 4ff32a35 Bernardo Dal Seno
==================
4 4ff32a35 Bernardo Dal Seno
5 4ff32a35 Bernardo Dal Seno
.. contents:: :depth: 4
6 4ff32a35 Bernardo Dal Seno
7 4ff32a35 Bernardo Dal Seno
Current state and shortcomings
8 4ff32a35 Bernardo Dal Seno
==============================
9 4ff32a35 Bernardo Dal Seno
10 4ff32a35 Bernardo Dal Seno
Currently Ganeti can be used to easily share a node between multiple
11 4ff32a35 Bernardo Dal Seno
virtual instances. While it's easy to do a completely "best effort"
12 4ff32a35 Bernardo Dal Seno
sharing it's quite harder to completely reserve resources for the use of
13 4ff32a35 Bernardo Dal Seno
a particular instance. In particular this has to be done manually for
14 4ff32a35 Bernardo Dal Seno
CPUs and disk, is implemented for RAM under Xen, but not under KVM, and
15 4ff32a35 Bernardo Dal Seno
there's no provision for network level QoS.
16 4ff32a35 Bernardo Dal Seno
17 4ff32a35 Bernardo Dal Seno
Proposed changes
18 4ff32a35 Bernardo Dal Seno
================
19 4ff32a35 Bernardo Dal Seno
20 4ff32a35 Bernardo Dal Seno
We want to make it easy to partition a node between machines with
21 4ff32a35 Bernardo Dal Seno
exclusive use of hardware resources. While some sharing will anyway need
22 4ff32a35 Bernardo Dal Seno
to happen (e.g. for operations that use the host domain, or use
23 4ff32a35 Bernardo Dal Seno
resources, like buses, which are unique or very scarce on host systems)
24 4ff32a35 Bernardo Dal Seno
we'll strive to maintain contention at a minimum, but won't try to avoid
25 4ff32a35 Bernardo Dal Seno
all possible sources of it.
26 4ff32a35 Bernardo Dal Seno
27 4ff32a35 Bernardo Dal Seno
Exclusive use of disks
28 4ff32a35 Bernardo Dal Seno
----------------------
29 4ff32a35 Bernardo Dal Seno
30 d3b06210 Bernardo Dal Seno
``exclusive_storage`` is a new node parameter. When it's enabled, Ganeti
31 d3b06210 Bernardo Dal Seno
will allocate entire disks to instances. Though it's possible to think
32 d3b06210 Bernardo Dal Seno
of ways of doing something similar for other storage back-ends, this
33 d3b06210 Bernardo Dal Seno
design targets only ``plain`` and ``drbd``. The name is generic enough
34 d3b06210 Bernardo Dal Seno
in case the feature will be extended to other back-ends. The flag value
35 d3b06210 Bernardo Dal Seno
should be homogeneous within a node-group; ``cluster-verify`` will report
36 d3b06210 Bernardo Dal Seno
any violation of this condition.
37 4ff32a35 Bernardo Dal Seno
38 4ff32a35 Bernardo Dal Seno
Ganeti will consider each physical volume in the destination volume
39 4ff32a35 Bernardo Dal Seno
group as a host disk (for proper isolation, an administrator should
40 4ff32a35 Bernardo Dal Seno
make sure that there aren't multiple PVs on the same physical
41 4ff32a35 Bernardo Dal Seno
disk). When ``exclusive_storage`` is enabled in a node group, all PVs
42 4ff32a35 Bernardo Dal Seno
in the node group must have the same size (within a certain margin, say
43 4ff32a35 Bernardo Dal Seno
1%, defined through a new parameter). Ganeti will check this condition
44 4ff32a35 Bernardo Dal Seno
when the ``exclusive_storage`` flag is set, whenever a new node is added
45 4ff32a35 Bernardo Dal Seno
and as part of ``cluster-verify``.
46 4ff32a35 Bernardo Dal Seno
47 4ff32a35 Bernardo Dal Seno
When creating a new disk for an instance, Ganeti will allocate the
48 4ff32a35 Bernardo Dal Seno
minimum number of PVs to hold the disk, and those PVs will be excluded
49 d3b06210 Bernardo Dal Seno
from the pool of available PVs for further disk creations. The
50 d3b06210 Bernardo Dal Seno
underlying LV will be striped, when striping is allowed by the current
51 d3b06210 Bernardo Dal Seno
configuration. Ganeti will continue to track only the LVs, and query the
52 d3b06210 Bernardo Dal Seno
LVM layer to figure out which PVs are available and how much space is
53 d3b06210 Bernardo Dal Seno
free. Yet, creation, disk growing, and free-space reporting will ignore
54 d3b06210 Bernardo Dal Seno
any partially allocated PVs, so that PVs won't be shared between
55 d3b06210 Bernardo Dal Seno
instance disks.
56 4ff32a35 Bernardo Dal Seno
57 4ff32a35 Bernardo Dal Seno
For compatibility with the DRBD template and to take into account disk
58 4ff32a35 Bernardo Dal Seno
variability, Ganeti will always subtract 2% (this will be a parameter)
59 4ff32a35 Bernardo Dal Seno
from the PV space when calculating how many PVs are needed to allocate
60 4ff32a35 Bernardo Dal Seno
an instance and when nodes report free space.
61 4ff32a35 Bernardo Dal Seno
62 4ff32a35 Bernardo Dal Seno
The obvious target for this option is plain disk template, which doesn't
63 4ff32a35 Bernardo Dal Seno
provide redundancy. An administrator can still provide resilience
64 4ff32a35 Bernardo Dal Seno
against disk failures by setting up RAID under PVs, but this is
65 4ff32a35 Bernardo Dal Seno
transparent to Ganeti.
66 4ff32a35 Bernardo Dal Seno
67 4ff32a35 Bernardo Dal Seno
Spindles as a resource
68 4ff32a35 Bernardo Dal Seno
~~~~~~~~~~~~~~~~~~~~~~
69 4ff32a35 Bernardo Dal Seno
70 4ff32a35 Bernardo Dal Seno
When resources are dedicated and there are more spindles than instances
71 4ff32a35 Bernardo Dal Seno
on a node, it is natural to assign more spindles to instances than what
72 4ff32a35 Bernardo Dal Seno
is strictly needed. For this reason, we introduce a new resource:
73 4ff32a35 Bernardo Dal Seno
spindles. A spindle is a PV in LVM. The number of spindles required for
74 4ff32a35 Bernardo Dal Seno
a disk of an instance is specified together with the size. Specifying
75 4ff32a35 Bernardo Dal Seno
the number of spindles is possible only when ``exclusive_storage`` is
76 4ff32a35 Bernardo Dal Seno
enabled. It is an error to specify a number of spindles insufficient to
77 4ff32a35 Bernardo Dal Seno
contain the requested disk size.
78 4ff32a35 Bernardo Dal Seno
79 4ff32a35 Bernardo Dal Seno
When ``exclusive_storage`` is not enabled, spindles are not used in free
80 4ff32a35 Bernardo Dal Seno
space calculation, in allocation algorithms, and policies. When it's
81 4ff32a35 Bernardo Dal Seno
enabled, ``hspace``, ``hbal``, and allocators will use spindles instead
82 4ff32a35 Bernardo Dal Seno
of disk size for their computation. For each node, the number of all the
83 4ff32a35 Bernardo Dal Seno
spindles in every LVM group is recorded, and different LVM groups are
84 4ff32a35 Bernardo Dal Seno
accounted separately in allocation and balancing.
85 4ff32a35 Bernardo Dal Seno
86 4ff32a35 Bernardo Dal Seno
There is already a concept of spindles in Ganeti. It's not related to
87 4ff32a35 Bernardo Dal Seno
any actual spindle or volume count, but it's used in ``spindle_use`` to
88 4ff32a35 Bernardo Dal Seno
measure the pressure of an instance on the storage system and in
89 8a96e60d Bernardo Dal Seno
``spindle_ratio`` to balance the I/O load on the nodes. When
90 8a96e60d Bernardo Dal Seno
``exclusive_storage`` is enabled, these parameters as currently defined
91 8a96e60d Bernardo Dal Seno
won't make any sense, so their meaning will be changed in this way:
92 8a96e60d Bernardo Dal Seno
93 8a96e60d Bernardo Dal Seno
- ``spindle_use`` refers to the resource, hence to the actual spindles
94 8a96e60d Bernardo Dal Seno
  (PVs in LVM), used by an instance. The values specified in the instance
95 8a96e60d Bernardo Dal Seno
  policy specifications are compared to the run-time numbers of spindle
96 8a96e60d Bernardo Dal Seno
  used by an instance. The ``spindle_use`` back-end parameter will be
97 8a96e60d Bernardo Dal Seno
  ignored.
98 8a96e60d Bernardo Dal Seno
- ``spindle_ratio`` in instance policies and ``spindle_count`` in node
99 8a96e60d Bernardo Dal Seno
  parameters are ignored, as the exclusive assignment of PVs already
100 8a96e60d Bernardo Dal Seno
  implies a value of 1.0 for the first, and the second is replaced by
101 8a96e60d Bernardo Dal Seno
  the actual number of spindles.
102 8a96e60d Bernardo Dal Seno
103 8a96e60d Bernardo Dal Seno
When ``exclusive_storage`` is disabled, the existing spindle parameters
104 8a96e60d Bernardo Dal Seno
behave as before.
105 4ff32a35 Bernardo Dal Seno
106 4ff32a35 Bernardo Dal Seno
Dedicated CPUs
107 4ff32a35 Bernardo Dal Seno
--------------
108 4ff32a35 Bernardo Dal Seno
109 4ff32a35 Bernardo Dal Seno
``vpcu_ratio`` can be used to tie the number of VCPUs to the number of
110 4ff32a35 Bernardo Dal Seno
CPUs provided by the hardware. We need to take into account the CPU
111 4ff32a35 Bernardo Dal Seno
usage of the hypervisor. For Xen, this means counting the number of
112 4ff32a35 Bernardo Dal Seno
VCPUs assigned to ``Domain-0``.
113 4ff32a35 Bernardo Dal Seno
114 4ff32a35 Bernardo Dal Seno
For KVM, it's more difficult to limit the number of CPUs used by the
115 4ff32a35 Bernardo Dal Seno
node OS. ``cgroups`` could be a solution to restrict the node OS to use
116 4ff32a35 Bernardo Dal Seno
some of the CPUs, leaving the other ones to instances and KVM processes.
117 4ff32a35 Bernardo Dal Seno
For KVM, the number of CPUs for the host system should also be a
118 4ff32a35 Bernardo Dal Seno
hypervisor parameter (set at the node group level).
119 4ff32a35 Bernardo Dal Seno
120 4ff32a35 Bernardo Dal Seno
Dedicated RAM
121 4ff32a35 Bernardo Dal Seno
-------------
122 4ff32a35 Bernardo Dal Seno
123 4ff32a35 Bernardo Dal Seno
Instances should not compete for RAM. This is easily done on Xen, but it
124 4ff32a35 Bernardo Dal Seno
is tricky on KVM.
125 4ff32a35 Bernardo Dal Seno
126 4ff32a35 Bernardo Dal Seno
Xen
127 4ff32a35 Bernardo Dal Seno
~~~
128 4ff32a35 Bernardo Dal Seno
129 4ff32a35 Bernardo Dal Seno
Memory is already fully segregated under Xen, if sharing mechanisms
130 4ff32a35 Bernardo Dal Seno
(transcendent memory, auto ballooning, etc) are not in use.
131 4ff32a35 Bernardo Dal Seno
132 4ff32a35 Bernardo Dal Seno
KVM
133 4ff32a35 Bernardo Dal Seno
~~~
134 4ff32a35 Bernardo Dal Seno
Under KVM or LXC memory is fully shared between the host system and all
135 4ff32a35 Bernardo Dal Seno
the guests, and instances can even be swapped out by the host OS.
136 4ff32a35 Bernardo Dal Seno
137 4ff32a35 Bernardo Dal Seno
It's not clear if the problem can be solved by limiting the size of the
138 f583e7ad Bernardo Dal Seno
instances, so that there is plenty of room for the host OS.
139 4ff32a35 Bernardo Dal Seno
140 4ff32a35 Bernardo Dal Seno
We could implement segregation using cgroups to limit the memory used by
141 4ff32a35 Bernardo Dal Seno
the host OS. This requires finishing the implementation of the memory
142 4ff32a35 Bernardo Dal Seno
hypervisor status (set at the node group level) that changes how free
143 4ff32a35 Bernardo Dal Seno
memory is computed under KVM systems. Then we have to add a way to
144 4ff32a35 Bernardo Dal Seno
enforce this limit on the host system itself, rather than leaving it as
145 4ff32a35 Bernardo Dal Seno
a calculation tool only.
146 4ff32a35 Bernardo Dal Seno
147 4ff32a35 Bernardo Dal Seno
Another problem for KVM is that we need to decide about the size of the
148 4ff32a35 Bernardo Dal Seno
cgroup versus the size of the VM: some overhead will in particular
149 4ff32a35 Bernardo Dal Seno
exist, due to the fact that an instance and its encapsulating KVM
150 4ff32a35 Bernardo Dal Seno
process share the same space. For KVM systems the physical memory
151 4ff32a35 Bernardo Dal Seno
allocatable to instances should be computed by subtracting an overhead
152 4ff32a35 Bernardo Dal Seno
for the KVM processes, whose value can be either statically configured
153 4ff32a35 Bernardo Dal Seno
or set in a hypervisor status parameter.
154 4ff32a35 Bernardo Dal Seno
155 4ff32a35 Bernardo Dal Seno
NUMA
156 4ff32a35 Bernardo Dal Seno
~~~~
157 4ff32a35 Bernardo Dal Seno
158 4ff32a35 Bernardo Dal Seno
If instances are pinned to CPUs, and the amount of memory used for every
159 4ff32a35 Bernardo Dal Seno
instance is proportionate to the number of VCPUs, NUMA shouldn't be a
160 4ff32a35 Bernardo Dal Seno
problem, as the hypervisors allocate memory in the appropriate NUMA
161 4ff32a35 Bernardo Dal Seno
node. Work is in progress in Xen and the Linux kernel to always allocate
162 4ff32a35 Bernardo Dal Seno
memory correctly even without pinning. Therefore, we don't need to
163 4ff32a35 Bernardo Dal Seno
address this problem specifically; it will be solved by future versions
164 4ff32a35 Bernardo Dal Seno
of the hypervisors or by implementing CPU pinning.
165 4ff32a35 Bernardo Dal Seno
166 4ff32a35 Bernardo Dal Seno
Constrained instance sizes
167 4ff32a35 Bernardo Dal Seno
--------------------------
168 4ff32a35 Bernardo Dal Seno
169 4ff32a35 Bernardo Dal Seno
In order to simplify allocation and resource provisioning we want to
170 4ff32a35 Bernardo Dal Seno
limit the possible sizes of instances to a finite set of specifications,
171 4ff32a35 Bernardo Dal Seno
defined at node-group level.
172 4ff32a35 Bernardo Dal Seno
173 4ff32a35 Bernardo Dal Seno
Currently it's possible to define an instance policy that limits the
174 4ff32a35 Bernardo Dal Seno
minimum and maximum value for CPU, memory, and disk usage (and spindles
175 4ff32a35 Bernardo Dal Seno
and any other resource, when implemented), independently from each other. We
176 a6321765 Bernardo Dal Seno
extend the policy by allowing it to contain more occurrences of the
177 a6321765 Bernardo Dal Seno
specifications for both the limits for the instance resources. Each
178 a6321765 Bernardo Dal Seno
specification pair (minimum and maximum) has a unique priority
179 a6321765 Bernardo Dal Seno
associated to it (or in other words, specifications are ordered), which
180 a6321765 Bernardo Dal Seno
is used by ``hspace`` (see below). The standard specification doesn't
181 a6321765 Bernardo Dal Seno
change: there is one for the whole cluster.
182 4ff32a35 Bernardo Dal Seno
183 4ff32a35 Bernardo Dal Seno
For example, a policy could be set up to allow instances with this
184 4ff32a35 Bernardo Dal Seno
constraints:
185 f583e7ad Bernardo Dal Seno
186 4ff32a35 Bernardo Dal Seno
- between 1 and 2 CPUs, 2 GB of RAM, and between 10 GB and 400 GB of
187 f583e7ad Bernardo Dal Seno
  disk space;
188 4ff32a35 Bernardo Dal Seno
- 4 CPUs, 4 GB of RAM, and between 10 GB and 800 GB of disk space.
189 4ff32a35 Bernardo Dal Seno
190 4ff32a35 Bernardo Dal Seno
Then, an instance using 1 CPU, 2 GB of RAM and 50 GB of disk would be
191 4ff32a35 Bernardo Dal Seno
legal, as an instance using 4 CPUs, 4 GB of RAM, and 20 GB of disk,
192 4ff32a35 Bernardo Dal Seno
while an instance using 2 CPUs, 4 GB of RAM and 40 GB of disk would be
193 4ff32a35 Bernardo Dal Seno
illegal.
194 4ff32a35 Bernardo Dal Seno
195 4ff32a35 Bernardo Dal Seno
Ganeti will refuse to create (or modify) instances that violate instance
196 4ff32a35 Bernardo Dal Seno
policy constraints, unless the flag ``--ignore-ipolicy`` is passed.
197 4ff32a35 Bernardo Dal Seno
198 4ff32a35 Bernardo Dal Seno
While the changes needed to check constraint violations are
199 a6321765 Bernardo Dal Seno
straightforward, ``hspace`` behavior needs some adjustments for tiered
200 a6321765 Bernardo Dal Seno
allocation. ``hspace`` will start to allocate instances using the
201 a6321765 Bernardo Dal Seno
maximum specification with the highest priority, then it will try to
202 a6321765 Bernardo Dal Seno
lower the most constrained resources (without breaking the policy)
203 a6321765 Bernardo Dal Seno
before moving to the second highest priority, and so on.
204 4ff32a35 Bernardo Dal Seno
205 4ff32a35 Bernardo Dal Seno
For consistent results in capacity calculation, the specifications
206 4ff32a35 Bernardo Dal Seno
inside a policy should be ordered so that the biggest specifications
207 4ff32a35 Bernardo Dal Seno
have the highest priorities. Also, specifications should not overlap.
208 4ff32a35 Bernardo Dal Seno
Ganeti won't check nor enforce such constraints, though.
209 4ff32a35 Bernardo Dal Seno
210 4ff32a35 Bernardo Dal Seno
Implementation order
211 4ff32a35 Bernardo Dal Seno
====================
212 4ff32a35 Bernardo Dal Seno
213 4ff32a35 Bernardo Dal Seno
We will implement this design in the following order:
214 4ff32a35 Bernardo Dal Seno
215 4ff32a35 Bernardo Dal Seno
- Exclusive use of disks (without spindles as a resource)
216 4ff32a35 Bernardo Dal Seno
- Constrained instance sizes
217 4ff32a35 Bernardo Dal Seno
- Spindles as a resource
218 4ff32a35 Bernardo Dal Seno
- Dedicated CPU and memory
219 4ff32a35 Bernardo Dal Seno
220 4ff32a35 Bernardo Dal Seno
In this way have always new features that are immediately useful.
221 4ff32a35 Bernardo Dal Seno
Spindles as a resource are not needed for correct capacity calculation,
222 4ff32a35 Bernardo Dal Seno
as long as allowed disk sizes are multiples of spindle size, so it's
223 4ff32a35 Bernardo Dal Seno
been moved after constrained instance sizes. If it turns out that it's
224 4ff32a35 Bernardo Dal Seno
easier to implement dedicated disks with spindles as a resource, then we
225 4ff32a35 Bernardo Dal Seno
will do that.
226 4ff32a35 Bernardo Dal Seno
227 4ff32a35 Bernardo Dal Seno
Possible future enhancements
228 4ff32a35 Bernardo Dal Seno
============================
229 4ff32a35 Bernardo Dal Seno
230 4ff32a35 Bernardo Dal Seno
This section briefly describes some enhancements to the current design.
231 4ff32a35 Bernardo Dal Seno
They may require their own design document, and must be re-evaluated
232 4ff32a35 Bernardo Dal Seno
when considered for implementation, as Ganeti and the hypervisors may
233 4ff32a35 Bernardo Dal Seno
change substantially in the meantime.
234 4ff32a35 Bernardo Dal Seno
235 4ff32a35 Bernardo Dal Seno
Network bandwidth
236 4ff32a35 Bernardo Dal Seno
-----------------
237 4ff32a35 Bernardo Dal Seno
238 4ff32a35 Bernardo Dal Seno
A new resource is introduced: network bandwidth. An administrator must
239 4ff32a35 Bernardo Dal Seno
be able to assign some network bandwidth to the virtual interfaces of an
240 4ff32a35 Bernardo Dal Seno
instance, and set limits in instance policies. Also, a list of the
241 4ff32a35 Bernardo Dal Seno
physical network interfaces available for Ganeti use and their maximum
242 4ff32a35 Bernardo Dal Seno
bandwidth must be kept at node-group or node level. This information
243 4ff32a35 Bernardo Dal Seno
will be taken into account for allocation, balancing, and free-space
244 4ff32a35 Bernardo Dal Seno
calculation.
245 4ff32a35 Bernardo Dal Seno
246 4ff32a35 Bernardo Dal Seno
An additional enhancement is Ganeti enforcing the values set in the
247 4ff32a35 Bernardo Dal Seno
bandwidth resource. This can be done by configuring limits for example
248 4ff32a35 Bernardo Dal Seno
via openvswitch or normal QoS for bridging or routing. The bandwidth
249 4ff32a35 Bernardo Dal Seno
resource represents the average bandwidth usage, so a few new back-end
250 4ff32a35 Bernardo Dal Seno
parameters are needed to configure how to deal with bursts (they depend
251 4ff32a35 Bernardo Dal Seno
on the actual way used to enforce the limit).
252 4ff32a35 Bernardo Dal Seno
253 4ff32a35 Bernardo Dal Seno
CPU pinning
254 4ff32a35 Bernardo Dal Seno
-----------
255 4ff32a35 Bernardo Dal Seno
256 4ff32a35 Bernardo Dal Seno
In order to avoid unwarranted migrations between CPUs and to deal with
257 4ff32a35 Bernardo Dal Seno
NUMA effectively we may need CPU pinning. CPU scheduling is a complex
258 4ff32a35 Bernardo Dal Seno
topic and still under active development in Xen and the Linux kernel, so
259 4ff32a35 Bernardo Dal Seno
we wont' try to outsmart their developers. If we need pinning it's more
260 4ff32a35 Bernardo Dal Seno
to have predictable performance than to get the maximum performance
261 4ff32a35 Bernardo Dal Seno
(which is best done by the hypervisor), so we'll implement a very simple
262 4ff32a35 Bernardo Dal Seno
algorithm that allocates CPUs when an instance is assigned to a node
263 4ff32a35 Bernardo Dal Seno
(either when it's created or when it's moved) and takes into account
264 4ff32a35 Bernardo Dal Seno
NUMA and maybe CPU multithreading. A more refined version might run also
265 4ff32a35 Bernardo Dal Seno
when an instance is deleted, but that would involve reassigning CPUs,
266 4ff32a35 Bernardo Dal Seno
which could be bad with NUMA.
267 4ff32a35 Bernardo Dal Seno
268 4ff32a35 Bernardo Dal Seno
Overcommit for RAM and disks
269 4ff32a35 Bernardo Dal Seno
----------------------------
270 4ff32a35 Bernardo Dal Seno
271 4ff32a35 Bernardo Dal Seno
Right now it is possible to assign more VCPUs to the instances running
272 4ff32a35 Bernardo Dal Seno
on a node than there are CPU available. This works as normally CPU usage
273 4ff32a35 Bernardo Dal Seno
on average is way below 100%. There are ways to share memory pages
274 4ff32a35 Bernardo Dal Seno
(e.g. KSM, transcendent memory) and disk blocks, so we could add new
275 4ff32a35 Bernardo Dal Seno
parameters to overcommit memory and disks, similar to ``vcpu_ratio``.
276 4ff32a35 Bernardo Dal Seno
277 4ff32a35 Bernardo Dal Seno
.. vim: set textwidth=72 :
278 4ff32a35 Bernardo Dal Seno
.. Local Variables:
279 4ff32a35 Bernardo Dal Seno
.. mode: rst
280 4ff32a35 Bernardo Dal Seno
.. fill-column: 72
281 4ff32a35 Bernardo Dal Seno
.. End: