root / doc / design-2.3.rst @ 887c7aa6
History | View | Annotate | Download (16.6 kB)
1 |
================= |
---|---|
2 |
Ganeti 2.3 design |
3 |
================= |
4 |
|
5 |
This document describes the major changes in Ganeti 2.3 compared to |
6 |
the 2.2 version. |
7 |
|
8 |
.. contents:: :depth: 4 |
9 |
|
10 |
As for 2.1 and 2.2 we divide the 2.3 design into three areas: |
11 |
|
12 |
- core changes, which affect the master daemon/job queue/locking or |
13 |
all/most logical units |
14 |
- logical unit/feature changes |
15 |
- external interface changes (e.g. command line, os api, hooks, ...) |
16 |
|
17 |
Core changes |
18 |
============ |
19 |
|
20 |
Node Groups |
21 |
----------- |
22 |
|
23 |
Current state and shortcomings |
24 |
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
25 |
|
26 |
Currently all nodes of a Ganeti cluster are considered as part of the |
27 |
same pool, for allocation purposes: DRBD instances for example can be |
28 |
allocated on any two nodes. |
29 |
|
30 |
This does cause a problem in cases where nodes are not all equally |
31 |
connected to each other. For example if a cluster is created over two |
32 |
set of machines, each connected to its own switch, the internal bandwidth |
33 |
between machines connected to the same switch might be bigger than the |
34 |
bandwidth for inter-switch connections. |
35 |
|
36 |
Moreover some operations inside a cluster require all nodes to be locked |
37 |
together for inter-node consistency, and won't scale if we increase the |
38 |
number of nodes to a few hundreds. |
39 |
|
40 |
Proposed changes |
41 |
~~~~~~~~~~~~~~~~ |
42 |
|
43 |
With this change we'll divide Ganeti nodes into groups. Nothing will |
44 |
change for clusters with only one node group, the default one. Bigger |
45 |
cluster instead will be able to have more than one group, and each node |
46 |
will belong to exactly one. |
47 |
|
48 |
Node group management |
49 |
+++++++++++++++++++++ |
50 |
|
51 |
To manage node groups and the nodes belonging to them, the following new |
52 |
commands/flags will be introduced:: |
53 |
|
54 |
gnt-node group-add <group> # add a new node group |
55 |
gnt-node group-del <group> # delete an empty group |
56 |
gnt-node group-list # list node groups |
57 |
gnt-node group-rename <oldname> <newname> # rename a group |
58 |
gnt-node list/info -g <group> # list only nodes belongin to a group |
59 |
gnt-node add -g <group> # add a node to a certain group |
60 |
gnt-node modify -g <group> # move a node to a new group |
61 |
|
62 |
Instance level changes |
63 |
++++++++++++++++++++++ |
64 |
|
65 |
Instances will be able to live in only one group at a time. This is |
66 |
mostly important for DRBD instances, in which case both their primary |
67 |
and secondary nodes will need to be in the same group. To support this |
68 |
we envision the following changes: |
69 |
|
70 |
- The cluster will have a default group, which will initially be |
71 |
- Instance allocation will happen to the cluster's default group |
72 |
(which will be changable via gnt-cluster modify or RAPI) unless a |
73 |
group is explicitely specified in the creation job (with -g or via |
74 |
RAPI). Iallocator will be only passed the nodes belonging to that |
75 |
group. |
76 |
- Moving an instance between groups can only happen via an explicit |
77 |
operation, which for example in the case of DRBD will work by |
78 |
performing internally a replace-disks, a migration, and a second |
79 |
replace-disks. It will be possible to cleanup an interrupted |
80 |
group-move operation. |
81 |
- Cluster verify will signal an error if an instance has been left |
82 |
mid-transition between groups. |
83 |
- Intra-group instance migration/failover will check that the target |
84 |
group will be able to accept the instance network/storage wise, and |
85 |
fail otherwise. In the future we may be able to make some parameter |
86 |
changed during the move, but in the first version we expect an |
87 |
import/export if this is not possible. |
88 |
- From an allocation point of view, inter-group movements will be |
89 |
shown to a iallocator as a new allocation over the target group. |
90 |
Only in a future version we may add allocator extensions to decide |
91 |
which group the instance should be in. In the meantime we expect |
92 |
Ganeti administrators to either put instances on different groups by |
93 |
filling all groups first, or to have their own strategy based on the |
94 |
instance needs. |
95 |
|
96 |
Cluster/Internal/Config level changes |
97 |
+++++++++++++++++++++++++++++++++++++ |
98 |
|
99 |
We expect the following changes for cluster management: |
100 |
|
101 |
- Frequent multinode operations, such as os-diagnose or cluster-verify |
102 |
will act one group at a time. The default group will be used if none |
103 |
is passed. Command line tools will have a way to easily target all |
104 |
groups, by generating one job per group. |
105 |
- Groups will have a human-readable name, but will internally always |
106 |
be referenced by a UUID, which will be immutable. For example the |
107 |
cluster object will contain the UUID of the default group, each node |
108 |
will contain the UUID of the group it belongs to, etc. This is done |
109 |
to simplify referencing while keeping it easy to handle renames and |
110 |
movements. If we see that this works well, we'll transition other |
111 |
config objects (instances, nodes) to the same model. |
112 |
- The addition of a new per-group lock will be evaluated, if we can |
113 |
transition some operations now requiring the BGL to it. |
114 |
- Master candidate status will be allowed to be spread among groups. |
115 |
For the first version we won't add any restriction over how this is |
116 |
done, although in the future we may have a minimum number of master |
117 |
candidates which Ganeti will try to keep in each group, for example. |
118 |
|
119 |
Other work and future changes |
120 |
+++++++++++++++++++++++++++++ |
121 |
|
122 |
Commands like gnt-cluster command/copyfile will continue to work on the |
123 |
whole cluster, but it will be possible to target one group only by |
124 |
specifying it. |
125 |
|
126 |
Commands which allow selection of sets of resources (for example |
127 |
gnt-instance start/stop) will be able to select them by node group as |
128 |
well. |
129 |
|
130 |
Initially node groups won't be taggable objects, to simplify the first |
131 |
implementation, but we expect this to be easy to add in a future version |
132 |
should we see it's useful. |
133 |
|
134 |
We envision groups as a good place to enhance cluster scalability. In |
135 |
the future we may want to use them ad units for configuration diffusion, |
136 |
to allow a better master scalability. For example it could be possible |
137 |
to change some all-nodes RPCs to contact each group once, from the |
138 |
master, and make one node in the group perform internal diffusion. We |
139 |
won't implement this in the first version, but we'll evaluate it for the |
140 |
future, if we see scalability problems on big multi-group clusters. |
141 |
|
142 |
When Ganeti will support more storage models (eg. SANs, sheepdog, ceph) |
143 |
we expect groups to be the basis for this, allowing for example a |
144 |
different sheepdog/ceph cluster, or a different SAN to be connected to |
145 |
each group. In some cases this will mean that inter-group move operation |
146 |
will be necessarily performed with instance downtime, unless the |
147 |
hypervisor has block-migrate functionality, and we implement support for |
148 |
it (this would be theoretically possible, today, with KVM, for example). |
149 |
|
150 |
|
151 |
Job priorities |
152 |
-------------- |
153 |
|
154 |
Current state and shortcomings |
155 |
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
156 |
|
157 |
Currently all jobs and opcodes have the same priority. Once a job |
158 |
started executing, its thread won't be released until all opcodes got |
159 |
their locks and did their work. When a job is finished, the next job is |
160 |
selected strictly by its incoming order. This does not mean jobs are run |
161 |
in their incoming order—locks and other delays can cause them to be |
162 |
stalled for some time. |
163 |
|
164 |
In some situations, e.g. an emergency shutdown, one may want to run a |
165 |
job as soon as possible. This is not possible currently if there are |
166 |
pending jobs in the queue. |
167 |
|
168 |
Proposed changes |
169 |
~~~~~~~~~~~~~~~~ |
170 |
|
171 |
Each opcode will be assigned a priority on submission. Opcode priorities |
172 |
are integers and the lower the number, the higher the opcode's priority |
173 |
is. Within the same priority, jobs and opcodes are initially processed |
174 |
in their incoming order. |
175 |
|
176 |
Submitted opcodes can have one of the priorities listed below. Other |
177 |
priorities are reserved for internal use. The absolute range is |
178 |
-20..+19. Opcodes submitted without a priority (e.g. by older clients) |
179 |
are assigned the default priority. |
180 |
|
181 |
- High (-10) |
182 |
- Normal (0, default) |
183 |
- Low (+10) |
184 |
|
185 |
As a change from the current model where executing a job blocks one |
186 |
thread for the whole duration, the new job processor must return the job |
187 |
to the queue after each opcode and also if it can't get all locks in a |
188 |
reasonable timeframe. This will allow opcodes of higher priority |
189 |
submitted in the meantime to be processed or opcodes of the same |
190 |
priority to try to get their locks. When added to the job queue's |
191 |
workerpool, the priority is determined by the first unprocessed opcode |
192 |
in the job. |
193 |
|
194 |
If an opcode is deferred, the job will go back to the "queued" status, |
195 |
even though it's just waiting to try to acquire its locks again later. |
196 |
|
197 |
If an opcode can not be processed after a certain number of retries or a |
198 |
certain amount of time, it should increase its priority. This will avoid |
199 |
starvation. |
200 |
|
201 |
A job's priority can never go below -20. If a job hits priority -20, it |
202 |
must acquire its locks in blocking mode. |
203 |
|
204 |
Opcode priorities are synchronized to disk in order to be restored after |
205 |
a restart or crash of the master daemon. |
206 |
|
207 |
Priorities also need to be considered inside the locking library to |
208 |
ensure opcodes with higher priorities get locks first. See |
209 |
:ref:`locking priorities <locking-priorities>` for more details. |
210 |
|
211 |
Worker pool |
212 |
+++++++++++ |
213 |
|
214 |
To support job priorities in the job queue, the worker pool underlying |
215 |
the job queue must be enhanced to support task priorities. Currently |
216 |
tasks are processed in the order they are added to the queue (but, due |
217 |
to their nature, they don't necessarily finish in that order). All tasks |
218 |
are equal. To support tasks with higher or lower priority, a few changes |
219 |
have to be made to the queue inside a worker pool. |
220 |
|
221 |
Each task is assigned a priority when added to the queue. This priority |
222 |
can not be changed until the task is executed (this is fine as in all |
223 |
current use-cases, tasks are added to a pool and then forgotten about |
224 |
until they're done). |
225 |
|
226 |
A task's priority can be compared to Unix' process priorities. The lower |
227 |
the priority number, the closer to the queue's front it is. A task with |
228 |
priority 0 is going to be run before one with priority 10. Tasks with |
229 |
the same priority are executed in the order in which they were added. |
230 |
|
231 |
While a task is running it can query its own priority. If it's not ready |
232 |
yet for finishing, it can raise an exception to defer itself, optionally |
233 |
changing its own priority. This is useful for the following cases: |
234 |
|
235 |
- A task is trying to acquire locks, but those locks are still held by |
236 |
other tasks. By deferring itself, the task gives others a chance to |
237 |
run. This is especially useful when all workers are busy. |
238 |
- If a task decides it hasn't gotten its locks in a long time, it can |
239 |
start to increase its own priority. |
240 |
- Tasks waiting for long-running operations running asynchronously could |
241 |
defer themselves while waiting for a long-running operation. |
242 |
|
243 |
With these changes, the job queue will be able to implement per-job |
244 |
priorities. |
245 |
|
246 |
.. _locking-priorities: |
247 |
|
248 |
Locking |
249 |
+++++++ |
250 |
|
251 |
In order to support priorities in Ganeti's own lock classes, |
252 |
``locking.SharedLock`` and ``locking.LockSet``, the internal structure |
253 |
of the former class needs to be changed. The last major change in this |
254 |
area was done for Ganeti 2.1 and can be found in the respective |
255 |
:doc:`design document <design-2.1>`. |
256 |
|
257 |
The plain list (``[]``) used as a queue is replaced by a heap queue, |
258 |
similar to the `worker pool`_. The heap or priority queue does automatic |
259 |
sorting, thereby automatically taking care of priorities. For each |
260 |
priority there's a plain list with pending acquires, like the single |
261 |
queue of pending acquires before this change. |
262 |
|
263 |
When the lock is released, the code locates the list of pending acquires |
264 |
for the highest priority waiting. The first condition (index 0) is |
265 |
notified. Once all waiting threads received the notification, the |
266 |
condition is removed from the list. If the list of conditions is empty |
267 |
it's removed from the heap queue. |
268 |
|
269 |
Like before, shared acquires are grouped and skip ahead of exclusive |
270 |
acquires if there's already an existing shared acquire for a priority. |
271 |
To accomplish this, a separate dictionary of shared acquires per |
272 |
priority is maintained. |
273 |
|
274 |
To simplify the code and reduce memory consumption, the concept of the |
275 |
"active" and "inactive" condition for shared acquires is abolished. The |
276 |
lock can't predict what priorities the next acquires will use and even |
277 |
keeping a cache can become computationally expensive for arguable |
278 |
benefit (the underlying POSIX pipe, see ``pipe(2)``, needs to be |
279 |
re-created for each notification anyway). |
280 |
|
281 |
The following diagram shows a possible state of the internal queue from |
282 |
a high-level view. Conditions are shown as (waiting) threads. Assuming |
283 |
no modifications are made to the queue (e.g. more acquires or timeouts), |
284 |
the lock would be acquired by the threads in this order (concurrent |
285 |
acquires in parentheses): ``threadE1``, ``threadE2``, (``threadS1``, |
286 |
``threadS2``, ``threadS3``), (``threadS4``, ``threadS5``), ``threadE3``, |
287 |
``threadS6``, ``threadE4``, ``threadE5``. |
288 |
|
289 |
:: |
290 |
|
291 |
[ |
292 |
(0, [exc/threadE1, exc/threadE2, shr/threadS1/threadS2/threadS3]), |
293 |
(2, [shr/threadS4/threadS5]), |
294 |
(10, [exc/threadE3]), |
295 |
(33, [shr/threadS6, exc/threadE4, exc/threadE5]), |
296 |
] |
297 |
|
298 |
|
299 |
IPv6 support |
300 |
------------ |
301 |
|
302 |
Currently Ganeti does not support IPv6. This is true for nodes as well |
303 |
as instances. Due to the fact that IPv4 exhaustion is threateningly near |
304 |
the need of using IPv6 is increasing, especially given that bigger and |
305 |
bigger clusters are supported. |
306 |
|
307 |
Supported IPv6 setup |
308 |
~~~~~~~~~~~~~~~~~~~~ |
309 |
|
310 |
In Ganeti 2.3 we introduce additionally to the ordinary pure IPv4 |
311 |
setup a hybrid IPv6/IPv4 mode. The latter works as follows: |
312 |
|
313 |
- all nodes in a cluster have a primary IPv6 address |
314 |
- the master has a IPv6 address |
315 |
- all nodes **must** have a secondary IPv4 address |
316 |
|
317 |
The reason for this hybrid setup is that key components that Ganeti |
318 |
depends on do not or only partially support IPv6. More precisely, Xen |
319 |
does not support instance migration via IPv6 in version 3.4 and 4.0. |
320 |
Similarly, KVM does not support instance migration nor VNC access for |
321 |
IPv6 at the time of this writing. |
322 |
|
323 |
This led to the decision of not supporting pure IPv6 Ganeti clusters, as |
324 |
very important cluster operations would not have been possible. Using |
325 |
IPv4 as secondary address does not affect any of the goals |
326 |
of the IPv6 support: since secondary addresses do not need to be |
327 |
publicly accessible, they need not be globally unique. In other words, |
328 |
one can practically use private IPv4 secondary addresses just for |
329 |
intra-cluster communication without propagating them across layer 3 |
330 |
boundaries. |
331 |
|
332 |
netutils: Utilities for handling common network tasks |
333 |
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
334 |
|
335 |
Currently common util functions are kept in the utils modules. Since |
336 |
this module grows bigger and bigger network-related functions are moved |
337 |
to a separate module named *netutils*. Additionally all these utilities |
338 |
will be IPv6-enabled. |
339 |
|
340 |
Cluster initialization |
341 |
~~~~~~~~~~~~~~~~~~~~~~ |
342 |
|
343 |
As mentioned above there will be two different setups in terms of IP |
344 |
addressing: pure IPv4 and hybrid IPv6/IPv4 address. To choose that a |
345 |
new cluster init parameter *--primary-ip-version* is introduced. This is |
346 |
needed as a given name can resolve to both an IPv4 and IPv6 address on a |
347 |
dual-stack host effectively making it impossible to infer that bit. |
348 |
|
349 |
Once a cluster is initialized and the primary IP version chosen all |
350 |
nodes that join have to conform to that setup. In the case of our |
351 |
IPv6/IPv4 setup all nodes *must* have a secondary IPv4 address. |
352 |
|
353 |
Furthermore we store the primary IP version in ssconf which is consulted |
354 |
every time a daemon starts to determine the default bind address (either |
355 |
*0.0.0.0* or *::*. In a IPv6/IPv4 setup we need to bind the Ganeti |
356 |
daemon listening on network sockets to the IPv6 address. |
357 |
|
358 |
Node addition |
359 |
~~~~~~~~~~~~~ |
360 |
|
361 |
When adding a new node to a IPv6/IPv4 cluster it must have a IPv6 |
362 |
address to be used as primary and a IPv4 address used as secondary. As |
363 |
explained above, every time a daemon is started we use the cluster |
364 |
primary IP version to determine to which any address to bind to. The |
365 |
only exception to this is when a node is added to the cluster. In this |
366 |
case there is no ssconf available when noded is started and therefore |
367 |
the correct address needs to be passed to it. |
368 |
|
369 |
Name resolution |
370 |
~~~~~~~~~~~~~~~ |
371 |
|
372 |
Since the gethostbyname*() functions do not support IPv6 name resolution |
373 |
will be done by using the recommended getaddrinfo(). |
374 |
|
375 |
IPv4-only components |
376 |
~~~~~~~~~~~~~~~~~~~~ |
377 |
|
378 |
============================ =================== ==================== |
379 |
Component IPv6 Status Planned Version |
380 |
============================ =================== ==================== |
381 |
Xen instance migration Not supported Xen 4.1: libxenlight |
382 |
KVM instance migration Not supported Unknown |
383 |
KVM VNC access Not supported Unknown |
384 |
============================ =================== ==================== |
385 |
|
386 |
|
387 |
Feature changes |
388 |
=============== |
389 |
|
390 |
|
391 |
External interface changes |
392 |
========================== |
393 |
|
394 |
|
395 |
.. vim: set textwidth=72 : |
396 |
.. Local Variables: |
397 |
.. mode: rst |
398 |
.. fill-column: 72 |
399 |
.. End: |