root / doc / design-2.3.rst @ 21d0f6c7
History | View | Annotate | Download (14.4 kB)
1 |
================= |
---|---|
2 |
Ganeti 2.3 design |
3 |
================= |
4 |
|
5 |
This document describes the major changes in Ganeti 2.3 compared to |
6 |
the 2.2 version. |
7 |
|
8 |
.. contents:: :depth: 4 |
9 |
|
10 |
As for 2.1 and 2.2 we divide the 2.3 design into three areas: |
11 |
|
12 |
- core changes, which affect the master daemon/job queue/locking or |
13 |
all/most logical units |
14 |
- logical unit/feature changes |
15 |
- external interface changes (e.g. command line, os api, hooks, ...) |
16 |
|
17 |
Core changes |
18 |
============ |
19 |
|
20 |
Node Groups |
21 |
----------- |
22 |
|
23 |
Current state and shortcomings |
24 |
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
25 |
|
26 |
Currently all nodes of a Ganeti cluster are considered as part of the |
27 |
same pool, for allocation purposes: DRBD instances for example can be |
28 |
allocated on any two nodes. |
29 |
|
30 |
This does cause a problem in cases where nodes are not all equally |
31 |
connected to each other. For example if a cluster is created over two |
32 |
set of machines, each connected to its own switch, the internal bandwidth |
33 |
between machines connected to the same switch might be bigger than the |
34 |
bandwidth for inter-switch connections. |
35 |
|
36 |
Moreover some operations inside a cluster require all nodes to be locked |
37 |
together for inter-node consistency, and won't scale if we increase the |
38 |
number of nodes to a few hundreds. |
39 |
|
40 |
Proposed changes |
41 |
~~~~~~~~~~~~~~~~ |
42 |
|
43 |
With this change we'll divide Ganeti nodes into groups. Nothing will |
44 |
change for clusters with only one node group, the default one. Bigger |
45 |
cluster instead will be able to have more than one group, and each node |
46 |
will belong to exactly one. |
47 |
|
48 |
Node group management |
49 |
+++++++++++++++++++++ |
50 |
|
51 |
To manage node groups and the nodes belonging to them, the following new |
52 |
commands/flags will be introduced:: |
53 |
|
54 |
gnt-node group-add <group> # add a new node group |
55 |
gnt-node group-del <group> # delete an empty group |
56 |
gnt-node group-list # list node groups |
57 |
gnt-node group-rename <oldname> <newname> # rename a group |
58 |
gnt-node list/info -g <group> # list only nodes belongin to a group |
59 |
gnt-node add -g <group> # add a node to a certain group |
60 |
gnt-node modify -g <group> # move a node to a new group |
61 |
|
62 |
Instance level changes |
63 |
++++++++++++++++++++++ |
64 |
|
65 |
Instances will be able to live in only one group at a time. This is |
66 |
mostly important for DRBD instances, in which case both their primary |
67 |
and secondary nodes will need to be in the same group. To support this |
68 |
we envision the following changes: |
69 |
|
70 |
- The cluster will have a default group, which will initially be |
71 |
- Instance allocation will happen to the cluster's default group |
72 |
(which will be changable via gnt-cluster modify or RAPI) unless a |
73 |
group is explicitely specified in the creation job (with -g or via |
74 |
RAPI). Iallocator will be only passed the nodes belonging to that |
75 |
group. |
76 |
- Moving an instance between groups can only happen via an explicit |
77 |
operation, which for example in the case of DRBD will work by |
78 |
performing internally a replace-disks, a migration, and a second |
79 |
replace-disks. It will be possible to cleanup an interrupted |
80 |
group-move operation. |
81 |
- Cluster verify will signal an error if an instance has been left |
82 |
mid-transition between groups. |
83 |
- Intra-group instance migration/failover will check that the target |
84 |
group will be able to accept the instance network/storage wise, and |
85 |
fail otherwise. In the future we may be able to make some parameter |
86 |
changed during the move, but in the first version we expect an |
87 |
import/export if this is not possible. |
88 |
- From an allocation point of view, inter-group movements will be |
89 |
shown to a iallocator as a new allocation over the target group. |
90 |
Only in a future version we may add allocator extensions to decide |
91 |
which group the instance should be in. In the meantime we expect |
92 |
Ganeti administrators to either put instances on different groups by |
93 |
filling all groups first, or to have their own strategy based on the |
94 |
instance needs. |
95 |
|
96 |
Cluster/Internal/Config level changes |
97 |
+++++++++++++++++++++++++++++++++++++ |
98 |
|
99 |
We expect the following changes for cluster management: |
100 |
|
101 |
- Frequent multinode operations, such as os-diagnose or cluster-verify |
102 |
will act one group at a time. The default group will be used if none |
103 |
is passed. Command line tools will have a way to easily target all |
104 |
groups, by generating one job per group. |
105 |
- Groups will have a human-readable name, but will internally always |
106 |
be referenced by a UUID, which will be immutable. For example the |
107 |
cluster object will contain the UUID of the default group, each node |
108 |
will contain the UUID of the group it belongs to, etc. This is done |
109 |
to simplify referencing while keeping it easy to handle renames and |
110 |
movements. If we see that this works well, we'll transition other |
111 |
config objects (instances, nodes) to the same model. |
112 |
- The addition of a new per-group lock will be evaluated, if we can |
113 |
transition some operations now requiring the BGL to it. |
114 |
- Master candidate status will be allowed to be spread among groups. |
115 |
For the first version we won't add any restriction over how this is |
116 |
done, although in the future we may have a minimum number of master |
117 |
candidates which Ganeti will try to keep in each group, for example. |
118 |
|
119 |
Other work and future changes |
120 |
+++++++++++++++++++++++++++++ |
121 |
|
122 |
Commands like gnt-cluster command/copyfile will continue to work on the |
123 |
whole cluster, but it will be possible to target one group only by |
124 |
specifying it. |
125 |
|
126 |
Commands which allow selection of sets of resources (for example |
127 |
gnt-instance start/stop) will be able to select them by node group as |
128 |
well. |
129 |
|
130 |
Initially node groups won't be taggable objects, to simplify the first |
131 |
implementation, but we expect this to be easy to add in a future version |
132 |
should we see it's useful. |
133 |
|
134 |
We envision groups as a good place to enhance cluster scalability. In |
135 |
the future we may want to use them ad units for configuration diffusion, |
136 |
to allow a better master scalability. For example it could be possible |
137 |
to change some all-nodes RPCs to contact each group once, from the |
138 |
master, and make one node in the group perform internal diffusion. We |
139 |
won't implement this in the first version, but we'll evaluate it for the |
140 |
future, if we see scalability problems on big multi-group clusters. |
141 |
|
142 |
When Ganeti will support more storage models (eg. SANs, sheepdog, ceph) |
143 |
we expect groups to be the basis for this, allowing for example a |
144 |
different sheepdog/ceph cluster, or a different SAN to be connected to |
145 |
each group. In some cases this will mean that inter-group move operation |
146 |
will be necessarily performed with instance downtime, unless the |
147 |
hypervisor has block-migrate functionality, and we implement support for |
148 |
it (this would be theoretically possible, today, with KVM, for example). |
149 |
|
150 |
|
151 |
Job priorities |
152 |
-------------- |
153 |
|
154 |
Current state and shortcomings |
155 |
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
156 |
|
157 |
Currently all jobs and opcodes have the same priority. Once a job |
158 |
started executing, its thread won't be released until all opcodes got |
159 |
their locks and did their work. When a job is finished, the next job is |
160 |
selected strictly by its incoming order. This does not mean jobs are run |
161 |
in their incoming order—locks and other delays can cause them to be |
162 |
stalled for some time. |
163 |
|
164 |
In some situations, e.g. an emergency shutdown, one may want to run a |
165 |
job as soon as possible. This is not possible currently if there are |
166 |
pending jobs in the queue. |
167 |
|
168 |
Proposed changes |
169 |
~~~~~~~~~~~~~~~~ |
170 |
|
171 |
Each opcode will be assigned a priority on submission. Opcode priorities |
172 |
are integers and the lower the number, the higher the opcode's priority |
173 |
is. Within the same priority, jobs and opcodes are initially processed |
174 |
in their incoming order. |
175 |
|
176 |
Submitted opcodes can have one of the priorities listed below. Other |
177 |
priorities are reserved for internal use. The absolute range is |
178 |
-20..+19. Opcodes submitted without a priority (e.g. by older clients) |
179 |
are assigned the default priority. |
180 |
|
181 |
- High (-10) |
182 |
- Normal (0, default) |
183 |
- Low (+10) |
184 |
|
185 |
As a change from the current model where executing a job blocks one |
186 |
thread for the whole duration, the new job processor must return the job |
187 |
to the queue after each opcode and also if it can't get all locks in a |
188 |
reasonable timeframe. This will allow opcodes of higher priority |
189 |
submitted in the meantime to be processed or opcodes of the same |
190 |
priority to try to get their locks. When added to the job queue's |
191 |
workerpool, the priority is determined by the first unprocessed opcode |
192 |
in the job. |
193 |
|
194 |
If an opcode is deferred, the job will go back to the "queued" status, |
195 |
even though it's just waiting to try to acquire its locks again later. |
196 |
|
197 |
If an opcode can not be processed after a certain number of retries or a |
198 |
certain amount of time, it should increase its priority. This will avoid |
199 |
starvation. |
200 |
|
201 |
A job's priority can never go below -20. If a job hits priority -20, it |
202 |
must acquire its locks in blocking mode. |
203 |
|
204 |
Opcode priorities are synchronized to disk in order to be restored after |
205 |
a restart or crash of the master daemon. |
206 |
|
207 |
Priorities also need to be considered inside the locking library to |
208 |
ensure opcodes with higher priorities get locks first, but the design |
209 |
changes for this will be discussed in a separate section. |
210 |
|
211 |
Worker pool |
212 |
+++++++++++ |
213 |
|
214 |
To support job priorities in the job queue, the worker pool underlying |
215 |
the job queue must be enhanced to support task priorities. Currently |
216 |
tasks are processed in the order they are added to the queue (but, due |
217 |
to their nature, they don't necessarily finish in that order). All tasks |
218 |
are equal. To support tasks with higher or lower priority, a few changes |
219 |
have to be made to the queue inside a worker pool. |
220 |
|
221 |
Each task is assigned a priority when added to the queue. This priority |
222 |
can not be changed until the task is executed (this is fine as in all |
223 |
current use-cases, tasks are added to a pool and then forgotten about |
224 |
until they're done). |
225 |
|
226 |
A task's priority can be compared to Unix' process priorities. The lower |
227 |
the priority number, the closer to the queue's front it is. A task with |
228 |
priority 0 is going to be run before one with priority 10. Tasks with |
229 |
the same priority are executed in the order in which they were added. |
230 |
|
231 |
While a task is running it can query its own priority. If it's not ready |
232 |
yet for finishing, it can raise an exception to defer itself, optionally |
233 |
changing its own priority. This is useful for the following cases: |
234 |
|
235 |
- A task is trying to acquire locks, but those locks are still held by |
236 |
other tasks. By deferring itself, the task gives others a chance to |
237 |
run. This is especially useful when all workers are busy. |
238 |
- If a task decides it hasn't gotten its locks in a long time, it can |
239 |
start to increase its own priority. |
240 |
- Tasks waiting for long-running operations running asynchronously could |
241 |
defer themselves while waiting for a long-running operation. |
242 |
|
243 |
With these changes, the job queue will be able to implement per-job |
244 |
priorities. |
245 |
|
246 |
IPv6 support |
247 |
------------ |
248 |
|
249 |
Currently Ganeti does not support IPv6. This is true for nodes as well |
250 |
as instances. Due to the fact that IPv4 exhaustion is threateningly near |
251 |
the need of using IPv6 is increasing, especially given that bigger and |
252 |
bigger clusters are supported. |
253 |
|
254 |
Supported IPv6 setup |
255 |
~~~~~~~~~~~~~~~~~~~~ |
256 |
|
257 |
In Ganeti 2.3 we introduce additionally to the ordinary pure IPv4 |
258 |
setup a hybrid IPv6/IPv4 mode. The latter works as follows: |
259 |
|
260 |
- all nodes in a cluster have a primary IPv6 address |
261 |
- the master has a IPv6 address |
262 |
- all nodes **must** have a secondary IPv4 address |
263 |
|
264 |
The reason for this hybrid setup is that key components that Ganeti |
265 |
depends on do not or only partially support IPv6. More precisely, Xen |
266 |
does not support instance migration via IPv6 in version 3.4 and 4.0. |
267 |
Similarly, KVM does not support instance migration nor VNC access for |
268 |
IPv6 at the time of this writing. |
269 |
|
270 |
This led to the decision of not supporting pure IPv6 Ganeti clusters, as |
271 |
very important cluster operations would not have been possible. Using |
272 |
IPv4 as secondary address does not affect any of the goals |
273 |
of the IPv6 support: since secondary addresses do not need to be |
274 |
publicly accessible, they need not be globally unique. In other words, |
275 |
one can practically use private IPv4 secondary addresses just for |
276 |
intra-cluster communication without propagating them across layer 3 |
277 |
boundaries. |
278 |
|
279 |
netutils: Utilities for handling common network tasks |
280 |
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
281 |
|
282 |
Currently common util functions are kept in the utils modules. Since |
283 |
this module grows bigger and bigger network-related functions are moved |
284 |
to a separate module named *netutils*. Additionally all these utilities |
285 |
will be IPv6-enabled. |
286 |
|
287 |
Cluster initialization |
288 |
~~~~~~~~~~~~~~~~~~~~~~ |
289 |
|
290 |
As mentioned above there will be two different setups in terms of IP |
291 |
addressing: pure IPv4 and hybrid IPv6/IPv4 address. To choose that a |
292 |
new cluster init parameter *--primary-ip-version* is introduced. This is |
293 |
needed as a given name can resolve to both an IPv4 and IPv6 address on a |
294 |
dual-stack host effectively making it impossible to infer that bit. |
295 |
|
296 |
Once a cluster is initialized and the primary IP version chosen all |
297 |
nodes that join have to conform to that setup. In the case of our |
298 |
IPv6/IPv4 setup all nodes *must* have a secondary IPv4 address. |
299 |
|
300 |
Furthermore we store the primary IP version in ssconf which is consulted |
301 |
every time a daemon starts to determine the default bind address (either |
302 |
*0.0.0.0* or *::*. In a IPv6/IPv4 setup we need to bind the Ganeti |
303 |
daemon listening on network sockets to the IPv6 address. |
304 |
|
305 |
Node addition |
306 |
~~~~~~~~~~~~~ |
307 |
|
308 |
When adding a new node to a IPv6/IPv4 cluster it must have a IPv6 |
309 |
address to be used as primary and a IPv4 address used as secondary. As |
310 |
explained above, every time a daemon is started we use the cluster |
311 |
primary IP version to determine to which any address to bind to. The |
312 |
only exception to this is when a node is added to the cluster. In this |
313 |
case there is no ssconf available when noded is started and therefore |
314 |
the correct address needs to be passed to it. |
315 |
|
316 |
Name resolution |
317 |
~~~~~~~~~~~~~~~ |
318 |
|
319 |
Since the gethostbyname*() functions do not support IPv6 name resolution |
320 |
will be done by using the recommended getaddrinfo(). |
321 |
|
322 |
IPv4-only components |
323 |
~~~~~~~~~~~~~~~~~~~~ |
324 |
|
325 |
============================ =================== ==================== |
326 |
Component IPv6 Status Planned Version |
327 |
============================ =================== ==================== |
328 |
Xen instance migration Not supported Xen 4.1: libxenlight |
329 |
KVM instance migration Not supported Unknown |
330 |
KVM VNC access Not supported Unknown |
331 |
============================ =================== ==================== |
332 |
|
333 |
|
334 |
Feature changes |
335 |
=============== |
336 |
|
337 |
|
338 |
External interface changes |
339 |
========================== |
340 |
|
341 |
|
342 |
.. vim: set textwidth=72 : |
343 |
.. Local Variables: |
344 |
.. mode: rst |
345 |
.. fill-column: 72 |
346 |
.. End: |