root / doc / design-2.3.rst @ 282f38e3
History | View | Annotate | Download (12.2 kB)
1 |
================= |
---|---|
2 |
Ganeti 2.3 design |
3 |
================= |
4 |
|
5 |
This document describes the major changes in Ganeti 2.3 compared to |
6 |
the 2.2 version. |
7 |
|
8 |
.. contents:: :depth: 4 |
9 |
|
10 |
As for 2.1 and 2.2 we divide the 2.3 design into three areas: |
11 |
|
12 |
- core changes, which affect the master daemon/job queue/locking or |
13 |
all/most logical units |
14 |
- logical unit/feature changes |
15 |
- external interface changes (e.g. command line, os api, hooks, ...) |
16 |
|
17 |
Core changes |
18 |
============ |
19 |
|
20 |
Node Groups |
21 |
----------- |
22 |
|
23 |
Current state and shortcomings |
24 |
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
25 |
|
26 |
Currently all nodes of a Ganeti cluster are considered as part of the |
27 |
same pool, for allocation purposes: DRBD instances for example can be |
28 |
allocated on any two nodes. |
29 |
|
30 |
This does cause a problem in cases where nodes are not all equally |
31 |
connected to each other. For example if a cluster is created over two |
32 |
set of machines, each connected to its own switch, the internal bandwidth |
33 |
between machines connected to the same switch might be bigger than the |
34 |
bandwidth for inter-switch connections. |
35 |
|
36 |
Moreover some operations inside a cluster require all nodes to be locked |
37 |
together for inter-node consistency, and won't scale if we increase the |
38 |
number of nodes to a few hundreds. |
39 |
|
40 |
Proposed changes |
41 |
~~~~~~~~~~~~~~~~ |
42 |
|
43 |
With this change we'll divide Ganeti nodes into groups. Nothing will |
44 |
change for clusters with only one node group, the default one. Bigger |
45 |
cluster instead will be able to have more than one group, and each node |
46 |
will belong to exactly one. |
47 |
|
48 |
Node group management |
49 |
+++++++++++++++++++++ |
50 |
|
51 |
To manage node groups and the nodes belonging to them, the following new |
52 |
commands/flags will be introduced:: |
53 |
|
54 |
gnt-node group-add <group> # add a new node group |
55 |
gnt-node group-del <group> # delete an empty group |
56 |
gnt-node group-list # list node groups |
57 |
gnt-node group-rename <oldname> <newname> # rename a group |
58 |
gnt-node list/info -g <group> # list only nodes belongin to a group |
59 |
gnt-node add -g <group> # add a node to a certain group |
60 |
gnt-node modify -g <group> # move a node to a new group |
61 |
|
62 |
Instance level changes |
63 |
++++++++++++++++++++++ |
64 |
|
65 |
Instances will be able to live in only one group at a time. This is |
66 |
mostly important for DRBD instances, in which case both their primary |
67 |
and secondary nodes will need to be in the same group. To support this |
68 |
we envision the following changes: |
69 |
|
70 |
- The cluster will have a default group, which will initially be |
71 |
- Instance allocation will happen to the cluster's default group |
72 |
(which will be changable via gnt-cluster modify or RAPI) unless a |
73 |
group is explicitely specified in the creation job (with -g or via |
74 |
RAPI). Iallocator will be only passed the nodes belonging to that |
75 |
group. |
76 |
- Moving an instance between groups can only happen via an explicit |
77 |
operation, which for example in the case of DRBD will work by |
78 |
performing internally a replace-disks, a migration, and a second |
79 |
replace-disks. It will be possible to cleanup an interrupted |
80 |
group-move operation. |
81 |
- Cluster verify will signal an error if an instance has been left |
82 |
mid-transition between groups. |
83 |
- Intra-group instance migration/failover will check that the target |
84 |
group will be able to accept the instance network/storage wise, and |
85 |
fail otherwise. In the future we may be able to make some parameter |
86 |
changed during the move, but in the first version we expect an |
87 |
import/export if this is not possible. |
88 |
- From an allocation point of view, inter-group movements will be |
89 |
shown to a iallocator as a new allocation over the target group. |
90 |
Only in a future version we may add allocator extensions to decide |
91 |
which group the instance should be in. In the meantime we expect |
92 |
Ganeti administrators to either put instances on different groups by |
93 |
filling all groups first, or to have their own strategy based on the |
94 |
instance needs. |
95 |
|
96 |
Cluster/Internal/Config level changes |
97 |
+++++++++++++++++++++++++++++++++++++ |
98 |
|
99 |
We expect the following changes for cluster management: |
100 |
|
101 |
- Frequent multinode operations, such as os-diagnose or cluster-verify |
102 |
will act one group at a time. The default group will be used if none |
103 |
is passed. Command line tools will have a way to easily target all |
104 |
groups, by generating one job per group. |
105 |
- Groups will have a human-readable name, but will internally always |
106 |
be referenced by a UUID, which will be immutable. For example the |
107 |
cluster object will contain the UUID of the default group, each node |
108 |
will contain the UUID of the group it belongs to, etc. This is done |
109 |
to simplify referencing while keeping it easy to handle renames and |
110 |
movements. If we see that this works well, we'll transition other |
111 |
config objects (instances, nodes) to the same model. |
112 |
- The addition of a new per-group lock will be evaluated, if we can |
113 |
transition some operations now requiring the BGL to it. |
114 |
- Master candidate status will be allowed to be spread among groups. |
115 |
For the first version we won't add any restriction over how this is |
116 |
done, although in the future we may have a minimum number of master |
117 |
candidates which Ganeti will try to keep in each group, for example. |
118 |
|
119 |
Other work and future changes |
120 |
+++++++++++++++++++++++++++++ |
121 |
|
122 |
Commands like gnt-cluster command/copyfile will continue to work on the |
123 |
whole cluster, but it will be possible to target one group only by |
124 |
specifying it. |
125 |
|
126 |
Commands which allow selection of sets of resources (for example |
127 |
gnt-instance start/stop) will be able to select them by node group as |
128 |
well. |
129 |
|
130 |
Initially node groups won't be taggable objects, to simplify the first |
131 |
implementation, but we expect this to be easy to add in a future version |
132 |
should we see it's useful. |
133 |
|
134 |
We envision groups as a good place to enhance cluster scalability. In |
135 |
the future we may want to use them ad units for configuration diffusion, |
136 |
to allow a better master scalability. For example it could be possible |
137 |
to change some all-nodes RPCs to contact each group once, from the |
138 |
master, and make one node in the group perform internal diffusion. We |
139 |
won't implement this in the first version, but we'll evaluate it for the |
140 |
future, if we see scalability problems on big multi-group clusters. |
141 |
|
142 |
When Ganeti will support more storage models (eg. SANs, sheepdog, ceph) |
143 |
we expect groups to be the basis for this, allowing for example a |
144 |
different sheepdog/ceph cluster, or a different SAN to be connected to |
145 |
each group. In some cases this will mean that inter-group move operation |
146 |
will be necessarily performed with instance downtime, unless the |
147 |
hypervisor has block-migrate functionality, and we implement support for |
148 |
it (this would be theoretically possible, today, with KVM, for example). |
149 |
|
150 |
|
151 |
Job priorities |
152 |
-------------- |
153 |
|
154 |
Current state and shortcomings |
155 |
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
156 |
|
157 |
.. TODO: Describe current situation |
158 |
|
159 |
Proposed changes |
160 |
~~~~~~~~~~~~~~~~ |
161 |
|
162 |
.. TODO: Describe changes to job queue and potentially client programs |
163 |
|
164 |
Worker pool |
165 |
+++++++++++ |
166 |
|
167 |
To support job priorities in the job queue, the worker pool underlying |
168 |
the job queue must be enhanced to support task priorities. Currently |
169 |
tasks are processed in the order they are added to the queue (but, due |
170 |
to their nature, they don't necessarily finish in that order). All tasks |
171 |
are equal. To support tasks with higher or lower priority, a few changes |
172 |
have to be made to the queue inside a worker pool. |
173 |
|
174 |
Each task is assigned a priority when added to the queue. This priority |
175 |
can not be changed until the task is executed (this is fine as in all |
176 |
current use-cases, tasks are added to a pool and then forgotten about |
177 |
until they're done). |
178 |
|
179 |
A task's priority can be compared to Unix' process priorities. The lower |
180 |
the priority number, the closer to the queue's front it is. A task with |
181 |
priority 0 is going to be run before one with priority 10. Tasks with |
182 |
the same priority are executed in the order in which they were added. |
183 |
|
184 |
While a task is running it can query its own priority. If it's not ready |
185 |
yet for finishing, it can raise an exception to defer itself, optionally |
186 |
changing its own priority. This is useful for the following cases: |
187 |
|
188 |
- A task is trying to acquire locks, but those locks are still held by |
189 |
other tasks. By deferring itself, the task gives others a chance to |
190 |
run. This is especially useful when all workers are busy. |
191 |
- If a task decides it hasn't gotten its locks in a long time, it can |
192 |
start to increase its own priority. |
193 |
- Tasks waiting for long-running operations running asynchronously could |
194 |
defer themselves while waiting for a long-running operation. |
195 |
|
196 |
With these changes, the job queue will be able to implement per-job |
197 |
priorities. |
198 |
|
199 |
IPv6 support |
200 |
------------ |
201 |
|
202 |
Currently Ganeti does not support IPv6. This is true for nodes as well |
203 |
as instances. Due to the fact that IPv4 exhaustion is threateningly near |
204 |
the need of using IPv6 is increasing, especially given that bigger and |
205 |
bigger clusters are supported. |
206 |
|
207 |
Supported IPv6 setup |
208 |
~~~~~~~~~~~~~~~~~~~~ |
209 |
|
210 |
In Ganeti 2.3 we introduce additionally to the ordinary pure IPv4 |
211 |
setup a hybrid IPv6/IPv4 mode. The latter works as follows: |
212 |
|
213 |
- all nodes in a cluster have a primary IPv6 address |
214 |
- the master has a IPv6 address |
215 |
- all nodes **must** have a secondary IPv4 address |
216 |
|
217 |
The reason for this hybrid setup is that key components that Ganeti |
218 |
depends on do not or only partially support IPv6. More precisely, Xen |
219 |
does not support instance migration via IPv6 in version 3.4 and 4.0. |
220 |
Similarly, KVM does not support instance migration nor VNC access for |
221 |
IPv6 at the time of this writing. |
222 |
|
223 |
This led to the decision of not supporting pure IPv6 Ganeti clusters, as |
224 |
very important cluster operations would not have been possible. Using |
225 |
IPv4 as secondary address does not affect any of the goals |
226 |
of the IPv6 support: since secondary addresses do not need to be |
227 |
publicly accessible, they need not be globally unique. In other words, |
228 |
one can practically use private IPv4 secondary addresses just for |
229 |
intra-cluster communication without propagating them across layer 3 |
230 |
boundaries. |
231 |
|
232 |
netutils: Utilities for handling common network tasks |
233 |
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
234 |
|
235 |
Currently common util functions are kept in the utils modules. Since |
236 |
this module grows bigger and bigger network-related functions are moved |
237 |
to a separate module named *netutils*. Additionally all these utilities |
238 |
will be IPv6-enabled. |
239 |
|
240 |
Cluster initialization |
241 |
~~~~~~~~~~~~~~~~~~~~~~ |
242 |
|
243 |
As mentioned above there will be two different setups in terms of IP |
244 |
addressing: pure IPv4 and hybrid IPv6/IPv4 address. To choose that a |
245 |
new cluster init parameter *--primary-ip-version* is introduced. This is |
246 |
needed as a given name can resolve to both an IPv4 and IPv6 address on a |
247 |
dual-stack host effectively making it impossible to infer that bit. |
248 |
|
249 |
Once a cluster is initialized and the primary IP version chosen all |
250 |
nodes that join have to conform to that setup. In the case of our |
251 |
IPv6/IPv4 setup all nodes *must* have a secondary IPv4 address. |
252 |
|
253 |
Furthermore we store the primary IP version in ssconf which is consulted |
254 |
every time a daemon starts to determine the default bind address (either |
255 |
*0.0.0.0* or *::*. In a IPv6/IPv4 setup we need to bind the Ganeti |
256 |
daemon listening on network sockets to the IPv6 address. |
257 |
|
258 |
Node addition |
259 |
~~~~~~~~~~~~~ |
260 |
|
261 |
When adding a new node to a IPv6/IPv4 cluster it must have a IPv6 |
262 |
address to be used as primary and a IPv4 address used as secondary. As |
263 |
explained above, every time a daemon is started we use the cluster |
264 |
primary IP version to determine to which any address to bind to. The |
265 |
only exception to this is when a node is added to the cluster. In this |
266 |
case there is no ssconf available when noded is started and therefore |
267 |
the correct address needs to be passed to it. |
268 |
|
269 |
Name resolution |
270 |
~~~~~~~~~~~~~~~ |
271 |
|
272 |
Since the gethostbyname*() functions do not support IPv6 name resolution |
273 |
will be done by using the recommended getaddrinfo(). |
274 |
|
275 |
IPv4-only components |
276 |
~~~~~~~~~~~~~~~~~~~~ |
277 |
|
278 |
============================ =================== ==================== |
279 |
Component IPv6 Status Planned Version |
280 |
============================ =================== ==================== |
281 |
Xen instance migration Not supported Xen 4.1: libxenlight |
282 |
KVM instance migration Not supported Unknown |
283 |
KVM VNC access Not supported Unknown |
284 |
============================ =================== ==================== |
285 |
|
286 |
|
287 |
Feature changes |
288 |
=============== |
289 |
|
290 |
|
291 |
External interface changes |
292 |
========================== |
293 |
|
294 |
|
295 |
.. vim: set textwidth=72 : |
296 |
.. Local Variables: |
297 |
.. mode: rst |
298 |
.. fill-column: 72 |
299 |
.. End: |