Statistics
| Branch: | Tag: | Revision:

root / doc / design-2.3.rst @ 282f38e3

History | View | Annotate | Download (12.2 kB)

1 1eb85930 Guido Trotter
=================
2 1eb85930 Guido Trotter
Ganeti 2.3 design
3 1eb85930 Guido Trotter
=================
4 1eb85930 Guido Trotter
5 1eb85930 Guido Trotter
This document describes the major changes in Ganeti 2.3 compared to
6 1eb85930 Guido Trotter
the 2.2 version.
7 1eb85930 Guido Trotter
8 1eb85930 Guido Trotter
.. contents:: :depth: 4
9 1eb85930 Guido Trotter
10 1eb85930 Guido Trotter
As for 2.1 and 2.2 we divide the 2.3 design into three areas:
11 1eb85930 Guido Trotter
12 1eb85930 Guido Trotter
- core changes, which affect the master daemon/job queue/locking or
13 1eb85930 Guido Trotter
  all/most logical units
14 1eb85930 Guido Trotter
- logical unit/feature changes
15 14bde528 Manuel Franceschini
- external interface changes (e.g. command line, os api, hooks, ...)
16 1eb85930 Guido Trotter
17 1eb85930 Guido Trotter
Core changes
18 9684d509 Guido Trotter
============
19 1eb85930 Guido Trotter
20 282f38e3 Guido Trotter
Node Groups
21 282f38e3 Guido Trotter
-----------
22 282f38e3 Guido Trotter
23 282f38e3 Guido Trotter
Current state and shortcomings
24 282f38e3 Guido Trotter
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
25 282f38e3 Guido Trotter
26 282f38e3 Guido Trotter
Currently all nodes of a Ganeti cluster are considered as part of the
27 282f38e3 Guido Trotter
same pool, for allocation purposes: DRBD instances for example can be
28 282f38e3 Guido Trotter
allocated on any two nodes.
29 282f38e3 Guido Trotter
30 282f38e3 Guido Trotter
This does cause a problem in cases where nodes are not all equally
31 282f38e3 Guido Trotter
connected to each other. For example if a cluster is created over two
32 282f38e3 Guido Trotter
set of machines, each connected to its own switch, the internal bandwidth
33 282f38e3 Guido Trotter
between machines connected to the same switch might be bigger than the
34 282f38e3 Guido Trotter
bandwidth for inter-switch connections.
35 282f38e3 Guido Trotter
36 282f38e3 Guido Trotter
Moreover some operations inside a cluster require all nodes to be locked
37 282f38e3 Guido Trotter
together for inter-node consistency, and won't scale if we increase the
38 282f38e3 Guido Trotter
number of nodes to a few hundreds.
39 282f38e3 Guido Trotter
40 282f38e3 Guido Trotter
Proposed changes
41 282f38e3 Guido Trotter
~~~~~~~~~~~~~~~~
42 282f38e3 Guido Trotter
43 282f38e3 Guido Trotter
With this change we'll divide Ganeti nodes into groups. Nothing will
44 282f38e3 Guido Trotter
change for clusters with only one node group, the default one. Bigger
45 282f38e3 Guido Trotter
cluster instead will be able to have more than one group, and each node
46 282f38e3 Guido Trotter
will belong to exactly one.
47 282f38e3 Guido Trotter
48 282f38e3 Guido Trotter
Node group management
49 282f38e3 Guido Trotter
+++++++++++++++++++++
50 282f38e3 Guido Trotter
51 282f38e3 Guido Trotter
To manage node groups and the nodes belonging to them, the following new
52 282f38e3 Guido Trotter
commands/flags will be introduced::
53 282f38e3 Guido Trotter
54 282f38e3 Guido Trotter
  gnt-node group-add <group> # add a new node group
55 282f38e3 Guido Trotter
  gnt-node group-del <group> # delete an empty group
56 282f38e3 Guido Trotter
  gnt-node group-list # list node groups
57 282f38e3 Guido Trotter
  gnt-node group-rename <oldname> <newname> # rename a group
58 282f38e3 Guido Trotter
  gnt-node list/info -g <group> # list only nodes belongin to a group
59 282f38e3 Guido Trotter
  gnt-node add -g <group> # add a node to a certain group
60 282f38e3 Guido Trotter
  gnt-node modify -g <group> # move a node to a new group
61 282f38e3 Guido Trotter
62 282f38e3 Guido Trotter
Instance level changes
63 282f38e3 Guido Trotter
++++++++++++++++++++++
64 282f38e3 Guido Trotter
65 282f38e3 Guido Trotter
Instances will be able to live in only one group at a time. This is
66 282f38e3 Guido Trotter
mostly important for DRBD instances, in which case both their primary
67 282f38e3 Guido Trotter
and secondary nodes will need to be in the same group. To support this
68 282f38e3 Guido Trotter
we envision the following changes:
69 282f38e3 Guido Trotter
70 282f38e3 Guido Trotter
  - The cluster will have a default group, which will initially be
71 282f38e3 Guido Trotter
  - Instance allocation will happen to the cluster's default group
72 282f38e3 Guido Trotter
    (which will be changable via gnt-cluster modify or RAPI) unless a
73 282f38e3 Guido Trotter
    group is explicitely specified in the creation job (with -g or via
74 282f38e3 Guido Trotter
    RAPI). Iallocator will be only passed the nodes belonging to that
75 282f38e3 Guido Trotter
    group.
76 282f38e3 Guido Trotter
  - Moving an instance between groups can only happen via an explicit
77 282f38e3 Guido Trotter
    operation, which for example in the case of DRBD will work by
78 282f38e3 Guido Trotter
    performing internally a replace-disks, a migration, and a second
79 282f38e3 Guido Trotter
    replace-disks. It will be possible to cleanup an interrupted
80 282f38e3 Guido Trotter
    group-move operation.
81 282f38e3 Guido Trotter
  - Cluster verify will signal an error if an instance has been left
82 282f38e3 Guido Trotter
    mid-transition between groups.
83 282f38e3 Guido Trotter
  - Intra-group instance migration/failover will check that the target
84 282f38e3 Guido Trotter
    group will be able to accept the instance network/storage wise, and
85 282f38e3 Guido Trotter
    fail otherwise. In the future we may be able to make some parameter
86 282f38e3 Guido Trotter
    changed during the move, but in the first version we expect an
87 282f38e3 Guido Trotter
    import/export if this is not possible.
88 282f38e3 Guido Trotter
  - From an allocation point of view, inter-group movements will be
89 282f38e3 Guido Trotter
    shown to a iallocator as a new allocation over the target group.
90 282f38e3 Guido Trotter
    Only in a future version we may add allocator extensions to decide
91 282f38e3 Guido Trotter
    which group the instance should be in. In the meantime we expect
92 282f38e3 Guido Trotter
    Ganeti administrators to either put instances on different groups by
93 282f38e3 Guido Trotter
    filling all groups first, or to have their own strategy based on the
94 282f38e3 Guido Trotter
    instance needs.
95 282f38e3 Guido Trotter
96 282f38e3 Guido Trotter
Cluster/Internal/Config level changes
97 282f38e3 Guido Trotter
+++++++++++++++++++++++++++++++++++++
98 282f38e3 Guido Trotter
99 282f38e3 Guido Trotter
We expect the following changes for cluster management:
100 282f38e3 Guido Trotter
101 282f38e3 Guido Trotter
  - Frequent multinode operations, such as os-diagnose or cluster-verify
102 282f38e3 Guido Trotter
    will act one group at a time. The default group will be used if none
103 282f38e3 Guido Trotter
    is passed. Command line tools will have a way to easily target all
104 282f38e3 Guido Trotter
    groups, by generating one job per group.
105 282f38e3 Guido Trotter
  - Groups will have a human-readable name, but will internally always
106 282f38e3 Guido Trotter
    be referenced by a UUID, which will be immutable. For example the
107 282f38e3 Guido Trotter
    cluster object will contain the UUID of the default group, each node
108 282f38e3 Guido Trotter
    will contain the UUID of the group it belongs to, etc. This is done
109 282f38e3 Guido Trotter
    to simplify referencing while keeping it easy to handle renames and
110 282f38e3 Guido Trotter
    movements. If we see that this works well, we'll transition other
111 282f38e3 Guido Trotter
    config objects (instances, nodes) to the same model.
112 282f38e3 Guido Trotter
  - The addition of a new per-group lock will be evaluated, if we can
113 282f38e3 Guido Trotter
    transition some operations now requiring the BGL to it.
114 282f38e3 Guido Trotter
  - Master candidate status will be allowed to be spread among groups.
115 282f38e3 Guido Trotter
    For the first version we won't add any restriction over how this is
116 282f38e3 Guido Trotter
    done, although in the future we may have a minimum number of master
117 282f38e3 Guido Trotter
    candidates which Ganeti will try to keep in each group, for example.
118 282f38e3 Guido Trotter
119 282f38e3 Guido Trotter
Other work and future changes
120 282f38e3 Guido Trotter
+++++++++++++++++++++++++++++
121 282f38e3 Guido Trotter
122 282f38e3 Guido Trotter
Commands like gnt-cluster command/copyfile will continue to work on the
123 282f38e3 Guido Trotter
whole cluster, but it will be possible to target one group only by
124 282f38e3 Guido Trotter
specifying it.
125 282f38e3 Guido Trotter
126 282f38e3 Guido Trotter
Commands which allow selection of sets of resources (for example
127 282f38e3 Guido Trotter
gnt-instance start/stop) will be able to select them by node group as
128 282f38e3 Guido Trotter
well.
129 282f38e3 Guido Trotter
130 282f38e3 Guido Trotter
Initially node groups won't be taggable objects, to simplify the first
131 282f38e3 Guido Trotter
implementation, but we expect this to be easy to add in a future version
132 282f38e3 Guido Trotter
should we see it's useful.
133 282f38e3 Guido Trotter
134 282f38e3 Guido Trotter
We envision groups as a good place to enhance cluster scalability. In
135 282f38e3 Guido Trotter
the future we may want to use them ad units for configuration diffusion,
136 282f38e3 Guido Trotter
to allow a better master scalability. For example it could be possible
137 282f38e3 Guido Trotter
to change some all-nodes RPCs to contact each group once, from the
138 282f38e3 Guido Trotter
master, and make one node in the group perform internal diffusion. We
139 282f38e3 Guido Trotter
won't implement this in the first version, but we'll evaluate it for the
140 282f38e3 Guido Trotter
future, if we see scalability problems on big multi-group clusters.
141 282f38e3 Guido Trotter
142 282f38e3 Guido Trotter
When Ganeti will support more storage models (eg. SANs, sheepdog, ceph)
143 282f38e3 Guido Trotter
we expect groups to be the basis for this, allowing for example a
144 282f38e3 Guido Trotter
different sheepdog/ceph cluster, or a different SAN to be connected to
145 282f38e3 Guido Trotter
each group. In some cases this will mean that inter-group move operation
146 282f38e3 Guido Trotter
will be necessarily performed with instance downtime, unless the
147 282f38e3 Guido Trotter
hypervisor has block-migrate functionality, and we implement support for
148 282f38e3 Guido Trotter
it (this would be theoretically possible, today, with KVM, for example).
149 282f38e3 Guido Trotter
150 282f38e3 Guido Trotter
151 52c47e4e Michael Hanselmann
Job priorities
152 9684d509 Guido Trotter
--------------
153 52c47e4e Michael Hanselmann
154 52c47e4e Michael Hanselmann
Current state and shortcomings
155 9684d509 Guido Trotter
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
156 52c47e4e Michael Hanselmann
157 52c47e4e Michael Hanselmann
.. TODO: Describe current situation
158 52c47e4e Michael Hanselmann
159 52c47e4e Michael Hanselmann
Proposed changes
160 9684d509 Guido Trotter
~~~~~~~~~~~~~~~~
161 52c47e4e Michael Hanselmann
162 52c47e4e Michael Hanselmann
.. TODO: Describe changes to job queue and potentially client programs
163 52c47e4e Michael Hanselmann
164 52c47e4e Michael Hanselmann
Worker pool
165 9684d509 Guido Trotter
+++++++++++
166 52c47e4e Michael Hanselmann
167 52c47e4e Michael Hanselmann
To support job priorities in the job queue, the worker pool underlying
168 52c47e4e Michael Hanselmann
the job queue must be enhanced to support task priorities. Currently
169 52c47e4e Michael Hanselmann
tasks are processed in the order they are added to the queue (but, due
170 52c47e4e Michael Hanselmann
to their nature, they don't necessarily finish in that order). All tasks
171 52c47e4e Michael Hanselmann
are equal. To support tasks with higher or lower priority, a few changes
172 52c47e4e Michael Hanselmann
have to be made to the queue inside a worker pool.
173 52c47e4e Michael Hanselmann
174 52c47e4e Michael Hanselmann
Each task is assigned a priority when added to the queue. This priority
175 52c47e4e Michael Hanselmann
can not be changed until the task is executed (this is fine as in all
176 52c47e4e Michael Hanselmann
current use-cases, tasks are added to a pool and then forgotten about
177 52c47e4e Michael Hanselmann
until they're done).
178 52c47e4e Michael Hanselmann
179 52c47e4e Michael Hanselmann
A task's priority can be compared to Unix' process priorities. The lower
180 52c47e4e Michael Hanselmann
the priority number, the closer to the queue's front it is. A task with
181 52c47e4e Michael Hanselmann
priority 0 is going to be run before one with priority 10. Tasks with
182 52c47e4e Michael Hanselmann
the same priority are executed in the order in which they were added.
183 52c47e4e Michael Hanselmann
184 52c47e4e Michael Hanselmann
While a task is running it can query its own priority. If it's not ready
185 52c47e4e Michael Hanselmann
yet for finishing, it can raise an exception to defer itself, optionally
186 52c47e4e Michael Hanselmann
changing its own priority. This is useful for the following cases:
187 52c47e4e Michael Hanselmann
188 52c47e4e Michael Hanselmann
- A task is trying to acquire locks, but those locks are still held by
189 52c47e4e Michael Hanselmann
  other tasks. By deferring itself, the task gives others a chance to
190 52c47e4e Michael Hanselmann
  run. This is especially useful when all workers are busy.
191 52c47e4e Michael Hanselmann
- If a task decides it hasn't gotten its locks in a long time, it can
192 52c47e4e Michael Hanselmann
  start to increase its own priority.
193 52c47e4e Michael Hanselmann
- Tasks waiting for long-running operations running asynchronously could
194 52c47e4e Michael Hanselmann
  defer themselves while waiting for a long-running operation.
195 52c47e4e Michael Hanselmann
196 52c47e4e Michael Hanselmann
With these changes, the job queue will be able to implement per-job
197 52c47e4e Michael Hanselmann
priorities.
198 52c47e4e Michael Hanselmann
199 14bde528 Manuel Franceschini
IPv6 support
200 9684d509 Guido Trotter
------------
201 14bde528 Manuel Franceschini
202 14bde528 Manuel Franceschini
Currently Ganeti does not support IPv6. This is true for nodes as well
203 14bde528 Manuel Franceschini
as instances. Due to the fact that IPv4 exhaustion is threateningly near
204 14bde528 Manuel Franceschini
the need of using IPv6 is increasing, especially given that bigger and
205 14bde528 Manuel Franceschini
bigger clusters are supported.
206 14bde528 Manuel Franceschini
207 14bde528 Manuel Franceschini
Supported IPv6 setup
208 9684d509 Guido Trotter
~~~~~~~~~~~~~~~~~~~~
209 14bde528 Manuel Franceschini
210 14bde528 Manuel Franceschini
In Ganeti 2.3 we introduce additionally to the ordinary pure IPv4
211 14bde528 Manuel Franceschini
setup a hybrid IPv6/IPv4 mode. The latter works as follows:
212 14bde528 Manuel Franceschini
213 14bde528 Manuel Franceschini
- all nodes in a cluster have a primary IPv6 address
214 14bde528 Manuel Franceschini
- the master has a IPv6 address
215 14bde528 Manuel Franceschini
- all nodes **must** have a secondary IPv4 address
216 14bde528 Manuel Franceschini
217 14bde528 Manuel Franceschini
The reason for this hybrid setup is that key components that Ganeti
218 14bde528 Manuel Franceschini
depends on do not or only partially support IPv6. More precisely, Xen
219 14bde528 Manuel Franceschini
does not support instance migration via IPv6 in version 3.4 and 4.0.
220 14bde528 Manuel Franceschini
Similarly, KVM does not support instance migration nor VNC access for
221 14bde528 Manuel Franceschini
IPv6 at the time of this writing.
222 14bde528 Manuel Franceschini
223 14bde528 Manuel Franceschini
This led to the decision of not supporting pure IPv6 Ganeti clusters, as
224 14bde528 Manuel Franceschini
very important cluster operations would not have been possible. Using
225 14bde528 Manuel Franceschini
IPv4 as secondary address does not affect any of the goals
226 14bde528 Manuel Franceschini
of the IPv6 support: since secondary addresses do not need to be
227 14bde528 Manuel Franceschini
publicly accessible, they need not be globally unique. In other words,
228 14bde528 Manuel Franceschini
one can practically use private IPv4 secondary addresses just for
229 14bde528 Manuel Franceschini
intra-cluster communication without propagating them across layer 3
230 14bde528 Manuel Franceschini
boundaries.
231 14bde528 Manuel Franceschini
232 14bde528 Manuel Franceschini
netutils: Utilities for handling common network tasks
233 9684d509 Guido Trotter
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
234 14bde528 Manuel Franceschini
235 14bde528 Manuel Franceschini
Currently common util functions are kept in the utils modules. Since
236 14bde528 Manuel Franceschini
this module grows bigger and bigger network-related functions are moved
237 14bde528 Manuel Franceschini
to a separate module named *netutils*. Additionally all these utilities
238 14bde528 Manuel Franceschini
will be IPv6-enabled.
239 14bde528 Manuel Franceschini
240 14bde528 Manuel Franceschini
Cluster initialization
241 9684d509 Guido Trotter
~~~~~~~~~~~~~~~~~~~~~~
242 14bde528 Manuel Franceschini
243 14bde528 Manuel Franceschini
As mentioned above there will be two different setups in terms of IP
244 14bde528 Manuel Franceschini
addressing: pure IPv4 and hybrid IPv6/IPv4 address. To choose that a
245 14bde528 Manuel Franceschini
new cluster init parameter *--primary-ip-version* is introduced. This is
246 14bde528 Manuel Franceschini
needed as a given name can resolve to both an IPv4 and IPv6 address on a
247 9684d509 Guido Trotter
dual-stack host effectively making it impossible to infer that bit.
248 14bde528 Manuel Franceschini
249 14bde528 Manuel Franceschini
Once a cluster is initialized and the primary IP version chosen all
250 14bde528 Manuel Franceschini
nodes that join have to conform to that setup. In the case of our
251 14bde528 Manuel Franceschini
IPv6/IPv4 setup all nodes *must* have a secondary IPv4 address.
252 14bde528 Manuel Franceschini
253 14bde528 Manuel Franceschini
Furthermore we store the primary IP version in ssconf which is consulted
254 14bde528 Manuel Franceschini
every time a daemon starts to determine the default bind address (either
255 14bde528 Manuel Franceschini
*0.0.0.0* or *::*. In a IPv6/IPv4 setup we need to bind the Ganeti
256 14bde528 Manuel Franceschini
daemon listening on network sockets to the IPv6 address.
257 14bde528 Manuel Franceschini
258 14bde528 Manuel Franceschini
Node addition
259 9684d509 Guido Trotter
~~~~~~~~~~~~~
260 14bde528 Manuel Franceschini
261 14bde528 Manuel Franceschini
When adding a new node to a IPv6/IPv4 cluster it must have a IPv6
262 14bde528 Manuel Franceschini
address to be used as primary and a IPv4 address used as secondary. As
263 14bde528 Manuel Franceschini
explained above, every time a daemon is started we use the cluster
264 14bde528 Manuel Franceschini
primary IP version to determine to which any address to bind to. The
265 14bde528 Manuel Franceschini
only exception to this is when a node is added to the cluster. In this
266 14bde528 Manuel Franceschini
case there is no ssconf available when noded is started and therefore
267 14bde528 Manuel Franceschini
the correct address needs to be passed to it.
268 14bde528 Manuel Franceschini
269 14bde528 Manuel Franceschini
Name resolution
270 9684d509 Guido Trotter
~~~~~~~~~~~~~~~
271 14bde528 Manuel Franceschini
272 14bde528 Manuel Franceschini
Since the gethostbyname*() functions do not support IPv6 name resolution
273 14bde528 Manuel Franceschini
will be done by using the recommended getaddrinfo().
274 14bde528 Manuel Franceschini
275 14bde528 Manuel Franceschini
IPv4-only components
276 9684d509 Guido Trotter
~~~~~~~~~~~~~~~~~~~~
277 14bde528 Manuel Franceschini
278 14bde528 Manuel Franceschini
============================  ===================  ====================
279 14bde528 Manuel Franceschini
Component                     IPv6 Status          Planned Version
280 14bde528 Manuel Franceschini
============================  ===================  ====================
281 14bde528 Manuel Franceschini
Xen instance migration        Not supported        Xen 4.1: libxenlight
282 14bde528 Manuel Franceschini
KVM instance migration        Not supported        Unknown
283 14bde528 Manuel Franceschini
KVM VNC access                Not supported        Unknown
284 14bde528 Manuel Franceschini
============================  ===================  ====================
285 14bde528 Manuel Franceschini
286 1eb85930 Guido Trotter
287 1eb85930 Guido Trotter
Feature changes
288 9684d509 Guido Trotter
===============
289 1eb85930 Guido Trotter
290 1eb85930 Guido Trotter
291 1eb85930 Guido Trotter
External interface changes
292 9684d509 Guido Trotter
==========================
293 1eb85930 Guido Trotter
294 1eb85930 Guido Trotter
295 1eb85930 Guido Trotter
.. vim: set textwidth=72 :
296 1eb85930 Guido Trotter
.. Local Variables:
297 1eb85930 Guido Trotter
.. mode: rst
298 1eb85930 Guido Trotter
.. fill-column: 72
299 1eb85930 Guido Trotter
.. End: