root / doc / design-2.1.rst @ ca9ccea8
History | View | Annotate | Download (33.6 kB)
1 |
================= |
---|---|
2 |
Ganeti 2.1 design |
3 |
================= |
4 |
|
5 |
This document describes the major changes in Ganeti 2.1 compared to |
6 |
the 2.0 version. |
7 |
|
8 |
The 2.1 version will be a relatively small release. Its main aim is to avoid |
9 |
changing too much of the core code, while addressing issues and adding new |
10 |
features and improvements over 2.0, in a timely fashion. |
11 |
|
12 |
.. contents:: :depth: 4 |
13 |
|
14 |
Objective |
15 |
========= |
16 |
|
17 |
Ganeti 2.1 will add features to help further automatization of cluster |
18 |
operations, further improbe scalability to even bigger clusters, and make it |
19 |
easier to debug the Ganeti core. |
20 |
|
21 |
Background |
22 |
========== |
23 |
|
24 |
Overview |
25 |
======== |
26 |
|
27 |
Detailed design |
28 |
=============== |
29 |
|
30 |
As for 2.0 we divide the 2.1 design into three areas: |
31 |
|
32 |
- core changes, which affect the master daemon/job queue/locking or all/most |
33 |
logical units |
34 |
- logical unit/feature changes |
35 |
- external interface changes (eg. command line, os api, hooks, ...) |
36 |
|
37 |
Core changes |
38 |
------------ |
39 |
|
40 |
Storage units modelling |
41 |
~~~~~~~~~~~~~~~~~~~~~~~ |
42 |
|
43 |
Currently, Ganeti has a good model of the block devices for instances |
44 |
(e.g. LVM logical volumes, files, DRBD devices, etc.) but none of the |
45 |
storage pools that are providing the space for these front-end |
46 |
devices. For example, there are hardcoded inter-node RPC calls for |
47 |
volume group listing, file storage creation/deletion, etc. |
48 |
|
49 |
The storage units framework will implement a generic handling for all |
50 |
kinds of storage backends: |
51 |
|
52 |
- LVM physical volumes |
53 |
- LVM volume groups |
54 |
- File-based storage directories |
55 |
- any other future storage method |
56 |
|
57 |
There will be a generic list of methods that each storage unit type |
58 |
will provide, like: |
59 |
|
60 |
- list of storage units of this type |
61 |
- check status of the storage unit |
62 |
|
63 |
Additionally, there will be specific methods for each method, for example: |
64 |
|
65 |
- enable/disable allocations on a specific PV |
66 |
- file storage directory creation/deletion |
67 |
- VG consistency fixing |
68 |
|
69 |
This will allow a much better modeling and unification of the various |
70 |
RPC calls related to backend storage pool in the future. Ganeti 2.1 is |
71 |
intended to add the basics of the framework, and not necessarilly move |
72 |
all the curent VG/FileBased operations to it. |
73 |
|
74 |
Note that while we model both LVM PVs and LVM VGs, the framework will |
75 |
**not** model any relationship between the different types. In other |
76 |
words, we don't model neither inheritances nor stacking, since this is |
77 |
too complex for our needs. While a ``vgreduce`` operation on a LVM VG |
78 |
could actually remove a PV from it, this will not be handled at the |
79 |
framework level, but at individual operation level. The goal is that |
80 |
this is a lightweight framework, for abstracting the different storage |
81 |
operation, and not for modelling the storage hierarchy. |
82 |
|
83 |
|
84 |
Locking improvements |
85 |
~~~~~~~~~~~~~~~~~~~~ |
86 |
|
87 |
Current State and shortcomings |
88 |
++++++++++++++++++++++++++++++ |
89 |
|
90 |
The class ``LockSet`` (see ``lib/locking.py``) is a container for one or |
91 |
many ``SharedLock`` instances. It provides an interface to add/remove locks |
92 |
and to acquire and subsequently release any number of those locks contained |
93 |
in it. |
94 |
|
95 |
Locks in a ``LockSet`` are always acquired in alphabetic order. Due to the |
96 |
way we're using locks for nodes and instances (the single cluster lock isn't |
97 |
affected by this issue) this can lead to long delays when acquiring locks if |
98 |
another operation tries to acquire multiple locks but has to wait for yet |
99 |
another operation. |
100 |
|
101 |
In the following demonstration we assume to have the instance locks |
102 |
``inst1``, ``inst2``, ``inst3`` and ``inst4``. |
103 |
|
104 |
#. Operation A grabs lock for instance ``inst4``. |
105 |
#. Operation B wants to acquire all instance locks in alphabetic order, but |
106 |
it has to wait for ``inst4``. |
107 |
#. Operation C tries to lock ``inst1``, but it has to wait until |
108 |
Operation B (which is trying to acquire all locks) releases the lock |
109 |
again. |
110 |
#. Operation A finishes and releases lock on ``inst4``. Operation B can |
111 |
continue and eventually releases all locks. |
112 |
#. Operation C can get ``inst1`` lock and finishes. |
113 |
|
114 |
Technically there's no need for Operation C to wait for Operation A, and |
115 |
subsequently Operation B, to finish. Operation B can't continue until |
116 |
Operation A is done (it has to wait for ``inst4``), anyway. |
117 |
|
118 |
Proposed changes |
119 |
++++++++++++++++ |
120 |
|
121 |
Non-blocking lock acquiring |
122 |
^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
123 |
|
124 |
Acquiring locks for OpCode execution is always done in blocking mode. They |
125 |
won't return until the lock has successfully been acquired (or an error |
126 |
occurred, although we won't cover that case here). |
127 |
|
128 |
``SharedLock`` and ``LockSet`` must be able to be acquired in a non-blocking |
129 |
way. They must support a timeout and abort trying to acquire the lock(s) |
130 |
after the specified amount of time. |
131 |
|
132 |
Retry acquiring locks |
133 |
^^^^^^^^^^^^^^^^^^^^^ |
134 |
|
135 |
To prevent other operations from waiting for a long time, such as described |
136 |
in the demonstration before, ``LockSet`` must not keep locks for a prolonged |
137 |
period of time when trying to acquire two or more locks. Instead it should, |
138 |
with an increasing timeout for acquiring all locks, release all locks again |
139 |
and sleep some time if it fails to acquire all requested locks. |
140 |
|
141 |
A good timeout value needs to be determined. In any case should ``LockSet`` |
142 |
proceed to acquire locks in blocking mode after a few (unsuccessful) |
143 |
attempts to acquire all requested locks. |
144 |
|
145 |
One proposal for the timeout is to use ``2**tries`` seconds, where ``tries`` |
146 |
is the number of unsuccessful tries. |
147 |
|
148 |
In the demonstration before this would allow Operation C to continue after |
149 |
Operation B unsuccessfully tried to acquire all locks and released all |
150 |
acquired locks (``inst1``, ``inst2`` and ``inst3``) again. |
151 |
|
152 |
Other solutions discussed |
153 |
+++++++++++++++++++++++++ |
154 |
|
155 |
There was also some discussion on going one step further and extend the job |
156 |
queue (see ``lib/jqueue.py``) to select the next task for a worker depending |
157 |
on whether it can acquire the necessary locks. While this may reduce the |
158 |
number of necessary worker threads and/or increase throughput on large |
159 |
clusters with many jobs, it also brings many potential problems, such as |
160 |
contention and increased memory usage, with it. As this would be an |
161 |
extension of the changes proposed before it could be implemented at a later |
162 |
point in time, but we decided to stay with the simpler solution for now. |
163 |
|
164 |
Implementation details |
165 |
++++++++++++++++++++++ |
166 |
|
167 |
``SharedLock`` redesign |
168 |
^^^^^^^^^^^^^^^^^^^^^^^ |
169 |
|
170 |
The current design of ``SharedLock`` is not good for supporting timeouts |
171 |
when acquiring a lock and there are also minor fairness issues in it. We |
172 |
plan to address both with a redesign. A proof of concept implementation was |
173 |
written and resulted in significantly simpler code. |
174 |
|
175 |
Currently ``SharedLock`` uses two separate queues for shared and exclusive |
176 |
acquires and waiters get to run in turns. This means if an exclusive acquire |
177 |
is released, the lock will allow shared waiters to run and vice versa. |
178 |
Although it's still fair in the end there is a slight bias towards shared |
179 |
waiters in the current implementation. The same implementation with two |
180 |
shared queues can not support timeouts without adding a lot of complexity. |
181 |
|
182 |
Our proposed redesign changes ``SharedLock`` to have only one single queue. |
183 |
There will be one condition (see Condition_ for a note about performance) in |
184 |
the queue per exclusive acquire and two for all shared acquires (see below for |
185 |
an explanation). The maximum queue length will always be ``2 + (number of |
186 |
exclusive acquires waiting)``. The number of queue entries for shared acquires |
187 |
can vary from 0 to 2. |
188 |
|
189 |
The two conditions for shared acquires are a bit special. They will be used |
190 |
in turn. When the lock is instantiated, no conditions are in the queue. As |
191 |
soon as the first shared acquire arrives (and there are holder(s) or waiting |
192 |
acquires; see Acquire_), the active condition is added to the queue. Until |
193 |
it becomes the topmost condition in the queue and has been notified, any |
194 |
shared acquire is added to this active condition. When the active condition |
195 |
is notified, the conditions are swapped and further shared acquires are |
196 |
added to the previously inactive condition (which has now become the active |
197 |
condition). After all waiters on the previously active (now inactive) and |
198 |
now notified condition received the notification, it is removed from the |
199 |
queue of pending acquires. |
200 |
|
201 |
This means shared acquires will skip any exclusive acquire in the queue. We |
202 |
believe it's better to improve parallelization on operations only asking for |
203 |
shared (or read-only) locks. Exclusive operations holding the same lock can |
204 |
not be parallelized. |
205 |
|
206 |
|
207 |
Acquire |
208 |
******* |
209 |
|
210 |
For exclusive acquires a new condition is created and appended to the queue. |
211 |
Shared acquires are added to the active condition for shared acquires and if |
212 |
the condition is not yet on the queue, it's appended. |
213 |
|
214 |
The next step is to wait for our condition to be on the top of the queue (to |
215 |
guarantee fairness). If the timeout expired, we return to the caller without |
216 |
acquiring the lock. On every notification we check whether the lock has been |
217 |
deleted, in which case an error is returned to the caller. |
218 |
|
219 |
The lock can be acquired if we're on top of the queue (there is no one else |
220 |
ahead of us). For an exclusive acquire, there must not be other exclusive or |
221 |
shared holders. For a shared acquire, there must not be an exclusive holder. |
222 |
If these conditions are all true, the lock is acquired and we return to the |
223 |
caller. In any other case we wait again on the condition. |
224 |
|
225 |
If it was the last waiter on a condition, the condition is removed from the |
226 |
queue. |
227 |
|
228 |
Optimization: There's no need to touch the queue if there are no pending |
229 |
acquires and no current holders. The caller can have the lock immediately. |
230 |
|
231 |
.. image:: design-2.1-lock-acquire.png |
232 |
|
233 |
|
234 |
Release |
235 |
******* |
236 |
|
237 |
First the lock removes the caller from the internal owner list. If there are |
238 |
pending acquires in the queue, the first (the oldest) condition is notified. |
239 |
|
240 |
If the first condition was the active condition for shared acquires, the |
241 |
inactive condition will be made active. This ensures fairness with exclusive |
242 |
locks by forcing consecutive shared acquires to wait in the queue. |
243 |
|
244 |
.. image:: design-2.1-lock-release.png |
245 |
|
246 |
|
247 |
Delete |
248 |
****** |
249 |
|
250 |
The caller must either hold the lock in exclusive mode already or the lock |
251 |
must be acquired in exclusive mode. Trying to delete a lock while it's held |
252 |
in shared mode must fail. |
253 |
|
254 |
After ensuring the lock is held in exclusive mode, the lock will mark itself |
255 |
as deleted and continue to notify all pending acquires. They will wake up, |
256 |
notice the deleted lock and return an error to the caller. |
257 |
|
258 |
|
259 |
Condition |
260 |
^^^^^^^^^ |
261 |
|
262 |
Note: This is not necessary for the locking changes above, but it may be a |
263 |
good optimization (pending performance tests). |
264 |
|
265 |
The existing locking code in Ganeti 2.0 uses Python's built-in |
266 |
``threading.Condition`` class. Unfortunately ``Condition`` implements |
267 |
timeouts by sleeping 1ms to 20ms between tries to acquire the condition lock |
268 |
in non-blocking mode. This requires unnecessary context switches and |
269 |
contention on the CPython GIL (Global Interpreter Lock). |
270 |
|
271 |
By using POSIX pipes (see ``pipe(2)``) we can use the operating system's |
272 |
support for timeouts on file descriptors (see ``select(2)``). A custom |
273 |
condition class will have to be written for this. |
274 |
|
275 |
On instantiation the class creates a pipe. After each notification the |
276 |
previous pipe is abandoned and re-created (technically the old pipe needs to |
277 |
stay around until all notifications have been delivered). |
278 |
|
279 |
All waiting clients of the condition use ``select(2)`` or ``poll(2)`` to |
280 |
wait for notifications, optionally with a timeout. A notification will be |
281 |
signalled to the waiting clients by closing the pipe. If the pipe wasn't |
282 |
closed during the timeout, the waiting function returns to its caller |
283 |
nonetheless. |
284 |
|
285 |
|
286 |
Feature changes |
287 |
--------------- |
288 |
|
289 |
Ganeti Confd |
290 |
~~~~~~~~~~~~ |
291 |
|
292 |
Current State and shortcomings |
293 |
++++++++++++++++++++++++++++++ |
294 |
In Ganeti 2.0 all nodes are equal, but some are more equal than others. In |
295 |
particular they are divided between "master", "master candidates" and "normal". |
296 |
(Moreover they can be offline or drained, but this is not important for the |
297 |
current discussion). In general the whole configuration is only replicated to |
298 |
master candidates, and some partial information is spread to all nodes via |
299 |
ssconf. |
300 |
|
301 |
This change was done so that the most frequent Ganeti operations didn't need to |
302 |
contact all nodes, and so clusters could become bigger. If we want more |
303 |
information to be available on all nodes, we need to add more ssconf values, |
304 |
which is counter-balancing the change, or to talk with the master node, which |
305 |
is not designed to happen now, and requires its availability. |
306 |
|
307 |
Information such as the instance->primary_node mapping will be needed on all |
308 |
nodes, and we also want to make sure services external to the cluster can query |
309 |
this information as well. This information must be available at all times, so |
310 |
we can't query it through RAPI, which would be a single point of failure, as |
311 |
it's only available on the master. |
312 |
|
313 |
|
314 |
Proposed changes |
315 |
++++++++++++++++ |
316 |
|
317 |
In order to allow fast and highly available access read-only to some |
318 |
configuration values, we'll create a new ganeti-confd daemon, which will run on |
319 |
master candidates. This daemon will talk via UDP, and authenticate messages |
320 |
using HMAC with a cluster-wide shared key. This key will be generated at |
321 |
cluster init time, and stored on the clusters alongside the ganeti SSL keys, |
322 |
and readable only by root. |
323 |
|
324 |
An interested client can query a value by making a request to a subset of the |
325 |
cluster master candidates. It will then wait to get a few responses, and use |
326 |
the one with the highest configuration serial number. Since the configuration |
327 |
serial number is increased each time the ganeti config is updated, and the |
328 |
serial number is included in all answers, this can be used to make sure to use |
329 |
the most recent answer, in case some master candidates are stale or in the |
330 |
middle of a configuration update. |
331 |
|
332 |
In order to prevent replay attacks queries will contain the current unix |
333 |
timestamp according to the client, and the server will verify that its |
334 |
timestamp is in the same 5 minutes range (this requires synchronized clocks, |
335 |
which is a good idea anyway). Queries will also contain a "salt" which they |
336 |
expect the answers to be sent with, and clients are supposed to accept only |
337 |
answers which contain salt generated by them. |
338 |
|
339 |
The configuration daemon will be able to answer simple queries such as: |
340 |
|
341 |
- master candidates list |
342 |
- master node |
343 |
- offline nodes |
344 |
- instance list |
345 |
- instance primary nodes |
346 |
|
347 |
Wire protocol |
348 |
^^^^^^^^^^^^^ |
349 |
|
350 |
A confd query will look like this, on the wire:: |
351 |
|
352 |
{ |
353 |
"msg": "{\"type\": 1, |
354 |
\"rsalt\": \"9aa6ce92-8336-11de-af38-001d093e835f\", |
355 |
\"protocol\": 1, |
356 |
\"query\": \"node1.example.com\"}\n", |
357 |
"salt": "1249637704", |
358 |
"hmac": "4a4139b2c3c5921f7e439469a0a45ad200aead0f" |
359 |
} |
360 |
|
361 |
Detailed explanation of the various fields: |
362 |
|
363 |
- 'msg' contains a JSON-encoded query, its fields are: |
364 |
|
365 |
- 'protocol', integer, is the confd protocol version (initially just |
366 |
constants.CONFD_PROTOCOL_VERSION, with a value of 1) |
367 |
- 'type', integer, is the query type. For example "node role by name" or |
368 |
"node primary ip by instance ip". Constants will be provided for the actual |
369 |
available query types. |
370 |
- 'query', string, is the search key. For example an ip, or a node name. |
371 |
- 'rsalt', string, is the required response salt. The client must use it to |
372 |
recognize which answer it's getting. |
373 |
|
374 |
- 'salt' must be the current unix timestamp, according to the client. Servers |
375 |
can refuse messages which have a wrong timing, according to their |
376 |
configuration and clock. |
377 |
- 'hmac' is an hmac signature of salt+msg, with the cluster hmac key |
378 |
|
379 |
If an answer comes back (which is optional, since confd works over UDP) it will |
380 |
be in this format:: |
381 |
|
382 |
{ |
383 |
"msg": "{\"status\": 0, |
384 |
\"answer\": 0, |
385 |
\"serial\": 42, |
386 |
\"protocol\": 1}\n", |
387 |
"salt": "9aa6ce92-8336-11de-af38-001d093e835f", |
388 |
"hmac": "aaeccc0dff9328fdf7967cb600b6a80a6a9332af" |
389 |
} |
390 |
|
391 |
Where: |
392 |
|
393 |
- 'msg' contains a JSON-encoded answer, its fields are: |
394 |
|
395 |
- 'protocol', integer, is the confd protocol version (initially just |
396 |
constants.CONFD_PROTOCOL_VERSION, with a value of 1) |
397 |
- 'status', integer, is the error code. Initially just 0 for 'ok' or '1' for |
398 |
'error' (in which case answer contains an error detail, rather than an |
399 |
answer), but in the future it may be expanded to have more meanings (eg: 2, |
400 |
the answer is compressed) |
401 |
- 'answer', is the actual answer. Its type and meaning is query specific. For |
402 |
example for "node primary ip by instance ip" queries it will be a string |
403 |
containing an IP address, for "node role by name" queries it will be an |
404 |
integer which encodes the role (master, candidate, drained, offline) |
405 |
according to constants. |
406 |
|
407 |
- 'salt' is the requested salt from the query. A client can use it to recognize |
408 |
what query the answer is answering. |
409 |
- 'hmac' is an hmac signature of salt+msg, with the cluster hmac key |
410 |
|
411 |
|
412 |
Redistribute Config |
413 |
~~~~~~~~~~~~~~~~~~~ |
414 |
|
415 |
Current State and shortcomings |
416 |
++++++++++++++++++++++++++++++ |
417 |
Currently LURedistributeConfig triggers a copy of the updated configuration |
418 |
file to all master candidates and of the ssconf files to all nodes. There are |
419 |
other files which are maintained manually but which are important to keep in |
420 |
sync. These are: |
421 |
|
422 |
- rapi SSL key certificate file (rapi.pem) (on master candidates) |
423 |
- rapi user/password file rapi_users (on master candidates) |
424 |
|
425 |
Furthermore there are some files which are hypervisor specific but we may want |
426 |
to keep in sync: |
427 |
|
428 |
- the xen-hvm hypervisor uses one shared file for all vnc passwords, and copies |
429 |
the file once, during node add. This design is subject to revision to be able |
430 |
to have different passwords for different groups of instances via the use of |
431 |
hypervisor parameters, and to allow xen-hvm and kvm to use an equal system to |
432 |
provide password-protected vnc sessions. In general, though, it would be |
433 |
useful if the vnc password files were copied as well, to avoid unwanted vnc |
434 |
password changes on instance failover/migrate. |
435 |
|
436 |
Optionally the admin may want to also ship files such as the global xend.conf |
437 |
file, and the network scripts to all nodes. |
438 |
|
439 |
Proposed changes |
440 |
++++++++++++++++ |
441 |
|
442 |
RedistributeConfig will be changed to copy also the rapi files, and to call |
443 |
every enabled hypervisor asking for a list of additional files to copy. Users |
444 |
will have the possibility to populate a file containing a list of files to be |
445 |
distributed; this file will be propagated as well. Such solution is really |
446 |
simple to implement and it's easily usable by scripts. |
447 |
|
448 |
This code will be also shared (via tasklets or by other means, if tasklets are |
449 |
not ready for 2.1) with the AddNode and SetNodeParams LUs (so that the relevant |
450 |
files will be automatically shipped to new master candidates as they are set). |
451 |
|
452 |
VNC Console Password |
453 |
~~~~~~~~~~~~~~~~~~~~ |
454 |
|
455 |
Current State and shortcomings |
456 |
++++++++++++++++++++++++++++++ |
457 |
|
458 |
Currently just the xen-hvm hypervisor supports setting a password to connect |
459 |
the the instances' VNC console, and has one common password stored in a file. |
460 |
|
461 |
This doesn't allow different passwords for different instances/groups of |
462 |
instances, and makes it necessary to remember to copy the file around the |
463 |
cluster when the password changes. |
464 |
|
465 |
Proposed changes |
466 |
++++++++++++++++ |
467 |
|
468 |
We'll change the VNC password file to a vnc_password_file hypervisor parameter. |
469 |
This way it can have a cluster default, but also a different value for each |
470 |
instance. The VNC enabled hypervisors (xen and kvm) will publish all the |
471 |
password files in use through the cluster so that a redistribute-config will |
472 |
ship them to all nodes (see the Redistribute Config proposed changes above). |
473 |
|
474 |
The current VNC_PASSWORD_FILE constant will be removed, but its value will be |
475 |
used as the default HV_VNC_PASSWORD_FILE value, thus retaining backwards |
476 |
compatibility with 2.0. |
477 |
|
478 |
The code to export the list of VNC password files from the hypervisors to |
479 |
RedistributeConfig will be shared between the KVM and xen-hvm hypervisors. |
480 |
|
481 |
Disk/Net parameters |
482 |
~~~~~~~~~~~~~~~~~~~ |
483 |
|
484 |
Current State and shortcomings |
485 |
++++++++++++++++++++++++++++++ |
486 |
|
487 |
Currently disks and network interfaces have a few tweakable options and all the |
488 |
rest is left to a default we chose. We're finding that we need more and more to |
489 |
tweak some of these parameters, for example to disable barriers for DRBD |
490 |
devices, or allow striping for the LVM volumes. |
491 |
|
492 |
Moreover for many of these parameters it will be nice to have cluster-wide |
493 |
defaults, and then be able to change them per disk/interface. |
494 |
|
495 |
Proposed changes |
496 |
++++++++++++++++ |
497 |
|
498 |
We will add new cluster level diskparams and netparams, which will contain all |
499 |
the tweakable parameters. All values which have a sensible cluster-wide default |
500 |
will go into this new structure while parameters which have unique values will not. |
501 |
|
502 |
Example of network parameters: |
503 |
- mode: bridge/route |
504 |
- link: for mode "bridge" the bridge to connect to, for mode route it can |
505 |
contain the routing table, or the destination interface |
506 |
|
507 |
Example of disk parameters: |
508 |
- stripe: lvm stripes |
509 |
- stripe_size: lvm stripe size |
510 |
- meta_flushes: drbd, enable/disable metadata "barriers" |
511 |
- data_flushes: drbd, enable/disable data "barriers" |
512 |
|
513 |
Some parameters are bound to be disk-type specific (drbd, vs lvm, vs files) or |
514 |
hypervisor specific (nic models for example), but for now they will all live in |
515 |
the same structure. Each component is supposed to validate only the parameters |
516 |
it knows about, and ganeti itself will make sure that no "globally unknown" |
517 |
parameters are added, and that no parameters have overridden meanings for |
518 |
different components. |
519 |
|
520 |
The parameters will be kept, as for the BEPARAMS into a "default" category, |
521 |
which will allow us to expand on by creating instance "classes" in the future. |
522 |
Instance classes is not a feature we plan implementing in 2.1, though. |
523 |
|
524 |
Non bridged instances support |
525 |
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
526 |
|
527 |
Current State and shortcomings |
528 |
++++++++++++++++++++++++++++++ |
529 |
|
530 |
Currently each instance NIC must be connected to a bridge, and if the bridge is |
531 |
not specified the default cluster one is used. This makes it impossible to use |
532 |
the vif-route xen network scripts, or other alternative mechanisms that don't |
533 |
need a bridge to work. |
534 |
|
535 |
Proposed changes |
536 |
++++++++++++++++ |
537 |
|
538 |
The new "mode" network parameter will distinguish between bridged interfaces |
539 |
and routed ones. |
540 |
|
541 |
When mode is "bridge" the "link" parameter will contain the bridge the instance |
542 |
should be connected to, effectively making things as today. The value has been |
543 |
migrated from a nic field to a parameter to allow for an easier manipulation of |
544 |
the cluster default. |
545 |
|
546 |
When mode is "route" the ip field of the interface will become mandatory, to |
547 |
allow for a route to be set. In the future we may want also to accept multiple |
548 |
IPs or IP/mask values for this purpose. We will evaluate possible meanings of |
549 |
the link parameter to signify a routing table to be used, which would allow for |
550 |
insulation between instance groups (as today happens for different bridges). |
551 |
|
552 |
For now we won't add a parameter to specify which network script gets called |
553 |
for which instance, so in a mixed cluster the network script must be able to |
554 |
handle both cases. The default kvm vif script will be changed to do so. (Xen |
555 |
doesn't have a ganeti provided script, so nothing will be done for that |
556 |
hypervisor) |
557 |
|
558 |
Introducing persistent UUIDs |
559 |
~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
560 |
|
561 |
Current state and shortcomings |
562 |
++++++++++++++++++++++++++++++ |
563 |
|
564 |
Some objects in the Ganeti configurations are tracked by their name |
565 |
while also supporting renames. This creates an extra difficulty, |
566 |
because neither Ganeti nor external management tools can then track |
567 |
the actual entity, and due to the name change it behaves like a new |
568 |
one. |
569 |
|
570 |
Proposed changes part 1 |
571 |
+++++++++++++++++++++++ |
572 |
|
573 |
We will change Ganeti to use UUIDs for entity tracking, but in a |
574 |
staggered way. In 2.1, we will simply add an “uuid” attribute to each |
575 |
of the instances, nodes and cluster itself. This will be reported on |
576 |
instance creation for nodes, and on node adds for the nodes. It will |
577 |
be of course avaiblable for querying via the OpQueryNodes/Instance and |
578 |
cluster information, and via RAPI as well. |
579 |
|
580 |
Note that Ganeti will not provide any way to change this attribute. |
581 |
|
582 |
Upgrading from Ganeti 2.0 will automatically add an ‘uuid’ attribute |
583 |
to all entities missing it. |
584 |
|
585 |
|
586 |
Proposed changes part 2 |
587 |
+++++++++++++++++++++++ |
588 |
|
589 |
In the next release (e.g. 2.2), the tracking of objects will change |
590 |
from the name to the UUID internally, and externally Ganeti will |
591 |
accept both forms of identification; e.g. an RAPI call would be made |
592 |
either against ``/2/instances/foo.bar`` or against |
593 |
``/2/instances/bb3b2e42…``. Since an FQDN must have at least a dot, |
594 |
and dots are not valid characters in UUIDs, we will not have namespace |
595 |
issues. |
596 |
|
597 |
Another change here is that node identification (during cluster |
598 |
operations/queries like master startup, “am I the master?” and |
599 |
similar) could be done via UUIDs which is more stable than the current |
600 |
hostname-based scheme. |
601 |
|
602 |
Internal tracking refers to the way the configuration is stored; a |
603 |
DRBD disk of an instance refers to the node name (so that IPs can be |
604 |
changed easily), but this is still a problem for name changes; thus |
605 |
these will be changed to point to the node UUID to ease renames. |
606 |
|
607 |
The advantages of this change (after the second round of changes), is |
608 |
that node rename becomes trivial, whereas today node rename would |
609 |
require a complete lock of all instances. |
610 |
|
611 |
|
612 |
Automated disk repairs infrastructure |
613 |
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
614 |
|
615 |
Replacing defective disks in an automated fashion is quite difficult with the |
616 |
current version of Ganeti. These changes will introduce additional |
617 |
functionality and interfaces to simplify automating disk replacements on a |
618 |
Ganeti node. |
619 |
|
620 |
Fix node volume group |
621 |
+++++++++++++++++++++ |
622 |
|
623 |
This is the most difficult addition, as it can lead to dataloss if it's not |
624 |
properly safeguarded. |
625 |
|
626 |
The operation must be done only when all the other nodes that have instances in |
627 |
common with the target node are fine, i.e. this is the only node with problems, |
628 |
and also we have to double-check that all instances on this node have at least |
629 |
a good copy of the data. |
630 |
|
631 |
This might mean that we have to enhance the GetMirrorStatus calls, and |
632 |
introduce and a smarter version that can tell us more about the status of an |
633 |
instance. |
634 |
|
635 |
Stop allocation on a given PV |
636 |
+++++++++++++++++++++++++++++ |
637 |
|
638 |
This is somewhat simple. First we need a "list PVs" opcode (and its associated |
639 |
logical unit) and then a set PV status opcode/LU. These in combination should |
640 |
allow both checking and changing the disk/PV status. |
641 |
|
642 |
Instance disk status |
643 |
++++++++++++++++++++ |
644 |
|
645 |
This new opcode or opcode change must list the instance-disk-index and node |
646 |
combinations of the instance together with their status. This will allow |
647 |
determining what part of the instance is broken (if any). |
648 |
|
649 |
Repair instance |
650 |
+++++++++++++++ |
651 |
|
652 |
This new opcode/LU/RAPI call will run ``replace-disks -p`` as needed, in order |
653 |
to fix the instance status. It only affects primary instances; secondaries can |
654 |
just be moved away. |
655 |
|
656 |
Migrate node |
657 |
++++++++++++ |
658 |
|
659 |
This new opcode/LU/RAPI call will take over the current ``gnt-node migrate`` |
660 |
code and run migrate for all instances on the node. |
661 |
|
662 |
Evacuate node |
663 |
++++++++++++++ |
664 |
|
665 |
This new opcode/LU/RAPI call will take over the current ``gnt-node evacuate`` |
666 |
code and run replace-secondary with an iallocator script for all instances on |
667 |
the node. |
668 |
|
669 |
|
670 |
External interface changes |
671 |
-------------------------- |
672 |
|
673 |
OS API |
674 |
~~~~~~ |
675 |
|
676 |
The OS API of Ganeti 2.0 has been built with extensibility in mind. Since we |
677 |
pass everything as environment variables it's a lot easier to send new |
678 |
information to the OSes without breaking retrocompatibility. This section of |
679 |
the design outlines the proposed extensions to the API and their |
680 |
implementation. |
681 |
|
682 |
API Version Compatibility Handling |
683 |
++++++++++++++++++++++++++++++++++ |
684 |
|
685 |
In 2.1 there will be a new OS API version (eg. 15), which should be mostly |
686 |
compatible with api 10, except for some new added variables. Since it's easy |
687 |
not to pass some variables we'll be able to handle Ganeti 2.0 OSes by just |
688 |
filtering out the newly added piece of information. We will still encourage |
689 |
OSes to declare support for the new API after checking that the new variables |
690 |
don't provide any conflict for them, and we will drop api 10 support after |
691 |
ganeti 2.1 has released. |
692 |
|
693 |
New Environment variables |
694 |
+++++++++++++++++++++++++ |
695 |
|
696 |
Some variables have never been added to the OS api but would definitely be |
697 |
useful for the OSes. We plan to add an INSTANCE_HYPERVISOR variable to allow |
698 |
the OS to make changes relevant to the virtualization the instance is going to |
699 |
use. Since this field is immutable for each instance, the os can tight the |
700 |
install without caring of making sure the instance can run under any |
701 |
virtualization technology. |
702 |
|
703 |
We also want the OS to know the particular hypervisor parameters, to be able to |
704 |
customize the install even more. Since the parameters can change, though, we |
705 |
will pass them only as an "FYI": if an OS ties some instance functionality to |
706 |
the value of a particular hypervisor parameter manual changes or a reinstall |
707 |
may be needed to adapt the instance to the new environment. This is not a |
708 |
regression as of today, because even if the OSes are left blind about this |
709 |
information, sometimes they still need to make compromises and cannot satisfy |
710 |
all possible parameter values. |
711 |
|
712 |
OS Variants |
713 |
+++++++++++ |
714 |
|
715 |
Currently we are assisting to some degree of "os proliferation" just to change |
716 |
a simple installation behavior. This means that the same OS gets installed on |
717 |
the cluster multiple times, with different names, to customize just one |
718 |
installation behavior. Usually such OSes try to share as much as possible |
719 |
through symlinks, but this still causes complications on the user side, |
720 |
especially when multiple parameters must be cross-matched. |
721 |
|
722 |
For example today if you want to install debian etch, lenny or squeeze you |
723 |
probably need to install the debootstrap OS multiple times, changing its |
724 |
configuration file, and calling it debootstrap-etch, debootstrap-lenny or |
725 |
debootstrap-squeeze. Furthermore if you have for example a "server" and a |
726 |
"development" environment which installs different packages/configuration files |
727 |
and must be available for all installs you'll probably end up with |
728 |
deboostrap-etch-server, debootstrap-etch-dev, debootrap-lenny-server, |
729 |
debootstrap-lenny-dev, etc. Crossing more than two parameters quickly becomes |
730 |
not manageable. |
731 |
|
732 |
In order to avoid this we plan to make OSes more customizable, by allowing each |
733 |
OS to declare a list of variants which can be used to customize it. The |
734 |
variants list is mandatory and must be written, one variant per line, in the |
735 |
new "variants.list" file inside the main os dir. At least one supported variant |
736 |
must be supported. When choosing the OS exactly one variant will have to be |
737 |
specified, and will be encoded in the os name as <OS-name>+<variant>. As for |
738 |
today it will be possible to change an instance's OS at creation or install |
739 |
time. |
740 |
|
741 |
The 2.1 OS list will be the combination of each OS, plus its supported |
742 |
variants. This will cause the name name proliferation to remain, but at least |
743 |
the internal OS code will be simplified to just parsing the passed variant, |
744 |
without the need for symlinks or code duplication. |
745 |
|
746 |
Also we expect the OSes to declare only "interesting" variants, but to accept |
747 |
some non-declared ones which a user will be able to pass in by overriding the |
748 |
checks ganeti does. This will be useful for allowing some variations to be used |
749 |
without polluting the OS list (per-OS documentation should list all supported |
750 |
variants). If a variant which is not internally supported is forced through, |
751 |
the OS scripts should abort. |
752 |
|
753 |
In the future (post 2.1) we may want to move to full fledged parameters all |
754 |
orthogonal to each other (for example "architecture" (i386, amd64), "suite" |
755 |
(lenny, squeeze, ...), etc). (As opposed to the variant, which is a single |
756 |
parameter, and you need a different variant for all the set of combinations you |
757 |
want to support). In this case we envision the variants to be moved inside of |
758 |
Ganeti and be associated with lists parameter->values associations, which will |
759 |
then be passed to the OS. |
760 |
|
761 |
|
762 |
IAllocator changes |
763 |
~~~~~~~~~~~~~~~~~~ |
764 |
|
765 |
Current State and shortcomings |
766 |
++++++++++++++++++++++++++++++ |
767 |
|
768 |
The iallocator interface allows creation of instances without manually |
769 |
specifying nodes, but instead by specifying plugins which will do the |
770 |
required computations and produce a valid node list. |
771 |
|
772 |
However, the interface is quite akward to use: |
773 |
|
774 |
- one cannot set a 'default' iallocator script |
775 |
- one cannot use it to easily test if allocation would succeed |
776 |
- some new functionality, such as rebalancing clusters and calculating |
777 |
capacity estimates is needed |
778 |
|
779 |
Proposed changes |
780 |
++++++++++++++++ |
781 |
|
782 |
There are two area of improvements proposed: |
783 |
|
784 |
- improving the use of the current interface |
785 |
- extending the IAllocator API to cover more automation |
786 |
|
787 |
|
788 |
Default iallocator names |
789 |
^^^^^^^^^^^^^^^^^^^^^^^^ |
790 |
|
791 |
The cluster will hold, for each type of iallocator, a (possibly empty) |
792 |
list of modules that will be used automatically. |
793 |
|
794 |
If the list is empty, the behaviour will remain the same. |
795 |
|
796 |
If the list has one entry, then ganeti will behave as if |
797 |
'--iallocator' was specifyed on the command line. I.e. use this |
798 |
allocator by default. If the user however passed nodes, those will be |
799 |
used in preference. |
800 |
|
801 |
If the list has multiple entries, they will be tried in order until |
802 |
one gives a successful answer. |
803 |
|
804 |
Dry-run allocation |
805 |
^^^^^^^^^^^^^^^^^^ |
806 |
|
807 |
The create instance LU will get a new 'dry-run' option that will just |
808 |
simulate the placement, and return the chosen node-lists after running |
809 |
all the usual checks. |
810 |
|
811 |
Cluster balancing |
812 |
^^^^^^^^^^^^^^^^^ |
813 |
|
814 |
Instance add/removals/moves can create a situation where load on the |
815 |
nodes is not spread equally. For this, a new iallocator mode will be |
816 |
implemented called ``balance`` in which the plugin, given the current |
817 |
cluster state, and a maximum number of operations, will need to |
818 |
compute the instance relocations needed in order to achieve a "better" |
819 |
(for whatever the script believes it's better) cluster. |
820 |
|
821 |
Cluster capacity calculation |
822 |
^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
823 |
|
824 |
In this mode, called ``capacity``, given an instance specification and |
825 |
the current cluster state (similar to the ``allocate`` mode), the |
826 |
plugin needs to return: |
827 |
|
828 |
- how many instances can be allocated on the cluster with that specification |
829 |
- on which nodes these will be allocated (in order) |