root / doc / design-2.1.rst @ ac932df1
History | View | Annotate | Download (46 kB)
1 |
================= |
---|---|
2 |
Ganeti 2.1 design |
3 |
================= |
4 |
|
5 |
This document describes the major changes in Ganeti 2.1 compared to |
6 |
the 2.0 version. |
7 |
|
8 |
The 2.1 version will be a relatively small release. Its main aim is to |
9 |
avoid changing too much of the core code, while addressing issues and |
10 |
adding new features and improvements over 2.0, in a timely fashion. |
11 |
|
12 |
.. contents:: :depth: 4 |
13 |
|
14 |
Objective |
15 |
========= |
16 |
|
17 |
Ganeti 2.1 will add features to help further automatization of cluster |
18 |
operations, further improve scalability to even bigger clusters, and |
19 |
make it easier to debug the Ganeti core. |
20 |
|
21 |
Background |
22 |
========== |
23 |
|
24 |
Overview |
25 |
======== |
26 |
|
27 |
Detailed design |
28 |
=============== |
29 |
|
30 |
As for 2.0 we divide the 2.1 design into three areas: |
31 |
|
32 |
- core changes, which affect the master daemon/job queue/locking or |
33 |
all/most logical units |
34 |
- logical unit/feature changes |
35 |
- external interface changes (eg. command line, os api, hooks, ...) |
36 |
|
37 |
Core changes |
38 |
------------ |
39 |
|
40 |
Storage units modelling |
41 |
~~~~~~~~~~~~~~~~~~~~~~~ |
42 |
|
43 |
Currently, Ganeti has a good model of the block devices for instances |
44 |
(e.g. LVM logical volumes, files, DRBD devices, etc.) but none of the |
45 |
storage pools that are providing the space for these front-end |
46 |
devices. For example, there are hardcoded inter-node RPC calls for |
47 |
volume group listing, file storage creation/deletion, etc. |
48 |
|
49 |
The storage units framework will implement a generic handling for all |
50 |
kinds of storage backends: |
51 |
|
52 |
- LVM physical volumes |
53 |
- LVM volume groups |
54 |
- File-based storage directories |
55 |
- any other future storage method |
56 |
|
57 |
There will be a generic list of methods that each storage unit type |
58 |
will provide, like: |
59 |
|
60 |
- list of storage units of this type |
61 |
- check status of the storage unit |
62 |
|
63 |
Additionally, there will be specific methods for each method, for |
64 |
example: |
65 |
|
66 |
- enable/disable allocations on a specific PV |
67 |
- file storage directory creation/deletion |
68 |
- VG consistency fixing |
69 |
|
70 |
This will allow a much better modeling and unification of the various |
71 |
RPC calls related to backend storage pool in the future. Ganeti 2.1 is |
72 |
intended to add the basics of the framework, and not necessarilly move |
73 |
all the curent VG/FileBased operations to it. |
74 |
|
75 |
Note that while we model both LVM PVs and LVM VGs, the framework will |
76 |
**not** model any relationship between the different types. In other |
77 |
words, we don't model neither inheritances nor stacking, since this is |
78 |
too complex for our needs. While a ``vgreduce`` operation on a LVM VG |
79 |
could actually remove a PV from it, this will not be handled at the |
80 |
framework level, but at individual operation level. The goal is that |
81 |
this is a lightweight framework, for abstracting the different storage |
82 |
operation, and not for modelling the storage hierarchy. |
83 |
|
84 |
|
85 |
Locking improvements |
86 |
~~~~~~~~~~~~~~~~~~~~ |
87 |
|
88 |
Current State and shortcomings |
89 |
++++++++++++++++++++++++++++++ |
90 |
|
91 |
The class ``LockSet`` (see ``lib/locking.py``) is a container for one or |
92 |
many ``SharedLock`` instances. It provides an interface to add/remove |
93 |
locks and to acquire and subsequently release any number of those locks |
94 |
contained in it. |
95 |
|
96 |
Locks in a ``LockSet`` are always acquired in alphabetic order. Due to |
97 |
the way we're using locks for nodes and instances (the single cluster |
98 |
lock isn't affected by this issue) this can lead to long delays when |
99 |
acquiring locks if another operation tries to acquire multiple locks but |
100 |
has to wait for yet another operation. |
101 |
|
102 |
In the following demonstration we assume to have the instance locks |
103 |
``inst1``, ``inst2``, ``inst3`` and ``inst4``. |
104 |
|
105 |
#. Operation A grabs lock for instance ``inst4``. |
106 |
#. Operation B wants to acquire all instance locks in alphabetic order, |
107 |
but it has to wait for ``inst4``. |
108 |
#. Operation C tries to lock ``inst1``, but it has to wait until |
109 |
Operation B (which is trying to acquire all locks) releases the lock |
110 |
again. |
111 |
#. Operation A finishes and releases lock on ``inst4``. Operation B can |
112 |
continue and eventually releases all locks. |
113 |
#. Operation C can get ``inst1`` lock and finishes. |
114 |
|
115 |
Technically there's no need for Operation C to wait for Operation A, and |
116 |
subsequently Operation B, to finish. Operation B can't continue until |
117 |
Operation A is done (it has to wait for ``inst4``), anyway. |
118 |
|
119 |
Proposed changes |
120 |
++++++++++++++++ |
121 |
|
122 |
Non-blocking lock acquiring |
123 |
^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
124 |
|
125 |
Acquiring locks for OpCode execution is always done in blocking mode. |
126 |
They won't return until the lock has successfully been acquired (or an |
127 |
error occurred, although we won't cover that case here). |
128 |
|
129 |
``SharedLock`` and ``LockSet`` must be able to be acquired in a |
130 |
non-blocking way. They must support a timeout and abort trying to |
131 |
acquire the lock(s) after the specified amount of time. |
132 |
|
133 |
Retry acquiring locks |
134 |
^^^^^^^^^^^^^^^^^^^^^ |
135 |
|
136 |
To prevent other operations from waiting for a long time, such as |
137 |
described in the demonstration before, ``LockSet`` must not keep locks |
138 |
for a prolonged period of time when trying to acquire two or more locks. |
139 |
Instead it should, with an increasing timeout for acquiring all locks, |
140 |
release all locks again and sleep some time if it fails to acquire all |
141 |
requested locks. |
142 |
|
143 |
A good timeout value needs to be determined. In any case should |
144 |
``LockSet`` proceed to acquire locks in blocking mode after a few |
145 |
(unsuccessful) attempts to acquire all requested locks. |
146 |
|
147 |
One proposal for the timeout is to use ``2**tries`` seconds, where |
148 |
``tries`` is the number of unsuccessful tries. |
149 |
|
150 |
In the demonstration before this would allow Operation C to continue |
151 |
after Operation B unsuccessfully tried to acquire all locks and released |
152 |
all acquired locks (``inst1``, ``inst2`` and ``inst3``) again. |
153 |
|
154 |
Other solutions discussed |
155 |
+++++++++++++++++++++++++ |
156 |
|
157 |
There was also some discussion on going one step further and extend the |
158 |
job queue (see ``lib/jqueue.py``) to select the next task for a worker |
159 |
depending on whether it can acquire the necessary locks. While this may |
160 |
reduce the number of necessary worker threads and/or increase throughput |
161 |
on large clusters with many jobs, it also brings many potential |
162 |
problems, such as contention and increased memory usage, with it. As |
163 |
this would be an extension of the changes proposed before it could be |
164 |
implemented at a later point in time, but we decided to stay with the |
165 |
simpler solution for now. |
166 |
|
167 |
Implementation details |
168 |
++++++++++++++++++++++ |
169 |
|
170 |
``SharedLock`` redesign |
171 |
^^^^^^^^^^^^^^^^^^^^^^^ |
172 |
|
173 |
The current design of ``SharedLock`` is not good for supporting timeouts |
174 |
when acquiring a lock and there are also minor fairness issues in it. We |
175 |
plan to address both with a redesign. A proof of concept implementation |
176 |
was written and resulted in significantly simpler code. |
177 |
|
178 |
Currently ``SharedLock`` uses two separate queues for shared and |
179 |
exclusive acquires and waiters get to run in turns. This means if an |
180 |
exclusive acquire is released, the lock will allow shared waiters to run |
181 |
and vice versa. Although it's still fair in the end there is a slight |
182 |
bias towards shared waiters in the current implementation. The same |
183 |
implementation with two shared queues can not support timeouts without |
184 |
adding a lot of complexity. |
185 |
|
186 |
Our proposed redesign changes ``SharedLock`` to have only one single |
187 |
queue. There will be one condition (see Condition_ for a note about |
188 |
performance) in the queue per exclusive acquire and two for all shared |
189 |
acquires (see below for an explanation). The maximum queue length will |
190 |
always be ``2 + (number of exclusive acquires waiting)``. The number of |
191 |
queue entries for shared acquires can vary from 0 to 2. |
192 |
|
193 |
The two conditions for shared acquires are a bit special. They will be |
194 |
used in turn. When the lock is instantiated, no conditions are in the |
195 |
queue. As soon as the first shared acquire arrives (and there are |
196 |
holder(s) or waiting acquires; see Acquire_), the active condition is |
197 |
added to the queue. Until it becomes the topmost condition in the queue |
198 |
and has been notified, any shared acquire is added to this active |
199 |
condition. When the active condition is notified, the conditions are |
200 |
swapped and further shared acquires are added to the previously inactive |
201 |
condition (which has now become the active condition). After all waiters |
202 |
on the previously active (now inactive) and now notified condition |
203 |
received the notification, it is removed from the queue of pending |
204 |
acquires. |
205 |
|
206 |
This means shared acquires will skip any exclusive acquire in the queue. |
207 |
We believe it's better to improve parallelization on operations only |
208 |
asking for shared (or read-only) locks. Exclusive operations holding the |
209 |
same lock can not be parallelized. |
210 |
|
211 |
|
212 |
Acquire |
213 |
******* |
214 |
|
215 |
For exclusive acquires a new condition is created and appended to the |
216 |
queue. Shared acquires are added to the active condition for shared |
217 |
acquires and if the condition is not yet on the queue, it's appended. |
218 |
|
219 |
The next step is to wait for our condition to be on the top of the queue |
220 |
(to guarantee fairness). If the timeout expired, we return to the caller |
221 |
without acquiring the lock. On every notification we check whether the |
222 |
lock has been deleted, in which case an error is returned to the caller. |
223 |
|
224 |
The lock can be acquired if we're on top of the queue (there is no one |
225 |
else ahead of us). For an exclusive acquire, there must not be other |
226 |
exclusive or shared holders. For a shared acquire, there must not be an |
227 |
exclusive holder. If these conditions are all true, the lock is |
228 |
acquired and we return to the caller. In any other case we wait again on |
229 |
the condition. |
230 |
|
231 |
If it was the last waiter on a condition, the condition is removed from |
232 |
the queue. |
233 |
|
234 |
Optimization: There's no need to touch the queue if there are no pending |
235 |
acquires and no current holders. The caller can have the lock |
236 |
immediately. |
237 |
|
238 |
.. image:: design-2.1-lock-acquire.png |
239 |
|
240 |
|
241 |
Release |
242 |
******* |
243 |
|
244 |
First the lock removes the caller from the internal owner list. If there |
245 |
are pending acquires in the queue, the first (the oldest) condition is |
246 |
notified. |
247 |
|
248 |
If the first condition was the active condition for shared acquires, the |
249 |
inactive condition will be made active. This ensures fairness with |
250 |
exclusive locks by forcing consecutive shared acquires to wait in the |
251 |
queue. |
252 |
|
253 |
.. image:: design-2.1-lock-release.png |
254 |
|
255 |
|
256 |
Delete |
257 |
****** |
258 |
|
259 |
The caller must either hold the lock in exclusive mode already or the |
260 |
lock must be acquired in exclusive mode. Trying to delete a lock while |
261 |
it's held in shared mode must fail. |
262 |
|
263 |
After ensuring the lock is held in exclusive mode, the lock will mark |
264 |
itself as deleted and continue to notify all pending acquires. They will |
265 |
wake up, notice the deleted lock and return an error to the caller. |
266 |
|
267 |
|
268 |
Condition |
269 |
^^^^^^^^^ |
270 |
|
271 |
Note: This is not necessary for the locking changes above, but it may be |
272 |
a good optimization (pending performance tests). |
273 |
|
274 |
The existing locking code in Ganeti 2.0 uses Python's built-in |
275 |
``threading.Condition`` class. Unfortunately ``Condition`` implements |
276 |
timeouts by sleeping 1ms to 20ms between tries to acquire the condition |
277 |
lock in non-blocking mode. This requires unnecessary context switches |
278 |
and contention on the CPython GIL (Global Interpreter Lock). |
279 |
|
280 |
By using POSIX pipes (see ``pipe(2)``) we can use the operating system's |
281 |
support for timeouts on file descriptors (see ``select(2)``). A custom |
282 |
condition class will have to be written for this. |
283 |
|
284 |
On instantiation the class creates a pipe. After each notification the |
285 |
previous pipe is abandoned and re-created (technically the old pipe |
286 |
needs to stay around until all notifications have been delivered). |
287 |
|
288 |
All waiting clients of the condition use ``select(2)`` or ``poll(2)`` to |
289 |
wait for notifications, optionally with a timeout. A notification will |
290 |
be signalled to the waiting clients by closing the pipe. If the pipe |
291 |
wasn't closed during the timeout, the waiting function returns to its |
292 |
caller nonetheless. |
293 |
|
294 |
|
295 |
Node daemon availability |
296 |
~~~~~~~~~~~~~~~~~~~~~~~~ |
297 |
|
298 |
Current State and shortcomings |
299 |
++++++++++++++++++++++++++++++ |
300 |
|
301 |
Currently, when a Ganeti node suffers serious system disk damage, the |
302 |
migration/failover of an instance may not correctly shutdown the virtual |
303 |
machine on the broken node causing instances duplication. The ``gnt-node |
304 |
powercycle`` command can be used to force a node reboot and thus to |
305 |
avoid duplicated instances. This command relies on node daemon |
306 |
availability, though, and thus can fail if the node daemon has some |
307 |
pages swapped out of ram, for example. |
308 |
|
309 |
|
310 |
Proposed changes |
311 |
++++++++++++++++ |
312 |
|
313 |
The proposed solution forces node daemon to run exclusively in RAM. It |
314 |
uses python ctypes to to call ``mlockall(MCL_CURRENT | MCL_FUTURE)`` on |
315 |
the node daemon process and all its children. In addition another log |
316 |
handler has been implemented for node daemon to redirect to |
317 |
``/dev/console`` messages that cannot be written on the logfile. |
318 |
|
319 |
With these changes node daemon can successfully run basic tasks such as |
320 |
a powercycle request even when the system disk is heavily damaged and |
321 |
reading/writing to disk fails constantly. |
322 |
|
323 |
|
324 |
New Features |
325 |
------------ |
326 |
|
327 |
Automated Ganeti Cluster Merger |
328 |
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
329 |
|
330 |
Current situation |
331 |
+++++++++++++++++ |
332 |
|
333 |
Currently there's no easy way to merge two or more clusters together. |
334 |
But in order to optimize resources this is a needed missing piece. The |
335 |
goal of this design doc is to come up with a easy to use solution which |
336 |
allows you to merge two or more cluster together. |
337 |
|
338 |
Initial contact |
339 |
+++++++++++++++ |
340 |
|
341 |
As the design of Ganeti is based on an autonomous system, Ganeti by |
342 |
itself has no way to reach nodes outside of its cluster. To overcome |
343 |
this situation we're required to prepare the cluster before we can go |
344 |
ahead with the actual merge: We've to replace at least the ssh keys on |
345 |
the affected nodes before we can do any operation within ``gnt-`` |
346 |
commands. |
347 |
|
348 |
To make this a automated process we'll ask the user to provide us with |
349 |
the root password of every cluster we've to merge. We use the password |
350 |
to grab the current ``id_dsa`` key and then rely on that ssh key for any |
351 |
further communication to be made until the cluster is fully merged. |
352 |
|
353 |
Cluster merge |
354 |
+++++++++++++ |
355 |
|
356 |
After initial contact we do the cluster merge: |
357 |
|
358 |
1. Grab the list of nodes |
359 |
2. On all nodes add our own ``id_dsa.pub`` key to ``authorized_keys`` |
360 |
3. Stop all instances running on the merging cluster |
361 |
4. Disable ``ganeti-watcher`` as it tries to restart Ganeti daemons |
362 |
5. Stop all Ganeti daemons on all merging nodes |
363 |
6. Grab the ``config.data`` from the master of the merging cluster |
364 |
7. Stop local ``ganeti-masterd`` |
365 |
8. Merge the config: |
366 |
|
367 |
1. Open our own cluster ``config.data`` |
368 |
2. Open cluster ``config.data`` of the merging cluster |
369 |
3. Grab all nodes of the merging cluster |
370 |
4. Set ``master_candidate`` to false on all merging nodes |
371 |
5. Add the nodes to our own cluster ``config.data`` |
372 |
6. Grab all the instances on the merging cluster |
373 |
7. Adjust the port if the instance has drbd layout: |
374 |
|
375 |
1. In ``logical_id`` (index 2) |
376 |
2. In ``physical_id`` (index 1 and 3) |
377 |
|
378 |
8. Add the instances to our own cluster ``config.data`` |
379 |
|
380 |
9. Start ``ganeti-masterd`` with ``--no-voting`` ``--yes-do-it`` |
381 |
10. ``gnt-node add --readd`` on all merging nodes |
382 |
11. ``gnt-cluster redist-conf`` |
383 |
12. Restart ``ganeti-masterd`` normally |
384 |
13. Enable ``ganeti-watcher`` again |
385 |
14. Start all merging instances again |
386 |
|
387 |
Rollback |
388 |
++++++++ |
389 |
|
390 |
Until we actually (re)add any nodes we can abort and rollback the merge |
391 |
at any point. After merging the config, though, we've to get the backup |
392 |
copy of ``config.data`` (from another master candidate node). And for |
393 |
security reasons it's a good idea to undo ``id_dsa.pub`` distribution by |
394 |
going on every affected node and remove the ``id_dsa.pub`` key again. |
395 |
Also we've to keep in mind, that we've to start the Ganeti daemons and |
396 |
starting up the instances again. |
397 |
|
398 |
Verification |
399 |
++++++++++++ |
400 |
|
401 |
Last but not least we should verify that the merge was successful. |
402 |
Therefore we run ``gnt-cluster verify``, which ensures that the cluster |
403 |
overall is in a healthy state. Additional it's also possible to compare |
404 |
the list of instances/nodes with a list made prior to the upgrade to |
405 |
make sure we didn't lose any data/instance/node. |
406 |
|
407 |
Appendix |
408 |
++++++++ |
409 |
|
410 |
cluster-merge.py |
411 |
^^^^^^^^^^^^^^^^ |
412 |
|
413 |
Used to merge the cluster config. This is a POC and might differ from |
414 |
actual production code. |
415 |
|
416 |
:: |
417 |
|
418 |
#!/usr/bin/python |
419 |
|
420 |
import sys |
421 |
from ganeti import config |
422 |
from ganeti import constants |
423 |
|
424 |
c_mine = config.ConfigWriter(offline=True) |
425 |
c_other = config.ConfigWriter(sys.argv[1]) |
426 |
|
427 |
fake_id = 0 |
428 |
for node in c_other.GetNodeList(): |
429 |
node_info = c_other.GetNodeInfo(node) |
430 |
node_info.master_candidate = False |
431 |
c_mine.AddNode(node_info, str(fake_id)) |
432 |
fake_id += 1 |
433 |
|
434 |
for instance in c_other.GetInstanceList(): |
435 |
instance_info = c_other.GetInstanceInfo(instance) |
436 |
for dsk in instance_info.disks: |
437 |
if dsk.dev_type in constants.LDS_DRBD: |
438 |
port = c_mine.AllocatePort() |
439 |
logical_id = list(dsk.logical_id) |
440 |
logical_id[2] = port |
441 |
dsk.logical_id = tuple(logical_id) |
442 |
physical_id = list(dsk.physical_id) |
443 |
physical_id[1] = physical_id[3] = port |
444 |
dsk.physical_id = tuple(physical_id) |
445 |
c_mine.AddInstance(instance_info, str(fake_id)) |
446 |
fake_id += 1 |
447 |
|
448 |
|
449 |
Feature changes |
450 |
--------------- |
451 |
|
452 |
Ganeti Confd |
453 |
~~~~~~~~~~~~ |
454 |
|
455 |
Current State and shortcomings |
456 |
++++++++++++++++++++++++++++++ |
457 |
|
458 |
In Ganeti 2.0 all nodes are equal, but some are more equal than others. |
459 |
In particular they are divided between "master", "master candidates" and |
460 |
"normal". (Moreover they can be offline or drained, but this is not |
461 |
important for the current discussion). In general the whole |
462 |
configuration is only replicated to master candidates, and some partial |
463 |
information is spread to all nodes via ssconf. |
464 |
|
465 |
This change was done so that the most frequent Ganeti operations didn't |
466 |
need to contact all nodes, and so clusters could become bigger. If we |
467 |
want more information to be available on all nodes, we need to add more |
468 |
ssconf values, which is counter-balancing the change, or to talk with |
469 |
the master node, which is not designed to happen now, and requires its |
470 |
availability. |
471 |
|
472 |
Information such as the instance->primary_node mapping will be needed on |
473 |
all nodes, and we also want to make sure services external to the |
474 |
cluster can query this information as well. This information must be |
475 |
available at all times, so we can't query it through RAPI, which would |
476 |
be a single point of failure, as it's only available on the master. |
477 |
|
478 |
|
479 |
Proposed changes |
480 |
++++++++++++++++ |
481 |
|
482 |
In order to allow fast and highly available access read-only to some |
483 |
configuration values, we'll create a new ganeti-confd daemon, which will |
484 |
run on master candidates. This daemon will talk via UDP, and |
485 |
authenticate messages using HMAC with a cluster-wide shared key. This |
486 |
key will be generated at cluster init time, and stored on the clusters |
487 |
alongside the ganeti SSL keys, and readable only by root. |
488 |
|
489 |
An interested client can query a value by making a request to a subset |
490 |
of the cluster master candidates. It will then wait to get a few |
491 |
responses, and use the one with the highest configuration serial number. |
492 |
Since the configuration serial number is increased each time the ganeti |
493 |
config is updated, and the serial number is included in all answers, |
494 |
this can be used to make sure to use the most recent answer, in case |
495 |
some master candidates are stale or in the middle of a configuration |
496 |
update. |
497 |
|
498 |
In order to prevent replay attacks queries will contain the current unix |
499 |
timestamp according to the client, and the server will verify that its |
500 |
timestamp is in the same 5 minutes range (this requires synchronized |
501 |
clocks, which is a good idea anyway). Queries will also contain a "salt" |
502 |
which they expect the answers to be sent with, and clients are supposed |
503 |
to accept only answers which contain salt generated by them. |
504 |
|
505 |
The configuration daemon will be able to answer simple queries such as: |
506 |
|
507 |
- master candidates list |
508 |
- master node |
509 |
- offline nodes |
510 |
- instance list |
511 |
- instance primary nodes |
512 |
|
513 |
Wire protocol |
514 |
^^^^^^^^^^^^^ |
515 |
|
516 |
A confd query will look like this, on the wire:: |
517 |
|
518 |
plj0{ |
519 |
"msg": "{\"type\": 1, |
520 |
\"rsalt\": \"9aa6ce92-8336-11de-af38-001d093e835f\", |
521 |
\"protocol\": 1, |
522 |
\"query\": \"node1.example.com\"}\n", |
523 |
"salt": "1249637704", |
524 |
"hmac": "4a4139b2c3c5921f7e439469a0a45ad200aead0f" |
525 |
} |
526 |
|
527 |
"plj0" is a fourcc that details the message content. It stands for plain |
528 |
json 0, and can be changed as we move on to different type of protocols |
529 |
(for example protocol buffers, or encrypted json). What follows is a |
530 |
json encoded string, with the following fields: |
531 |
|
532 |
- 'msg' contains a JSON-encoded query, its fields are: |
533 |
|
534 |
- 'protocol', integer, is the confd protocol version (initially just |
535 |
constants.CONFD_PROTOCOL_VERSION, with a value of 1) |
536 |
- 'type', integer, is the query type. For example "node role by name" |
537 |
or "node primary ip by instance ip". Constants will be provided for |
538 |
the actual available query types. |
539 |
- 'query', string, is the search key. For example an ip, or a node |
540 |
name. |
541 |
- 'rsalt', string, is the required response salt. The client must use |
542 |
it to recognize which answer it's getting. |
543 |
|
544 |
- 'salt' must be the current unix timestamp, according to the client. |
545 |
Servers can refuse messages which have a wrong timing, according to |
546 |
their configuration and clock. |
547 |
- 'hmac' is an hmac signature of salt+msg, with the cluster hmac key |
548 |
|
549 |
If an answer comes back (which is optional, since confd works over UDP) |
550 |
it will be in this format:: |
551 |
|
552 |
plj0{ |
553 |
"msg": "{\"status\": 0, |
554 |
\"answer\": 0, |
555 |
\"serial\": 42, |
556 |
\"protocol\": 1}\n", |
557 |
"salt": "9aa6ce92-8336-11de-af38-001d093e835f", |
558 |
"hmac": "aaeccc0dff9328fdf7967cb600b6a80a6a9332af" |
559 |
} |
560 |
|
561 |
Where: |
562 |
|
563 |
- 'plj0' the message type magic fourcc, as discussed above |
564 |
- 'msg' contains a JSON-encoded answer, its fields are: |
565 |
|
566 |
- 'protocol', integer, is the confd protocol version (initially just |
567 |
constants.CONFD_PROTOCOL_VERSION, with a value of 1) |
568 |
- 'status', integer, is the error code. Initially just 0 for 'ok' or |
569 |
'1' for 'error' (in which case answer contains an error detail, |
570 |
rather than an answer), but in the future it may be expanded to have |
571 |
more meanings (eg: 2, the answer is compressed) |
572 |
- 'answer', is the actual answer. Its type and meaning is query |
573 |
specific. For example for "node primary ip by instance ip" queries |
574 |
it will be a string containing an IP address, for "node role by |
575 |
name" queries it will be an integer which encodes the role (master, |
576 |
candidate, drained, offline) according to constants. |
577 |
|
578 |
- 'salt' is the requested salt from the query. A client can use it to |
579 |
recognize what query the answer is answering. |
580 |
- 'hmac' is an hmac signature of salt+msg, with the cluster hmac key |
581 |
|
582 |
|
583 |
Redistribute Config |
584 |
~~~~~~~~~~~~~~~~~~~ |
585 |
|
586 |
Current State and shortcomings |
587 |
++++++++++++++++++++++++++++++ |
588 |
|
589 |
Currently LURedistributeConfig triggers a copy of the updated |
590 |
configuration file to all master candidates and of the ssconf files to |
591 |
all nodes. There are other files which are maintained manually but which |
592 |
are important to keep in sync. These are: |
593 |
|
594 |
- rapi SSL key certificate file (rapi.pem) (on master candidates) |
595 |
- rapi user/password file rapi_users (on master candidates) |
596 |
|
597 |
Furthermore there are some files which are hypervisor specific but we |
598 |
may want to keep in sync: |
599 |
|
600 |
- the xen-hvm hypervisor uses one shared file for all vnc passwords, and |
601 |
copies the file once, during node add. This design is subject to |
602 |
revision to be able to have different passwords for different groups |
603 |
of instances via the use of hypervisor parameters, and to allow |
604 |
xen-hvm and kvm to use an equal system to provide password-protected |
605 |
vnc sessions. In general, though, it would be useful if the vnc |
606 |
password files were copied as well, to avoid unwanted vnc password |
607 |
changes on instance failover/migrate. |
608 |
|
609 |
Optionally the admin may want to also ship files such as the global |
610 |
xend.conf file, and the network scripts to all nodes. |
611 |
|
612 |
Proposed changes |
613 |
++++++++++++++++ |
614 |
|
615 |
RedistributeConfig will be changed to copy also the rapi files, and to |
616 |
call every enabled hypervisor asking for a list of additional files to |
617 |
copy. Users will have the possibility to populate a file containing a |
618 |
list of files to be distributed; this file will be propagated as well. |
619 |
Such solution is really simple to implement and it's easily usable by |
620 |
scripts. |
621 |
|
622 |
This code will be also shared (via tasklets or by other means, if |
623 |
tasklets are not ready for 2.1) with the AddNode and SetNodeParams LUs |
624 |
(so that the relevant files will be automatically shipped to new master |
625 |
candidates as they are set). |
626 |
|
627 |
VNC Console Password |
628 |
~~~~~~~~~~~~~~~~~~~~ |
629 |
|
630 |
Current State and shortcomings |
631 |
++++++++++++++++++++++++++++++ |
632 |
|
633 |
Currently just the xen-hvm hypervisor supports setting a password to |
634 |
connect the the instances' VNC console, and has one common password |
635 |
stored in a file. |
636 |
|
637 |
This doesn't allow different passwords for different instances/groups of |
638 |
instances, and makes it necessary to remember to copy the file around |
639 |
the cluster when the password changes. |
640 |
|
641 |
Proposed changes |
642 |
++++++++++++++++ |
643 |
|
644 |
We'll change the VNC password file to a vnc_password_file hypervisor |
645 |
parameter. This way it can have a cluster default, but also a different |
646 |
value for each instance. The VNC enabled hypervisors (xen and kvm) will |
647 |
publish all the password files in use through the cluster so that a |
648 |
redistribute-config will ship them to all nodes (see the Redistribute |
649 |
Config proposed changes above). |
650 |
|
651 |
The current VNC_PASSWORD_FILE constant will be removed, but its value |
652 |
will be used as the default HV_VNC_PASSWORD_FILE value, thus retaining |
653 |
backwards compatibility with 2.0. |
654 |
|
655 |
The code to export the list of VNC password files from the hypervisors |
656 |
to RedistributeConfig will be shared between the KVM and xen-hvm |
657 |
hypervisors. |
658 |
|
659 |
Disk/Net parameters |
660 |
~~~~~~~~~~~~~~~~~~~ |
661 |
|
662 |
Current State and shortcomings |
663 |
++++++++++++++++++++++++++++++ |
664 |
|
665 |
Currently disks and network interfaces have a few tweakable options and |
666 |
all the rest is left to a default we chose. We're finding that we need |
667 |
more and more to tweak some of these parameters, for example to disable |
668 |
barriers for DRBD devices, or allow striping for the LVM volumes. |
669 |
|
670 |
Moreover for many of these parameters it will be nice to have |
671 |
cluster-wide defaults, and then be able to change them per |
672 |
disk/interface. |
673 |
|
674 |
Proposed changes |
675 |
++++++++++++++++ |
676 |
|
677 |
We will add new cluster level diskparams and netparams, which will |
678 |
contain all the tweakable parameters. All values which have a sensible |
679 |
cluster-wide default will go into this new structure while parameters |
680 |
which have unique values will not. |
681 |
|
682 |
Example of network parameters: |
683 |
- mode: bridge/route |
684 |
- link: for mode "bridge" the bridge to connect to, for mode route it |
685 |
can contain the routing table, or the destination interface |
686 |
|
687 |
Example of disk parameters: |
688 |
- stripe: lvm stripes |
689 |
- stripe_size: lvm stripe size |
690 |
- meta_flushes: drbd, enable/disable metadata "barriers" |
691 |
- data_flushes: drbd, enable/disable data "barriers" |
692 |
|
693 |
Some parameters are bound to be disk-type specific (drbd, vs lvm, vs |
694 |
files) or hypervisor specific (nic models for example), but for now they |
695 |
will all live in the same structure. Each component is supposed to |
696 |
validate only the parameters it knows about, and ganeti itself will make |
697 |
sure that no "globally unknown" parameters are added, and that no |
698 |
parameters have overridden meanings for different components. |
699 |
|
700 |
The parameters will be kept, as for the BEPARAMS into a "default" |
701 |
category, which will allow us to expand on by creating instance |
702 |
"classes" in the future. Instance classes is not a feature we plan |
703 |
implementing in 2.1, though. |
704 |
|
705 |
|
706 |
Global hypervisor parameters |
707 |
~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
708 |
|
709 |
Current State and shortcomings |
710 |
++++++++++++++++++++++++++++++ |
711 |
|
712 |
Currently all hypervisor parameters are modifiable both globally |
713 |
(cluster level) and at instance level. However, there is no other |
714 |
framework to held hypervisor-specific parameters, so if we want to add |
715 |
a new class of hypervisor parameters that only makes sense on a global |
716 |
level, we have to change the hvparams framework. |
717 |
|
718 |
Proposed changes |
719 |
++++++++++++++++ |
720 |
|
721 |
We add a new (global, not per-hypervisor) list of parameters which are |
722 |
not changeable on a per-instance level. The create, modify and query |
723 |
instance operations are changed to not allow/show these parameters. |
724 |
|
725 |
Furthermore, to allow transition of parameters to the global list, and |
726 |
to allow cleanup of inadverdently-customised parameters, the |
727 |
``UpgradeConfig()`` method of instances will drop any such parameters |
728 |
from their list of hvparams, such that a restart of the master daemon |
729 |
is all that is needed for cleaning these up. |
730 |
|
731 |
Also, the framework is simple enough that if we need to replicate it |
732 |
at beparams level we can do so easily. |
733 |
|
734 |
|
735 |
Non bridged instances support |
736 |
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
737 |
|
738 |
Current State and shortcomings |
739 |
++++++++++++++++++++++++++++++ |
740 |
|
741 |
Currently each instance NIC must be connected to a bridge, and if the |
742 |
bridge is not specified the default cluster one is used. This makes it |
743 |
impossible to use the vif-route xen network scripts, or other |
744 |
alternative mechanisms that don't need a bridge to work. |
745 |
|
746 |
Proposed changes |
747 |
++++++++++++++++ |
748 |
|
749 |
The new "mode" network parameter will distinguish between bridged |
750 |
interfaces and routed ones. |
751 |
|
752 |
When mode is "bridge" the "link" parameter will contain the bridge the |
753 |
instance should be connected to, effectively making things as today. The |
754 |
value has been migrated from a nic field to a parameter to allow for an |
755 |
easier manipulation of the cluster default. |
756 |
|
757 |
When mode is "route" the ip field of the interface will become |
758 |
mandatory, to allow for a route to be set. In the future we may want |
759 |
also to accept multiple IPs or IP/mask values for this purpose. We will |
760 |
evaluate possible meanings of the link parameter to signify a routing |
761 |
table to be used, which would allow for insulation between instance |
762 |
groups (as today happens for different bridges). |
763 |
|
764 |
For now we won't add a parameter to specify which network script gets |
765 |
called for which instance, so in a mixed cluster the network script must |
766 |
be able to handle both cases. The default kvm vif script will be changed |
767 |
to do so. (Xen doesn't have a ganeti provided script, so nothing will be |
768 |
done for that hypervisor) |
769 |
|
770 |
Introducing persistent UUIDs |
771 |
~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
772 |
|
773 |
Current state and shortcomings |
774 |
++++++++++++++++++++++++++++++ |
775 |
|
776 |
Some objects in the Ganeti configurations are tracked by their name |
777 |
while also supporting renames. This creates an extra difficulty, |
778 |
because neither Ganeti nor external management tools can then track |
779 |
the actual entity, and due to the name change it behaves like a new |
780 |
one. |
781 |
|
782 |
Proposed changes part 1 |
783 |
+++++++++++++++++++++++ |
784 |
|
785 |
We will change Ganeti to use UUIDs for entity tracking, but in a |
786 |
staggered way. In 2.1, we will simply add an “uuid” attribute to each |
787 |
of the instances, nodes and cluster itself. This will be reported on |
788 |
instance creation for nodes, and on node adds for the nodes. It will |
789 |
be of course avaiblable for querying via the OpQueryNodes/Instance and |
790 |
cluster information, and via RAPI as well. |
791 |
|
792 |
Note that Ganeti will not provide any way to change this attribute. |
793 |
|
794 |
Upgrading from Ganeti 2.0 will automatically add an ‘uuid’ attribute |
795 |
to all entities missing it. |
796 |
|
797 |
|
798 |
Proposed changes part 2 |
799 |
+++++++++++++++++++++++ |
800 |
|
801 |
In the next release (e.g. 2.2), the tracking of objects will change |
802 |
from the name to the UUID internally, and externally Ganeti will |
803 |
accept both forms of identification; e.g. an RAPI call would be made |
804 |
either against ``/2/instances/foo.bar`` or against |
805 |
``/2/instances/bb3b2e42…``. Since an FQDN must have at least a dot, |
806 |
and dots are not valid characters in UUIDs, we will not have namespace |
807 |
issues. |
808 |
|
809 |
Another change here is that node identification (during cluster |
810 |
operations/queries like master startup, “am I the master?” and |
811 |
similar) could be done via UUIDs which is more stable than the current |
812 |
hostname-based scheme. |
813 |
|
814 |
Internal tracking refers to the way the configuration is stored; a |
815 |
DRBD disk of an instance refers to the node name (so that IPs can be |
816 |
changed easily), but this is still a problem for name changes; thus |
817 |
these will be changed to point to the node UUID to ease renames. |
818 |
|
819 |
The advantages of this change (after the second round of changes), is |
820 |
that node rename becomes trivial, whereas today node rename would |
821 |
require a complete lock of all instances. |
822 |
|
823 |
|
824 |
Automated disk repairs infrastructure |
825 |
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
826 |
|
827 |
Replacing defective disks in an automated fashion is quite difficult |
828 |
with the current version of Ganeti. These changes will introduce |
829 |
additional functionality and interfaces to simplify automating disk |
830 |
replacements on a Ganeti node. |
831 |
|
832 |
Fix node volume group |
833 |
+++++++++++++++++++++ |
834 |
|
835 |
This is the most difficult addition, as it can lead to dataloss if it's |
836 |
not properly safeguarded. |
837 |
|
838 |
The operation must be done only when all the other nodes that have |
839 |
instances in common with the target node are fine, i.e. this is the only |
840 |
node with problems, and also we have to double-check that all instances |
841 |
on this node have at least a good copy of the data. |
842 |
|
843 |
This might mean that we have to enhance the GetMirrorStatus calls, and |
844 |
introduce and a smarter version that can tell us more about the status |
845 |
of an instance. |
846 |
|
847 |
Stop allocation on a given PV |
848 |
+++++++++++++++++++++++++++++ |
849 |
|
850 |
This is somewhat simple. First we need a "list PVs" opcode (and its |
851 |
associated logical unit) and then a set PV status opcode/LU. These in |
852 |
combination should allow both checking and changing the disk/PV status. |
853 |
|
854 |
Instance disk status |
855 |
++++++++++++++++++++ |
856 |
|
857 |
This new opcode or opcode change must list the instance-disk-index and |
858 |
node combinations of the instance together with their status. This will |
859 |
allow determining what part of the instance is broken (if any). |
860 |
|
861 |
Repair instance |
862 |
+++++++++++++++ |
863 |
|
864 |
This new opcode/LU/RAPI call will run ``replace-disks -p`` as needed, in |
865 |
order to fix the instance status. It only affects primary instances; |
866 |
secondaries can just be moved away. |
867 |
|
868 |
Migrate node |
869 |
++++++++++++ |
870 |
|
871 |
This new opcode/LU/RAPI call will take over the current ``gnt-node |
872 |
migrate`` code and run migrate for all instances on the node. |
873 |
|
874 |
Evacuate node |
875 |
++++++++++++++ |
876 |
|
877 |
This new opcode/LU/RAPI call will take over the current ``gnt-node |
878 |
evacuate`` code and run replace-secondary with an iallocator script for |
879 |
all instances on the node. |
880 |
|
881 |
|
882 |
User-id pool |
883 |
~~~~~~~~~~~~ |
884 |
|
885 |
In order to allow running different processes under unique user-ids |
886 |
on a node, we introduce the user-id pool concept. |
887 |
|
888 |
The user-id pool is a cluster-wide configuration parameter. |
889 |
It is a list of user-ids and/or user-id ranges that are reserved |
890 |
for running Ganeti processes (including KVM instances). |
891 |
The code guarantees that on a given node a given user-id is only |
892 |
handed out if there is no other process running with that user-id. |
893 |
|
894 |
Please note, that this can only be guaranteed if all processes in |
895 |
the system - that run under a user-id belonging to the pool - are |
896 |
started by reserving a user-id first. That can be accomplished |
897 |
either by using the RequestUnusedUid() function to get an unused |
898 |
user-id or by implementing the same locking mechanism. |
899 |
|
900 |
Implementation |
901 |
++++++++++++++ |
902 |
|
903 |
The functions that are specific to the user-id pool feature are located |
904 |
in a separate module: ``lib/uidpool.py``. |
905 |
|
906 |
Storage |
907 |
^^^^^^^ |
908 |
|
909 |
The user-id pool is a single cluster parameter. It is stored in the |
910 |
*Cluster* object under the ``uid_pool`` name as a list of integer |
911 |
tuples. These tuples represent the boundaries of user-id ranges. |
912 |
For single user-ids, the boundaries are equal. |
913 |
|
914 |
The internal user-id pool representation is converted into a |
915 |
string: a newline separated list of user-ids or user-id ranges. |
916 |
This string representation is distributed to all the nodes via the |
917 |
*ssconf* mechanism. This means that the user-id pool can be |
918 |
accessed in a read-only way on any node without consulting the master |
919 |
node or master candidate nodes. |
920 |
|
921 |
Initial value |
922 |
^^^^^^^^^^^^^ |
923 |
|
924 |
The value of the user-id pool cluster parameter can be initialized |
925 |
at cluster initialization time using the |
926 |
|
927 |
``gnt-cluster init --uid-pool <uid-pool definition> ...`` |
928 |
|
929 |
command. |
930 |
|
931 |
As there is no sensible default value for the user-id pool parameter, |
932 |
it is initialized to an empty list if no ``--uid-pool`` option is |
933 |
supplied at cluster init time. |
934 |
|
935 |
If the user-id pool is empty, the user-id pool feature is considered |
936 |
to be disabled. |
937 |
|
938 |
Manipulation |
939 |
^^^^^^^^^^^^ |
940 |
|
941 |
The user-id pool cluster parameter can be modified from the |
942 |
command-line with the following commands: |
943 |
|
944 |
- ``gnt-cluster modify --uid-pool <uid-pool definition>`` |
945 |
- ``gnt-cluster modify --add-uids <uid-pool definition>`` |
946 |
- ``gnt-cluster modify --remove-uids <uid-pool definition>`` |
947 |
|
948 |
The ``--uid-pool`` option overwrites the current setting with the |
949 |
supplied ``<uid-pool definition>``, while |
950 |
``--add-uids``/``--remove-uids`` adds/removes the listed uids |
951 |
or uid-ranges from the pool. |
952 |
|
953 |
The ``<uid-pool definition>`` should be a comma-separated list of |
954 |
user-ids or user-id ranges. A range should be defined by a lower and |
955 |
a higher boundary. The boundaries should be separated with a dash. |
956 |
The boundaries are inclusive. |
957 |
|
958 |
The ``<uid-pool definition>`` is parsed into the internal |
959 |
representation, sanity-checked and stored in the ``uid_pool`` |
960 |
attribute of the *Cluster* object. |
961 |
|
962 |
It is also immediately converted into a string (formatted in the |
963 |
input format) and distributed to all nodes via the *ssconf* mechanism. |
964 |
|
965 |
Inspection |
966 |
^^^^^^^^^^ |
967 |
|
968 |
The current value of the user-id pool cluster parameter is printed |
969 |
by the ``gnt-cluster info`` command. |
970 |
|
971 |
The output format is accepted by the ``gnt-cluster modify --uid-pool`` |
972 |
command. |
973 |
|
974 |
Locking |
975 |
^^^^^^^ |
976 |
|
977 |
The ``uidpool.py`` module provides a function (``RequestUnusedUid``) |
978 |
for requesting an unused user-id from the pool. |
979 |
|
980 |
This will try to find a random user-id that is not currently in use. |
981 |
The algorithm is the following: |
982 |
|
983 |
1) Randomize the list of user-ids in the user-id pool |
984 |
2) Iterate over this randomized UID list |
985 |
3) Create a lock file (it doesn't matter if it already exists) |
986 |
4) Acquire an exclusive POSIX lock on the file, to provide mutual |
987 |
exclusion for the following non-atomic operations |
988 |
5) Check if there is a process in the system with the given UID |
989 |
6) If there isn't, return the UID, otherwise unlock the file and |
990 |
continue the iteration over the user-ids |
991 |
|
992 |
The user can than start a new process with this user-id. |
993 |
Once a process is successfully started, the exclusive POSIX lock can |
994 |
be released, but the lock file will remain in the filesystem. |
995 |
The presence of such a lock file means that the given user-id is most |
996 |
probably in use. The lack of a uid lock file does not guarantee that |
997 |
there are no processes with that user-id. |
998 |
|
999 |
After acquiring the exclusive POSIX lock, ``RequestUnusedUid`` |
1000 |
always performs a check to see if there is a process running with the |
1001 |
given uid. |
1002 |
|
1003 |
A user-id can be returned to the pool, by calling the |
1004 |
``ReleaseUid`` function. This will remove the corresponding lock file. |
1005 |
Note, that it doesn't check if there is any process still running |
1006 |
with that user-id. The removal of the lock file only means that there |
1007 |
are most probably no processes with the given user-id. This helps |
1008 |
in speeding up the process of finding a user-id that is guaranteed to |
1009 |
be unused. |
1010 |
|
1011 |
There is a convenience function, called ``ExecWithUnusedUid`` that |
1012 |
wraps the execution of a function (or any callable) that requires a |
1013 |
unique user-id. ``ExecWithUnusedUid`` takes care of requesting an |
1014 |
unused user-id and unlocking the lock file. It also automatically |
1015 |
returns the user-id to the pool if the callable raises an exception. |
1016 |
|
1017 |
Code examples |
1018 |
+++++++++++++ |
1019 |
|
1020 |
Requesting a user-id from the pool: |
1021 |
|
1022 |
:: |
1023 |
|
1024 |
from ganeti import ssconf |
1025 |
from ganeti import uidpool |
1026 |
|
1027 |
# Get list of all user-ids in the uid-pool from ssconf |
1028 |
ss = ssconf.SimpleStore() |
1029 |
uid_pool = uidpool.ParseUidPool(ss.GetUidPool(), separator="\n") |
1030 |
all_uids = set(uidpool.ExpandUidPool(uid_pool)) |
1031 |
|
1032 |
uid = uidpool.RequestUnusedUid(all_uids) |
1033 |
try: |
1034 |
<start a process with the UID> |
1035 |
# Once the process is started, we can release the file lock |
1036 |
uid.Unlock() |
1037 |
except ..., err: |
1038 |
# Return the UID to the pool |
1039 |
uidpool.ReleaseUid(uid) |
1040 |
|
1041 |
|
1042 |
Releasing a user-id: |
1043 |
|
1044 |
:: |
1045 |
|
1046 |
from ganeti import uidpool |
1047 |
|
1048 |
uid = <get the UID the process is running under> |
1049 |
<stop the process> |
1050 |
uidpool.ReleaseUid(uid) |
1051 |
|
1052 |
|
1053 |
External interface changes |
1054 |
-------------------------- |
1055 |
|
1056 |
OS API |
1057 |
~~~~~~ |
1058 |
|
1059 |
The OS API of Ganeti 2.0 has been built with extensibility in mind. |
1060 |
Since we pass everything as environment variables it's a lot easier to |
1061 |
send new information to the OSes without breaking retrocompatibility. |
1062 |
This section of the design outlines the proposed extensions to the API |
1063 |
and their implementation. |
1064 |
|
1065 |
API Version Compatibility Handling |
1066 |
++++++++++++++++++++++++++++++++++ |
1067 |
|
1068 |
In 2.1 there will be a new OS API version (eg. 15), which should be |
1069 |
mostly compatible with api 10, except for some new added variables. |
1070 |
Since it's easy not to pass some variables we'll be able to handle |
1071 |
Ganeti 2.0 OSes by just filtering out the newly added piece of |
1072 |
information. We will still encourage OSes to declare support for the new |
1073 |
API after checking that the new variables don't provide any conflict for |
1074 |
them, and we will drop api 10 support after ganeti 2.1 has released. |
1075 |
|
1076 |
New Environment variables |
1077 |
+++++++++++++++++++++++++ |
1078 |
|
1079 |
Some variables have never been added to the OS api but would definitely |
1080 |
be useful for the OSes. We plan to add an INSTANCE_HYPERVISOR variable |
1081 |
to allow the OS to make changes relevant to the virtualization the |
1082 |
instance is going to use. Since this field is immutable for each |
1083 |
instance, the os can tight the install without caring of making sure the |
1084 |
instance can run under any virtualization technology. |
1085 |
|
1086 |
We also want the OS to know the particular hypervisor parameters, to be |
1087 |
able to customize the install even more. Since the parameters can |
1088 |
change, though, we will pass them only as an "FYI": if an OS ties some |
1089 |
instance functionality to the value of a particular hypervisor parameter |
1090 |
manual changes or a reinstall may be needed to adapt the instance to the |
1091 |
new environment. This is not a regression as of today, because even if |
1092 |
the OSes are left blind about this information, sometimes they still |
1093 |
need to make compromises and cannot satisfy all possible parameter |
1094 |
values. |
1095 |
|
1096 |
OS Variants |
1097 |
+++++++++++ |
1098 |
|
1099 |
Currently we are assisting to some degree of "os proliferation" just to |
1100 |
change a simple installation behavior. This means that the same OS gets |
1101 |
installed on the cluster multiple times, with different names, to |
1102 |
customize just one installation behavior. Usually such OSes try to share |
1103 |
as much as possible through symlinks, but this still causes |
1104 |
complications on the user side, especially when multiple parameters must |
1105 |
be cross-matched. |
1106 |
|
1107 |
For example today if you want to install debian etch, lenny or squeeze |
1108 |
you probably need to install the debootstrap OS multiple times, changing |
1109 |
its configuration file, and calling it debootstrap-etch, |
1110 |
debootstrap-lenny or debootstrap-squeeze. Furthermore if you have for |
1111 |
example a "server" and a "development" environment which installs |
1112 |
different packages/configuration files and must be available for all |
1113 |
installs you'll probably end up with deboostrap-etch-server, |
1114 |
debootstrap-etch-dev, debootrap-lenny-server, debootstrap-lenny-dev, |
1115 |
etc. Crossing more than two parameters quickly becomes not manageable. |
1116 |
|
1117 |
In order to avoid this we plan to make OSes more customizable, by |
1118 |
allowing each OS to declare a list of variants which can be used to |
1119 |
customize it. The variants list is mandatory and must be written, one |
1120 |
variant per line, in the new "variants.list" file inside the main os |
1121 |
dir. At least one supported variant must be supported. When choosing the |
1122 |
OS exactly one variant will have to be specified, and will be encoded in |
1123 |
the os name as <OS-name>+<variant>. As for today it will be possible to |
1124 |
change an instance's OS at creation or install time. |
1125 |
|
1126 |
The 2.1 OS list will be the combination of each OS, plus its supported |
1127 |
variants. This will cause the name name proliferation to remain, but at |
1128 |
least the internal OS code will be simplified to just parsing the passed |
1129 |
variant, without the need for symlinks or code duplication. |
1130 |
|
1131 |
Also we expect the OSes to declare only "interesting" variants, but to |
1132 |
accept some non-declared ones which a user will be able to pass in by |
1133 |
overriding the checks ganeti does. This will be useful for allowing some |
1134 |
variations to be used without polluting the OS list (per-OS |
1135 |
documentation should list all supported variants). If a variant which is |
1136 |
not internally supported is forced through, the OS scripts should abort. |
1137 |
|
1138 |
In the future (post 2.1) we may want to move to full fledged parameters |
1139 |
all orthogonal to each other (for example "architecture" (i386, amd64), |
1140 |
"suite" (lenny, squeeze, ...), etc). (As opposed to the variant, which |
1141 |
is a single parameter, and you need a different variant for all the set |
1142 |
of combinations you want to support). In this case we envision the |
1143 |
variants to be moved inside of Ganeti and be associated with lists |
1144 |
parameter->values associations, which will then be passed to the OS. |
1145 |
|
1146 |
|
1147 |
IAllocator changes |
1148 |
~~~~~~~~~~~~~~~~~~ |
1149 |
|
1150 |
Current State and shortcomings |
1151 |
++++++++++++++++++++++++++++++ |
1152 |
|
1153 |
The iallocator interface allows creation of instances without manually |
1154 |
specifying nodes, but instead by specifying plugins which will do the |
1155 |
required computations and produce a valid node list. |
1156 |
|
1157 |
However, the interface is quite akward to use: |
1158 |
|
1159 |
- one cannot set a 'default' iallocator script |
1160 |
- one cannot use it to easily test if allocation would succeed |
1161 |
- some new functionality, such as rebalancing clusters and calculating |
1162 |
capacity estimates is needed |
1163 |
|
1164 |
Proposed changes |
1165 |
++++++++++++++++ |
1166 |
|
1167 |
There are two area of improvements proposed: |
1168 |
|
1169 |
- improving the use of the current interface |
1170 |
- extending the IAllocator API to cover more automation |
1171 |
|
1172 |
|
1173 |
Default iallocator names |
1174 |
^^^^^^^^^^^^^^^^^^^^^^^^ |
1175 |
|
1176 |
The cluster will hold, for each type of iallocator, a (possibly empty) |
1177 |
list of modules that will be used automatically. |
1178 |
|
1179 |
If the list is empty, the behaviour will remain the same. |
1180 |
|
1181 |
If the list has one entry, then ganeti will behave as if |
1182 |
'--iallocator' was specifyed on the command line. I.e. use this |
1183 |
allocator by default. If the user however passed nodes, those will be |
1184 |
used in preference. |
1185 |
|
1186 |
If the list has multiple entries, they will be tried in order until |
1187 |
one gives a successful answer. |
1188 |
|
1189 |
Dry-run allocation |
1190 |
^^^^^^^^^^^^^^^^^^ |
1191 |
|
1192 |
The create instance LU will get a new 'dry-run' option that will just |
1193 |
simulate the placement, and return the chosen node-lists after running |
1194 |
all the usual checks. |
1195 |
|
1196 |
Cluster balancing |
1197 |
^^^^^^^^^^^^^^^^^ |
1198 |
|
1199 |
Instance add/removals/moves can create a situation where load on the |
1200 |
nodes is not spread equally. For this, a new iallocator mode will be |
1201 |
implemented called ``balance`` in which the plugin, given the current |
1202 |
cluster state, and a maximum number of operations, will need to |
1203 |
compute the instance relocations needed in order to achieve a "better" |
1204 |
(for whatever the script believes it's better) cluster. |
1205 |
|
1206 |
Cluster capacity calculation |
1207 |
^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
1208 |
|
1209 |
In this mode, called ``capacity``, given an instance specification and |
1210 |
the current cluster state (similar to the ``allocate`` mode), the |
1211 |
plugin needs to return: |
1212 |
|
1213 |
- how many instances can be allocated on the cluster with that |
1214 |
specification |
1215 |
- on which nodes these will be allocated (in order) |
1216 |
|
1217 |
.. vim: set textwidth=72 : |