root / doc / design-2.0.rst @ 5c0c1eeb
History | View | Annotate | Download (64.4 kB)
1 |
================= |
---|---|
2 |
Ganeti 2.0 design |
3 |
================= |
4 |
|
5 |
This document describes the major changes in Ganeti 2.0 compared to |
6 |
the 1.2 version. |
7 |
|
8 |
The 2.0 version will constitute a rewrite of the 'core' architecture, |
9 |
paving the way for additional features in future 2.x versions. |
10 |
|
11 |
.. contents:: |
12 |
|
13 |
Objective |
14 |
========= |
15 |
|
16 |
Ganeti 1.2 has many scalability issues and restrictions due to its |
17 |
roots as software for managing small and 'static' clusters. |
18 |
|
19 |
Version 2.0 will attempt to remedy first the scalability issues and |
20 |
then the restrictions. |
21 |
|
22 |
Background |
23 |
========== |
24 |
|
25 |
While Ganeti 1.2 is usable, it severly limits the flexibility of the |
26 |
cluster administration and imposes a very rigid model. It has the |
27 |
following main scalability issues: |
28 |
|
29 |
- only one operation at a time on the cluster [#]_ |
30 |
- poor handling of node failures in the cluster |
31 |
- mixing hypervisors in a cluster not allowed |
32 |
|
33 |
It also has a number of artificial restrictions, due to historical design: |
34 |
|
35 |
- fixed number of disks (two) per instance |
36 |
- fixed number of nics |
37 |
|
38 |
.. [#] Replace disks will release the lock, but this is an exception |
39 |
and not a recommended way to operate |
40 |
|
41 |
The 2.0 version is intended to address some of these problems, and |
42 |
create a more flexible codebase for future developments. |
43 |
|
44 |
Scalability problems |
45 |
-------------------- |
46 |
|
47 |
Ganeti 1.2 has a single global lock, which is used for all cluster |
48 |
operations. This has been painful at various times, for example: |
49 |
|
50 |
- It is impossible for two people to efficiently interact with a cluster |
51 |
(for example for debugging) at the same time. |
52 |
- When batch jobs are running it's impossible to do other work (for example |
53 |
failovers/fixes) on a cluster. |
54 |
|
55 |
This poses scalability problems: as clusters grow in node and instance |
56 |
size it's a lot more likely that operations which one could conceive |
57 |
should run in parallel (for example because they happen on different |
58 |
nodes) are actually stalling each other while waiting for the global |
59 |
lock, without a real reason for that to happen. |
60 |
|
61 |
One of the main causes of this global lock (beside the higher |
62 |
difficulty of ensuring data consistency in a more granular lock model) |
63 |
is the fact that currently there is no "master" daemon in Ganeti. Each |
64 |
command tries to acquire the so called *cmd* lock and when it |
65 |
succeeds, it takes complete ownership of the cluster configuration and |
66 |
state. |
67 |
|
68 |
Other scalability problems are due the design of the DRBD device |
69 |
model, which assumed at its creation a low (one to four) number of |
70 |
instances per node, which is no longer true with today's hardware. |
71 |
|
72 |
Artificial restrictions |
73 |
----------------------- |
74 |
|
75 |
Ganeti 1.2 (and previous versions) have a fixed two-disks, one-NIC per |
76 |
instance model. This is a purely artificial restrictions, but it |
77 |
touches multiple areas (configuration, import/export, command line) |
78 |
that it's more fitted to a major release than a minor one. |
79 |
|
80 |
Overview |
81 |
======== |
82 |
|
83 |
In order to solve the scalability problems, a rewrite of the core |
84 |
design of Ganeti is required. While the cluster operations themselves |
85 |
won't change (e.g. start instance will do the same things, the way |
86 |
these operations are scheduled internally will change radically. |
87 |
|
88 |
Detailed design |
89 |
=============== |
90 |
|
91 |
The changes for 2.0 can be split into roughly three areas: |
92 |
|
93 |
- core changes that affect the design of the software |
94 |
- features (or restriction removals) but which do not have a wide |
95 |
impact on the design |
96 |
- user-level and API-level changes which translate into differences for |
97 |
the operation of the cluster |
98 |
|
99 |
Core changes |
100 |
------------ |
101 |
|
102 |
The main changes will be switching from a per-process model to a |
103 |
daemon based model, where the individual gnt-* commands will be |
104 |
clients that talk to this daemon (see the design-2.0-master-daemon |
105 |
document). This will allow us to get rid of the global cluster lock |
106 |
for most operations, having instead a per-object lock (see |
107 |
design-2.0-granular-locking). Also, the daemon will be able to queue |
108 |
jobs, and this will allow the invidual clients to submit jobs without |
109 |
waiting for them to finish, and also see the result of old requests |
110 |
(see design-2.0-job-queue). |
111 |
|
112 |
Beside these major changes, another 'core' change but that will not be |
113 |
as visible to the users will be changing the model of object attribute |
114 |
storage, and separate that into namespaces (such that an Xen PVM |
115 |
instance will not have the Xen HVM parameters). This will allow future |
116 |
flexibility in defining additional parameters. More details in the |
117 |
design-2.0-cluster-parameters document. |
118 |
|
119 |
The various changes brought in by the master daemon model and the |
120 |
read-write RAPI will require changes to the cluster security; we move |
121 |
away from Twisted and use http(s) for intra- and extra-cluster |
122 |
communications. For more details, see the security document in the |
123 |
doc/ directory. |
124 |
|
125 |
Master daemon |
126 |
~~~~~~~~~~~~~ |
127 |
|
128 |
In Ganeti 2.0, we will have the following *entities*: |
129 |
|
130 |
- the master daemon (on the master node) |
131 |
- the node daemon (on all nodes) |
132 |
- the command line tools (on the master node) |
133 |
- the RAPI daemon (on the master node) |
134 |
|
135 |
Interaction paths are between: |
136 |
|
137 |
- (CLI tools/RAPI daemon) and the master daemon, via the so called *luxi* API |
138 |
- the master daemon and the node daemons, via the node RPC |
139 |
|
140 |
The protocol between the master daemon and the node daemons will be |
141 |
changed to HTTP(S), using a simple PUT/GET of JSON-encoded |
142 |
messages. This is done due to difficulties in working with the Twisted |
143 |
framework and its protocols in a multithreaded environment, which we |
144 |
can overcome by using a simpler stack (see the caveats section). The |
145 |
protocol between the CLI/RAPI and the master daemon will be a custom |
146 |
one (called *luxi*): on a UNIX socket on the master node, with rights |
147 |
restricted by filesystem permissions, the CLI/RAPI will talk to the |
148 |
master daemon using JSON-encoded messages. |
149 |
|
150 |
The operations supported over this internal protocol will be encoded |
151 |
via a python library that will expose a simple API for its |
152 |
users. Internally, the protocol will simply encode all objects in JSON |
153 |
format and decode them on the receiver side. |
154 |
|
155 |
The LUXI protocol |
156 |
+++++++++++++++++ |
157 |
|
158 |
We will have two main classes of operations over the master daemon API: |
159 |
|
160 |
- cluster query functions |
161 |
- job related functions |
162 |
|
163 |
The cluster query functions are usually short-duration, and are the |
164 |
equivalent of the OP_QUERY_* opcodes in ganeti 1.2 (and they are |
165 |
internally implemented still with these opcodes). The clients are |
166 |
guaranteed to receive the response in a reasonable time via a timeout. |
167 |
|
168 |
The job-related functions will be: |
169 |
|
170 |
- submit job |
171 |
- query job (which could also be categorized in the query-functions) |
172 |
- archive job (see the job queue design doc) |
173 |
- wait for job change, which allows a client to wait without polling |
174 |
|
175 |
For more details, see the job queue design document. |
176 |
|
177 |
Daemon implementation |
178 |
+++++++++++++++++++++ |
179 |
|
180 |
The daemon will be based around a main I/O thread that will wait for |
181 |
new requests from the clients, and that does the setup/shutdown of the |
182 |
other thread (pools). |
183 |
|
184 |
There will two other classes of threads in the daemon: |
185 |
|
186 |
- job processing threads, part of a thread pool, and which are |
187 |
long-lived, started at daemon startup and terminated only at shutdown |
188 |
time |
189 |
- client I/O threads, which are the ones that talk the local protocol |
190 |
to the clients |
191 |
|
192 |
Master startup/failover |
193 |
+++++++++++++++++++++++ |
194 |
|
195 |
In Ganeti 1.x there is no protection against failing over the master |
196 |
to a node with stale configuration. In effect, the responsibility of |
197 |
correct failovers falls on the admin. This is true both for the new |
198 |
master and for when an old, offline master startup. |
199 |
|
200 |
Since in 2.x we are extending the cluster state to cover the job queue |
201 |
and have a daemon that will execute by itself the job queue, we want |
202 |
to have more resilience for the master role. |
203 |
|
204 |
The following algorithm will happen whenever a node is ready to |
205 |
transition to the master role, either at startup time or at node |
206 |
failover: |
207 |
|
208 |
#. read the configuration file and parse the node list |
209 |
contained within |
210 |
|
211 |
#. query all the nodes and make sure we obtain an agreement via |
212 |
a quorum of at least half plus one nodes for the following: |
213 |
|
214 |
- we have the latest configuration and job list (as |
215 |
determined by the serial number on the configuration and |
216 |
highest job ID on the job queue) |
217 |
|
218 |
- there is not even a single node having a newer |
219 |
configuration file |
220 |
|
221 |
- if we are not failing over (but just starting), the |
222 |
quorum agrees that we are the designated master |
223 |
|
224 |
#. at this point, the node transitions to the master role |
225 |
|
226 |
#. for all the in-progress jobs, mark them as failed, with |
227 |
reason unknown or something similar (master failed, etc.) |
228 |
|
229 |
|
230 |
Logging |
231 |
+++++++ |
232 |
|
233 |
The logging system will be switched completely to the logging module; |
234 |
currently it's logging-based, but exposes a different API, which is |
235 |
just overhead. As such, the code will be switched over to standard |
236 |
logging calls, and only the setup will be custom. |
237 |
|
238 |
With this change, we will remove the separate debug/info/error logs, |
239 |
and instead have always one logfile per daemon model: |
240 |
|
241 |
- master-daemon.log for the master daemon |
242 |
- node-daemon.log for the node daemon (this is the same as in 1.2) |
243 |
- rapi-daemon.log for the RAPI daemon logs |
244 |
- rapi-access.log, an additional log file for the RAPI that will be |
245 |
in the standard http log format for possible parsing by other tools |
246 |
|
247 |
Since the watcher will only submit jobs to the master for startup of |
248 |
the instances, its log file will contain less information than before, |
249 |
mainly that it will start the instance, but not the results. |
250 |
|
251 |
Caveats |
252 |
+++++++ |
253 |
|
254 |
A discussed alternative is to keep the current individual processes |
255 |
touching the cluster configuration model. The reasons we have not |
256 |
chosen this approach is: |
257 |
|
258 |
- the speed of reading and unserializing the cluster state |
259 |
today is not small enough that we can ignore it; the addition of |
260 |
the job queue will make the startup cost even higher. While this |
261 |
runtime cost is low, it can be on the order of a few seconds on |
262 |
bigger clusters, which for very quick commands is comparable to |
263 |
the actual duration of the computation itself |
264 |
|
265 |
- individual commands would make it harder to implement a |
266 |
fire-and-forget job request, along the lines "start this |
267 |
instance but do not wait for it to finish"; it would require a |
268 |
model of backgrounding the operation and other things that are |
269 |
much better served by a daemon-based model |
270 |
|
271 |
Another area of discussion is moving away from Twisted in this new |
272 |
implementation. While Twisted hase its advantages, there are also many |
273 |
disatvantanges to using it: |
274 |
|
275 |
- first and foremost, it's not a library, but a framework; thus, if |
276 |
you use twisted, all the code needs to be 'twiste-ized'; we were able |
277 |
to keep the 1.x code clean by hacking around twisted in an |
278 |
unsupported, unrecommended way, and the only alternative would have |
279 |
been to make all the code be written for twisted |
280 |
- it has some weaknesses in working with multiple threads, since its base |
281 |
model is designed to replace thread usage by using deferred calls, so while |
282 |
it can use threads, it's not less flexible in doing so |
283 |
|
284 |
And, since we already have an HTTP server library for the RAPI, we |
285 |
can just reuse that for inter-node communication. |
286 |
|
287 |
|
288 |
Granular locking |
289 |
~~~~~~~~~~~~~~~~ |
290 |
|
291 |
We want to make sure that multiple operations can run in parallel on a Ganeti |
292 |
Cluster. In order for this to happen we need to make sure concurrently run |
293 |
operations don't step on each other toes and break the cluster. |
294 |
|
295 |
This design addresses how we are going to deal with locking so that: |
296 |
|
297 |
- high urgency operations are not stopped by long length ones |
298 |
- long length operations can run in parallel |
299 |
- we preserve safety (data coherency) and liveness (no deadlock, no work |
300 |
postponed indefinitely) on the cluster |
301 |
|
302 |
Reaching the maximum possible parallelism is a Non-Goal. We have identified a |
303 |
set of operations that are currently bottlenecks and need to be parallelised |
304 |
and have worked on those. In the future it will be possible to address other |
305 |
needs, thus making the cluster more and more parallel one step at a time. |
306 |
|
307 |
This document only talks about parallelising Ganeti level operations, aka |
308 |
Logical Units, and the locking needed for that. Any other synchronisation lock |
309 |
needed internally by the code is outside its scope. |
310 |
|
311 |
Ganeti 1.2 |
312 |
++++++++++ |
313 |
|
314 |
We intend to implement a Ganeti locking library, which can be used by the |
315 |
various ganeti code components in order to easily, efficiently and correctly |
316 |
grab the locks they need to perform their function. |
317 |
|
318 |
The proposed library has these features: |
319 |
|
320 |
- Internally managing all the locks, making the implementation transparent |
321 |
from their usage |
322 |
- Automatically grabbing multiple locks in the right order (avoid deadlock) |
323 |
- Ability to transparently handle conversion to more granularity |
324 |
- Support asynchronous operation (future goal) |
325 |
|
326 |
Locking will be valid only on the master node and will not be a distributed |
327 |
operation. In case of master failure, though, if some locks were held it means |
328 |
some opcodes were in progress, so when recovery of the job queue is done it |
329 |
will be possible to determine by the interrupted opcodes which operations could |
330 |
have been left half way through and thus which locks could have been held. It |
331 |
is then the responsibility either of the master failover code, of the cluster |
332 |
verification code, or of the admin to do what's necessary to make sure that any |
333 |
leftover state is dealt with. This is not an issue from a locking point of view |
334 |
because the fact that the previous master has failed means that it cannot do |
335 |
any job. |
336 |
|
337 |
A corollary of this is that a master-failover operation with both masters alive |
338 |
needs to happen while no other locks are held. |
339 |
|
340 |
The Locks |
341 |
+++++++++ |
342 |
|
343 |
At the first stage we have decided to provide the following locks: |
344 |
|
345 |
- One "config file" lock |
346 |
- One lock per node in the cluster |
347 |
- One lock per instance in the cluster |
348 |
|
349 |
All the instance locks will need to be taken before the node locks, and the |
350 |
node locks before the config lock. Locks will need to be acquired at the same |
351 |
time for multiple instances and nodes, and internal ordering will be dealt |
352 |
within the locking library, which, for simplicity, will just use alphabetical |
353 |
order. |
354 |
|
355 |
Handling conversion to more granularity |
356 |
+++++++++++++++++++++++++++++++++++++++ |
357 |
|
358 |
In order to convert to a more granular approach transparently each time we |
359 |
split a lock into more we'll create a "metalock", which will depend on those |
360 |
sublocks and live for the time necessary for all the code to convert (or |
361 |
forever, in some conditions). When a metalock exists all converted code must |
362 |
acquire it in shared mode, so it can run concurrently, but still be exclusive |
363 |
with old code, which acquires it exclusively. |
364 |
|
365 |
In the beginning the only such lock will be what replaces the current "command" |
366 |
lock, and will acquire all the locks in the system, before proceeding. This |
367 |
lock will be called the "Big Ganeti Lock" because holding that one will avoid |
368 |
any other concurrent ganeti operations. |
369 |
|
370 |
We might also want to devise more metalocks (eg. all nodes, all nodes+config) |
371 |
in order to make it easier for some parts of the code to acquire what it needs |
372 |
without specifying it explicitly. |
373 |
|
374 |
In the future things like the node locks could become metalocks, should we |
375 |
decide to split them into an even more fine grained approach, but this will |
376 |
probably be only after the first 2.0 version has been released. |
377 |
|
378 |
Library API |
379 |
+++++++++++ |
380 |
|
381 |
All the locking will be its own class, and the locks will be created at |
382 |
initialisation time, from the config file. |
383 |
|
384 |
The API will have a way to grab one or more than one locks at the same time. |
385 |
Any attempt to grab a lock while already holding one in the wrong order will be |
386 |
checked for, and fail. |
387 |
|
388 |
Adding/Removing locks |
389 |
+++++++++++++++++++++ |
390 |
|
391 |
When a new instance or a new node is created an associated lock must be added |
392 |
to the list. The relevant code will need to inform the locking library of such |
393 |
a change. |
394 |
|
395 |
This needs to be compatible with every other lock in the system, especially |
396 |
metalocks that guarantee to grab sets of resources without specifying them |
397 |
explicitly. The implementation of this will be handled in the locking library |
398 |
itself. |
399 |
|
400 |
Of course when instances or nodes disappear from the cluster the relevant locks |
401 |
must be removed. This is easier than adding new elements, as the code which |
402 |
removes them must own them exclusively or can queue for their ownership, and |
403 |
thus deals with metalocks exactly as normal code acquiring those locks. Any |
404 |
operation queueing on a removed lock will fail after its removal. |
405 |
|
406 |
Asynchronous operations |
407 |
+++++++++++++++++++++++ |
408 |
|
409 |
For the first version the locking library will only export synchronous |
410 |
operations, which will block till the needed lock are held, and only fail if |
411 |
the request is impossible or somehow erroneous. |
412 |
|
413 |
In the future we may want to implement different types of asynchronous |
414 |
operations such as: |
415 |
|
416 |
- Try to acquire this lock set and fail if not possible |
417 |
- Try to acquire one of these lock sets and return the first one you were |
418 |
able to get (or after a timeout) (select/poll like) |
419 |
|
420 |
These operations can be used to prioritize operations based on available locks, |
421 |
rather than making them just blindly queue for acquiring them. The inherent |
422 |
risk, though, is that any code using the first operation, or setting a timeout |
423 |
for the second one, is susceptible to starvation and thus may never be able to |
424 |
get the required locks and complete certain tasks. Considering this |
425 |
providing/using these operations should not be among our first priorities. |
426 |
|
427 |
Locking granularity |
428 |
+++++++++++++++++++ |
429 |
|
430 |
For the first version of this code we'll convert each Logical Unit to |
431 |
acquire/release the locks it needs, so locking will be at the Logical Unit |
432 |
level. In the future we may want to split logical units in independent |
433 |
"tasklets" with their own locking requirements. A different design doc (or mini |
434 |
design doc) will cover the move from Logical Units to tasklets. |
435 |
|
436 |
Lock acquisition code path |
437 |
++++++++++++++++++++++++++ |
438 |
|
439 |
In general when acquiring locks we should use a code path equivalent to:: |
440 |
|
441 |
lock.acquire() |
442 |
try: |
443 |
... |
444 |
# other code |
445 |
finally: |
446 |
lock.release() |
447 |
|
448 |
This makes sure we release all locks, and avoid possible deadlocks. Of course |
449 |
extra care must be used not to leave, if possible locked structures in an |
450 |
unusable state. |
451 |
|
452 |
In order to avoid this extra indentation and code changes everywhere in the |
453 |
Logical Units code, we decided to allow LUs to declare locks, and then execute |
454 |
their code with their locks acquired. In the new world LUs are called like |
455 |
this:: |
456 |
|
457 |
# user passed names are expanded to the internal lock/resource name, |
458 |
# then known needed locks are declared |
459 |
lu.ExpandNames() |
460 |
... some locking/adding of locks may happen ... |
461 |
# late declaration of locks for one level: this is useful because sometimes |
462 |
# we can't know which resource we need before locking the previous level |
463 |
lu.DeclareLocks() # for each level (cluster, instance, node) |
464 |
... more locking/adding of locks can happen ... |
465 |
# these functions are called with the proper locks held |
466 |
lu.CheckPrereq() |
467 |
lu.Exec() |
468 |
... locks declared for removal are removed, all acquired locks released ... |
469 |
|
470 |
The Processor and the LogicalUnit class will contain exact documentation on how |
471 |
locks are supposed to be declared. |
472 |
|
473 |
Caveats |
474 |
+++++++ |
475 |
|
476 |
This library will provide an easy upgrade path to bring all the code to |
477 |
granular locking without breaking everything, and it will also guarantee |
478 |
against a lot of common errors. Code switching from the old "lock everything" |
479 |
lock to the new system, though, needs to be carefully scrutinised to be sure it |
480 |
is really acquiring all the necessary locks, and none has been overlooked or |
481 |
forgotten. |
482 |
|
483 |
The code can contain other locks outside of this library, to synchronise other |
484 |
threaded code (eg for the job queue) but in general these should be leaf locks |
485 |
or carefully structured non-leaf ones, to avoid deadlock race conditions. |
486 |
|
487 |
|
488 |
Job Queue |
489 |
~~~~~~~~~ |
490 |
|
491 |
Granular locking is not enough to speed up operations, we also need a |
492 |
queue to store these and to be able to process as many as possible in |
493 |
parallel. |
494 |
|
495 |
A ganeti job will consist of multiple ``OpCodes`` which are the basic |
496 |
element of operation in Ganeti 1.2 (and will remain as such). Most |
497 |
command-level commands are equivalent to one OpCode, or in some cases |
498 |
to a sequence of opcodes, all of the same type (e.g. evacuating a node |
499 |
will generate N opcodes of type replace disks). |
500 |
|
501 |
|
502 |
Job execution—“Life of a Ganeti job” |
503 |
++++++++++++++++++++++++++++++++++++ |
504 |
|
505 |
#. Job gets submitted by the client. A new job identifier is generated and |
506 |
assigned to the job. The job is then automatically replicated [#replic]_ |
507 |
to all nodes in the cluster. The identifier is returned to the client. |
508 |
#. A pool of worker threads waits for new jobs. If all are busy, the job has |
509 |
to wait and the first worker finishing its work will grab it. Otherwise any |
510 |
of the waiting threads will pick up the new job. |
511 |
#. Client waits for job status updates by calling a waiting RPC function. |
512 |
Log message may be shown to the user. Until the job is started, it can also |
513 |
be cancelled. |
514 |
#. As soon as the job is finished, its final result and status can be retrieved |
515 |
from the server. |
516 |
#. If the client archives the job, it gets moved to a history directory. |
517 |
There will be a method to archive all jobs older than a a given age. |
518 |
|
519 |
.. [#replic] We need replication in order to maintain the consistency across |
520 |
all nodes in the system; the master node only differs in the fact that |
521 |
now it is running the master daemon, but it if fails and we do a master |
522 |
failover, the jobs are still visible on the new master (though marked as |
523 |
failed). |
524 |
|
525 |
Failures to replicate a job to other nodes will be only flagged as |
526 |
errors in the master daemon log if more than half of the nodes failed, |
527 |
otherwise we ignore the failure, and rely on the fact that the next |
528 |
update (for still running jobs) will retry the update. For finished |
529 |
jobs, it is less of a problem. |
530 |
|
531 |
Future improvements will look into checking the consistency of the job |
532 |
list and jobs themselves at master daemon startup. |
533 |
|
534 |
|
535 |
Job storage |
536 |
+++++++++++ |
537 |
|
538 |
Jobs are stored in the filesystem as individual files, serialized |
539 |
using JSON (standard serialization mechanism in Ganeti). |
540 |
|
541 |
The choice of storing each job in its own file was made because: |
542 |
|
543 |
- a file can be atomically replaced |
544 |
- a file can easily be replicated to other nodes |
545 |
- checking consistency across nodes can be implemented very easily, since |
546 |
all job files should be (at a given moment in time) identical |
547 |
|
548 |
The other possible choices that were discussed and discounted were: |
549 |
|
550 |
- single big file with all job data: not feasible due to difficult updates |
551 |
- in-process databases: hard to replicate the entire database to the |
552 |
other nodes, and replicating individual operations does not mean wee keep |
553 |
consistency |
554 |
|
555 |
|
556 |
Queue structure |
557 |
+++++++++++++++ |
558 |
|
559 |
All file operations have to be done atomically by writing to a temporary file |
560 |
and subsequent renaming. Except for log messages, every change in a job is |
561 |
stored and replicated to other nodes. |
562 |
|
563 |
:: |
564 |
|
565 |
/var/lib/ganeti/queue/ |
566 |
job-1 (JSON encoded job description and status) |
567 |
[…] |
568 |
job-37 |
569 |
job-38 |
570 |
job-39 |
571 |
lock (Queue managing process opens this file in exclusive mode) |
572 |
serial (Last job ID used) |
573 |
version (Queue format version) |
574 |
|
575 |
|
576 |
Locking |
577 |
+++++++ |
578 |
|
579 |
Locking in the job queue is a complicated topic. It is called from more than |
580 |
one thread and must be thread-safe. For simplicity, a single lock is used for |
581 |
the whole job queue. |
582 |
|
583 |
A more detailed description can be found in doc/locking.txt. |
584 |
|
585 |
|
586 |
Internal RPC |
587 |
++++++++++++ |
588 |
|
589 |
RPC calls available between Ganeti master and node daemons: |
590 |
|
591 |
jobqueue_update(file_name, content) |
592 |
Writes a file in the job queue directory. |
593 |
jobqueue_purge() |
594 |
Cleans the job queue directory completely, including archived job. |
595 |
jobqueue_rename(old, new) |
596 |
Renames a file in the job queue directory. |
597 |
|
598 |
|
599 |
Client RPC |
600 |
++++++++++ |
601 |
|
602 |
RPC between Ganeti clients and the Ganeti master daemon supports the following |
603 |
operations: |
604 |
|
605 |
SubmitJob(ops) |
606 |
Submits a list of opcodes and returns the job identifier. The identifier is |
607 |
guaranteed to be unique during the lifetime of a cluster. |
608 |
WaitForJobChange(job_id, fields, […], timeout) |
609 |
This function waits until a job changes or a timeout expires. The condition |
610 |
for when a job changed is defined by the fields passed and the last log |
611 |
message received. |
612 |
QueryJobs(job_ids, fields) |
613 |
Returns field values for the job identifiers passed. |
614 |
CancelJob(job_id) |
615 |
Cancels the job specified by identifier. This operation may fail if the job |
616 |
is already running, canceled or finished. |
617 |
ArchiveJob(job_id) |
618 |
Moves a job into the …/archive/ directory. This operation will fail if the |
619 |
job has not been canceled or finished. |
620 |
|
621 |
|
622 |
Job and opcode status |
623 |
+++++++++++++++++++++ |
624 |
|
625 |
Each job and each opcode has, at any time, one of the following states: |
626 |
|
627 |
Queued |
628 |
The job/opcode was submitted, but did not yet start. |
629 |
Waiting |
630 |
The job/opcode is waiting for a lock to proceed. |
631 |
Running |
632 |
The job/opcode is running. |
633 |
Canceled |
634 |
The job/opcode was canceled before it started. |
635 |
Success |
636 |
The job/opcode ran and finished successfully. |
637 |
Error |
638 |
The job/opcode was aborted with an error. |
639 |
|
640 |
If the master is aborted while a job is running, the job will be set to the |
641 |
Error status once the master started again. |
642 |
|
643 |
|
644 |
History |
645 |
+++++++ |
646 |
|
647 |
Archived jobs are kept in a separate directory, |
648 |
/var/lib/ganeti/queue/archive/. This is done in order to speed up the |
649 |
queue handling: by default, the jobs in the archive are not touched by |
650 |
any functions. Only the current (unarchived) jobs are parsed, loaded, |
651 |
and verified (if implemented) by the master daemon. |
652 |
|
653 |
|
654 |
Ganeti updates |
655 |
++++++++++++++ |
656 |
|
657 |
The queue has to be completely empty for Ganeti updates with changes |
658 |
in the job queue structure. In order to allow this, there will be a |
659 |
way to prevent new jobs entering the queue. |
660 |
|
661 |
|
662 |
|
663 |
Object parameters |
664 |
~~~~~~~~~~~~~~~~~ |
665 |
|
666 |
Across all cluster configuration data, we have multiple classes of |
667 |
parameters: |
668 |
|
669 |
A. cluster-wide parameters (e.g. name of the cluster, the master); |
670 |
these are the ones that we have today, and are unchanged from the |
671 |
current model |
672 |
|
673 |
#. node parameters |
674 |
|
675 |
#. instance specific parameters, e.g. the name of disks (LV), that |
676 |
cannot be shared with other instances |
677 |
|
678 |
#. instance parameters, that are or can be the same for many |
679 |
instances, but are not hypervisor related; e.g. the number of VCPUs, |
680 |
or the size of memory |
681 |
|
682 |
#. instance parameters that are hypervisor specific (e.g. kernel_path |
683 |
or PAE mode) |
684 |
|
685 |
|
686 |
The following definitions for instance parameters will be used below: |
687 |
|
688 |
:hypervisor parameter: |
689 |
a hypervisor parameter (or hypervisor specific parameter) is defined |
690 |
as a parameter that is interpreted by the hypervisor support code in |
691 |
Ganeti and usually is specific to a particular hypervisor (like the |
692 |
kernel path for PVM which makes no sense for HVM). |
693 |
|
694 |
:backend parameter: |
695 |
a backend parameter is defined as an instance parameter that can be |
696 |
shared among a list of instances, and is either generic enough not |
697 |
to be tied to a given hypervisor or cannot influence at all the |
698 |
hypervisor behaviour. |
699 |
|
700 |
For example: memory, vcpus, auto_balance |
701 |
|
702 |
All these parameters will be encoded into constants.py with the prefix "BE\_" |
703 |
and the whole list of parameters will exist in the set "BES_PARAMETERS" |
704 |
|
705 |
:proper parameter: |
706 |
a parameter whose value is unique to the instance (e.g. the name of a LV, |
707 |
or the MAC of a NIC) |
708 |
|
709 |
As a general rule, for all kind of parameters, “None” (or in |
710 |
JSON-speak, “nil”) will no longer be a valid value for a parameter. As |
711 |
such, only non-default parameters will be saved as part of objects in |
712 |
the serialization step, reducing the size of the serialized format. |
713 |
|
714 |
Cluster parameters |
715 |
++++++++++++++++++ |
716 |
|
717 |
Cluster parameters remain as today, attributes at the top level of the |
718 |
Cluster object. In addition, two new attributes at this level will |
719 |
hold defaults for the instances: |
720 |
|
721 |
- hvparams, a dictionary indexed by hypervisor type, holding default |
722 |
values for hypervisor parameters that are not defined/overrided by |
723 |
the instances of this hypervisor type |
724 |
|
725 |
- beparams, a dictionary holding (for 2.0) a single element 'default', |
726 |
which holds the default value for backend parameters |
727 |
|
728 |
Node parameters |
729 |
+++++++++++++++ |
730 |
|
731 |
Node-related parameters are very few, and we will continue using the |
732 |
same model for these as previously (attributes on the Node object). |
733 |
|
734 |
Instance parameters |
735 |
+++++++++++++++++++ |
736 |
|
737 |
As described before, the instance parameters are split in three: |
738 |
instance proper parameters, unique to each instance, instance |
739 |
hypervisor parameters and instance backend parameters. |
740 |
|
741 |
The “hvparams” and “beparams” are kept in two dictionaries at instance |
742 |
level. Only non-default parameters are stored (but once customized, a |
743 |
parameter will be kept, even with the same value as the default one, |
744 |
until reset). |
745 |
|
746 |
The names for hypervisor parameters in the instance.hvparams subtree |
747 |
should be choosen as generic as possible, especially if specific |
748 |
parameters could conceivably be useful for more than one hypervisor, |
749 |
e.g. instance.hvparams.vnc_console_port instead of using both |
750 |
instance.hvparams.hvm_vnc_console_port and |
751 |
instance.hvparams.kvm_vnc_console_port. |
752 |
|
753 |
There are some special cases related to disks and NICs (for example): |
754 |
a disk has both ganeti-related parameters (e.g. the name of the LV) |
755 |
and hypervisor-related parameters (how the disk is presented to/named |
756 |
in the instance). The former parameters remain as proper-instance |
757 |
parameters, while the latter value are migrated to the hvparams |
758 |
structure. In 2.0, we will have only globally-per-instance such |
759 |
hypervisor parameters, and not per-disk ones (e.g. all NICs will be |
760 |
exported as of the same type). |
761 |
|
762 |
Starting from the 1.2 list of instance parameters, here is how they |
763 |
will be mapped to the three classes of parameters: |
764 |
|
765 |
- name (P) |
766 |
- primary_node (P) |
767 |
- os (P) |
768 |
- hypervisor (P) |
769 |
- status (P) |
770 |
- memory (BE) |
771 |
- vcpus (BE) |
772 |
- nics (P) |
773 |
- disks (P) |
774 |
- disk_template (P) |
775 |
- network_port (P) |
776 |
- kernel_path (HV) |
777 |
- initrd_path (HV) |
778 |
- hvm_boot_order (HV) |
779 |
- hvm_acpi (HV) |
780 |
- hvm_pae (HV) |
781 |
- hvm_cdrom_image_path (HV) |
782 |
- hvm_nic_type (HV) |
783 |
- hvm_disk_type (HV) |
784 |
- vnc_bind_address (HV) |
785 |
- serial_no (P) |
786 |
|
787 |
|
788 |
Parameter validation |
789 |
++++++++++++++++++++ |
790 |
|
791 |
To support the new cluster parameter design, additional features will |
792 |
be required from the hypervisor support implementations in Ganeti. |
793 |
|
794 |
The hypervisor support implementation API will be extended with the |
795 |
following features: |
796 |
|
797 |
:PARAMETERS: class-level attribute holding the list of valid parameters |
798 |
for this hypervisor |
799 |
:CheckParamSyntax(hvparams): checks that the given parameters are |
800 |
valid (as in the names are valid) for this hypervisor; usually just |
801 |
comparing hvparams.keys() and cls.PARAMETERS; this is a class method |
802 |
that can be called from within master code (i.e. cmdlib) and should |
803 |
be safe to do so |
804 |
:ValidateParameters(hvparams): verifies the values of the provided |
805 |
parameters against this hypervisor; this is a method that will be |
806 |
called on the target node, from backend.py code, and as such can |
807 |
make node-specific checks (e.g. kernel_path checking) |
808 |
|
809 |
Default value application |
810 |
+++++++++++++++++++++++++ |
811 |
|
812 |
The application of defaults to an instance is done in the Cluster |
813 |
object, via two new methods as follows: |
814 |
|
815 |
- ``Cluster.FillHV(instance)``, returns 'filled' hvparams dict, based on |
816 |
instance's hvparams and cluster's ``hvparams[instance.hypervisor]`` |
817 |
|
818 |
- ``Cluster.FillBE(instance, be_type="default")``, which returns the |
819 |
beparams dict, based on the instance and cluster beparams |
820 |
|
821 |
The FillHV/BE transformations will be used, for example, in the RpcRunner |
822 |
when sending an instance for activation/stop, and the sent instance |
823 |
hvparams/beparams will have the final value (noded code doesn't know |
824 |
about defaults). |
825 |
|
826 |
LU code will need to self-call the transformation, if needed. |
827 |
|
828 |
Opcode changes |
829 |
++++++++++++++ |
830 |
|
831 |
The parameter changes will have impact on the OpCodes, especially on |
832 |
the following ones: |
833 |
|
834 |
- OpCreateInstance, where the new hv and be parameters will be sent as |
835 |
dictionaries; note that all hv and be parameters are now optional, as |
836 |
the values can be instead taken from the cluster |
837 |
- OpQueryInstances, where we have to be able to query these new |
838 |
parameters; the syntax for names will be ``hvparam/$NAME`` and |
839 |
``beparam/$NAME`` for querying an individual parameter out of one |
840 |
dictionary, and ``hvparams``, respectively ``beparams``, for the whole |
841 |
dictionaries |
842 |
- OpModifyInstance, where the the modified parameters are sent as |
843 |
dictionaries |
844 |
|
845 |
Additionally, we will need new OpCodes to modify the cluster-level |
846 |
defaults for the be/hv sets of parameters. |
847 |
|
848 |
Caveats |
849 |
+++++++ |
850 |
|
851 |
One problem that might appear is that our classification is not |
852 |
complete or not good enough, and we'll need to change this model. As |
853 |
the last resort, we will need to rollback and keep 1.2 style. |
854 |
|
855 |
Another problem is that classification of one parameter is unclear |
856 |
(e.g. ``network_port``, is this BE or HV?); in this case we'll take |
857 |
the risk of having to move parameters later between classes. |
858 |
|
859 |
Security |
860 |
++++++++ |
861 |
|
862 |
The only security issue that we foresee is if some new parameters will |
863 |
have sensitive value. If so, we will need to have a way to export the |
864 |
config data while purging the sensitive value. |
865 |
|
866 |
E.g. for the drbd shared secrets, we could export these with the |
867 |
values replaced by an empty string. |
868 |
|
869 |
Feature changes |
870 |
--------------- |
871 |
|
872 |
The main feature-level changes will be: |
873 |
|
874 |
- a number of disk related changes |
875 |
- removal of fixed two-disk, one-nic per instance limitation |
876 |
|
877 |
Disk handling changes |
878 |
~~~~~~~~~~~~~~~~~~~~~ |
879 |
|
880 |
The storage options available in Ganeti 1.x were introduced based on |
881 |
then-current software (first DRBD 0.7 then later DRBD 8) and the |
882 |
estimated usage patters. However, experience has later shown that some |
883 |
assumptions made initially are not true and that more flexibility is |
884 |
needed. |
885 |
|
886 |
One main assupmtion made was that disk failures should be treated as 'rare' |
887 |
events, and that each of them needs to be manually handled in order to ensure |
888 |
data safety; however, both these assumptions are false: |
889 |
|
890 |
- disk failures can be a common occurence, based on usage patterns or cluster |
891 |
size |
892 |
- our disk setup is robust enough (referring to DRBD8 + LVM) that we could |
893 |
automate more of the recovery |
894 |
|
895 |
Note that we still don't have fully-automated disk recovery as a goal, but our |
896 |
goal is to reduce the manual work needed. |
897 |
|
898 |
As such, we plan the following main changes: |
899 |
|
900 |
- DRBD8 is much more flexible and stable than its previous version (0.7), |
901 |
such that removing the support for the ``remote_raid1`` template and |
902 |
focusing only on DRBD8 is easier |
903 |
|
904 |
- dynamic discovery of DRBD devices is not actually needed in a cluster that |
905 |
where the DRBD namespace is controlled by Ganeti; switching to a static |
906 |
assignment (done at either instance creation time or change secondary time) |
907 |
will change the disk activation time from O(n) to O(1), which on big |
908 |
clusters is a significant gain |
909 |
|
910 |
- remove the hard dependency on LVM (currently all available storage types are |
911 |
ultimately backed by LVM volumes) by introducing file-based storage |
912 |
|
913 |
Additionally, a number of smaller enhancements are also planned: |
914 |
- support variable number of disks |
915 |
- support read-only disks |
916 |
|
917 |
Future enhancements in the 2.x series, which do not require base design |
918 |
changes, might include: |
919 |
|
920 |
- enhancement of the LVM allocation method in order to try to keep |
921 |
all of an instance's virtual disks on the same physical |
922 |
disks |
923 |
|
924 |
- add support for DRBD8 authentication at handshake time in |
925 |
order to ensure each device connects to the correct peer |
926 |
|
927 |
- remove the restrictions on failover only to the secondary |
928 |
which creates very strict rules on cluster allocation |
929 |
|
930 |
DRBD minor allocation |
931 |
+++++++++++++++++++++ |
932 |
|
933 |
Currently, when trying to identify or activate a new DRBD (or MD) |
934 |
device, the code scans all in-use devices in order to see if we find |
935 |
one that looks similar to our parameters and is already in the desired |
936 |
state or not. Since this needs external commands to be run, it is very |
937 |
slow when more than a few devices are already present. |
938 |
|
939 |
Therefore, we will change the discovery model from dynamic to |
940 |
static. When a new device is logically created (added to the |
941 |
configuration) a free minor number is computed from the list of |
942 |
devices that should exist on that node and assigned to that |
943 |
device. |
944 |
|
945 |
At device activation, if the minor is already in use, we check if |
946 |
it has our parameters; if not so, we just destroy the device (if |
947 |
possible, otherwise we abort) and start it with our own |
948 |
parameters. |
949 |
|
950 |
This means that we in effect take ownership of the minor space for |
951 |
that device type; if there's a user-created drbd minor, it will be |
952 |
automatically removed. |
953 |
|
954 |
The change will have the effect of reducing the number of external |
955 |
commands run per device from a constant number times the index of the |
956 |
first free DRBD minor to just a constant number. |
957 |
|
958 |
Removal of obsolete device types (md, drbd7) |
959 |
++++++++++++++++++++++++++++++++++++++++++++ |
960 |
|
961 |
We need to remove these device types because of two issues. First, |
962 |
drbd7 has bad failure modes in case of dual failures (both network and |
963 |
disk - it cannot propagate the error up the device stack and instead |
964 |
just panics. Second, due to the assymetry between primary and |
965 |
secondary in md+drbd mode, we cannot do live failover (not even if we |
966 |
had md+drbd8). |
967 |
|
968 |
File-based storage support |
969 |
++++++++++++++++++++++++++ |
970 |
|
971 |
This is covered by a separate design doc (<em>Vinales</em>) and |
972 |
would allow us to get rid of the hard requirement for testing |
973 |
clusters; it would also allow people who have SAN storage to do live |
974 |
failover taking advantage of their storage solution. |
975 |
|
976 |
Better LVM allocation |
977 |
+++++++++++++++++++++ |
978 |
|
979 |
Currently, the LV to PV allocation mechanism is a very simple one: at |
980 |
each new request for a logical volume, tell LVM to allocate the volume |
981 |
in order based on the amount of free space. This is good for |
982 |
simplicity and for keeping the usage equally spread over the available |
983 |
physical disks, however it introduces a problem that an instance could |
984 |
end up with its (currently) two drives on two physical disks, or |
985 |
(worse) that the data and metadata for a DRBD device end up on |
986 |
different drives. |
987 |
|
988 |
This is bad because it causes unneeded ``replace-disks`` operations in |
989 |
case of a physical failure. |
990 |
|
991 |
The solution is to batch allocations for an instance and make the LVM |
992 |
handling code try to allocate as close as possible all the storage of |
993 |
one instance. We will still allow the logical volumes to spill over to |
994 |
additional disks as needed. |
995 |
|
996 |
Note that this clustered allocation can only be attempted at initial |
997 |
instance creation, or at change secondary node time. At add disk time, |
998 |
or at replacing individual disks, it's not easy enough to compute the |
999 |
current disk map so we'll not attempt the clustering. |
1000 |
|
1001 |
DRBD8 peer authentication at handshake |
1002 |
++++++++++++++++++++++++++++++++++++++ |
1003 |
|
1004 |
DRBD8 has a new feature that allow authentication of the peer at |
1005 |
connect time. We can use this to prevent connecting to the wrong peer |
1006 |
more that securing the connection. Even though we never had issues |
1007 |
with wrong connections, it would be good to implement this. |
1008 |
|
1009 |
|
1010 |
LVM self-repair (optional) |
1011 |
++++++++++++++++++++++++++ |
1012 |
|
1013 |
The complete failure of a physical disk is very tedious to |
1014 |
troubleshoot, mainly because of the many failure modes and the many |
1015 |
steps needed. We can safely automate some of the steps, more |
1016 |
specifically the ``vgreduce --removemissing`` using the following |
1017 |
method: |
1018 |
|
1019 |
#. check if all nodes have consistent volume groups |
1020 |
#. if yes, and previous status was yes, do nothing |
1021 |
#. if yes, and previous status was no, save status and restart |
1022 |
#. if no, and previous status was no, do nothing |
1023 |
#. if no, and previous status was yes: |
1024 |
#. if more than one node is inconsistent, do nothing |
1025 |
#. if only one node is incosistent: |
1026 |
#. run ``vgreduce --removemissing`` |
1027 |
#. log this occurence in the ganeti log in a form that |
1028 |
can be used for monitoring |
1029 |
#. [FUTURE] run ``replace-disks`` for all |
1030 |
instances affected |
1031 |
|
1032 |
Failover to any node |
1033 |
++++++++++++++++++++ |
1034 |
|
1035 |
With a modified disk activation sequence, we can implement the |
1036 |
*failover to any* functionality, removing many of the layout |
1037 |
restrictions of a cluster: |
1038 |
|
1039 |
- the need to reserve memory on the current secondary: this gets reduced to |
1040 |
a must to reserve memory anywhere on the cluster |
1041 |
|
1042 |
- the need to first failover and then replace secondary for an |
1043 |
instance: with failover-to-any, we can directly failover to |
1044 |
another node, which also does the replace disks at the same |
1045 |
step |
1046 |
|
1047 |
In the following, we denote the current primary by P1, the current |
1048 |
secondary by S1, and the new primary and secondaries by P2 and S2. P2 |
1049 |
is fixed to the node the user chooses, but the choice of S2 can be |
1050 |
made between P1 and S1. This choice can be constrained, depending on |
1051 |
which of P1 and S1 has failed. |
1052 |
|
1053 |
- if P1 has failed, then S1 must become S2, and live migration is not possible |
1054 |
- if S1 has failed, then P1 must become S2, and live migration could be |
1055 |
possible (in theory, but this is not a design goal for 2.0) |
1056 |
|
1057 |
The algorithm for performing the failover is straightforward: |
1058 |
|
1059 |
- verify that S2 (the node the user has chosen to keep as secondary) has |
1060 |
valid data (is consistent) |
1061 |
|
1062 |
- tear down the current DRBD association and setup a drbd pairing between |
1063 |
P2 (P2 is indicated by the user) and S2; since P2 has no data, it will |
1064 |
start resyncing from S2 |
1065 |
|
1066 |
- as soon as P2 is in state SyncTarget (i.e. after the resync has started |
1067 |
but before it has finished), we can promote it to primary role (r/w) |
1068 |
and start the instance on P2 |
1069 |
|
1070 |
- as soon as the P2?S2 sync has finished, we can remove |
1071 |
the old data on the old node that has not been chosen for |
1072 |
S2 |
1073 |
|
1074 |
Caveats: during the P2?S2 sync, a (non-transient) network error |
1075 |
will cause I/O errors on the instance, so (if a longer instance |
1076 |
downtime is acceptable) we can postpone the restart of the instance |
1077 |
until the resync is done. However, disk I/O errors on S2 will cause |
1078 |
dataloss, since we don't have a good copy of the data anymore, so in |
1079 |
this case waiting for the sync to complete is not an option. As such, |
1080 |
it is recommended that this feature is used only in conjunction with |
1081 |
proper disk monitoring. |
1082 |
|
1083 |
|
1084 |
Live migration note: While failover-to-any is possible for all choices |
1085 |
of S2, migration-to-any is possible only if we keep P1 as S2. |
1086 |
|
1087 |
Caveats |
1088 |
+++++++ |
1089 |
|
1090 |
The dynamic device model, while more complex, has an advantage: it |
1091 |
will not reuse by mistake another's instance DRBD device, since it |
1092 |
always looks for either our own or a free one. |
1093 |
|
1094 |
The static one, in contrast, will assume that given a minor number N, |
1095 |
it's ours and we can take over. This needs careful implementation such |
1096 |
that if the minor is in use, either we are able to cleanly shut it |
1097 |
down, or we abort the startup. Otherwise, it could be that we start |
1098 |
syncing between two instance's disks, causing dataloss. |
1099 |
|
1100 |
|
1101 |
Variable number of disk/NICs per instance |
1102 |
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
1103 |
|
1104 |
Variable number of disks |
1105 |
++++++++++++++++++++++++ |
1106 |
|
1107 |
In order to support high-security scenarios (for example read-only sda |
1108 |
and read-write sdb), we need to make a fully flexibly disk |
1109 |
definition. This has less impact that it might look at first sight: |
1110 |
only the instance creation has hardcoded number of disks, not the disk |
1111 |
handling code. The block device handling and most of the instance |
1112 |
handling code is already working with "the instance's disks" as |
1113 |
opposed to "the two disks of the instance", but some pieces are not |
1114 |
(e.g. import/export) and the code needs a review to ensure safety. |
1115 |
|
1116 |
The objective is to be able to specify the number of disks at |
1117 |
instance creation, and to be able to toggle from read-only to |
1118 |
read-write a disk afterwards. |
1119 |
|
1120 |
Variable number of NICs |
1121 |
+++++++++++++++++++++++ |
1122 |
|
1123 |
Similar to the disk change, we need to allow multiple network |
1124 |
interfaces per instance. This will affect the internal code (some |
1125 |
function will have to stop assuming that ``instance.nics`` is a list |
1126 |
of length one), the OS api which currently can export/import only one |
1127 |
instance, and the command line interface. |
1128 |
|
1129 |
Interface changes |
1130 |
----------------- |
1131 |
|
1132 |
There are two areas of interface changes: API-level changes (the OS |
1133 |
interface and the RAPI interface) and the command line interface |
1134 |
changes. |
1135 |
|
1136 |
OS interface |
1137 |
~~~~~~~~~~~~ |
1138 |
|
1139 |
The current Ganeti OS interface, version 5, is tailored for Ganeti 1.2. The |
1140 |
interface is composed by a series of scripts which get called with certain |
1141 |
parameters to perform OS-dependent operations on the cluster. The current |
1142 |
scripts are: |
1143 |
|
1144 |
create |
1145 |
called when a new instance is added to the cluster |
1146 |
export |
1147 |
called to export an instance disk to a stream |
1148 |
import |
1149 |
called to import from a stream to a new instance |
1150 |
rename |
1151 |
called to perform the os-specific operations necessary for renaming an |
1152 |
instance |
1153 |
|
1154 |
Currently these scripts suffer from the limitations of Ganeti 1.2: for example |
1155 |
they accept exactly one block and one swap devices to operate on, rather than |
1156 |
any amount of generic block devices, they blindly assume that an instance will |
1157 |
have just one network interface to operate, they can not be configured to |
1158 |
optimise the instance for a particular hypervisor. |
1159 |
|
1160 |
Since in Ganeti 2.0 we want to support multiple hypervisors, and a non-fixed |
1161 |
number of network and disks the OS interface need to change to transmit the |
1162 |
appropriate amount of information about an instance to its managing operating |
1163 |
system, when operating on it. Moreover since some old assumptions usually used |
1164 |
in OS scripts are no longer valid we need to re-establish a common knowledge on |
1165 |
what can be assumed and what cannot be regarding Ganeti environment. |
1166 |
|
1167 |
|
1168 |
When designing the new OS API our priorities are: |
1169 |
- ease of use |
1170 |
- future extensibility |
1171 |
- ease of porting from the old api |
1172 |
- modularity |
1173 |
|
1174 |
As such we want to limit the number of scripts that must be written to support |
1175 |
an OS, and make it easy to share code between them by uniforming their input. |
1176 |
We also will leave the current script structure unchanged, as far as we can, |
1177 |
and make a few of the scripts (import, export and rename) optional. Most |
1178 |
information will be passed to the script through environment variables, for |
1179 |
ease of access and at the same time ease of using only the information a script |
1180 |
needs. |
1181 |
|
1182 |
|
1183 |
The Scripts |
1184 |
+++++++++++ |
1185 |
|
1186 |
As in Ganeti 1.2, every OS which wants to be installed in Ganeti needs to |
1187 |
support the following functionality, through scripts: |
1188 |
|
1189 |
create: |
1190 |
used to create a new instance running that OS. This script should prepare the |
1191 |
block devices, and install them so that the new OS can boot under the |
1192 |
specified hypervisor. |
1193 |
export (optional): |
1194 |
used to export an installed instance using the given OS to a format which can |
1195 |
be used to import it back into a new instance. |
1196 |
import (optional): |
1197 |
used to import an exported instance into a new one. This script is similar to |
1198 |
create, but the new instance should have the content of the export, rather |
1199 |
than contain a pristine installation. |
1200 |
rename (optional): |
1201 |
used to perform the internal OS-specific operations needed to rename an |
1202 |
instance. |
1203 |
|
1204 |
If any optional script is not implemented Ganeti will refuse to perform the |
1205 |
given operation on instances using the non-implementing OS. Of course the |
1206 |
create script is mandatory, and it doesn't make sense to support the either the |
1207 |
export or the import operation but not both. |
1208 |
|
1209 |
Incompatibilities with 1.2 |
1210 |
__________________________ |
1211 |
|
1212 |
We expect the following incompatibilities between the OS scripts for 1.2 and |
1213 |
the ones for 2.0: |
1214 |
|
1215 |
- Input parameters: in 1.2 those were passed on the command line, in 2.0 we'll |
1216 |
use environment variables, as there will be a lot more information and not |
1217 |
all OSes may care about all of it. |
1218 |
- Number of calls: export scripts will be called once for each device the |
1219 |
instance has, and import scripts once for every exported disk. Imported |
1220 |
instances will be forced to have a number of disks greater or equal to the |
1221 |
one of the export. |
1222 |
- Some scripts are not compulsory: if such a script is missing the relevant |
1223 |
operations will be forbidden for instances of that os. This makes it easier |
1224 |
to distinguish between unsupported operations and no-op ones (if any). |
1225 |
|
1226 |
|
1227 |
Input |
1228 |
_____ |
1229 |
|
1230 |
Rather than using command line flags, as they do now, scripts will accept |
1231 |
inputs from environment variables. We expect the following input values: |
1232 |
|
1233 |
OS_API_VERSION |
1234 |
The version of the OS api that the following parameters comply with; |
1235 |
this is used so that in the future we could have OSes supporting |
1236 |
multiple versions and thus Ganeti send the proper version in this |
1237 |
parameter |
1238 |
INSTANCE_NAME |
1239 |
Name of the instance acted on |
1240 |
HYPERVISOR |
1241 |
The hypervisor the instance should run on (eg. 'xen-pvm', 'xen-hvm', 'kvm') |
1242 |
DISK_COUNT |
1243 |
The number of disks this instance will have |
1244 |
NIC_COUNT |
1245 |
The number of nics this instance will have |
1246 |
DISK_<N>_PATH |
1247 |
Path to the Nth disk. |
1248 |
DISK_<N>_ACCESS |
1249 |
W if read/write, R if read only. OS scripts are not supposed to touch |
1250 |
read-only disks, but will be passed them to know. |
1251 |
DISK_<N>_FRONTEND_TYPE |
1252 |
Type of the disk as seen by the instance. Can be 'scsi', 'ide', 'virtio' |
1253 |
DISK_<N>_BACKEND_TYPE |
1254 |
Type of the disk as seen from the node. Can be 'block', 'file:loop' or |
1255 |
'file:blktap' |
1256 |
NIC_<N>_MAC |
1257 |
Mac address for the Nth network interface |
1258 |
NIC_<N>_IP |
1259 |
Ip address for the Nth network interface, if available |
1260 |
NIC_<N>_BRIDGE |
1261 |
Node bridge the Nth network interface will be connected to |
1262 |
NIC_<N>_FRONTEND_TYPE |
1263 |
Type of the Nth nic as seen by the instance. For example 'virtio', 'rtl8139', etc. |
1264 |
DEBUG_LEVEL |
1265 |
Whether more out should be produced, for debugging purposes. Currently the |
1266 |
only valid values are 0 and 1. |
1267 |
|
1268 |
These are only the basic variables we are thinking of now, but more may come |
1269 |
during the implementation and they will be documented in the ganeti-os-api man |
1270 |
page. All these variables will be available to all scripts. |
1271 |
|
1272 |
Some scripts will need a few more information to work. These will have |
1273 |
per-script variables, such as for example: |
1274 |
|
1275 |
OLD_INSTANCE_NAME |
1276 |
rename: the name the instance should be renamed from. |
1277 |
EXPORT_DEVICE |
1278 |
export: device to be exported, a snapshot of the actual device. The data must be exported to stdout. |
1279 |
EXPORT_INDEX |
1280 |
export: sequential number of the instance device targeted. |
1281 |
IMPORT_DEVICE |
1282 |
import: device to send the data to, part of the new instance. The data must be imported from stdin. |
1283 |
IMPORT_INDEX |
1284 |
import: sequential number of the instance device targeted. |
1285 |
|
1286 |
(Rationale for INSTANCE_NAME as an environment variable: the instance name is |
1287 |
always needed and we could pass it on the command line. On the other hand, |
1288 |
though, this would force scripts to both access the environment and parse the |
1289 |
command line, so we'll move it for uniformity.) |
1290 |
|
1291 |
|
1292 |
Output/Behaviour |
1293 |
________________ |
1294 |
|
1295 |
As discussed scripts should only send user-targeted information to stderr. The |
1296 |
create and import scripts are supposed to format/initialise the given block |
1297 |
devices and install the correct instance data. The export script is supposed to |
1298 |
export instance data to stdout in a format understandable by the the import |
1299 |
script. The data will be compressed by ganeti, so no compression should be |
1300 |
done. The rename script should only modify the instance's knowledge of what |
1301 |
its name is. |
1302 |
|
1303 |
Other declarative style features |
1304 |
++++++++++++++++++++++++++++++++ |
1305 |
|
1306 |
Similar to Ganeti 1.2, OS specifications will need to provide a |
1307 |
'ganeti_api_version' containing list of numbers matching the version(s) of the |
1308 |
api they implement. Ganeti itself will always be compatible with one version of |
1309 |
the API and may maintain retrocompatibility if it's feasible to do so. The |
1310 |
numbers are one-per-line, so an OS supporting both version 5 and version 20 |
1311 |
will have a file containing two lines. This is different from Ganeti 1.2, which |
1312 |
only supported one version number. |
1313 |
|
1314 |
In addition to that an OS will be able to declare that it does support only a |
1315 |
subset of the ganeti hypervisors, by declaring them in the 'hypervisors' file. |
1316 |
|
1317 |
|
1318 |
Caveats/Notes |
1319 |
+++++++++++++ |
1320 |
|
1321 |
We might want to have a "default" import/export behaviour that just dumps all |
1322 |
disks and restores them. This can save work as most systems will just do this, |
1323 |
while allowing flexibility for different systems. |
1324 |
|
1325 |
Environment variables are limited in size, but we expect that there will be |
1326 |
enough space to store the information we need. If we discover that this is not |
1327 |
the case we may want to go to a more complex API such as storing those |
1328 |
information on the filesystem and providing the OS script with the path to a |
1329 |
file where they are encoded in some format. |
1330 |
|
1331 |
|
1332 |
|
1333 |
Remote API changes |
1334 |
~~~~~~~~~~~~~~~~~~ |
1335 |
|
1336 |
The first Ganeti RAPI was designed and deployed with the Ganeti 1.2.5 release. |
1337 |
That version provide Read-Only access to a cluster state. Fully functional |
1338 |
read-write API demand significant internal changes which are in a pipeline for |
1339 |
Ganeti 2.0 release. |
1340 |
|
1341 |
We decided to go with implementing the Ganeti RAPI in a RESTful way, which is |
1342 |
aligned with key features we looking. It is simple, stateless, scalable and |
1343 |
extensible paradigm of API implementation. As transport it uses HTTP over SSL, |
1344 |
and we are implementing it in JSON encoding, but in a way it possible to extend |
1345 |
and provide any other one. |
1346 |
|
1347 |
Design |
1348 |
++++++ |
1349 |
|
1350 |
The Ganeti API implemented as independent daemon, running on the same node |
1351 |
with the same permission level as Ganeti master daemon. Communication done |
1352 |
through unix socket protocol provided by Ganeti luxi library. |
1353 |
In order to keep communication asynchronous RAPI process two types of client |
1354 |
requests: |
1355 |
|
1356 |
- queries: sever able to answer immediately |
1357 |
- jobs: some time needed. |
1358 |
|
1359 |
In the query case requested data send back to client in http body. Typical |
1360 |
examples of queries would be list of nodes, instances, cluster info, etc. |
1361 |
Dealing with jobs client instead of waiting until job completes receive a job |
1362 |
id, the identifier which allows to query the job progress in the job queue. |
1363 |
(See job queue design doc for details) |
1364 |
|
1365 |
Internally, each exported object has an version identifier, which is used as a |
1366 |
state stamp in the http header E-Tag field for request/response to avoid a race |
1367 |
condition. |
1368 |
|
1369 |
|
1370 |
Resource representation |
1371 |
+++++++++++++++++++++++ |
1372 |
|
1373 |
The key difference of REST approach from others API is instead having one URI |
1374 |
for all our requests, REST demand separate service by resources with unique |
1375 |
URI. Each of them should have limited amount of stateless and standard HTTP |
1376 |
methods: GET, POST, DELETE, PUT. |
1377 |
|
1378 |
For example in Ganeti case we can have a set of URI: |
1379 |
- /{clustername}/instances |
1380 |
- /{clustername}/instances/{instancename} |
1381 |
- /{clustername}/instances/{instancename}/tag |
1382 |
- /{clustername}/tag |
1383 |
|
1384 |
A GET request to /{clustername}/instances will return list of instances, a POST |
1385 |
to /{clustername}/instances should create new instance, a DELETE |
1386 |
/{clustername}/instances/{instancename} should delete instance, a GET |
1387 |
/{clustername}/tag get cluster tag |
1388 |
|
1389 |
Each resource URI has a version prefix. The complete list of resources id TBD. |
1390 |
|
1391 |
Internal encoding might be JSON, XML, or any other. The JSON encoding fits |
1392 |
nicely in Ganeti RAPI needs. Specific representation client can request with |
1393 |
Accept field in the HTTP header. |
1394 |
|
1395 |
The REST uses standard HTTP as application protocol (not just as a transport) |
1396 |
for resources access. Set of possible result codes is a subset of standard HTTP |
1397 |
results. The stateless provide additional reliability and transparency to |
1398 |
operations. |
1399 |
|
1400 |
|
1401 |
Security |
1402 |
++++++++ |
1403 |
|
1404 |
With the write functionality security becomes much bigger an issue. The Ganeti |
1405 |
RAPI uses basic HTTP authentication on top of SSL connection to grant access to |
1406 |
an exported resource. The password stores locally in Apache-style .htpasswd |
1407 |
file. Only one level of privileges is supported. |
1408 |
|
1409 |
|
1410 |
Command line changes |
1411 |
~~~~~~~~~~~~~~~~~~~~ |
1412 |
|
1413 |
Ganeti 2.0 introduces several new features as well as new ways to |
1414 |
handle instance resources like disks or network interfaces. This |
1415 |
requires some noticable changes in the way commandline arguments are |
1416 |
handled. |
1417 |
|
1418 |
- extend and modify commandline syntax to support new features |
1419 |
- ensure consistent patterns in commandline arguments to reduce cognitive load |
1420 |
|
1421 |
The design changes that require these changes are, in no particular |
1422 |
order: |
1423 |
|
1424 |
- flexible instance disk handling: support a variable number of disks |
1425 |
with varying properties per instance, |
1426 |
- flexible instance network interface handling: support a variable |
1427 |
number of network interfaces with varying properties per instance |
1428 |
- multiple hypervisors: multiple hypervisors can be active on the same |
1429 |
cluster, each supporting different parameters, |
1430 |
- support for device type CDROM (via ISO image) |
1431 |
|
1432 |
As such, there are several areas of Ganeti where the commandline |
1433 |
arguments will change: |
1434 |
|
1435 |
- Cluster configuration |
1436 |
|
1437 |
- cluster initialization |
1438 |
- cluster default configuration |
1439 |
|
1440 |
- Instance configuration |
1441 |
|
1442 |
- handling of network cards for instances, |
1443 |
- handling of disks for instances, |
1444 |
- handling of CDROM devices and |
1445 |
- handling of hypervisor specific options. |
1446 |
|
1447 |
There are several areas of Ganeti where the commandline arguments will change: |
1448 |
|
1449 |
- Cluster configuration |
1450 |
|
1451 |
- cluster initialization |
1452 |
- cluster default configuration |
1453 |
|
1454 |
- Instance configuration |
1455 |
|
1456 |
- handling of network cards for instances, |
1457 |
- handling of disks for instances, |
1458 |
- handling of CDROM devices and |
1459 |
- handling of hypervisor specific options. |
1460 |
|
1461 |
Notes about device removal/addition |
1462 |
+++++++++++++++++++++++++++++++++++ |
1463 |
|
1464 |
To avoid problems with device location changes (e.g. second network |
1465 |
interface of the instance becoming the first or third and the like) |
1466 |
the list of network/disk devices is treated as a stack, i.e. devices |
1467 |
can only be added/removed at the end of the list of devices of each |
1468 |
class (disk or network) for each instance. |
1469 |
|
1470 |
gnt-instance commands |
1471 |
+++++++++++++++++++++ |
1472 |
|
1473 |
The commands for gnt-instance will be modified and extended to allow |
1474 |
for the new functionality: |
1475 |
|
1476 |
- the add command will be extended to support the new device and |
1477 |
hypervisor options, |
1478 |
- the modify command continues to handle all modifications to |
1479 |
instances, but will be extended with new arguments for handling |
1480 |
devices. |
1481 |
|
1482 |
Network Device Options |
1483 |
++++++++++++++++++++++ |
1484 |
|
1485 |
The generic format of the network device option is: |
1486 |
|
1487 |
--net $DEVNUM[:$OPTION=$VALUE][,$OPTION=VALUE] |
1488 |
|
1489 |
:$DEVNUM: device number, unsigned integer, starting at 0, |
1490 |
:$OPTION: device option, string, |
1491 |
:$VALUE: device option value, string. |
1492 |
|
1493 |
Currently, the following device options will be defined (open to |
1494 |
further changes): |
1495 |
|
1496 |
:mac: MAC address of the network interface, accepts either a valid |
1497 |
MAC address or the string 'auto'. If 'auto' is specified, a new MAC |
1498 |
address will be generated randomly. If the mac device option is not |
1499 |
specified, the default value 'auto' is assumed. |
1500 |
:bridge: network bridge the network interface is connected |
1501 |
to. Accepts either a valid bridge name (the specified bridge must |
1502 |
exist on the node(s)) as string or the string 'auto'. If 'auto' is |
1503 |
specified, the default brigde is used. If the bridge option is not |
1504 |
specified, the default value 'auto' is assumed. |
1505 |
|
1506 |
Disk Device Options |
1507 |
+++++++++++++++++++ |
1508 |
|
1509 |
The generic format of the disk device option is: |
1510 |
|
1511 |
--disk $DEVNUM[:$OPTION=$VALUE][,$OPTION=VALUE] |
1512 |
|
1513 |
:$DEVNUM: device number, unsigned integer, starting at 0, |
1514 |
:$OPTION: device option, string, |
1515 |
:$VALUE: device option value, string. |
1516 |
|
1517 |
Currently, the following device options will be defined (open to |
1518 |
further changes): |
1519 |
|
1520 |
:size: size of the disk device, either a positive number, specifying |
1521 |
the disk size in mebibytes, or a number followed by a magnitude suffix |
1522 |
(M for mebibytes, G for gibibytes). Also accepts the string 'auto' in |
1523 |
which case the default disk size will be used. If the size option is |
1524 |
not specified, 'auto' is assumed. This option is not valid for all |
1525 |
disk layout types. |
1526 |
:access: access mode of the disk device, a single letter, valid values |
1527 |
are: |
1528 |
|
1529 |
- *w*: read/write access to the disk device or |
1530 |
- *r*: read-only access to the disk device. |
1531 |
|
1532 |
If the access mode is not specified, the default mode of read/write |
1533 |
access will be configured. |
1534 |
:path: path to the image file for the disk device, string. No default |
1535 |
exists. This option is not valid for all disk layout types. |
1536 |
|
1537 |
Adding devices |
1538 |
++++++++++++++ |
1539 |
|
1540 |
To add devices to an already existing instance, use the device type |
1541 |
specific option to gnt-instance modify. Currently, there are two |
1542 |
device type specific options supported: |
1543 |
|
1544 |
:--net: for network interface cards |
1545 |
:--disk: for disk devices |
1546 |
|
1547 |
The syntax to the device specific options is similiar to the generic |
1548 |
device options, but instead of specifying a device number like for |
1549 |
gnt-instance add, you specify the magic string add. The new device |
1550 |
will always be appended at the end of the list of devices of this type |
1551 |
for the specified instance, e.g. if the instance has disk devices 0,1 |
1552 |
and 2, the newly added disk device will be disk device 3. |
1553 |
|
1554 |
Example: gnt-instance modify --net add:mac=auto test-instance |
1555 |
|
1556 |
Removing devices |
1557 |
++++++++++++++++ |
1558 |
|
1559 |
Removing devices from and instance is done via gnt-instance |
1560 |
modify. The same device specific options as for adding instances are |
1561 |
used. Instead of a device number and further device options, only the |
1562 |
magic string remove is specified. It will always remove the last |
1563 |
device in the list of devices of this type for the instance specified, |
1564 |
e.g. if the instance has disk devices 0, 1, 2 and 3, the disk device |
1565 |
number 3 will be removed. |
1566 |
|
1567 |
Example: gnt-instance modify --net remove test-instance |
1568 |
|
1569 |
Modifying devices |
1570 |
+++++++++++++++++ |
1571 |
|
1572 |
Modifying devices is also done with device type specific options to |
1573 |
the gnt-instance modify command. There are currently two device type |
1574 |
options supported: |
1575 |
|
1576 |
:--net: for network interface cards |
1577 |
:--disk: for disk devices |
1578 |
|
1579 |
The syntax to the device specific options is similiar to the generic |
1580 |
device options. The device number you specify identifies the device to |
1581 |
be modified. |
1582 |
|
1583 |
Example: gnt-instance modify --disk 2:access=r |
1584 |
|
1585 |
Hypervisor Options |
1586 |
++++++++++++++++++ |
1587 |
|
1588 |
Ganeti 2.0 will support more than one hypervisor. Different |
1589 |
hypervisors have various options that only apply to a specific |
1590 |
hypervisor. Those hypervisor specific options are treated specially |
1591 |
via the --hypervisor option. The generic syntax of the hypervisor |
1592 |
option is as follows: |
1593 |
|
1594 |
--hypervisor $HYPERVISOR:$OPTION=$VALUE[,$OPTION=$VALUE] |
1595 |
|
1596 |
:$HYPERVISOR: symbolic name of the hypervisor to use, string, |
1597 |
has to match the supported hypervisors. Example: xen-pvm |
1598 |
|
1599 |
:$OPTION: hypervisor option name, string |
1600 |
:$VALUE: hypervisor option value, string |
1601 |
|
1602 |
The hypervisor option for an instance can be set on instance creation |
1603 |
time via the gnt-instance add command. If the hypervisor for an |
1604 |
instance is not specified upon instance creation, the default |
1605 |
hypervisor will be used. |
1606 |
|
1607 |
Modifying hypervisor parameters |
1608 |
+++++++++++++++++++++++++++++++ |
1609 |
|
1610 |
The hypervisor parameters of an existing instance can be modified |
1611 |
using --hypervisor option of the gnt-instance modify command. However, |
1612 |
the hypervisor type of an existing instance can not be changed, only |
1613 |
the particular hypervisor specific option can be changed. Therefore, |
1614 |
the format of the option parameters has been simplified to omit the |
1615 |
hypervisor name and only contain the comma separated list of |
1616 |
option-value pairs. |
1617 |
|
1618 |
Example: gnt-instance modify --hypervisor |
1619 |
cdrom=/srv/boot.iso,boot_order=cdrom:network test-instance |
1620 |
|
1621 |
gnt-cluster commands |
1622 |
++++++++++++++++++++ |
1623 |
|
1624 |
The command for gnt-cluster will be extended to allow setting and |
1625 |
changing the default parameters of the cluster: |
1626 |
|
1627 |
- The init command will be extend to support the defaults option to |
1628 |
set the cluster defaults upon cluster initialization. |
1629 |
- The modify command will be added to modify the cluster |
1630 |
parameters. It will support the --defaults option to change the |
1631 |
cluster defaults. |
1632 |
|
1633 |
Cluster defaults |
1634 |
|
1635 |
The generic format of the cluster default setting option is: |
1636 |
|
1637 |
--defaults $OPTION=$VALUE[,$OPTION=$VALUE] |
1638 |
|
1639 |
:$OPTION: cluster default option, string, |
1640 |
:$VALUE: cluster default option value, string. |
1641 |
|
1642 |
Currently, the following cluster default options are defined (open to |
1643 |
further changes): |
1644 |
|
1645 |
:hypervisor: the default hypervisor to use for new instances, |
1646 |
string. Must be a valid hypervisor known to and supported by the |
1647 |
cluster. |
1648 |
:disksize: the disksize for newly created instance disks, where |
1649 |
applicable. Must be either a positive number, in which case the unit |
1650 |
of megabyte is assumed, or a positive number followed by a supported |
1651 |
magnitude symbol (M for megabyte or G for gigabyte). |
1652 |
:bridge: the default network bridge to use for newly created instance |
1653 |
network interfaces, string. Must be a valid bridge name of a bridge |
1654 |
existing on the node(s). |
1655 |
|
1656 |
Hypervisor cluster defaults |
1657 |
+++++++++++++++++++++++++++ |
1658 |
|
1659 |
The generic format of the hypervisor clusterwide default setting option is: |
1660 |
|
1661 |
--hypervisor-defaults $HYPERVISOR:$OPTION=$VALUE[,$OPTION=$VALUE] |
1662 |
|
1663 |
:$HYPERVISOR: symbolic name of the hypervisor whose defaults you want |
1664 |
to set, string |
1665 |
:$OPTION: cluster default option, string, |
1666 |
:$VALUE: cluster default option value, string. |
1667 |
|
1668 |
|
1669 |
Functionality changes |
1670 |
--------------------- |
1671 |
|
1672 |
The disk storage will receive some changes, and will also remove |
1673 |
support for the drbd7 and md disk types. See the |
1674 |
design-2.0-disk-changes document. |
1675 |
|
1676 |
The configuration storage will be changed, with the effect that more |
1677 |
data will be available on the nodes for access from outside ganeti |
1678 |
(e.g. from shell scripts) and that nodes will get slightly more |
1679 |
awareness of the cluster configuration. |
1680 |
|
1681 |
The RAPI will enable modify operations (beside the read-only queries |
1682 |
that are available today), so in effect almost all the operations |
1683 |
available today via the ``gnt-*`` commands will be available via the |
1684 |
remote API. |
1685 |
|
1686 |
A change in the hypervisor support area will be that we will support |
1687 |
multiple hypervisors in parallel in the same cluster, so one could run |
1688 |
Xen HVM side-by-side with Xen PVM on the same cluster. |
1689 |
|
1690 |
New features |
1691 |
------------ |
1692 |
|
1693 |
There will be a number of minor feature enhancements targeted to |
1694 |
either 2.0 or subsequent 2.x releases: |
1695 |
|
1696 |
- multiple disks, with custom properties (read-only/read-write, exportable, |
1697 |
etc.) |
1698 |
- multiple NICs |
1699 |
|
1700 |
These changes will require OS API changes, details are in the |
1701 |
design-2.0-os-interface document. And they will also require many |
1702 |
command line changes, see the design-2.0-commandline-parameters |
1703 |
document. |