root / doc / design-2.0.rst @ e18def2a
History | View | Annotate | Download (76.2 kB)
1 |
================= |
---|---|
2 |
Ganeti 2.0 design |
3 |
================= |
4 |
|
5 |
This document describes the major changes in Ganeti 2.0 compared to |
6 |
the 1.2 version. |
7 |
|
8 |
The 2.0 version will constitute a rewrite of the 'core' architecture, |
9 |
paving the way for additional features in future 2.x versions. |
10 |
|
11 |
.. contents:: :depth: 3 |
12 |
|
13 |
Objective |
14 |
========= |
15 |
|
16 |
Ganeti 1.2 has many scalability issues and restrictions due to its |
17 |
roots as software for managing small and 'static' clusters. |
18 |
|
19 |
Version 2.0 will attempt to remedy first the scalability issues and |
20 |
then the restrictions. |
21 |
|
22 |
Background |
23 |
========== |
24 |
|
25 |
While Ganeti 1.2 is usable, it severely limits the flexibility of the |
26 |
cluster administration and imposes a very rigid model. It has the |
27 |
following main scalability issues: |
28 |
|
29 |
- only one operation at a time on the cluster [#]_ |
30 |
- poor handling of node failures in the cluster |
31 |
- mixing hypervisors in a cluster not allowed |
32 |
|
33 |
It also has a number of artificial restrictions, due to historical design: |
34 |
|
35 |
- fixed number of disks (two) per instance |
36 |
- fixed number of NICs |
37 |
|
38 |
.. [#] Replace disks will release the lock, but this is an exception |
39 |
and not a recommended way to operate |
40 |
|
41 |
The 2.0 version is intended to address some of these problems, and |
42 |
create a more flexible code base for future developments. |
43 |
|
44 |
Among these problems, the single-operation at a time restriction is |
45 |
biggest issue with the current version of Ganeti. It is such a big |
46 |
impediment in operating bigger clusters that many times one is tempted |
47 |
to remove the lock just to do a simple operation like start instance |
48 |
while an OS installation is running. |
49 |
|
50 |
Scalability problems |
51 |
-------------------- |
52 |
|
53 |
Ganeti 1.2 has a single global lock, which is used for all cluster |
54 |
operations. This has been painful at various times, for example: |
55 |
|
56 |
- It is impossible for two people to efficiently interact with a cluster |
57 |
(for example for debugging) at the same time. |
58 |
- When batch jobs are running it's impossible to do other work (for example |
59 |
failovers/fixes) on a cluster. |
60 |
|
61 |
This poses scalability problems: as clusters grow in node and instance |
62 |
size it's a lot more likely that operations which one could conceive |
63 |
should run in parallel (for example because they happen on different |
64 |
nodes) are actually stalling each other while waiting for the global |
65 |
lock, without a real reason for that to happen. |
66 |
|
67 |
One of the main causes of this global lock (beside the higher |
68 |
difficulty of ensuring data consistency in a more granular lock model) |
69 |
is the fact that currently there is no long-lived process in Ganeti |
70 |
that can coordinate multiple operations. Each command tries to acquire |
71 |
the so called *cmd* lock and when it succeeds, it takes complete |
72 |
ownership of the cluster configuration and state. |
73 |
|
74 |
Other scalability problems are due the design of the DRBD device |
75 |
model, which assumed at its creation a low (one to four) number of |
76 |
instances per node, which is no longer true with today's hardware. |
77 |
|
78 |
Artificial restrictions |
79 |
----------------------- |
80 |
|
81 |
Ganeti 1.2 (and previous versions) have a fixed two-disks, one-NIC per |
82 |
instance model. This is a purely artificial restrictions, but it |
83 |
touches multiple areas (configuration, import/export, command line) |
84 |
that it's more fitted to a major release than a minor one. |
85 |
|
86 |
Architecture issues |
87 |
------------------- |
88 |
|
89 |
The fact that each command is a separate process that reads the |
90 |
cluster state, executes the command, and saves the new state is also |
91 |
an issue on big clusters where the configuration data for the cluster |
92 |
begins to be non-trivial in size. |
93 |
|
94 |
Overview |
95 |
======== |
96 |
|
97 |
In order to solve the scalability problems, a rewrite of the core |
98 |
design of Ganeti is required. While the cluster operations themselves |
99 |
won't change (e.g. start instance will do the same things, the way |
100 |
these operations are scheduled internally will change radically. |
101 |
|
102 |
The new design will change the cluster architecture to: |
103 |
|
104 |
.. image:: arch-2.0.png |
105 |
|
106 |
This differs from the 1.2 architecture by the addition of the master |
107 |
daemon, which will be the only entity to talk to the node daemons. |
108 |
|
109 |
|
110 |
Detailed design |
111 |
=============== |
112 |
|
113 |
The changes for 2.0 can be split into roughly three areas: |
114 |
|
115 |
- core changes that affect the design of the software |
116 |
- features (or restriction removals) but which do not have a wide |
117 |
impact on the design |
118 |
- user-level and API-level changes which translate into differences for |
119 |
the operation of the cluster |
120 |
|
121 |
Core changes |
122 |
------------ |
123 |
|
124 |
The main changes will be switching from a per-process model to a |
125 |
daemon based model, where the individual gnt-* commands will be |
126 |
clients that talk to this daemon (see `Master daemon`_). This will |
127 |
allow us to get rid of the global cluster lock for most operations, |
128 |
having instead a per-object lock (see `Granular locking`_). Also, the |
129 |
daemon will be able to queue jobs, and this will allow the individual |
130 |
clients to submit jobs without waiting for them to finish, and also |
131 |
see the result of old requests (see `Job Queue`_). |
132 |
|
133 |
Beside these major changes, another 'core' change but that will not be |
134 |
as visible to the users will be changing the model of object attribute |
135 |
storage, and separate that into name spaces (such that an Xen PVM |
136 |
instance will not have the Xen HVM parameters). This will allow future |
137 |
flexibility in defining additional parameters. For more details see |
138 |
`Object parameters`_. |
139 |
|
140 |
The various changes brought in by the master daemon model and the |
141 |
read-write RAPI will require changes to the cluster security; we move |
142 |
away from Twisted and use HTTP(s) for intra- and extra-cluster |
143 |
communications. For more details, see the security document in the |
144 |
doc/ directory. |
145 |
|
146 |
Master daemon |
147 |
~~~~~~~~~~~~~ |
148 |
|
149 |
In Ganeti 2.0, we will have the following *entities*: |
150 |
|
151 |
- the master daemon (on the master node) |
152 |
- the node daemon (on all nodes) |
153 |
- the command line tools (on the master node) |
154 |
- the RAPI daemon (on the master node) |
155 |
|
156 |
The master-daemon related interaction paths are: |
157 |
|
158 |
- (CLI tools/RAPI daemon) and the master daemon, via the so called *LUXI* API |
159 |
- the master daemon and the node daemons, via the node RPC |
160 |
|
161 |
There are also some additional interaction paths for exceptional cases: |
162 |
|
163 |
- CLI tools might access via SSH the nodes (for ``gnt-cluster copyfile`` |
164 |
and ``gnt-cluster command``) |
165 |
- master failover is a special case when a non-master node will SSH |
166 |
and do node-RPC calls to the current master |
167 |
|
168 |
The protocol between the master daemon and the node daemons will be |
169 |
changed from (Ganeti 1.2) Twisted PB (perspective broker) to HTTP(S), |
170 |
using a simple PUT/GET of JSON-encoded messages. This is done due to |
171 |
difficulties in working with the Twisted framework and its protocols |
172 |
in a multithreaded environment, which we can overcome by using a |
173 |
simpler stack (see the caveats section). |
174 |
|
175 |
The protocol between the CLI/RAPI and the master daemon will be a |
176 |
custom one (called *LUXI*): on a UNIX socket on the master node, with |
177 |
rights restricted by filesystem permissions, the CLI/RAPI will talk to |
178 |
the master daemon using JSON-encoded messages. |
179 |
|
180 |
The operations supported over this internal protocol will be encoded |
181 |
via a python library that will expose a simple API for its |
182 |
users. Internally, the protocol will simply encode all objects in JSON |
183 |
format and decode them on the receiver side. |
184 |
|
185 |
For more details about the RAPI daemon see `Remote API changes`_, and |
186 |
for the node daemon see `Node daemon changes`_. |
187 |
|
188 |
The LUXI protocol |
189 |
+++++++++++++++++ |
190 |
|
191 |
As described above, the protocol for making requests or queries to the |
192 |
master daemon will be a UNIX-socket based simple RPC of JSON-encoded |
193 |
messages. |
194 |
|
195 |
The choice of UNIX was in order to get rid of the need of |
196 |
authentication and authorisation inside Ganeti; for 2.0, the |
197 |
permissions on the Unix socket itself will determine the access |
198 |
rights. |
199 |
|
200 |
We will have two main classes of operations over this API: |
201 |
|
202 |
- cluster query functions |
203 |
- job related functions |
204 |
|
205 |
The cluster query functions are usually short-duration, and are the |
206 |
equivalent of the ``OP_QUERY_*`` opcodes in Ganeti 1.2 (and they are |
207 |
internally implemented still with these opcodes). The clients are |
208 |
guaranteed to receive the response in a reasonable time via a timeout. |
209 |
|
210 |
The job-related functions will be: |
211 |
|
212 |
- submit job |
213 |
- query job (which could also be categorized in the query-functions) |
214 |
- archive job (see the job queue design doc) |
215 |
- wait for job change, which allows a client to wait without polling |
216 |
|
217 |
For more details of the actual operation list, see the `Job Queue`_. |
218 |
|
219 |
Both requests and responses will consist of a JSON-encoded message |
220 |
followed by the ``ETX`` character (ASCII decimal 3), which is not a |
221 |
valid character in JSON messages and thus can serve as a message |
222 |
delimiter. The contents of the messages will be a dictionary with two |
223 |
fields: |
224 |
|
225 |
:method: |
226 |
the name of the method called |
227 |
:args: |
228 |
the arguments to the method, as a list (no keyword arguments allowed) |
229 |
|
230 |
Responses will follow the same format, with the two fields being: |
231 |
|
232 |
:success: |
233 |
a boolean denoting the success of the operation |
234 |
:result: |
235 |
the actual result, or error message in case of failure |
236 |
|
237 |
There are two special value for the result field: |
238 |
|
239 |
- in the case that the operation failed, and this field is a list of |
240 |
length two, the client library will try to interpret is as an exception, |
241 |
the first element being the exception type and the second one the |
242 |
actual exception arguments; this will allow a simple method of passing |
243 |
Ganeti-related exception across the interface |
244 |
- for the *WaitForChange* call (that waits on the server for a job to |
245 |
change status), if the result is equal to ``nochange`` instead of the |
246 |
usual result for this call (a list of changes), then the library will |
247 |
internally retry the call; this is done in order to differentiate |
248 |
internally between master daemon hung and job simply not changed |
249 |
|
250 |
Users of the API that don't use the provided python library should |
251 |
take care of the above two cases. |
252 |
|
253 |
|
254 |
Master daemon implementation |
255 |
++++++++++++++++++++++++++++ |
256 |
|
257 |
The daemon will be based around a main I/O thread that will wait for |
258 |
new requests from the clients, and that does the setup/shutdown of the |
259 |
other thread (pools). |
260 |
|
261 |
There will two other classes of threads in the daemon: |
262 |
|
263 |
- job processing threads, part of a thread pool, and which are |
264 |
long-lived, started at daemon startup and terminated only at shutdown |
265 |
time |
266 |
- client I/O threads, which are the ones that talk the local protocol |
267 |
(LUXI) to the clients, and are short-lived |
268 |
|
269 |
Master startup/failover |
270 |
+++++++++++++++++++++++ |
271 |
|
272 |
In Ganeti 1.x there is no protection against failing over the master |
273 |
to a node with stale configuration. In effect, the responsibility of |
274 |
correct failovers falls on the admin. This is true both for the new |
275 |
master and for when an old, offline master startup. |
276 |
|
277 |
Since in 2.x we are extending the cluster state to cover the job queue |
278 |
and have a daemon that will execute by itself the job queue, we want |
279 |
to have more resilience for the master role. |
280 |
|
281 |
The following algorithm will happen whenever a node is ready to |
282 |
transition to the master role, either at startup time or at node |
283 |
failover: |
284 |
|
285 |
#. read the configuration file and parse the node list |
286 |
contained within |
287 |
|
288 |
#. query all the nodes and make sure we obtain an agreement via |
289 |
a quorum of at least half plus one nodes for the following: |
290 |
|
291 |
- we have the latest configuration and job list (as |
292 |
determined by the serial number on the configuration and |
293 |
highest job ID on the job queue) |
294 |
|
295 |
- there is not even a single node having a newer |
296 |
configuration file |
297 |
|
298 |
- if we are not failing over (but just starting), the |
299 |
quorum agrees that we are the designated master |
300 |
|
301 |
- if any of the above is false, we prevent the current operation |
302 |
(i.e. we don't become the master) |
303 |
|
304 |
#. at this point, the node transitions to the master role |
305 |
|
306 |
#. for all the in-progress jobs, mark them as failed, with |
307 |
reason unknown or something similar (master failed, etc.) |
308 |
|
309 |
Since due to exceptional conditions we could have a situation in which |
310 |
no node can become the master due to inconsistent data, we will have |
311 |
an override switch for the master daemon startup that will assume the |
312 |
current node has the right data and will replicate all the |
313 |
configuration files to the other nodes. |
314 |
|
315 |
**Note**: the above algorithm is by no means an election algorithm; it |
316 |
is a *confirmation* of the master role currently held by a node. |
317 |
|
318 |
Logging |
319 |
+++++++ |
320 |
|
321 |
The logging system will be switched completely to the standard python |
322 |
logging module; currently it's logging-based, but exposes a different |
323 |
API, which is just overhead. As such, the code will be switched over |
324 |
to standard logging calls, and only the setup will be custom. |
325 |
|
326 |
With this change, we will remove the separate debug/info/error logs, |
327 |
and instead have always one logfile per daemon model: |
328 |
|
329 |
- master-daemon.log for the master daemon |
330 |
- node-daemon.log for the node daemon (this is the same as in 1.2) |
331 |
- rapi-daemon.log for the RAPI daemon logs |
332 |
- rapi-access.log, an additional log file for the RAPI that will be |
333 |
in the standard HTTP log format for possible parsing by other tools |
334 |
|
335 |
Since the `watcher`_ will only submit jobs to the master for startup |
336 |
of the instances, its log file will contain less information than |
337 |
before, mainly that it will start the instance, but not the results. |
338 |
|
339 |
Node daemon changes |
340 |
+++++++++++++++++++ |
341 |
|
342 |
The only change to the node daemon is that, since we need better |
343 |
concurrency, we don't process the inter-node RPC calls in the node |
344 |
daemon itself, but we fork and process each request in a separate |
345 |
child. |
346 |
|
347 |
Since we don't have many calls, and we only fork (not exec), the |
348 |
overhead should be minimal. |
349 |
|
350 |
Caveats |
351 |
+++++++ |
352 |
|
353 |
A discussed alternative is to keep the current individual processes |
354 |
touching the cluster configuration model. The reasons we have not |
355 |
chosen this approach is: |
356 |
|
357 |
- the speed of reading and unserializing the cluster state |
358 |
today is not small enough that we can ignore it; the addition of |
359 |
the job queue will make the startup cost even higher. While this |
360 |
runtime cost is low, it can be on the order of a few seconds on |
361 |
bigger clusters, which for very quick commands is comparable to |
362 |
the actual duration of the computation itself |
363 |
|
364 |
- individual commands would make it harder to implement a |
365 |
fire-and-forget job request, along the lines "start this |
366 |
instance but do not wait for it to finish"; it would require a |
367 |
model of backgrounding the operation and other things that are |
368 |
much better served by a daemon-based model |
369 |
|
370 |
Another area of discussion is moving away from Twisted in this new |
371 |
implementation. While Twisted has its advantages, there are also many |
372 |
disadvantages to using it: |
373 |
|
374 |
- first and foremost, it's not a library, but a framework; thus, if |
375 |
you use twisted, all the code needs to be 'twiste-ized' and written |
376 |
in an asynchronous manner, using deferreds; while this method works, |
377 |
it's not a common way to code and it requires that the entire process |
378 |
workflow is based around a single *reactor* (Twisted name for a main |
379 |
loop) |
380 |
- the more advanced granular locking that we want to implement would |
381 |
require, if written in the async-manner, deep integration with the |
382 |
Twisted stack, to such an extend that business-logic is inseparable |
383 |
from the protocol coding; we felt that this is an unreasonable request, |
384 |
and that a good protocol library should allow complete separation of |
385 |
low-level protocol calls and business logic; by comparison, the threaded |
386 |
approach combined with HTTPs protocol required (for the first iteration) |
387 |
absolutely no changes from the 1.2 code, and later changes for optimizing |
388 |
the inter-node RPC calls required just syntactic changes (e.g. |
389 |
``rpc.call_...`` to ``self.rpc.call_...``) |
390 |
|
391 |
Another issue is with the Twisted API stability - during the Ganeti |
392 |
1.x lifetime, we had to to implement many times workarounds to changes |
393 |
in the Twisted version, so that for example 1.2 is able to use both |
394 |
Twisted 2.x and 8.x. |
395 |
|
396 |
In the end, since we already had an HTTP server library for the RAPI, |
397 |
we just reused that for inter-node communication. |
398 |
|
399 |
|
400 |
Granular locking |
401 |
~~~~~~~~~~~~~~~~ |
402 |
|
403 |
We want to make sure that multiple operations can run in parallel on a Ganeti |
404 |
Cluster. In order for this to happen we need to make sure concurrently run |
405 |
operations don't step on each other toes and break the cluster. |
406 |
|
407 |
This design addresses how we are going to deal with locking so that: |
408 |
|
409 |
- we preserve data coherency |
410 |
- we prevent deadlocks |
411 |
- we prevent job starvation |
412 |
|
413 |
Reaching the maximum possible parallelism is a Non-Goal. We have identified a |
414 |
set of operations that are currently bottlenecks and need to be parallelised |
415 |
and have worked on those. In the future it will be possible to address other |
416 |
needs, thus making the cluster more and more parallel one step at a time. |
417 |
|
418 |
This section only talks about parallelising Ganeti level operations, aka |
419 |
Logical Units, and the locking needed for that. Any other synchronization lock |
420 |
needed internally by the code is outside its scope. |
421 |
|
422 |
Library details |
423 |
+++++++++++++++ |
424 |
|
425 |
The proposed library has these features: |
426 |
|
427 |
- internally managing all the locks, making the implementation transparent |
428 |
from their usage |
429 |
- automatically grabbing multiple locks in the right order (avoid deadlock) |
430 |
- ability to transparently handle conversion to more granularity |
431 |
- support asynchronous operation (future goal) |
432 |
|
433 |
Locking will be valid only on the master node and will not be a |
434 |
distributed operation. Therefore, in case of master failure, the |
435 |
operations currently running will be aborted and the locks will be |
436 |
lost; it remains to the administrator to cleanup (if needed) the |
437 |
operation result (e.g. make sure an instance is either installed |
438 |
correctly or removed). |
439 |
|
440 |
A corollary of this is that a master-failover operation with both |
441 |
masters alive needs to happen while no operations are running, and |
442 |
therefore no locks are held. |
443 |
|
444 |
All the locks will be represented by objects (like |
445 |
``lockings.SharedLock``), and the individual locks for each object |
446 |
will be created at initialisation time, from the config file. |
447 |
|
448 |
The API will have a way to grab one or more than one locks at the same time. |
449 |
Any attempt to grab a lock while already holding one in the wrong order will be |
450 |
checked for, and fail. |
451 |
|
452 |
|
453 |
The Locks |
454 |
+++++++++ |
455 |
|
456 |
At the first stage we have decided to provide the following locks: |
457 |
|
458 |
- One "config file" lock |
459 |
- One lock per node in the cluster |
460 |
- One lock per instance in the cluster |
461 |
|
462 |
All the instance locks will need to be taken before the node locks, and the |
463 |
node locks before the config lock. Locks will need to be acquired at the same |
464 |
time for multiple instances and nodes, and internal ordering will be dealt |
465 |
within the locking library, which, for simplicity, will just use alphabetical |
466 |
order. |
467 |
|
468 |
Each lock has the following three possible statuses: |
469 |
|
470 |
- unlocked (anyone can grab the lock) |
471 |
- shared (anyone can grab/have the lock but only in shared mode) |
472 |
- exclusive (no one else can grab/have the lock) |
473 |
|
474 |
Handling conversion to more granularity |
475 |
+++++++++++++++++++++++++++++++++++++++ |
476 |
|
477 |
In order to convert to a more granular approach transparently each time we |
478 |
split a lock into more we'll create a "metalock", which will depend on those |
479 |
sub-locks and live for the time necessary for all the code to convert (or |
480 |
forever, in some conditions). When a metalock exists all converted code must |
481 |
acquire it in shared mode, so it can run concurrently, but still be exclusive |
482 |
with old code, which acquires it exclusively. |
483 |
|
484 |
In the beginning the only such lock will be what replaces the current "command" |
485 |
lock, and will acquire all the locks in the system, before proceeding. This |
486 |
lock will be called the "Big Ganeti Lock" because holding that one will avoid |
487 |
any other concurrent Ganeti operations. |
488 |
|
489 |
We might also want to devise more metalocks (eg. all nodes, all nodes+config) |
490 |
in order to make it easier for some parts of the code to acquire what it needs |
491 |
without specifying it explicitly. |
492 |
|
493 |
In the future things like the node locks could become metalocks, should we |
494 |
decide to split them into an even more fine grained approach, but this will |
495 |
probably be only after the first 2.0 version has been released. |
496 |
|
497 |
Adding/Removing locks |
498 |
+++++++++++++++++++++ |
499 |
|
500 |
When a new instance or a new node is created an associated lock must be added |
501 |
to the list. The relevant code will need to inform the locking library of such |
502 |
a change. |
503 |
|
504 |
This needs to be compatible with every other lock in the system, especially |
505 |
metalocks that guarantee to grab sets of resources without specifying them |
506 |
explicitly. The implementation of this will be handled in the locking library |
507 |
itself. |
508 |
|
509 |
When instances or nodes disappear from the cluster the relevant locks |
510 |
must be removed. This is easier than adding new elements, as the code |
511 |
which removes them must own them exclusively already, and thus deals |
512 |
with metalocks exactly as normal code acquiring those locks. Any |
513 |
operation queuing on a removed lock will fail after its removal. |
514 |
|
515 |
Asynchronous operations |
516 |
+++++++++++++++++++++++ |
517 |
|
518 |
For the first version the locking library will only export synchronous |
519 |
operations, which will block till the needed lock are held, and only fail if |
520 |
the request is impossible or somehow erroneous. |
521 |
|
522 |
In the future we may want to implement different types of asynchronous |
523 |
operations such as: |
524 |
|
525 |
- try to acquire this lock set and fail if not possible |
526 |
- try to acquire one of these lock sets and return the first one you were |
527 |
able to get (or after a timeout) (select/poll like) |
528 |
|
529 |
These operations can be used to prioritize operations based on available locks, |
530 |
rather than making them just blindly queue for acquiring them. The inherent |
531 |
risk, though, is that any code using the first operation, or setting a timeout |
532 |
for the second one, is susceptible to starvation and thus may never be able to |
533 |
get the required locks and complete certain tasks. Considering this |
534 |
providing/using these operations should not be among our first priorities. |
535 |
|
536 |
Locking granularity |
537 |
+++++++++++++++++++ |
538 |
|
539 |
For the first version of this code we'll convert each Logical Unit to |
540 |
acquire/release the locks it needs, so locking will be at the Logical Unit |
541 |
level. In the future we may want to split logical units in independent |
542 |
"tasklets" with their own locking requirements. A different design doc (or mini |
543 |
design doc) will cover the move from Logical Units to tasklets. |
544 |
|
545 |
Code examples |
546 |
+++++++++++++ |
547 |
|
548 |
In general when acquiring locks we should use a code path equivalent to:: |
549 |
|
550 |
lock.acquire() |
551 |
try: |
552 |
... |
553 |
# other code |
554 |
finally: |
555 |
lock.release() |
556 |
|
557 |
This makes sure we release all locks, and avoid possible deadlocks. Of |
558 |
course extra care must be used not to leave, if possible locked |
559 |
structures in an unusable state. Note that with Python 2.5 a simpler |
560 |
syntax will be possible, but we want to keep compatibility with Python |
561 |
2.4 so the new constructs should not be used. |
562 |
|
563 |
In order to avoid this extra indentation and code changes everywhere in the |
564 |
Logical Units code, we decided to allow LUs to declare locks, and then execute |
565 |
their code with their locks acquired. In the new world LUs are called like |
566 |
this:: |
567 |
|
568 |
# user passed names are expanded to the internal lock/resource name, |
569 |
# then known needed locks are declared |
570 |
lu.ExpandNames() |
571 |
... some locking/adding of locks may happen ... |
572 |
# late declaration of locks for one level: this is useful because sometimes |
573 |
# we can't know which resource we need before locking the previous level |
574 |
lu.DeclareLocks() # for each level (cluster, instance, node) |
575 |
... more locking/adding of locks can happen ... |
576 |
# these functions are called with the proper locks held |
577 |
lu.CheckPrereq() |
578 |
lu.Exec() |
579 |
... locks declared for removal are removed, all acquired locks released ... |
580 |
|
581 |
The Processor and the LogicalUnit class will contain exact documentation on how |
582 |
locks are supposed to be declared. |
583 |
|
584 |
Caveats |
585 |
+++++++ |
586 |
|
587 |
This library will provide an easy upgrade path to bring all the code to |
588 |
granular locking without breaking everything, and it will also guarantee |
589 |
against a lot of common errors. Code switching from the old "lock everything" |
590 |
lock to the new system, though, needs to be carefully scrutinised to be sure it |
591 |
is really acquiring all the necessary locks, and none has been overlooked or |
592 |
forgotten. |
593 |
|
594 |
The code can contain other locks outside of this library, to synchronise other |
595 |
threaded code (eg for the job queue) but in general these should be leaf locks |
596 |
or carefully structured non-leaf ones, to avoid deadlock race conditions. |
597 |
|
598 |
|
599 |
Job Queue |
600 |
~~~~~~~~~ |
601 |
|
602 |
Granular locking is not enough to speed up operations, we also need a |
603 |
queue to store these and to be able to process as many as possible in |
604 |
parallel. |
605 |
|
606 |
A Ganeti job will consist of multiple ``OpCodes`` which are the basic |
607 |
element of operation in Ganeti 1.2 (and will remain as such). Most |
608 |
command-level commands are equivalent to one OpCode, or in some cases |
609 |
to a sequence of opcodes, all of the same type (e.g. evacuating a node |
610 |
will generate N opcodes of type replace disks). |
611 |
|
612 |
|
613 |
Job execution—“Life of a Ganeti job” |
614 |
++++++++++++++++++++++++++++++++++++ |
615 |
|
616 |
#. Job gets submitted by the client. A new job identifier is generated and |
617 |
assigned to the job. The job is then automatically replicated [#replic]_ |
618 |
to all nodes in the cluster. The identifier is returned to the client. |
619 |
#. A pool of worker threads waits for new jobs. If all are busy, the job has |
620 |
to wait and the first worker finishing its work will grab it. Otherwise any |
621 |
of the waiting threads will pick up the new job. |
622 |
#. Client waits for job status updates by calling a waiting RPC function. |
623 |
Log message may be shown to the user. Until the job is started, it can also |
624 |
be canceled. |
625 |
#. As soon as the job is finished, its final result and status can be retrieved |
626 |
from the server. |
627 |
#. If the client archives the job, it gets moved to a history directory. |
628 |
There will be a method to archive all jobs older than a a given age. |
629 |
|
630 |
.. [#replic] We need replication in order to maintain the consistency across |
631 |
all nodes in the system; the master node only differs in the fact that |
632 |
now it is running the master daemon, but it if fails and we do a master |
633 |
failover, the jobs are still visible on the new master (though marked as |
634 |
failed). |
635 |
|
636 |
Failures to replicate a job to other nodes will be only flagged as |
637 |
errors in the master daemon log if more than half of the nodes failed, |
638 |
otherwise we ignore the failure, and rely on the fact that the next |
639 |
update (for still running jobs) will retry the update. For finished |
640 |
jobs, it is less of a problem. |
641 |
|
642 |
Future improvements will look into checking the consistency of the job |
643 |
list and jobs themselves at master daemon startup. |
644 |
|
645 |
|
646 |
Job storage |
647 |
+++++++++++ |
648 |
|
649 |
Jobs are stored in the filesystem as individual files, serialized |
650 |
using JSON (standard serialization mechanism in Ganeti). |
651 |
|
652 |
The choice of storing each job in its own file was made because: |
653 |
|
654 |
- a file can be atomically replaced |
655 |
- a file can easily be replicated to other nodes |
656 |
- checking consistency across nodes can be implemented very easily, since |
657 |
all job files should be (at a given moment in time) identical |
658 |
|
659 |
The other possible choices that were discussed and discounted were: |
660 |
|
661 |
- single big file with all job data: not feasible due to difficult updates |
662 |
- in-process databases: hard to replicate the entire database to the |
663 |
other nodes, and replicating individual operations does not mean wee keep |
664 |
consistency |
665 |
|
666 |
|
667 |
Queue structure |
668 |
+++++++++++++++ |
669 |
|
670 |
All file operations have to be done atomically by writing to a temporary file |
671 |
and subsequent renaming. Except for log messages, every change in a job is |
672 |
stored and replicated to other nodes. |
673 |
|
674 |
:: |
675 |
|
676 |
/var/lib/ganeti/queue/ |
677 |
job-1 (JSON encoded job description and status) |
678 |
[…] |
679 |
job-37 |
680 |
job-38 |
681 |
job-39 |
682 |
lock (Queue managing process opens this file in exclusive mode) |
683 |
serial (Last job ID used) |
684 |
version (Queue format version) |
685 |
|
686 |
|
687 |
Locking |
688 |
+++++++ |
689 |
|
690 |
Locking in the job queue is a complicated topic. It is called from more than |
691 |
one thread and must be thread-safe. For simplicity, a single lock is used for |
692 |
the whole job queue. |
693 |
|
694 |
A more detailed description can be found in doc/locking.txt. |
695 |
|
696 |
|
697 |
Internal RPC |
698 |
++++++++++++ |
699 |
|
700 |
RPC calls available between Ganeti master and node daemons: |
701 |
|
702 |
jobqueue_update(file_name, content) |
703 |
Writes a file in the job queue directory. |
704 |
jobqueue_purge() |
705 |
Cleans the job queue directory completely, including archived job. |
706 |
jobqueue_rename(old, new) |
707 |
Renames a file in the job queue directory. |
708 |
|
709 |
|
710 |
Client RPC |
711 |
++++++++++ |
712 |
|
713 |
RPC between Ganeti clients and the Ganeti master daemon supports the following |
714 |
operations: |
715 |
|
716 |
SubmitJob(ops) |
717 |
Submits a list of opcodes and returns the job identifier. The identifier is |
718 |
guaranteed to be unique during the lifetime of a cluster. |
719 |
WaitForJobChange(job_id, fields, […], timeout) |
720 |
This function waits until a job changes or a timeout expires. The condition |
721 |
for when a job changed is defined by the fields passed and the last log |
722 |
message received. |
723 |
QueryJobs(job_ids, fields) |
724 |
Returns field values for the job identifiers passed. |
725 |
CancelJob(job_id) |
726 |
Cancels the job specified by identifier. This operation may fail if the job |
727 |
is already running, canceled or finished. |
728 |
ArchiveJob(job_id) |
729 |
Moves a job into the …/archive/ directory. This operation will fail if the |
730 |
job has not been canceled or finished. |
731 |
|
732 |
|
733 |
Job and opcode status |
734 |
+++++++++++++++++++++ |
735 |
|
736 |
Each job and each opcode has, at any time, one of the following states: |
737 |
|
738 |
Queued |
739 |
The job/opcode was submitted, but did not yet start. |
740 |
Waiting |
741 |
The job/opcode is waiting for a lock to proceed. |
742 |
Running |
743 |
The job/opcode is running. |
744 |
Canceled |
745 |
The job/opcode was canceled before it started. |
746 |
Success |
747 |
The job/opcode ran and finished successfully. |
748 |
Error |
749 |
The job/opcode was aborted with an error. |
750 |
|
751 |
If the master is aborted while a job is running, the job will be set to the |
752 |
Error status once the master started again. |
753 |
|
754 |
|
755 |
History |
756 |
+++++++ |
757 |
|
758 |
Archived jobs are kept in a separate directory, |
759 |
``/var/lib/ganeti/queue/archive/``. This is done in order to speed up |
760 |
the queue handling: by default, the jobs in the archive are not |
761 |
touched by any functions. Only the current (unarchived) jobs are |
762 |
parsed, loaded, and verified (if implemented) by the master daemon. |
763 |
|
764 |
|
765 |
Ganeti updates |
766 |
++++++++++++++ |
767 |
|
768 |
The queue has to be completely empty for Ganeti updates with changes |
769 |
in the job queue structure. In order to allow this, there will be a |
770 |
way to prevent new jobs entering the queue. |
771 |
|
772 |
|
773 |
Object parameters |
774 |
~~~~~~~~~~~~~~~~~ |
775 |
|
776 |
Across all cluster configuration data, we have multiple classes of |
777 |
parameters: |
778 |
|
779 |
A. cluster-wide parameters (e.g. name of the cluster, the master); |
780 |
these are the ones that we have today, and are unchanged from the |
781 |
current model |
782 |
|
783 |
#. node parameters |
784 |
|
785 |
#. instance specific parameters, e.g. the name of disks (LV), that |
786 |
cannot be shared with other instances |
787 |
|
788 |
#. instance parameters, that are or can be the same for many |
789 |
instances, but are not hypervisor related; e.g. the number of VCPUs, |
790 |
or the size of memory |
791 |
|
792 |
#. instance parameters that are hypervisor specific (e.g. kernel_path |
793 |
or PAE mode) |
794 |
|
795 |
|
796 |
The following definitions for instance parameters will be used below: |
797 |
|
798 |
:hypervisor parameter: |
799 |
a hypervisor parameter (or hypervisor specific parameter) is defined |
800 |
as a parameter that is interpreted by the hypervisor support code in |
801 |
Ganeti and usually is specific to a particular hypervisor (like the |
802 |
kernel path for `PVM`_ which makes no sense for `HVM`_). |
803 |
|
804 |
:backend parameter: |
805 |
a backend parameter is defined as an instance parameter that can be |
806 |
shared among a list of instances, and is either generic enough not |
807 |
to be tied to a given hypervisor or cannot influence at all the |
808 |
hypervisor behaviour. |
809 |
|
810 |
For example: memory, vcpus, auto_balance |
811 |
|
812 |
All these parameters will be encoded into constants.py with the prefix "BE\_" |
813 |
and the whole list of parameters will exist in the set "BES_PARAMETERS" |
814 |
|
815 |
:proper parameter: |
816 |
a parameter whose value is unique to the instance (e.g. the name of a LV, |
817 |
or the MAC of a NIC) |
818 |
|
819 |
As a general rule, for all kind of parameters, “None” (or in |
820 |
JSON-speak, “nil”) will no longer be a valid value for a parameter. As |
821 |
such, only non-default parameters will be saved as part of objects in |
822 |
the serialization step, reducing the size of the serialized format. |
823 |
|
824 |
Cluster parameters |
825 |
++++++++++++++++++ |
826 |
|
827 |
Cluster parameters remain as today, attributes at the top level of the |
828 |
Cluster object. In addition, two new attributes at this level will |
829 |
hold defaults for the instances: |
830 |
|
831 |
- hvparams, a dictionary indexed by hypervisor type, holding default |
832 |
values for hypervisor parameters that are not defined/overridden by |
833 |
the instances of this hypervisor type |
834 |
|
835 |
- beparams, a dictionary holding (for 2.0) a single element 'default', |
836 |
which holds the default value for backend parameters |
837 |
|
838 |
Node parameters |
839 |
+++++++++++++++ |
840 |
|
841 |
Node-related parameters are very few, and we will continue using the |
842 |
same model for these as previously (attributes on the Node object). |
843 |
|
844 |
There are three new node flags, described in a separate section "node |
845 |
flags" below. |
846 |
|
847 |
Instance parameters |
848 |
+++++++++++++++++++ |
849 |
|
850 |
As described before, the instance parameters are split in three: |
851 |
instance proper parameters, unique to each instance, instance |
852 |
hypervisor parameters and instance backend parameters. |
853 |
|
854 |
The “hvparams” and “beparams” are kept in two dictionaries at instance |
855 |
level. Only non-default parameters are stored (but once customized, a |
856 |
parameter will be kept, even with the same value as the default one, |
857 |
until reset). |
858 |
|
859 |
The names for hypervisor parameters in the instance.hvparams subtree |
860 |
should be choosen as generic as possible, especially if specific |
861 |
parameters could conceivably be useful for more than one hypervisor, |
862 |
e.g. ``instance.hvparams.vnc_console_port`` instead of using both |
863 |
``instance.hvparams.hvm_vnc_console_port`` and |
864 |
``instance.hvparams.kvm_vnc_console_port``. |
865 |
|
866 |
There are some special cases related to disks and NICs (for example): |
867 |
a disk has both Ganeti-related parameters (e.g. the name of the LV) |
868 |
and hypervisor-related parameters (how the disk is presented to/named |
869 |
in the instance). The former parameters remain as proper-instance |
870 |
parameters, while the latter value are migrated to the hvparams |
871 |
structure. In 2.0, we will have only globally-per-instance such |
872 |
hypervisor parameters, and not per-disk ones (e.g. all NICs will be |
873 |
exported as of the same type). |
874 |
|
875 |
Starting from the 1.2 list of instance parameters, here is how they |
876 |
will be mapped to the three classes of parameters: |
877 |
|
878 |
- name (P) |
879 |
- primary_node (P) |
880 |
- os (P) |
881 |
- hypervisor (P) |
882 |
- status (P) |
883 |
- memory (BE) |
884 |
- vcpus (BE) |
885 |
- nics (P) |
886 |
- disks (P) |
887 |
- disk_template (P) |
888 |
- network_port (P) |
889 |
- kernel_path (HV) |
890 |
- initrd_path (HV) |
891 |
- hvm_boot_order (HV) |
892 |
- hvm_acpi (HV) |
893 |
- hvm_pae (HV) |
894 |
- hvm_cdrom_image_path (HV) |
895 |
- hvm_nic_type (HV) |
896 |
- hvm_disk_type (HV) |
897 |
- vnc_bind_address (HV) |
898 |
- serial_no (P) |
899 |
|
900 |
|
901 |
Parameter validation |
902 |
++++++++++++++++++++ |
903 |
|
904 |
To support the new cluster parameter design, additional features will |
905 |
be required from the hypervisor support implementations in Ganeti. |
906 |
|
907 |
The hypervisor support implementation API will be extended with the |
908 |
following features: |
909 |
|
910 |
:PARAMETERS: class-level attribute holding the list of valid parameters |
911 |
for this hypervisor |
912 |
:CheckParamSyntax(hvparams): checks that the given parameters are |
913 |
valid (as in the names are valid) for this hypervisor; usually just |
914 |
comparing ``hvparams.keys()`` and ``cls.PARAMETERS``; this is a class |
915 |
method that can be called from within master code (i.e. cmdlib) and |
916 |
should be safe to do so |
917 |
:ValidateParameters(hvparams): verifies the values of the provided |
918 |
parameters against this hypervisor; this is a method that will be |
919 |
called on the target node, from backend.py code, and as such can |
920 |
make node-specific checks (e.g. kernel_path checking) |
921 |
|
922 |
Default value application |
923 |
+++++++++++++++++++++++++ |
924 |
|
925 |
The application of defaults to an instance is done in the Cluster |
926 |
object, via two new methods as follows: |
927 |
|
928 |
- ``Cluster.FillHV(instance)``, returns 'filled' hvparams dict, based on |
929 |
instance's hvparams and cluster's ``hvparams[instance.hypervisor]`` |
930 |
|
931 |
- ``Cluster.FillBE(instance, be_type="default")``, which returns the |
932 |
beparams dict, based on the instance and cluster beparams |
933 |
|
934 |
The FillHV/BE transformations will be used, for example, in the RpcRunner |
935 |
when sending an instance for activation/stop, and the sent instance |
936 |
hvparams/beparams will have the final value (noded code doesn't know |
937 |
about defaults). |
938 |
|
939 |
LU code will need to self-call the transformation, if needed. |
940 |
|
941 |
Opcode changes |
942 |
++++++++++++++ |
943 |
|
944 |
The parameter changes will have impact on the OpCodes, especially on |
945 |
the following ones: |
946 |
|
947 |
- ``OpCreateInstance``, where the new hv and be parameters will be sent as |
948 |
dictionaries; note that all hv and be parameters are now optional, as |
949 |
the values can be instead taken from the cluster |
950 |
- ``OpQueryInstances``, where we have to be able to query these new |
951 |
parameters; the syntax for names will be ``hvparam/$NAME`` and |
952 |
``beparam/$NAME`` for querying an individual parameter out of one |
953 |
dictionary, and ``hvparams``, respectively ``beparams``, for the whole |
954 |
dictionaries |
955 |
- ``OpModifyInstance``, where the the modified parameters are sent as |
956 |
dictionaries |
957 |
|
958 |
Additionally, we will need new OpCodes to modify the cluster-level |
959 |
defaults for the be/hv sets of parameters. |
960 |
|
961 |
Caveats |
962 |
+++++++ |
963 |
|
964 |
One problem that might appear is that our classification is not |
965 |
complete or not good enough, and we'll need to change this model. As |
966 |
the last resort, we will need to rollback and keep 1.2 style. |
967 |
|
968 |
Another problem is that classification of one parameter is unclear |
969 |
(e.g. ``network_port``, is this BE or HV?); in this case we'll take |
970 |
the risk of having to move parameters later between classes. |
971 |
|
972 |
Security |
973 |
++++++++ |
974 |
|
975 |
The only security issue that we foresee is if some new parameters will |
976 |
have sensitive value. If so, we will need to have a way to export the |
977 |
config data while purging the sensitive value. |
978 |
|
979 |
E.g. for the drbd shared secrets, we could export these with the |
980 |
values replaced by an empty string. |
981 |
|
982 |
Node flags |
983 |
~~~~~~~~~~ |
984 |
|
985 |
Ganeti 2.0 adds three node flags that change the way nodes are handled |
986 |
within Ganeti and the related infrastructure (iallocator interaction, |
987 |
RAPI data export). |
988 |
|
989 |
*master candidate* flag |
990 |
+++++++++++++++++++++++ |
991 |
|
992 |
Ganeti 2.0 allows more scalability in operation by introducing |
993 |
parallelization. However, a new bottleneck is reached that is the |
994 |
synchronization and replication of cluster configuration to all nodes |
995 |
in the cluster. |
996 |
|
997 |
This breaks scalability as the speed of the replication decreases |
998 |
roughly with the size of the nodes in the cluster. The goal of the |
999 |
master candidate flag is to change this O(n) into O(1) with respect to |
1000 |
job and configuration data propagation. |
1001 |
|
1002 |
Only nodes having this flag set (let's call this set of nodes the |
1003 |
*candidate pool*) will have jobs and configuration data replicated. |
1004 |
|
1005 |
The cluster will have a new parameter (runtime changeable) called |
1006 |
``candidate_pool_size`` which represents the number of candidates the |
1007 |
cluster tries to maintain (preferably automatically). |
1008 |
|
1009 |
This will impact the cluster operations as follows: |
1010 |
|
1011 |
- jobs and config data will be replicated only to a fixed set of nodes |
1012 |
- master fail-over will only be possible to a node in the candidate pool |
1013 |
- cluster verify needs changing to account for these two roles |
1014 |
- external scripts will no longer have access to the configuration |
1015 |
file (this is not recommended anyway) |
1016 |
|
1017 |
|
1018 |
The caveats of this change are: |
1019 |
|
1020 |
- if all candidates are lost (completely), cluster configuration is |
1021 |
lost (but it should be backed up external to the cluster anyway) |
1022 |
|
1023 |
- failed nodes which are candidate must be dealt with properly, so |
1024 |
that we don't lose too many candidates at the same time; this will be |
1025 |
reported in cluster verify |
1026 |
|
1027 |
- the 'all equal' concept of ganeti is no longer true |
1028 |
|
1029 |
- the partial distribution of config data means that all nodes will |
1030 |
have to revert to ssconf files for master info (as in 1.2) |
1031 |
|
1032 |
Advantages: |
1033 |
|
1034 |
- speed on a 100+ nodes simulated cluster is greatly enhanced, even |
1035 |
for a simple operation; ``gnt-instance remove`` on a diskless instance |
1036 |
remove goes from ~9seconds to ~2 seconds |
1037 |
|
1038 |
- node failure of non-candidates will be less impacting on the cluster |
1039 |
|
1040 |
The default value for the candidate pool size will be set to 10 but |
1041 |
this can be changed at cluster creation and modified any time later. |
1042 |
|
1043 |
Testing on simulated big clusters with sequential and parallel jobs |
1044 |
show that this value (10) is a sweet-spot from performance and load |
1045 |
point of view. |
1046 |
|
1047 |
*offline* flag |
1048 |
++++++++++++++ |
1049 |
|
1050 |
In order to support better the situation in which nodes are offline |
1051 |
(e.g. for repair) without altering the cluster configuration, Ganeti |
1052 |
needs to be told and needs to properly handle this state for nodes. |
1053 |
|
1054 |
This will result in simpler procedures, and less mistakes, when the |
1055 |
amount of node failures is high on an absolute scale (either due to |
1056 |
high failure rate or simply big clusters). |
1057 |
|
1058 |
Nodes having this attribute set will not be contacted for inter-node |
1059 |
RPC calls, will not be master candidates, and will not be able to host |
1060 |
instances as primaries. |
1061 |
|
1062 |
Setting this attribute on a node: |
1063 |
|
1064 |
- will not be allowed if the node is the master |
1065 |
- will not be allowed if the node has primary instances |
1066 |
- will cause the node to be demoted from the master candidate role (if |
1067 |
it was), possibly causing another node to be promoted to that role |
1068 |
|
1069 |
This attribute will impact the cluster operations as follows: |
1070 |
|
1071 |
- querying these nodes for anything will fail instantly in the RPC |
1072 |
library, with a specific RPC error (RpcResult.offline == True) |
1073 |
|
1074 |
- they will be listed in the Other section of cluster verify |
1075 |
|
1076 |
The code is changed in the following ways: |
1077 |
|
1078 |
- RPC calls were be converted to skip such nodes: |
1079 |
|
1080 |
- RpcRunner-instance-based RPC calls are easy to convert |
1081 |
|
1082 |
- static/classmethod RPC calls are harder to convert, and were left |
1083 |
alone |
1084 |
|
1085 |
- the RPC results were unified so that this new result state (offline) |
1086 |
can be differentiated |
1087 |
|
1088 |
- master voting still queries in repair nodes, as we need to ensure |
1089 |
consistency in case the (wrong) masters have old data, and nodes have |
1090 |
come back from repairs |
1091 |
|
1092 |
Caveats: |
1093 |
|
1094 |
- some operation semantics are less clear (e.g. what to do on instance |
1095 |
start with offline secondary?); for now, these will just fail as if the |
1096 |
flag is not set (but faster) |
1097 |
- 2-node cluster with one node offline needs manual startup of the |
1098 |
master with a special flag to skip voting (as the master can't get a |
1099 |
quorum there) |
1100 |
|
1101 |
One of the advantages of implementing this flag is that it will allow |
1102 |
in the future automation tools to automatically put the node in |
1103 |
repairs and recover from this state, and the code (should/will) handle |
1104 |
this much better than just timing out. So, future possible |
1105 |
improvements (for later versions): |
1106 |
|
1107 |
- watcher will detect nodes which fail RPC calls, will attempt to ssh |
1108 |
to them, if failure will put them offline |
1109 |
- watcher will try to ssh and query the offline nodes, if successful |
1110 |
will take them off the repair list |
1111 |
|
1112 |
Alternatives considered: The RPC call model in 2.0 is, by default, |
1113 |
much nicer - errors are logged in the background, and job/opcode |
1114 |
execution is clearer, so we could simply not introduce this. However, |
1115 |
having this state will make both the codepaths clearer (offline |
1116 |
vs. temporary failure) and the operational model (it's not a node with |
1117 |
errors, but an offline node). |
1118 |
|
1119 |
|
1120 |
*drained* flag |
1121 |
++++++++++++++ |
1122 |
|
1123 |
Due to parallel execution of jobs in Ganeti 2.0, we could have the |
1124 |
following situation: |
1125 |
|
1126 |
- gnt-node migrate + failover is run |
1127 |
- gnt-node evacuate is run, which schedules a long-running 6-opcode |
1128 |
job for the node |
1129 |
- partway through, a new job comes in that runs an iallocator script, |
1130 |
which finds the above node as empty and a very good candidate |
1131 |
- gnt-node evacuate has finished, but now it has to be run again, to |
1132 |
clean the above instance(s) |
1133 |
|
1134 |
In order to prevent this situation, and to be able to get nodes into |
1135 |
proper offline status easily, a new *drained* flag was added to the nodes. |
1136 |
|
1137 |
This flag (which actually means "is being, or was drained, and is |
1138 |
expected to go offline"), will prevent allocations on the node, but |
1139 |
otherwise all other operations (start/stop instance, query, etc.) are |
1140 |
working without any restrictions. |
1141 |
|
1142 |
Interaction between flags |
1143 |
+++++++++++++++++++++++++ |
1144 |
|
1145 |
While these flags are implemented as separate flags, they are |
1146 |
mutually-exclusive and are acting together with the master node role |
1147 |
as a single *node status* value. In other words, a flag is only in one |
1148 |
of these roles at a given time. The lack of any of these flags denote |
1149 |
a regular node. |
1150 |
|
1151 |
The current node status is visible in the ``gnt-cluster verify`` |
1152 |
output, and the individual flags can be examined via separate flags in |
1153 |
the ``gnt-node list`` output. |
1154 |
|
1155 |
These new flags will be exported in both the iallocator input message |
1156 |
and via RAPI, see the respective man pages for the exact names. |
1157 |
|
1158 |
Feature changes |
1159 |
--------------- |
1160 |
|
1161 |
The main feature-level changes will be: |
1162 |
|
1163 |
- a number of disk related changes |
1164 |
- removal of fixed two-disk, one-nic per instance limitation |
1165 |
|
1166 |
Disk handling changes |
1167 |
~~~~~~~~~~~~~~~~~~~~~ |
1168 |
|
1169 |
The storage options available in Ganeti 1.x were introduced based on |
1170 |
then-current software (first DRBD 0.7 then later DRBD 8) and the |
1171 |
estimated usage patters. However, experience has later shown that some |
1172 |
assumptions made initially are not true and that more flexibility is |
1173 |
needed. |
1174 |
|
1175 |
One main assumption made was that disk failures should be treated as 'rare' |
1176 |
events, and that each of them needs to be manually handled in order to ensure |
1177 |
data safety; however, both these assumptions are false: |
1178 |
|
1179 |
- disk failures can be a common occurrence, based on usage patterns or cluster |
1180 |
size |
1181 |
- our disk setup is robust enough (referring to DRBD8 + LVM) that we could |
1182 |
automate more of the recovery |
1183 |
|
1184 |
Note that we still don't have fully-automated disk recovery as a goal, but our |
1185 |
goal is to reduce the manual work needed. |
1186 |
|
1187 |
As such, we plan the following main changes: |
1188 |
|
1189 |
- DRBD8 is much more flexible and stable than its previous version (0.7), |
1190 |
such that removing the support for the ``remote_raid1`` template and |
1191 |
focusing only on DRBD8 is easier |
1192 |
|
1193 |
- dynamic discovery of DRBD devices is not actually needed in a cluster that |
1194 |
where the DRBD namespace is controlled by Ganeti; switching to a static |
1195 |
assignment (done at either instance creation time or change secondary time) |
1196 |
will change the disk activation time from O(n) to O(1), which on big |
1197 |
clusters is a significant gain |
1198 |
|
1199 |
- remove the hard dependency on LVM (currently all available storage types are |
1200 |
ultimately backed by LVM volumes) by introducing file-based storage |
1201 |
|
1202 |
Additionally, a number of smaller enhancements are also planned: |
1203 |
- support variable number of disks |
1204 |
- support read-only disks |
1205 |
|
1206 |
Future enhancements in the 2.x series, which do not require base design |
1207 |
changes, might include: |
1208 |
|
1209 |
- enhancement of the LVM allocation method in order to try to keep |
1210 |
all of an instance's virtual disks on the same physical |
1211 |
disks |
1212 |
|
1213 |
- add support for DRBD8 authentication at handshake time in |
1214 |
order to ensure each device connects to the correct peer |
1215 |
|
1216 |
- remove the restrictions on failover only to the secondary |
1217 |
which creates very strict rules on cluster allocation |
1218 |
|
1219 |
DRBD minor allocation |
1220 |
+++++++++++++++++++++ |
1221 |
|
1222 |
Currently, when trying to identify or activate a new DRBD (or MD) |
1223 |
device, the code scans all in-use devices in order to see if we find |
1224 |
one that looks similar to our parameters and is already in the desired |
1225 |
state or not. Since this needs external commands to be run, it is very |
1226 |
slow when more than a few devices are already present. |
1227 |
|
1228 |
Therefore, we will change the discovery model from dynamic to |
1229 |
static. When a new device is logically created (added to the |
1230 |
configuration) a free minor number is computed from the list of |
1231 |
devices that should exist on that node and assigned to that |
1232 |
device. |
1233 |
|
1234 |
At device activation, if the minor is already in use, we check if |
1235 |
it has our parameters; if not so, we just destroy the device (if |
1236 |
possible, otherwise we abort) and start it with our own |
1237 |
parameters. |
1238 |
|
1239 |
This means that we in effect take ownership of the minor space for |
1240 |
that device type; if there's a user-created DRBD minor, it will be |
1241 |
automatically removed. |
1242 |
|
1243 |
The change will have the effect of reducing the number of external |
1244 |
commands run per device from a constant number times the index of the |
1245 |
first free DRBD minor to just a constant number. |
1246 |
|
1247 |
Removal of obsolete device types (MD, DRBD7) |
1248 |
++++++++++++++++++++++++++++++++++++++++++++ |
1249 |
|
1250 |
We need to remove these device types because of two issues. First, |
1251 |
DRBD7 has bad failure modes in case of dual failures (both network and |
1252 |
disk - it cannot propagate the error up the device stack and instead |
1253 |
just panics. Second, due to the asymmetry between primary and |
1254 |
secondary in MD+DRBD mode, we cannot do live failover (not even if we |
1255 |
had MD+DRBD8). |
1256 |
|
1257 |
File-based storage support |
1258 |
++++++++++++++++++++++++++ |
1259 |
|
1260 |
Using files instead of logical volumes for instance storage would |
1261 |
allow us to get rid of the hard requirement for volume groups for |
1262 |
testing clusters and it would also allow usage of SAN storage to do |
1263 |
live failover taking advantage of this storage solution. |
1264 |
|
1265 |
Better LVM allocation |
1266 |
+++++++++++++++++++++ |
1267 |
|
1268 |
Currently, the LV to PV allocation mechanism is a very simple one: at |
1269 |
each new request for a logical volume, tell LVM to allocate the volume |
1270 |
in order based on the amount of free space. This is good for |
1271 |
simplicity and for keeping the usage equally spread over the available |
1272 |
physical disks, however it introduces a problem that an instance could |
1273 |
end up with its (currently) two drives on two physical disks, or |
1274 |
(worse) that the data and metadata for a DRBD device end up on |
1275 |
different drives. |
1276 |
|
1277 |
This is bad because it causes unneeded ``replace-disks`` operations in |
1278 |
case of a physical failure. |
1279 |
|
1280 |
The solution is to batch allocations for an instance and make the LVM |
1281 |
handling code try to allocate as close as possible all the storage of |
1282 |
one instance. We will still allow the logical volumes to spill over to |
1283 |
additional disks as needed. |
1284 |
|
1285 |
Note that this clustered allocation can only be attempted at initial |
1286 |
instance creation, or at change secondary node time. At add disk time, |
1287 |
or at replacing individual disks, it's not easy enough to compute the |
1288 |
current disk map so we'll not attempt the clustering. |
1289 |
|
1290 |
DRBD8 peer authentication at handshake |
1291 |
++++++++++++++++++++++++++++++++++++++ |
1292 |
|
1293 |
DRBD8 has a new feature that allow authentication of the peer at |
1294 |
connect time. We can use this to prevent connecting to the wrong peer |
1295 |
more that securing the connection. Even though we never had issues |
1296 |
with wrong connections, it would be good to implement this. |
1297 |
|
1298 |
|
1299 |
LVM self-repair (optional) |
1300 |
++++++++++++++++++++++++++ |
1301 |
|
1302 |
The complete failure of a physical disk is very tedious to |
1303 |
troubleshoot, mainly because of the many failure modes and the many |
1304 |
steps needed. We can safely automate some of the steps, more |
1305 |
specifically the ``vgreduce --removemissing`` using the following |
1306 |
method: |
1307 |
|
1308 |
#. check if all nodes have consistent volume groups |
1309 |
#. if yes, and previous status was yes, do nothing |
1310 |
#. if yes, and previous status was no, save status and restart |
1311 |
#. if no, and previous status was no, do nothing |
1312 |
#. if no, and previous status was yes: |
1313 |
#. if more than one node is inconsistent, do nothing |
1314 |
#. if only one node is inconsistent: |
1315 |
#. run ``vgreduce --removemissing`` |
1316 |
#. log this occurrence in the Ganeti log in a form that |
1317 |
can be used for monitoring |
1318 |
#. [FUTURE] run ``replace-disks`` for all |
1319 |
instances affected |
1320 |
|
1321 |
Failover to any node |
1322 |
++++++++++++++++++++ |
1323 |
|
1324 |
With a modified disk activation sequence, we can implement the |
1325 |
*failover to any* functionality, removing many of the layout |
1326 |
restrictions of a cluster: |
1327 |
|
1328 |
- the need to reserve memory on the current secondary: this gets reduced to |
1329 |
a must to reserve memory anywhere on the cluster |
1330 |
|
1331 |
- the need to first failover and then replace secondary for an |
1332 |
instance: with failover-to-any, we can directly failover to |
1333 |
another node, which also does the replace disks at the same |
1334 |
step |
1335 |
|
1336 |
In the following, we denote the current primary by P1, the current |
1337 |
secondary by S1, and the new primary and secondaries by P2 and S2. P2 |
1338 |
is fixed to the node the user chooses, but the choice of S2 can be |
1339 |
made between P1 and S1. This choice can be constrained, depending on |
1340 |
which of P1 and S1 has failed. |
1341 |
|
1342 |
- if P1 has failed, then S1 must become S2, and live migration is not possible |
1343 |
- if S1 has failed, then P1 must become S2, and live migration could be |
1344 |
possible (in theory, but this is not a design goal for 2.0) |
1345 |
|
1346 |
The algorithm for performing the failover is straightforward: |
1347 |
|
1348 |
- verify that S2 (the node the user has chosen to keep as secondary) has |
1349 |
valid data (is consistent) |
1350 |
|
1351 |
- tear down the current DRBD association and setup a DRBD pairing between |
1352 |
P2 (P2 is indicated by the user) and S2; since P2 has no data, it will |
1353 |
start re-syncing from S2 |
1354 |
|
1355 |
- as soon as P2 is in state SyncTarget (i.e. after the resync has started |
1356 |
but before it has finished), we can promote it to primary role (r/w) |
1357 |
and start the instance on P2 |
1358 |
|
1359 |
- as soon as the P2?S2 sync has finished, we can remove |
1360 |
the old data on the old node that has not been chosen for |
1361 |
S2 |
1362 |
|
1363 |
Caveats: during the P2?S2 sync, a (non-transient) network error |
1364 |
will cause I/O errors on the instance, so (if a longer instance |
1365 |
downtime is acceptable) we can postpone the restart of the instance |
1366 |
until the resync is done. However, disk I/O errors on S2 will cause |
1367 |
data loss, since we don't have a good copy of the data anymore, so in |
1368 |
this case waiting for the sync to complete is not an option. As such, |
1369 |
it is recommended that this feature is used only in conjunction with |
1370 |
proper disk monitoring. |
1371 |
|
1372 |
|
1373 |
Live migration note: While failover-to-any is possible for all choices |
1374 |
of S2, migration-to-any is possible only if we keep P1 as S2. |
1375 |
|
1376 |
Caveats |
1377 |
+++++++ |
1378 |
|
1379 |
The dynamic device model, while more complex, has an advantage: it |
1380 |
will not reuse by mistake the DRBD device of another instance, since |
1381 |
it always looks for either our own or a free one. |
1382 |
|
1383 |
The static one, in contrast, will assume that given a minor number N, |
1384 |
it's ours and we can take over. This needs careful implementation such |
1385 |
that if the minor is in use, either we are able to cleanly shut it |
1386 |
down, or we abort the startup. Otherwise, it could be that we start |
1387 |
syncing between two instance's disks, causing data loss. |
1388 |
|
1389 |
|
1390 |
Variable number of disk/NICs per instance |
1391 |
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
1392 |
|
1393 |
Variable number of disks |
1394 |
++++++++++++++++++++++++ |
1395 |
|
1396 |
In order to support high-security scenarios (for example read-only sda |
1397 |
and read-write sdb), we need to make a fully flexibly disk |
1398 |
definition. This has less impact that it might look at first sight: |
1399 |
only the instance creation has hard coded number of disks, not the disk |
1400 |
handling code. The block device handling and most of the instance |
1401 |
handling code is already working with "the instance's disks" as |
1402 |
opposed to "the two disks of the instance", but some pieces are not |
1403 |
(e.g. import/export) and the code needs a review to ensure safety. |
1404 |
|
1405 |
The objective is to be able to specify the number of disks at |
1406 |
instance creation, and to be able to toggle from read-only to |
1407 |
read-write a disk afterward. |
1408 |
|
1409 |
Variable number of NICs |
1410 |
+++++++++++++++++++++++ |
1411 |
|
1412 |
Similar to the disk change, we need to allow multiple network |
1413 |
interfaces per instance. This will affect the internal code (some |
1414 |
function will have to stop assuming that ``instance.nics`` is a list |
1415 |
of length one), the OS API which currently can export/import only one |
1416 |
instance, and the command line interface. |
1417 |
|
1418 |
Interface changes |
1419 |
----------------- |
1420 |
|
1421 |
There are two areas of interface changes: API-level changes (the OS |
1422 |
interface and the RAPI interface) and the command line interface |
1423 |
changes. |
1424 |
|
1425 |
OS interface |
1426 |
~~~~~~~~~~~~ |
1427 |
|
1428 |
The current Ganeti OS interface, version 5, is tailored for Ganeti 1.2. The |
1429 |
interface is composed by a series of scripts which get called with certain |
1430 |
parameters to perform OS-dependent operations on the cluster. The current |
1431 |
scripts are: |
1432 |
|
1433 |
create |
1434 |
called when a new instance is added to the cluster |
1435 |
export |
1436 |
called to export an instance disk to a stream |
1437 |
import |
1438 |
called to import from a stream to a new instance |
1439 |
rename |
1440 |
called to perform the os-specific operations necessary for renaming an |
1441 |
instance |
1442 |
|
1443 |
Currently these scripts suffer from the limitations of Ganeti 1.2: for example |
1444 |
they accept exactly one block and one swap devices to operate on, rather than |
1445 |
any amount of generic block devices, they blindly assume that an instance will |
1446 |
have just one network interface to operate, they can not be configured to |
1447 |
optimise the instance for a particular hypervisor. |
1448 |
|
1449 |
Since in Ganeti 2.0 we want to support multiple hypervisors, and a non-fixed |
1450 |
number of network and disks the OS interface need to change to transmit the |
1451 |
appropriate amount of information about an instance to its managing operating |
1452 |
system, when operating on it. Moreover since some old assumptions usually used |
1453 |
in OS scripts are no longer valid we need to re-establish a common knowledge on |
1454 |
what can be assumed and what cannot be regarding Ganeti environment. |
1455 |
|
1456 |
|
1457 |
When designing the new OS API our priorities are: |
1458 |
- ease of use |
1459 |
- future extensibility |
1460 |
- ease of porting from the old API |
1461 |
- modularity |
1462 |
|
1463 |
As such we want to limit the number of scripts that must be written to support |
1464 |
an OS, and make it easy to share code between them by uniforming their input. |
1465 |
We also will leave the current script structure unchanged, as far as we can, |
1466 |
and make a few of the scripts (import, export and rename) optional. Most |
1467 |
information will be passed to the script through environment variables, for |
1468 |
ease of access and at the same time ease of using only the information a script |
1469 |
needs. |
1470 |
|
1471 |
|
1472 |
The Scripts |
1473 |
+++++++++++ |
1474 |
|
1475 |
As in Ganeti 1.2, every OS which wants to be installed in Ganeti needs to |
1476 |
support the following functionality, through scripts: |
1477 |
|
1478 |
create: |
1479 |
used to create a new instance running that OS. This script should prepare the |
1480 |
block devices, and install them so that the new OS can boot under the |
1481 |
specified hypervisor. |
1482 |
export (optional): |
1483 |
used to export an installed instance using the given OS to a format which can |
1484 |
be used to import it back into a new instance. |
1485 |
import (optional): |
1486 |
used to import an exported instance into a new one. This script is similar to |
1487 |
create, but the new instance should have the content of the export, rather |
1488 |
than contain a pristine installation. |
1489 |
rename (optional): |
1490 |
used to perform the internal OS-specific operations needed to rename an |
1491 |
instance. |
1492 |
|
1493 |
If any optional script is not implemented Ganeti will refuse to perform the |
1494 |
given operation on instances using the non-implementing OS. Of course the |
1495 |
create script is mandatory, and it doesn't make sense to support the either the |
1496 |
export or the import operation but not both. |
1497 |
|
1498 |
Incompatibilities with 1.2 |
1499 |
__________________________ |
1500 |
|
1501 |
We expect the following incompatibilities between the OS scripts for 1.2 and |
1502 |
the ones for 2.0: |
1503 |
|
1504 |
- Input parameters: in 1.2 those were passed on the command line, in 2.0 we'll |
1505 |
use environment variables, as there will be a lot more information and not |
1506 |
all OSes may care about all of it. |
1507 |
- Number of calls: export scripts will be called once for each device the |
1508 |
instance has, and import scripts once for every exported disk. Imported |
1509 |
instances will be forced to have a number of disks greater or equal to the |
1510 |
one of the export. |
1511 |
- Some scripts are not compulsory: if such a script is missing the relevant |
1512 |
operations will be forbidden for instances of that OS. This makes it easier |
1513 |
to distinguish between unsupported operations and no-op ones (if any). |
1514 |
|
1515 |
|
1516 |
Input |
1517 |
_____ |
1518 |
|
1519 |
Rather than using command line flags, as they do now, scripts will accept |
1520 |
inputs from environment variables. We expect the following input values: |
1521 |
|
1522 |
OS_API_VERSION |
1523 |
The version of the OS API that the following parameters comply with; |
1524 |
this is used so that in the future we could have OSes supporting |
1525 |
multiple versions and thus Ganeti send the proper version in this |
1526 |
parameter |
1527 |
INSTANCE_NAME |
1528 |
Name of the instance acted on |
1529 |
HYPERVISOR |
1530 |
The hypervisor the instance should run on (e.g. 'xen-pvm', 'xen-hvm', 'kvm') |
1531 |
DISK_COUNT |
1532 |
The number of disks this instance will have |
1533 |
NIC_COUNT |
1534 |
The number of NICs this instance will have |
1535 |
DISK_<N>_PATH |
1536 |
Path to the Nth disk. |
1537 |
DISK_<N>_ACCESS |
1538 |
W if read/write, R if read only. OS scripts are not supposed to touch |
1539 |
read-only disks, but will be passed them to know. |
1540 |
DISK_<N>_FRONTEND_TYPE |
1541 |
Type of the disk as seen by the instance. Can be 'scsi', 'ide', 'virtio' |
1542 |
DISK_<N>_BACKEND_TYPE |
1543 |
Type of the disk as seen from the node. Can be 'block', 'file:loop' or |
1544 |
'file:blktap' |
1545 |
NIC_<N>_MAC |
1546 |
Mac address for the Nth network interface |
1547 |
NIC_<N>_IP |
1548 |
Ip address for the Nth network interface, if available |
1549 |
NIC_<N>_BRIDGE |
1550 |
Node bridge the Nth network interface will be connected to |
1551 |
NIC_<N>_FRONTEND_TYPE |
1552 |
Type of the Nth NIC as seen by the instance. For example 'virtio', |
1553 |
'rtl8139', etc. |
1554 |
DEBUG_LEVEL |
1555 |
Whether more out should be produced, for debugging purposes. Currently the |
1556 |
only valid values are 0 and 1. |
1557 |
|
1558 |
These are only the basic variables we are thinking of now, but more |
1559 |
may come during the implementation and they will be documented in the |
1560 |
``ganeti-os-api`` man page. All these variables will be available to |
1561 |
all scripts. |
1562 |
|
1563 |
Some scripts will need a few more information to work. These will have |
1564 |
per-script variables, such as for example: |
1565 |
|
1566 |
OLD_INSTANCE_NAME |
1567 |
rename: the name the instance should be renamed from. |
1568 |
EXPORT_DEVICE |
1569 |
export: device to be exported, a snapshot of the actual device. The data must be exported to stdout. |
1570 |
EXPORT_INDEX |
1571 |
export: sequential number of the instance device targeted. |
1572 |
IMPORT_DEVICE |
1573 |
import: device to send the data to, part of the new instance. The data must be imported from stdin. |
1574 |
IMPORT_INDEX |
1575 |
import: sequential number of the instance device targeted. |
1576 |
|
1577 |
(Rationale for INSTANCE_NAME as an environment variable: the instance name is |
1578 |
always needed and we could pass it on the command line. On the other hand, |
1579 |
though, this would force scripts to both access the environment and parse the |
1580 |
command line, so we'll move it for uniformity.) |
1581 |
|
1582 |
|
1583 |
Output/Behaviour |
1584 |
________________ |
1585 |
|
1586 |
As discussed scripts should only send user-targeted information to stderr. The |
1587 |
create and import scripts are supposed to format/initialise the given block |
1588 |
devices and install the correct instance data. The export script is supposed to |
1589 |
export instance data to stdout in a format understandable by the the import |
1590 |
script. The data will be compressed by Ganeti, so no compression should be |
1591 |
done. The rename script should only modify the instance's knowledge of what |
1592 |
its name is. |
1593 |
|
1594 |
Other declarative style features |
1595 |
++++++++++++++++++++++++++++++++ |
1596 |
|
1597 |
Similar to Ganeti 1.2, OS specifications will need to provide a |
1598 |
'ganeti_api_version' containing list of numbers matching the |
1599 |
version(s) of the API they implement. Ganeti itself will always be |
1600 |
compatible with one version of the API and may maintain backwards |
1601 |
compatibility if it's feasible to do so. The numbers are one-per-line, |
1602 |
so an OS supporting both version 5 and version 20 will have a file |
1603 |
containing two lines. This is different from Ganeti 1.2, which only |
1604 |
supported one version number. |
1605 |
|
1606 |
In addition to that an OS will be able to declare that it does support only a |
1607 |
subset of the Ganeti hypervisors, by declaring them in the 'hypervisors' file. |
1608 |
|
1609 |
|
1610 |
Caveats/Notes |
1611 |
+++++++++++++ |
1612 |
|
1613 |
We might want to have a "default" import/export behaviour that just dumps all |
1614 |
disks and restores them. This can save work as most systems will just do this, |
1615 |
while allowing flexibility for different systems. |
1616 |
|
1617 |
Environment variables are limited in size, but we expect that there will be |
1618 |
enough space to store the information we need. If we discover that this is not |
1619 |
the case we may want to go to a more complex API such as storing those |
1620 |
information on the filesystem and providing the OS script with the path to a |
1621 |
file where they are encoded in some format. |
1622 |
|
1623 |
|
1624 |
|
1625 |
Remote API changes |
1626 |
~~~~~~~~~~~~~~~~~~ |
1627 |
|
1628 |
The first Ganeti remote API (RAPI) was designed and deployed with the |
1629 |
Ganeti 1.2.5 release. That version provide read-only access to the |
1630 |
cluster state. Fully functional read-write API demands significant |
1631 |
internal changes which will be implemented in version 2.0. |
1632 |
|
1633 |
We decided to go with implementing the Ganeti RAPI in a RESTful way, |
1634 |
which is aligned with key features we looking. It is simple, |
1635 |
stateless, scalable and extensible paradigm of API implementation. As |
1636 |
transport it uses HTTP over SSL, and we are implementing it with JSON |
1637 |
encoding, but in a way it possible to extend and provide any other |
1638 |
one. |
1639 |
|
1640 |
Design |
1641 |
++++++ |
1642 |
|
1643 |
The Ganeti RAPI is implemented as independent daemon, running on the |
1644 |
same node with the same permission level as Ganeti master |
1645 |
daemon. Communication is done through the LUXI library to the master |
1646 |
daemon. In order to keep communication asynchronous RAPI processes two |
1647 |
types of client requests: |
1648 |
|
1649 |
- queries: server is able to answer immediately |
1650 |
- job submission: some time is required for a useful response |
1651 |
|
1652 |
In the query case requested data send back to client in the HTTP |
1653 |
response body. Typical examples of queries would be: list of nodes, |
1654 |
instances, cluster info, etc. |
1655 |
|
1656 |
In the case of job submission, the client receive a job ID, the |
1657 |
identifier which allows to query the job progress in the job queue |
1658 |
(see `Job Queue`_). |
1659 |
|
1660 |
Internally, each exported object has an version identifier, which is |
1661 |
used as a state identifier in the HTTP header E-Tag field for |
1662 |
requests/responses to avoid race conditions. |
1663 |
|
1664 |
|
1665 |
Resource representation |
1666 |
+++++++++++++++++++++++ |
1667 |
|
1668 |
The key difference of using REST instead of others API is that REST |
1669 |
requires separation of services via resources with unique URIs. Each |
1670 |
of them should have limited amount of state and support standard HTTP |
1671 |
methods: GET, POST, DELETE, PUT. |
1672 |
|
1673 |
For example in Ganeti's case we can have a set of URI: |
1674 |
|
1675 |
- ``/{clustername}/instances`` |
1676 |
- ``/{clustername}/instances/{instancename}`` |
1677 |
- ``/{clustername}/instances/{instancename}/tag`` |
1678 |
- ``/{clustername}/tag`` |
1679 |
|
1680 |
A GET request to ``/{clustername}/instances`` will return the list of |
1681 |
instances, a POST to ``/{clustername}/instances`` should create a new |
1682 |
instance, a DELETE ``/{clustername}/instances/{instancename}`` should |
1683 |
delete the instance, a GET ``/{clustername}/tag`` should return get |
1684 |
cluster tags. |
1685 |
|
1686 |
Each resource URI will have a version prefix. The resource IDs are to |
1687 |
be determined. |
1688 |
|
1689 |
Internal encoding might be JSON, XML, or any other. The JSON encoding |
1690 |
fits nicely in Ganeti RAPI needs. The client can request a specific |
1691 |
representation via the Accept field in the HTTP header. |
1692 |
|
1693 |
REST uses HTTP as its transport and application protocol for resource |
1694 |
access. The set of possible responses is a subset of standard HTTP |
1695 |
responses. |
1696 |
|
1697 |
The statelessness model provides additional reliability and |
1698 |
transparency to operations (e.g. only one request needs to be analyzed |
1699 |
to understand the in-progress operation, not a sequence of multiple |
1700 |
requests/responses). |
1701 |
|
1702 |
|
1703 |
Security |
1704 |
++++++++ |
1705 |
|
1706 |
With the write functionality security becomes a much bigger an issue. |
1707 |
The Ganeti RAPI uses basic HTTP authentication on top of an |
1708 |
SSL-secured connection to grant access to an exported resource. The |
1709 |
password is stored locally in an Apache-style ``.htpasswd`` file. Only |
1710 |
one level of privileges is supported. |
1711 |
|
1712 |
Caveats |
1713 |
+++++++ |
1714 |
|
1715 |
The model detailed above for job submission requires the client to |
1716 |
poll periodically for updates to the job; an alternative would be to |
1717 |
allow the client to request a callback, or a 'wait for updates' call. |
1718 |
|
1719 |
The callback model was not considered due to the following two issues: |
1720 |
|
1721 |
- callbacks would require a new model of allowed callback URLs, |
1722 |
together with a method of managing these |
1723 |
- callbacks only work when the client and the master are in the same |
1724 |
security domain, and they fail in the other cases (e.g. when there is |
1725 |
a firewall between the client and the RAPI daemon that only allows |
1726 |
client-to-RAPI calls, which is usual in DMZ cases) |
1727 |
|
1728 |
The 'wait for updates' method is not suited to the HTTP protocol, |
1729 |
where requests are supposed to be short-lived. |
1730 |
|
1731 |
Command line changes |
1732 |
~~~~~~~~~~~~~~~~~~~~ |
1733 |
|
1734 |
Ganeti 2.0 introduces several new features as well as new ways to |
1735 |
handle instance resources like disks or network interfaces. This |
1736 |
requires some noticeable changes in the way command line arguments are |
1737 |
handled. |
1738 |
|
1739 |
- extend and modify command line syntax to support new features |
1740 |
- ensure consistent patterns in command line arguments to reduce |
1741 |
cognitive load |
1742 |
|
1743 |
The design changes that require these changes are, in no particular |
1744 |
order: |
1745 |
|
1746 |
- flexible instance disk handling: support a variable number of disks |
1747 |
with varying properties per instance, |
1748 |
- flexible instance network interface handling: support a variable |
1749 |
number of network interfaces with varying properties per instance |
1750 |
- multiple hypervisors: multiple hypervisors can be active on the same |
1751 |
cluster, each supporting different parameters, |
1752 |
- support for device type CDROM (via ISO image) |
1753 |
|
1754 |
As such, there are several areas of Ganeti where the command line |
1755 |
arguments will change: |
1756 |
|
1757 |
- Cluster configuration |
1758 |
|
1759 |
- cluster initialization |
1760 |
- cluster default configuration |
1761 |
|
1762 |
- Instance configuration |
1763 |
|
1764 |
- handling of network cards for instances, |
1765 |
- handling of disks for instances, |
1766 |
- handling of CDROM devices and |
1767 |
- handling of hypervisor specific options. |
1768 |
|
1769 |
There are several areas of Ganeti where the command line arguments |
1770 |
will change: |
1771 |
|
1772 |
- Cluster configuration |
1773 |
|
1774 |
- cluster initialization |
1775 |
- cluster default configuration |
1776 |
|
1777 |
- Instance configuration |
1778 |
|
1779 |
- handling of network cards for instances, |
1780 |
- handling of disks for instances, |
1781 |
- handling of CDROM devices and |
1782 |
- handling of hypervisor specific options. |
1783 |
|
1784 |
Notes about device removal/addition |
1785 |
+++++++++++++++++++++++++++++++++++ |
1786 |
|
1787 |
To avoid problems with device location changes (e.g. second network |
1788 |
interface of the instance becoming the first or third and the like) |
1789 |
the list of network/disk devices is treated as a stack, i.e. devices |
1790 |
can only be added/removed at the end of the list of devices of each |
1791 |
class (disk or network) for each instance. |
1792 |
|
1793 |
gnt-instance commands |
1794 |
+++++++++++++++++++++ |
1795 |
|
1796 |
The commands for gnt-instance will be modified and extended to allow |
1797 |
for the new functionality: |
1798 |
|
1799 |
- the add command will be extended to support the new device and |
1800 |
hypervisor options, |
1801 |
- the modify command continues to handle all modifications to |
1802 |
instances, but will be extended with new arguments for handling |
1803 |
devices. |
1804 |
|
1805 |
Network Device Options |
1806 |
++++++++++++++++++++++ |
1807 |
|
1808 |
The generic format of the network device option is: |
1809 |
|
1810 |
--net $DEVNUM[:$OPTION=$VALUE][,$OPTION=VALUE] |
1811 |
|
1812 |
:$DEVNUM: device number, unsigned integer, starting at 0, |
1813 |
:$OPTION: device option, string, |
1814 |
:$VALUE: device option value, string. |
1815 |
|
1816 |
Currently, the following device options will be defined (open to |
1817 |
further changes): |
1818 |
|
1819 |
:mac: MAC address of the network interface, accepts either a valid |
1820 |
MAC address or the string 'auto'. If 'auto' is specified, a new MAC |
1821 |
address will be generated randomly. If the mac device option is not |
1822 |
specified, the default value 'auto' is assumed. |
1823 |
:bridge: network bridge the network interface is connected |
1824 |
to. Accepts either a valid bridge name (the specified bridge must |
1825 |
exist on the node(s)) as string or the string 'auto'. If 'auto' is |
1826 |
specified, the default brigde is used. If the bridge option is not |
1827 |
specified, the default value 'auto' is assumed. |
1828 |
|
1829 |
Disk Device Options |
1830 |
+++++++++++++++++++ |
1831 |
|
1832 |
The generic format of the disk device option is: |
1833 |
|
1834 |
--disk $DEVNUM[:$OPTION=$VALUE][,$OPTION=VALUE] |
1835 |
|
1836 |
:$DEVNUM: device number, unsigned integer, starting at 0, |
1837 |
:$OPTION: device option, string, |
1838 |
:$VALUE: device option value, string. |
1839 |
|
1840 |
Currently, the following device options will be defined (open to |
1841 |
further changes): |
1842 |
|
1843 |
:size: size of the disk device, either a positive number, specifying |
1844 |
the disk size in mebibytes, or a number followed by a magnitude suffix |
1845 |
(M for mebibytes, G for gibibytes). Also accepts the string 'auto' in |
1846 |
which case the default disk size will be used. If the size option is |
1847 |
not specified, 'auto' is assumed. This option is not valid for all |
1848 |
disk layout types. |
1849 |
:access: access mode of the disk device, a single letter, valid values |
1850 |
are: |
1851 |
|
1852 |
- *w*: read/write access to the disk device or |
1853 |
- *r*: read-only access to the disk device. |
1854 |
|
1855 |
If the access mode is not specified, the default mode of read/write |
1856 |
access will be configured. |
1857 |
:path: path to the image file for the disk device, string. No default |
1858 |
exists. This option is not valid for all disk layout types. |
1859 |
|
1860 |
Adding devices |
1861 |
++++++++++++++ |
1862 |
|
1863 |
To add devices to an already existing instance, use the device type |
1864 |
specific option to gnt-instance modify. Currently, there are two |
1865 |
device type specific options supported: |
1866 |
|
1867 |
:--net: for network interface cards |
1868 |
:--disk: for disk devices |
1869 |
|
1870 |
The syntax to the device specific options is similar to the generic |
1871 |
device options, but instead of specifying a device number like for |
1872 |
gnt-instance add, you specify the magic string add. The new device |
1873 |
will always be appended at the end of the list of devices of this type |
1874 |
for the specified instance, e.g. if the instance has disk devices 0,1 |
1875 |
and 2, the newly added disk device will be disk device 3. |
1876 |
|
1877 |
Example: gnt-instance modify --net add:mac=auto test-instance |
1878 |
|
1879 |
Removing devices |
1880 |
++++++++++++++++ |
1881 |
|
1882 |
Removing devices from and instance is done via gnt-instance |
1883 |
modify. The same device specific options as for adding instances are |
1884 |
used. Instead of a device number and further device options, only the |
1885 |
magic string remove is specified. It will always remove the last |
1886 |
device in the list of devices of this type for the instance specified, |
1887 |
e.g. if the instance has disk devices 0, 1, 2 and 3, the disk device |
1888 |
number 3 will be removed. |
1889 |
|
1890 |
Example: gnt-instance modify --net remove test-instance |
1891 |
|
1892 |
Modifying devices |
1893 |
+++++++++++++++++ |
1894 |
|
1895 |
Modifying devices is also done with device type specific options to |
1896 |
the gnt-instance modify command. There are currently two device type |
1897 |
options supported: |
1898 |
|
1899 |
:--net: for network interface cards |
1900 |
:--disk: for disk devices |
1901 |
|
1902 |
The syntax to the device specific options is similar to the generic |
1903 |
device options. The device number you specify identifies the device to |
1904 |
be modified. |
1905 |
|
1906 |
Example:: |
1907 |
|
1908 |
gnt-instance modify --disk 2:access=r |
1909 |
|
1910 |
Hypervisor Options |
1911 |
++++++++++++++++++ |
1912 |
|
1913 |
Ganeti 2.0 will support more than one hypervisor. Different |
1914 |
hypervisors have various options that only apply to a specific |
1915 |
hypervisor. Those hypervisor specific options are treated specially |
1916 |
via the ``--hypervisor`` option. The generic syntax of the hypervisor |
1917 |
option is as follows:: |
1918 |
|
1919 |
--hypervisor $HYPERVISOR:$OPTION=$VALUE[,$OPTION=$VALUE] |
1920 |
|
1921 |
:$HYPERVISOR: symbolic name of the hypervisor to use, string, |
1922 |
has to match the supported hypervisors. Example: xen-pvm |
1923 |
|
1924 |
:$OPTION: hypervisor option name, string |
1925 |
:$VALUE: hypervisor option value, string |
1926 |
|
1927 |
The hypervisor option for an instance can be set on instance creation |
1928 |
time via the ``gnt-instance add`` command. If the hypervisor for an |
1929 |
instance is not specified upon instance creation, the default |
1930 |
hypervisor will be used. |
1931 |
|
1932 |
Modifying hypervisor parameters |
1933 |
+++++++++++++++++++++++++++++++ |
1934 |
|
1935 |
The hypervisor parameters of an existing instance can be modified |
1936 |
using ``--hypervisor`` option of the ``gnt-instance modify`` |
1937 |
command. However, the hypervisor type of an existing instance can not |
1938 |
be changed, only the particular hypervisor specific option can be |
1939 |
changed. Therefore, the format of the option parameters has been |
1940 |
simplified to omit the hypervisor name and only contain the comma |
1941 |
separated list of option-value pairs. |
1942 |
|
1943 |
Example:: |
1944 |
|
1945 |
gnt-instance modify --hypervisor cdrom=/srv/boot.iso,boot_order=cdrom:network test-instance |
1946 |
|
1947 |
gnt-cluster commands |
1948 |
++++++++++++++++++++ |
1949 |
|
1950 |
The command for gnt-cluster will be extended to allow setting and |
1951 |
changing the default parameters of the cluster: |
1952 |
|
1953 |
- The init command will be extend to support the defaults option to |
1954 |
set the cluster defaults upon cluster initialization. |
1955 |
- The modify command will be added to modify the cluster |
1956 |
parameters. It will support the --defaults option to change the |
1957 |
cluster defaults. |
1958 |
|
1959 |
Cluster defaults |
1960 |
|
1961 |
The generic format of the cluster default setting option is: |
1962 |
|
1963 |
--defaults $OPTION=$VALUE[,$OPTION=$VALUE] |
1964 |
|
1965 |
:$OPTION: cluster default option, string, |
1966 |
:$VALUE: cluster default option value, string. |
1967 |
|
1968 |
Currently, the following cluster default options are defined (open to |
1969 |
further changes): |
1970 |
|
1971 |
:hypervisor: the default hypervisor to use for new instances, |
1972 |
string. Must be a valid hypervisor known to and supported by the |
1973 |
cluster. |
1974 |
:disksize: the disksize for newly created instance disks, where |
1975 |
applicable. Must be either a positive number, in which case the unit |
1976 |
of megabyte is assumed, or a positive number followed by a supported |
1977 |
magnitude symbol (M for megabyte or G for gigabyte). |
1978 |
:bridge: the default network bridge to use for newly created instance |
1979 |
network interfaces, string. Must be a valid bridge name of a bridge |
1980 |
existing on the node(s). |
1981 |
|
1982 |
Hypervisor cluster defaults |
1983 |
+++++++++++++++++++++++++++ |
1984 |
|
1985 |
The generic format of the hypervisor cluster wide default setting |
1986 |
option is:: |
1987 |
|
1988 |
--hypervisor-defaults $HYPERVISOR:$OPTION=$VALUE[,$OPTION=$VALUE] |
1989 |
|
1990 |
:$HYPERVISOR: symbolic name of the hypervisor whose defaults you want |
1991 |
to set, string |
1992 |
:$OPTION: cluster default option, string, |
1993 |
:$VALUE: cluster default option value, string. |
1994 |
|
1995 |
Glossary |
1996 |
======== |
1997 |
|
1998 |
Since this document is only a delta from the Ganeti 1.2, there are |
1999 |
some unexplained terms. Here is a non-exhaustive list. |
2000 |
|
2001 |
.. _HVM: |
2002 |
|
2003 |
HVM |
2004 |
hardware virtualization mode, where the virtual machine is oblivious |
2005 |
to the fact that's being virtualized and all the hardware is emulated |
2006 |
|
2007 |
.. _LU: |
2008 |
|
2009 |
LogicalUnit |
2010 |
the code associated with an OpCode, i.e. the code that implements the |
2011 |
startup of an instance |
2012 |
|
2013 |
.. _opcode: |
2014 |
|
2015 |
OpCode |
2016 |
a data structure encapsulating a basic cluster operation; for example, |
2017 |
start instance, add instance, etc.; |
2018 |
|
2019 |
.. _PVM: |
2020 |
|
2021 |
PVM |
2022 |
para-virtualization mode, where the virtual machine knows it's being |
2023 |
virtualized and as such there is no need for hardware emulation |
2024 |
|
2025 |
.. _watcher: |
2026 |
|
2027 |
watcher |
2028 |
``ganeti-watcher`` is a tool that should be run regularly from cron |
2029 |
and takes care of restarting failed instances, restarting secondary |
2030 |
DRBD devices, etc. For more details, see the man page |
2031 |
``ganeti-watcher(8)``. |