root / doc / design-2.0.rst @ 5fa0375e
History | View | Annotate | Download (75.3 kB)
1 |
================= |
---|---|
2 |
Ganeti 2.0 design |
3 |
================= |
4 |
|
5 |
This document describes the major changes in Ganeti 2.0 compared to |
6 |
the 1.2 version. |
7 |
|
8 |
The 2.0 version will constitute a rewrite of the 'core' architecture, |
9 |
paving the way for additional features in future 2.x versions. |
10 |
|
11 |
.. contents:: :depth: 3 |
12 |
|
13 |
Objective |
14 |
========= |
15 |
|
16 |
Ganeti 1.2 has many scalability issues and restrictions due to its |
17 |
roots as software for managing small and 'static' clusters. |
18 |
|
19 |
Version 2.0 will attempt to remedy first the scalability issues and |
20 |
then the restrictions. |
21 |
|
22 |
Background |
23 |
========== |
24 |
|
25 |
While Ganeti 1.2 is usable, it severely limits the flexibility of the |
26 |
cluster administration and imposes a very rigid model. It has the |
27 |
following main scalability issues: |
28 |
|
29 |
- only one operation at a time on the cluster [#]_ |
30 |
- poor handling of node failures in the cluster |
31 |
- mixing hypervisors in a cluster not allowed |
32 |
|
33 |
It also has a number of artificial restrictions, due to historical |
34 |
design: |
35 |
|
36 |
- fixed number of disks (two) per instance |
37 |
- fixed number of NICs |
38 |
|
39 |
.. [#] Replace disks will release the lock, but this is an exception |
40 |
and not a recommended way to operate |
41 |
|
42 |
The 2.0 version is intended to address some of these problems, and |
43 |
create a more flexible code base for future developments. |
44 |
|
45 |
Among these problems, the single-operation at a time restriction is |
46 |
biggest issue with the current version of Ganeti. It is such a big |
47 |
impediment in operating bigger clusters that many times one is tempted |
48 |
to remove the lock just to do a simple operation like start instance |
49 |
while an OS installation is running. |
50 |
|
51 |
Scalability problems |
52 |
-------------------- |
53 |
|
54 |
Ganeti 1.2 has a single global lock, which is used for all cluster |
55 |
operations. This has been painful at various times, for example: |
56 |
|
57 |
- It is impossible for two people to efficiently interact with a cluster |
58 |
(for example for debugging) at the same time. |
59 |
- When batch jobs are running it's impossible to do other work (for |
60 |
example failovers/fixes) on a cluster. |
61 |
|
62 |
This poses scalability problems: as clusters grow in node and instance |
63 |
size it's a lot more likely that operations which one could conceive |
64 |
should run in parallel (for example because they happen on different |
65 |
nodes) are actually stalling each other while waiting for the global |
66 |
lock, without a real reason for that to happen. |
67 |
|
68 |
One of the main causes of this global lock (beside the higher |
69 |
difficulty of ensuring data consistency in a more granular lock model) |
70 |
is the fact that currently there is no long-lived process in Ganeti |
71 |
that can coordinate multiple operations. Each command tries to acquire |
72 |
the so called *cmd* lock and when it succeeds, it takes complete |
73 |
ownership of the cluster configuration and state. |
74 |
|
75 |
Other scalability problems are due the design of the DRBD device |
76 |
model, which assumed at its creation a low (one to four) number of |
77 |
instances per node, which is no longer true with today's hardware. |
78 |
|
79 |
Artificial restrictions |
80 |
----------------------- |
81 |
|
82 |
Ganeti 1.2 (and previous versions) have a fixed two-disks, one-NIC per |
83 |
instance model. This is a purely artificial restrictions, but it |
84 |
touches multiple areas (configuration, import/export, command line) |
85 |
that it's more fitted to a major release than a minor one. |
86 |
|
87 |
Architecture issues |
88 |
------------------- |
89 |
|
90 |
The fact that each command is a separate process that reads the |
91 |
cluster state, executes the command, and saves the new state is also |
92 |
an issue on big clusters where the configuration data for the cluster |
93 |
begins to be non-trivial in size. |
94 |
|
95 |
Overview |
96 |
======== |
97 |
|
98 |
In order to solve the scalability problems, a rewrite of the core |
99 |
design of Ganeti is required. While the cluster operations themselves |
100 |
won't change (e.g. start instance will do the same things, the way |
101 |
these operations are scheduled internally will change radically. |
102 |
|
103 |
The new design will change the cluster architecture to: |
104 |
|
105 |
.. image:: arch-2.0.png |
106 |
|
107 |
This differs from the 1.2 architecture by the addition of the master |
108 |
daemon, which will be the only entity to talk to the node daemons. |
109 |
|
110 |
|
111 |
Detailed design |
112 |
=============== |
113 |
|
114 |
The changes for 2.0 can be split into roughly three areas: |
115 |
|
116 |
- core changes that affect the design of the software |
117 |
- features (or restriction removals) but which do not have a wide |
118 |
impact on the design |
119 |
- user-level and API-level changes which translate into differences for |
120 |
the operation of the cluster |
121 |
|
122 |
Core changes |
123 |
------------ |
124 |
|
125 |
The main changes will be switching from a per-process model to a |
126 |
daemon based model, where the individual gnt-* commands will be |
127 |
clients that talk to this daemon (see `Master daemon`_). This will |
128 |
allow us to get rid of the global cluster lock for most operations, |
129 |
having instead a per-object lock (see `Granular locking`_). Also, the |
130 |
daemon will be able to queue jobs, and this will allow the individual |
131 |
clients to submit jobs without waiting for them to finish, and also |
132 |
see the result of old requests (see `Job Queue`_). |
133 |
|
134 |
Beside these major changes, another 'core' change but that will not be |
135 |
as visible to the users will be changing the model of object attribute |
136 |
storage, and separate that into name spaces (such that an Xen PVM |
137 |
instance will not have the Xen HVM parameters). This will allow future |
138 |
flexibility in defining additional parameters. For more details see |
139 |
`Object parameters`_. |
140 |
|
141 |
The various changes brought in by the master daemon model and the |
142 |
read-write RAPI will require changes to the cluster security; we move |
143 |
away from Twisted and use HTTP(s) for intra- and extra-cluster |
144 |
communications. For more details, see the security document in the |
145 |
doc/ directory. |
146 |
|
147 |
Master daemon |
148 |
~~~~~~~~~~~~~ |
149 |
|
150 |
In Ganeti 2.0, we will have the following *entities*: |
151 |
|
152 |
- the master daemon (on the master node) |
153 |
- the node daemon (on all nodes) |
154 |
- the command line tools (on the master node) |
155 |
- the RAPI daemon (on the master node) |
156 |
|
157 |
The master-daemon related interaction paths are: |
158 |
|
159 |
- (CLI tools/RAPI daemon) and the master daemon, via the so called |
160 |
*LUXI* API |
161 |
- the master daemon and the node daemons, via the node RPC |
162 |
|
163 |
There are also some additional interaction paths for exceptional cases: |
164 |
|
165 |
- CLI tools might access via SSH the nodes (for ``gnt-cluster copyfile`` |
166 |
and ``gnt-cluster command``) |
167 |
- master failover is a special case when a non-master node will SSH |
168 |
and do node-RPC calls to the current master |
169 |
|
170 |
The protocol between the master daemon and the node daemons will be |
171 |
changed from (Ganeti 1.2) Twisted PB (perspective broker) to HTTP(S), |
172 |
using a simple PUT/GET of JSON-encoded messages. This is done due to |
173 |
difficulties in working with the Twisted framework and its protocols |
174 |
in a multithreaded environment, which we can overcome by using a |
175 |
simpler stack (see the caveats section). |
176 |
|
177 |
The protocol between the CLI/RAPI and the master daemon will be a |
178 |
custom one (called *LUXI*): on a UNIX socket on the master node, with |
179 |
rights restricted by filesystem permissions, the CLI/RAPI will talk to |
180 |
the master daemon using JSON-encoded messages. |
181 |
|
182 |
The operations supported over this internal protocol will be encoded |
183 |
via a python library that will expose a simple API for its |
184 |
users. Internally, the protocol will simply encode all objects in JSON |
185 |
format and decode them on the receiver side. |
186 |
|
187 |
For more details about the RAPI daemon see `Remote API changes`_, and |
188 |
for the node daemon see `Node daemon changes`_. |
189 |
|
190 |
.. _luxi: |
191 |
|
192 |
The LUXI protocol |
193 |
+++++++++++++++++ |
194 |
|
195 |
As described above, the protocol for making requests or queries to the |
196 |
master daemon will be a UNIX-socket based simple RPC of JSON-encoded |
197 |
messages. |
198 |
|
199 |
The choice of UNIX was in order to get rid of the need of |
200 |
authentication and authorisation inside Ganeti; for 2.0, the |
201 |
permissions on the Unix socket itself will determine the access |
202 |
rights. |
203 |
|
204 |
We will have two main classes of operations over this API: |
205 |
|
206 |
- cluster query functions |
207 |
- job related functions |
208 |
|
209 |
The cluster query functions are usually short-duration, and are the |
210 |
equivalent of the ``OP_QUERY_*`` opcodes in Ganeti 1.2 (and they are |
211 |
internally implemented still with these opcodes). The clients are |
212 |
guaranteed to receive the response in a reasonable time via a timeout. |
213 |
|
214 |
The job-related functions will be: |
215 |
|
216 |
- submit job |
217 |
- query job (which could also be categorized in the query-functions) |
218 |
- archive job (see the job queue design doc) |
219 |
- wait for job change, which allows a client to wait without polling |
220 |
|
221 |
For more details of the actual operation list, see the `Job Queue`_. |
222 |
|
223 |
Both requests and responses will consist of a JSON-encoded message |
224 |
followed by the ``ETX`` character (ASCII decimal 3), which is not a |
225 |
valid character in JSON messages and thus can serve as a message |
226 |
delimiter. The contents of the messages will be a dictionary with two |
227 |
fields: |
228 |
|
229 |
:method: |
230 |
the name of the method called |
231 |
:args: |
232 |
the arguments to the method, as a list (no keyword arguments allowed) |
233 |
|
234 |
Responses will follow the same format, with the two fields being: |
235 |
|
236 |
:success: |
237 |
a boolean denoting the success of the operation |
238 |
:result: |
239 |
the actual result, or error message in case of failure |
240 |
|
241 |
There are two special value for the result field: |
242 |
|
243 |
- in the case that the operation failed, and this field is a list of |
244 |
length two, the client library will try to interpret is as an |
245 |
exception, the first element being the exception type and the second |
246 |
one the actual exception arguments; this will allow a simple method of |
247 |
passing Ganeti-related exception across the interface |
248 |
- for the *WaitForChange* call (that waits on the server for a job to |
249 |
change status), if the result is equal to ``nochange`` instead of the |
250 |
usual result for this call (a list of changes), then the library will |
251 |
internally retry the call; this is done in order to differentiate |
252 |
internally between master daemon hung and job simply not changed |
253 |
|
254 |
Users of the API that don't use the provided python library should |
255 |
take care of the above two cases. |
256 |
|
257 |
|
258 |
Master daemon implementation |
259 |
++++++++++++++++++++++++++++ |
260 |
|
261 |
The daemon will be based around a main I/O thread that will wait for |
262 |
new requests from the clients, and that does the setup/shutdown of the |
263 |
other thread (pools). |
264 |
|
265 |
There will two other classes of threads in the daemon: |
266 |
|
267 |
- job processing threads, part of a thread pool, and which are |
268 |
long-lived, started at daemon startup and terminated only at shutdown |
269 |
time |
270 |
- client I/O threads, which are the ones that talk the local protocol |
271 |
(LUXI) to the clients, and are short-lived |
272 |
|
273 |
Master startup/failover |
274 |
+++++++++++++++++++++++ |
275 |
|
276 |
In Ganeti 1.x there is no protection against failing over the master |
277 |
to a node with stale configuration. In effect, the responsibility of |
278 |
correct failovers falls on the admin. This is true both for the new |
279 |
master and for when an old, offline master startup. |
280 |
|
281 |
Since in 2.x we are extending the cluster state to cover the job queue |
282 |
and have a daemon that will execute by itself the job queue, we want |
283 |
to have more resilience for the master role. |
284 |
|
285 |
The following algorithm will happen whenever a node is ready to |
286 |
transition to the master role, either at startup time or at node |
287 |
failover: |
288 |
|
289 |
#. read the configuration file and parse the node list |
290 |
contained within |
291 |
|
292 |
#. query all the nodes and make sure we obtain an agreement via |
293 |
a quorum of at least half plus one nodes for the following: |
294 |
|
295 |
- we have the latest configuration and job list (as |
296 |
determined by the serial number on the configuration and |
297 |
highest job ID on the job queue) |
298 |
|
299 |
- if we are not failing over (but just starting), the |
300 |
quorum agrees that we are the designated master |
301 |
|
302 |
- if any of the above is false, we prevent the current operation |
303 |
(i.e. we don't become the master) |
304 |
|
305 |
#. at this point, the node transitions to the master role |
306 |
|
307 |
#. for all the in-progress jobs, mark them as failed, with |
308 |
reason unknown or something similar (master failed, etc.) |
309 |
|
310 |
Since due to exceptional conditions we could have a situation in which |
311 |
no node can become the master due to inconsistent data, we will have |
312 |
an override switch for the master daemon startup that will assume the |
313 |
current node has the right data and will replicate all the |
314 |
configuration files to the other nodes. |
315 |
|
316 |
**Note**: the above algorithm is by no means an election algorithm; it |
317 |
is a *confirmation* of the master role currently held by a node. |
318 |
|
319 |
Logging |
320 |
+++++++ |
321 |
|
322 |
The logging system will be switched completely to the standard python |
323 |
logging module; currently it's logging-based, but exposes a different |
324 |
API, which is just overhead. As such, the code will be switched over |
325 |
to standard logging calls, and only the setup will be custom. |
326 |
|
327 |
With this change, we will remove the separate debug/info/error logs, |
328 |
and instead have always one logfile per daemon model: |
329 |
|
330 |
- master-daemon.log for the master daemon |
331 |
- node-daemon.log for the node daemon (this is the same as in 1.2) |
332 |
- rapi-daemon.log for the RAPI daemon logs |
333 |
- rapi-access.log, an additional log file for the RAPI that will be |
334 |
in the standard HTTP log format for possible parsing by other tools |
335 |
|
336 |
Since the :term:`watcher` will only submit jobs to the master for |
337 |
startup of the instances, its log file will contain less information |
338 |
than before, mainly that it will start the instance, but not the |
339 |
results. |
340 |
|
341 |
Node daemon changes |
342 |
+++++++++++++++++++ |
343 |
|
344 |
The only change to the node daemon is that, since we need better |
345 |
concurrency, we don't process the inter-node RPC calls in the node |
346 |
daemon itself, but we fork and process each request in a separate |
347 |
child. |
348 |
|
349 |
Since we don't have many calls, and we only fork (not exec), the |
350 |
overhead should be minimal. |
351 |
|
352 |
Caveats |
353 |
+++++++ |
354 |
|
355 |
A discussed alternative is to keep the current individual processes |
356 |
touching the cluster configuration model. The reasons we have not |
357 |
chosen this approach is: |
358 |
|
359 |
- the speed of reading and unserializing the cluster state |
360 |
today is not small enough that we can ignore it; the addition of |
361 |
the job queue will make the startup cost even higher. While this |
362 |
runtime cost is low, it can be on the order of a few seconds on |
363 |
bigger clusters, which for very quick commands is comparable to |
364 |
the actual duration of the computation itself |
365 |
|
366 |
- individual commands would make it harder to implement a |
367 |
fire-and-forget job request, along the lines "start this |
368 |
instance but do not wait for it to finish"; it would require a |
369 |
model of backgrounding the operation and other things that are |
370 |
much better served by a daemon-based model |
371 |
|
372 |
Another area of discussion is moving away from Twisted in this new |
373 |
implementation. While Twisted has its advantages, there are also many |
374 |
disadvantages to using it: |
375 |
|
376 |
- first and foremost, it's not a library, but a framework; thus, if |
377 |
you use twisted, all the code needs to be 'twiste-ized' and written |
378 |
in an asynchronous manner, using deferreds; while this method works, |
379 |
it's not a common way to code and it requires that the entire process |
380 |
workflow is based around a single *reactor* (Twisted name for a main |
381 |
loop) |
382 |
- the more advanced granular locking that we want to implement would |
383 |
require, if written in the async-manner, deep integration with the |
384 |
Twisted stack, to such an extend that business-logic is inseparable |
385 |
from the protocol coding; we felt that this is an unreasonable |
386 |
request, and that a good protocol library should allow complete |
387 |
separation of low-level protocol calls and business logic; by |
388 |
comparison, the threaded approach combined with HTTPs protocol |
389 |
required (for the first iteration) absolutely no changes from the 1.2 |
390 |
code, and later changes for optimizing the inter-node RPC calls |
391 |
required just syntactic changes (e.g. ``rpc.call_...`` to |
392 |
``self.rpc.call_...``) |
393 |
|
394 |
Another issue is with the Twisted API stability - during the Ganeti |
395 |
1.x lifetime, we had to to implement many times workarounds to changes |
396 |
in the Twisted version, so that for example 1.2 is able to use both |
397 |
Twisted 2.x and 8.x. |
398 |
|
399 |
In the end, since we already had an HTTP server library for the RAPI, |
400 |
we just reused that for inter-node communication. |
401 |
|
402 |
|
403 |
Granular locking |
404 |
~~~~~~~~~~~~~~~~ |
405 |
|
406 |
We want to make sure that multiple operations can run in parallel on a |
407 |
Ganeti Cluster. In order for this to happen we need to make sure |
408 |
concurrently run operations don't step on each other toes and break the |
409 |
cluster. |
410 |
|
411 |
This design addresses how we are going to deal with locking so that: |
412 |
|
413 |
- we preserve data coherency |
414 |
- we prevent deadlocks |
415 |
- we prevent job starvation |
416 |
|
417 |
Reaching the maximum possible parallelism is a Non-Goal. We have |
418 |
identified a set of operations that are currently bottlenecks and need |
419 |
to be parallelised and have worked on those. In the future it will be |
420 |
possible to address other needs, thus making the cluster more and more |
421 |
parallel one step at a time. |
422 |
|
423 |
This section only talks about parallelising Ganeti level operations, aka |
424 |
Logical Units, and the locking needed for that. Any other |
425 |
synchronization lock needed internally by the code is outside its scope. |
426 |
|
427 |
Library details |
428 |
+++++++++++++++ |
429 |
|
430 |
The proposed library has these features: |
431 |
|
432 |
- internally managing all the locks, making the implementation |
433 |
transparent from their usage |
434 |
- automatically grabbing multiple locks in the right order (avoid |
435 |
deadlock) |
436 |
- ability to transparently handle conversion to more granularity |
437 |
- support asynchronous operation (future goal) |
438 |
|
439 |
Locking will be valid only on the master node and will not be a |
440 |
distributed operation. Therefore, in case of master failure, the |
441 |
operations currently running will be aborted and the locks will be |
442 |
lost; it remains to the administrator to cleanup (if needed) the |
443 |
operation result (e.g. make sure an instance is either installed |
444 |
correctly or removed). |
445 |
|
446 |
A corollary of this is that a master-failover operation with both |
447 |
masters alive needs to happen while no operations are running, and |
448 |
therefore no locks are held. |
449 |
|
450 |
All the locks will be represented by objects (like |
451 |
``lockings.SharedLock``), and the individual locks for each object |
452 |
will be created at initialisation time, from the config file. |
453 |
|
454 |
The API will have a way to grab one or more than one locks at the same |
455 |
time. Any attempt to grab a lock while already holding one in the wrong |
456 |
order will be checked for, and fail. |
457 |
|
458 |
|
459 |
The Locks |
460 |
+++++++++ |
461 |
|
462 |
At the first stage we have decided to provide the following locks: |
463 |
|
464 |
- One "config file" lock |
465 |
- One lock per node in the cluster |
466 |
- One lock per instance in the cluster |
467 |
|
468 |
All the instance locks will need to be taken before the node locks, and |
469 |
the node locks before the config lock. Locks will need to be acquired at |
470 |
the same time for multiple instances and nodes, and internal ordering |
471 |
will be dealt within the locking library, which, for simplicity, will |
472 |
just use alphabetical order. |
473 |
|
474 |
Each lock has the following three possible statuses: |
475 |
|
476 |
- unlocked (anyone can grab the lock) |
477 |
- shared (anyone can grab/have the lock but only in shared mode) |
478 |
- exclusive (no one else can grab/have the lock) |
479 |
|
480 |
Handling conversion to more granularity |
481 |
+++++++++++++++++++++++++++++++++++++++ |
482 |
|
483 |
In order to convert to a more granular approach transparently each time |
484 |
we split a lock into more we'll create a "metalock", which will depend |
485 |
on those sub-locks and live for the time necessary for all the code to |
486 |
convert (or forever, in some conditions). When a metalock exists all |
487 |
converted code must acquire it in shared mode, so it can run |
488 |
concurrently, but still be exclusive with old code, which acquires it |
489 |
exclusively. |
490 |
|
491 |
In the beginning the only such lock will be what replaces the current |
492 |
"command" lock, and will acquire all the locks in the system, before |
493 |
proceeding. This lock will be called the "Big Ganeti Lock" because |
494 |
holding that one will avoid any other concurrent Ganeti operations. |
495 |
|
496 |
We might also want to devise more metalocks (eg. all nodes, all |
497 |
nodes+config) in order to make it easier for some parts of the code to |
498 |
acquire what it needs without specifying it explicitly. |
499 |
|
500 |
In the future things like the node locks could become metalocks, should |
501 |
we decide to split them into an even more fine grained approach, but |
502 |
this will probably be only after the first 2.0 version has been |
503 |
released. |
504 |
|
505 |
Adding/Removing locks |
506 |
+++++++++++++++++++++ |
507 |
|
508 |
When a new instance or a new node is created an associated lock must be |
509 |
added to the list. The relevant code will need to inform the locking |
510 |
library of such a change. |
511 |
|
512 |
This needs to be compatible with every other lock in the system, |
513 |
especially metalocks that guarantee to grab sets of resources without |
514 |
specifying them explicitly. The implementation of this will be handled |
515 |
in the locking library itself. |
516 |
|
517 |
When instances or nodes disappear from the cluster the relevant locks |
518 |
must be removed. This is easier than adding new elements, as the code |
519 |
which removes them must own them exclusively already, and thus deals |
520 |
with metalocks exactly as normal code acquiring those locks. Any |
521 |
operation queuing on a removed lock will fail after its removal. |
522 |
|
523 |
Asynchronous operations |
524 |
+++++++++++++++++++++++ |
525 |
|
526 |
For the first version the locking library will only export synchronous |
527 |
operations, which will block till the needed lock are held, and only |
528 |
fail if the request is impossible or somehow erroneous. |
529 |
|
530 |
In the future we may want to implement different types of asynchronous |
531 |
operations such as: |
532 |
|
533 |
- try to acquire this lock set and fail if not possible |
534 |
- try to acquire one of these lock sets and return the first one you |
535 |
were able to get (or after a timeout) (select/poll like) |
536 |
|
537 |
These operations can be used to prioritize operations based on available |
538 |
locks, rather than making them just blindly queue for acquiring them. |
539 |
The inherent risk, though, is that any code using the first operation, |
540 |
or setting a timeout for the second one, is susceptible to starvation |
541 |
and thus may never be able to get the required locks and complete |
542 |
certain tasks. Considering this providing/using these operations should |
543 |
not be among our first priorities. |
544 |
|
545 |
Locking granularity |
546 |
+++++++++++++++++++ |
547 |
|
548 |
For the first version of this code we'll convert each Logical Unit to |
549 |
acquire/release the locks it needs, so locking will be at the Logical |
550 |
Unit level. In the future we may want to split logical units in |
551 |
independent "tasklets" with their own locking requirements. A different |
552 |
design doc (or mini design doc) will cover the move from Logical Units |
553 |
to tasklets. |
554 |
|
555 |
Code examples |
556 |
+++++++++++++ |
557 |
|
558 |
In general when acquiring locks we should use a code path equivalent |
559 |
to:: |
560 |
|
561 |
lock.acquire() |
562 |
try: |
563 |
... |
564 |
# other code |
565 |
finally: |
566 |
lock.release() |
567 |
|
568 |
This makes sure we release all locks, and avoid possible deadlocks. Of |
569 |
course extra care must be used not to leave, if possible locked |
570 |
structures in an unusable state. Note that with Python 2.5 a simpler |
571 |
syntax will be possible, but we want to keep compatibility with Python |
572 |
2.4 so the new constructs should not be used. |
573 |
|
574 |
In order to avoid this extra indentation and code changes everywhere in |
575 |
the Logical Units code, we decided to allow LUs to declare locks, and |
576 |
then execute their code with their locks acquired. In the new world LUs |
577 |
are called like this:: |
578 |
|
579 |
# user passed names are expanded to the internal lock/resource name, |
580 |
# then known needed locks are declared |
581 |
lu.ExpandNames() |
582 |
... some locking/adding of locks may happen ... |
583 |
# late declaration of locks for one level: this is useful because sometimes |
584 |
# we can't know which resource we need before locking the previous level |
585 |
lu.DeclareLocks() # for each level (cluster, instance, node) |
586 |
... more locking/adding of locks can happen ... |
587 |
# these functions are called with the proper locks held |
588 |
lu.CheckPrereq() |
589 |
lu.Exec() |
590 |
... locks declared for removal are removed, all acquired locks released ... |
591 |
|
592 |
The Processor and the LogicalUnit class will contain exact documentation |
593 |
on how locks are supposed to be declared. |
594 |
|
595 |
Caveats |
596 |
+++++++ |
597 |
|
598 |
This library will provide an easy upgrade path to bring all the code to |
599 |
granular locking without breaking everything, and it will also guarantee |
600 |
against a lot of common errors. Code switching from the old "lock |
601 |
everything" lock to the new system, though, needs to be carefully |
602 |
scrutinised to be sure it is really acquiring all the necessary locks, |
603 |
and none has been overlooked or forgotten. |
604 |
|
605 |
The code can contain other locks outside of this library, to synchronise |
606 |
other threaded code (eg for the job queue) but in general these should |
607 |
be leaf locks or carefully structured non-leaf ones, to avoid deadlock |
608 |
race conditions. |
609 |
|
610 |
|
611 |
.. _jqueue-original-design: |
612 |
|
613 |
Job Queue |
614 |
~~~~~~~~~ |
615 |
|
616 |
Granular locking is not enough to speed up operations, we also need a |
617 |
queue to store these and to be able to process as many as possible in |
618 |
parallel. |
619 |
|
620 |
A Ganeti job will consist of multiple ``OpCodes`` which are the basic |
621 |
element of operation in Ganeti 1.2 (and will remain as such). Most |
622 |
command-level commands are equivalent to one OpCode, or in some cases |
623 |
to a sequence of opcodes, all of the same type (e.g. evacuating a node |
624 |
will generate N opcodes of type replace disks). |
625 |
|
626 |
|
627 |
Job execution—“Life of a Ganeti job” |
628 |
++++++++++++++++++++++++++++++++++++ |
629 |
|
630 |
#. Job gets submitted by the client. A new job identifier is generated |
631 |
and assigned to the job. The job is then automatically replicated |
632 |
[#replic]_ to all nodes in the cluster. The identifier is returned to |
633 |
the client. |
634 |
#. A pool of worker threads waits for new jobs. If all are busy, the job |
635 |
has to wait and the first worker finishing its work will grab it. |
636 |
Otherwise any of the waiting threads will pick up the new job. |
637 |
#. Client waits for job status updates by calling a waiting RPC |
638 |
function. Log message may be shown to the user. Until the job is |
639 |
started, it can also be canceled. |
640 |
#. As soon as the job is finished, its final result and status can be |
641 |
retrieved from the server. |
642 |
#. If the client archives the job, it gets moved to a history directory. |
643 |
There will be a method to archive all jobs older than a a given age. |
644 |
|
645 |
.. [#replic] We need replication in order to maintain the consistency |
646 |
across all nodes in the system; the master node only differs in the |
647 |
fact that now it is running the master daemon, but it if fails and we |
648 |
do a master failover, the jobs are still visible on the new master |
649 |
(though marked as failed). |
650 |
|
651 |
Failures to replicate a job to other nodes will be only flagged as |
652 |
errors in the master daemon log if more than half of the nodes failed, |
653 |
otherwise we ignore the failure, and rely on the fact that the next |
654 |
update (for still running jobs) will retry the update. For finished |
655 |
jobs, it is less of a problem. |
656 |
|
657 |
Future improvements will look into checking the consistency of the job |
658 |
list and jobs themselves at master daemon startup. |
659 |
|
660 |
|
661 |
Job storage |
662 |
+++++++++++ |
663 |
|
664 |
Jobs are stored in the filesystem as individual files, serialized |
665 |
using JSON (standard serialization mechanism in Ganeti). |
666 |
|
667 |
The choice of storing each job in its own file was made because: |
668 |
|
669 |
- a file can be atomically replaced |
670 |
- a file can easily be replicated to other nodes |
671 |
- checking consistency across nodes can be implemented very easily, |
672 |
since all job files should be (at a given moment in time) identical |
673 |
|
674 |
The other possible choices that were discussed and discounted were: |
675 |
|
676 |
- single big file with all job data: not feasible due to difficult |
677 |
updates |
678 |
- in-process databases: hard to replicate the entire database to the |
679 |
other nodes, and replicating individual operations does not mean wee |
680 |
keep consistency |
681 |
|
682 |
|
683 |
Queue structure |
684 |
+++++++++++++++ |
685 |
|
686 |
All file operations have to be done atomically by writing to a temporary |
687 |
file and subsequent renaming. Except for log messages, every change in a |
688 |
job is stored and replicated to other nodes. |
689 |
|
690 |
:: |
691 |
|
692 |
/var/lib/ganeti/queue/ |
693 |
job-1 (JSON encoded job description and status) |
694 |
[…] |
695 |
job-37 |
696 |
job-38 |
697 |
job-39 |
698 |
lock (Queue managing process opens this file in exclusive mode) |
699 |
serial (Last job ID used) |
700 |
version (Queue format version) |
701 |
|
702 |
|
703 |
Locking |
704 |
+++++++ |
705 |
|
706 |
Locking in the job queue is a complicated topic. It is called from more |
707 |
than one thread and must be thread-safe. For simplicity, a single lock |
708 |
is used for the whole job queue. |
709 |
|
710 |
A more detailed description can be found in doc/locking.rst. |
711 |
|
712 |
|
713 |
Internal RPC |
714 |
++++++++++++ |
715 |
|
716 |
RPC calls available between Ganeti master and node daemons: |
717 |
|
718 |
jobqueue_update(file_name, content) |
719 |
Writes a file in the job queue directory. |
720 |
jobqueue_purge() |
721 |
Cleans the job queue directory completely, including archived job. |
722 |
jobqueue_rename(old, new) |
723 |
Renames a file in the job queue directory. |
724 |
|
725 |
|
726 |
Client RPC |
727 |
++++++++++ |
728 |
|
729 |
RPC between Ganeti clients and the Ganeti master daemon supports the |
730 |
following operations: |
731 |
|
732 |
SubmitJob(ops) |
733 |
Submits a list of opcodes and returns the job identifier. The |
734 |
identifier is guaranteed to be unique during the lifetime of a |
735 |
cluster. |
736 |
WaitForJobChange(job_id, fields, […], timeout) |
737 |
This function waits until a job changes or a timeout expires. The |
738 |
condition for when a job changed is defined by the fields passed and |
739 |
the last log message received. |
740 |
QueryJobs(job_ids, fields) |
741 |
Returns field values for the job identifiers passed. |
742 |
CancelJob(job_id) |
743 |
Cancels the job specified by identifier. This operation may fail if |
744 |
the job is already running, canceled or finished. |
745 |
ArchiveJob(job_id) |
746 |
Moves a job into the …/archive/ directory. This operation will fail if |
747 |
the job has not been canceled or finished. |
748 |
|
749 |
|
750 |
Job and opcode status |
751 |
+++++++++++++++++++++ |
752 |
|
753 |
Each job and each opcode has, at any time, one of the following states: |
754 |
|
755 |
Queued |
756 |
The job/opcode was submitted, but did not yet start. |
757 |
Waiting |
758 |
The job/opcode is waiting for a lock to proceed. |
759 |
Running |
760 |
The job/opcode is running. |
761 |
Canceled |
762 |
The job/opcode was canceled before it started. |
763 |
Success |
764 |
The job/opcode ran and finished successfully. |
765 |
Error |
766 |
The job/opcode was aborted with an error. |
767 |
|
768 |
If the master is aborted while a job is running, the job will be set to |
769 |
the Error status once the master started again. |
770 |
|
771 |
|
772 |
History |
773 |
+++++++ |
774 |
|
775 |
Archived jobs are kept in a separate directory, |
776 |
``/var/lib/ganeti/queue/archive/``. This is done in order to speed up |
777 |
the queue handling: by default, the jobs in the archive are not |
778 |
touched by any functions. Only the current (unarchived) jobs are |
779 |
parsed, loaded, and verified (if implemented) by the master daemon. |
780 |
|
781 |
|
782 |
Ganeti updates |
783 |
++++++++++++++ |
784 |
|
785 |
The queue has to be completely empty for Ganeti updates with changes |
786 |
in the job queue structure. In order to allow this, there will be a |
787 |
way to prevent new jobs entering the queue. |
788 |
|
789 |
|
790 |
Object parameters |
791 |
~~~~~~~~~~~~~~~~~ |
792 |
|
793 |
Across all cluster configuration data, we have multiple classes of |
794 |
parameters: |
795 |
|
796 |
A. cluster-wide parameters (e.g. name of the cluster, the master); |
797 |
these are the ones that we have today, and are unchanged from the |
798 |
current model |
799 |
|
800 |
#. node parameters |
801 |
|
802 |
#. instance specific parameters, e.g. the name of disks (LV), that |
803 |
cannot be shared with other instances |
804 |
|
805 |
#. instance parameters, that are or can be the same for many |
806 |
instances, but are not hypervisor related; e.g. the number of VCPUs, |
807 |
or the size of memory |
808 |
|
809 |
#. instance parameters that are hypervisor specific (e.g. kernel_path |
810 |
or PAE mode) |
811 |
|
812 |
|
813 |
The following definitions for instance parameters will be used below: |
814 |
|
815 |
:hypervisor parameter: |
816 |
a hypervisor parameter (or hypervisor specific parameter) is defined |
817 |
as a parameter that is interpreted by the hypervisor support code in |
818 |
Ganeti and usually is specific to a particular hypervisor (like the |
819 |
kernel path for :term:`PVM` which makes no sense for :term:`HVM`). |
820 |
|
821 |
:backend parameter: |
822 |
a backend parameter is defined as an instance parameter that can be |
823 |
shared among a list of instances, and is either generic enough not |
824 |
to be tied to a given hypervisor or cannot influence at all the |
825 |
hypervisor behaviour. |
826 |
|
827 |
For example: memory, vcpus, auto_balance |
828 |
|
829 |
All these parameters will be encoded into constants.py with the prefix |
830 |
"BE\_" and the whole list of parameters will exist in the set |
831 |
"BES_PARAMETERS" |
832 |
|
833 |
:proper parameter: |
834 |
a parameter whose value is unique to the instance (e.g. the name of a |
835 |
LV, or the MAC of a NIC) |
836 |
|
837 |
As a general rule, for all kind of parameters, “None” (or in |
838 |
JSON-speak, “nil”) will no longer be a valid value for a parameter. As |
839 |
such, only non-default parameters will be saved as part of objects in |
840 |
the serialization step, reducing the size of the serialized format. |
841 |
|
842 |
Cluster parameters |
843 |
++++++++++++++++++ |
844 |
|
845 |
Cluster parameters remain as today, attributes at the top level of the |
846 |
Cluster object. In addition, two new attributes at this level will |
847 |
hold defaults for the instances: |
848 |
|
849 |
- hvparams, a dictionary indexed by hypervisor type, holding default |
850 |
values for hypervisor parameters that are not defined/overridden by |
851 |
the instances of this hypervisor type |
852 |
|
853 |
- beparams, a dictionary holding (for 2.0) a single element 'default', |
854 |
which holds the default value for backend parameters |
855 |
|
856 |
Node parameters |
857 |
+++++++++++++++ |
858 |
|
859 |
Node-related parameters are very few, and we will continue using the |
860 |
same model for these as previously (attributes on the Node object). |
861 |
|
862 |
There are three new node flags, described in a separate section "node |
863 |
flags" below. |
864 |
|
865 |
Instance parameters |
866 |
+++++++++++++++++++ |
867 |
|
868 |
As described before, the instance parameters are split in three: |
869 |
instance proper parameters, unique to each instance, instance |
870 |
hypervisor parameters and instance backend parameters. |
871 |
|
872 |
The “hvparams” and “beparams” are kept in two dictionaries at instance |
873 |
level. Only non-default parameters are stored (but once customized, a |
874 |
parameter will be kept, even with the same value as the default one, |
875 |
until reset). |
876 |
|
877 |
The names for hypervisor parameters in the instance.hvparams subtree |
878 |
should be choosen as generic as possible, especially if specific |
879 |
parameters could conceivably be useful for more than one hypervisor, |
880 |
e.g. ``instance.hvparams.vnc_console_port`` instead of using both |
881 |
``instance.hvparams.hvm_vnc_console_port`` and |
882 |
``instance.hvparams.kvm_vnc_console_port``. |
883 |
|
884 |
There are some special cases related to disks and NICs (for example): |
885 |
a disk has both Ganeti-related parameters (e.g. the name of the LV) |
886 |
and hypervisor-related parameters (how the disk is presented to/named |
887 |
in the instance). The former parameters remain as proper-instance |
888 |
parameters, while the latter value are migrated to the hvparams |
889 |
structure. In 2.0, we will have only globally-per-instance such |
890 |
hypervisor parameters, and not per-disk ones (e.g. all NICs will be |
891 |
exported as of the same type). |
892 |
|
893 |
Starting from the 1.2 list of instance parameters, here is how they |
894 |
will be mapped to the three classes of parameters: |
895 |
|
896 |
- name (P) |
897 |
- primary_node (P) |
898 |
- os (P) |
899 |
- hypervisor (P) |
900 |
- status (P) |
901 |
- memory (BE) |
902 |
- vcpus (BE) |
903 |
- nics (P) |
904 |
- disks (P) |
905 |
- disk_template (P) |
906 |
- network_port (P) |
907 |
- kernel_path (HV) |
908 |
- initrd_path (HV) |
909 |
- hvm_boot_order (HV) |
910 |
- hvm_acpi (HV) |
911 |
- hvm_pae (HV) |
912 |
- hvm_cdrom_image_path (HV) |
913 |
- hvm_nic_type (HV) |
914 |
- hvm_disk_type (HV) |
915 |
- vnc_bind_address (HV) |
916 |
- serial_no (P) |
917 |
|
918 |
|
919 |
Parameter validation |
920 |
++++++++++++++++++++ |
921 |
|
922 |
To support the new cluster parameter design, additional features will |
923 |
be required from the hypervisor support implementations in Ganeti. |
924 |
|
925 |
The hypervisor support implementation API will be extended with the |
926 |
following features: |
927 |
|
928 |
:PARAMETERS: class-level attribute holding the list of valid parameters |
929 |
for this hypervisor |
930 |
:CheckParamSyntax(hvparams): checks that the given parameters are |
931 |
valid (as in the names are valid) for this hypervisor; usually just |
932 |
comparing ``hvparams.keys()`` and ``cls.PARAMETERS``; this is a class |
933 |
method that can be called from within master code (i.e. cmdlib) and |
934 |
should be safe to do so |
935 |
:ValidateParameters(hvparams): verifies the values of the provided |
936 |
parameters against this hypervisor; this is a method that will be |
937 |
called on the target node, from backend.py code, and as such can |
938 |
make node-specific checks (e.g. kernel_path checking) |
939 |
|
940 |
Default value application |
941 |
+++++++++++++++++++++++++ |
942 |
|
943 |
The application of defaults to an instance is done in the Cluster |
944 |
object, via two new methods as follows: |
945 |
|
946 |
- ``Cluster.FillHV(instance)``, returns 'filled' hvparams dict, based on |
947 |
instance's hvparams and cluster's ``hvparams[instance.hypervisor]`` |
948 |
|
949 |
- ``Cluster.FillBE(instance, be_type="default")``, which returns the |
950 |
beparams dict, based on the instance and cluster beparams |
951 |
|
952 |
The FillHV/BE transformations will be used, for example, in the |
953 |
RpcRunner when sending an instance for activation/stop, and the sent |
954 |
instance hvparams/beparams will have the final value (noded code doesn't |
955 |
know about defaults). |
956 |
|
957 |
LU code will need to self-call the transformation, if needed. |
958 |
|
959 |
Opcode changes |
960 |
++++++++++++++ |
961 |
|
962 |
The parameter changes will have impact on the OpCodes, especially on |
963 |
the following ones: |
964 |
|
965 |
- ``OpInstanceCreate``, where the new hv and be parameters will be sent |
966 |
as dictionaries; note that all hv and be parameters are now optional, |
967 |
as the values can be instead taken from the cluster |
968 |
- ``OpInstanceQuery``, where we have to be able to query these new |
969 |
parameters; the syntax for names will be ``hvparam/$NAME`` and |
970 |
``beparam/$NAME`` for querying an individual parameter out of one |
971 |
dictionary, and ``hvparams``, respectively ``beparams``, for the whole |
972 |
dictionaries |
973 |
- ``OpModifyInstance``, where the the modified parameters are sent as |
974 |
dictionaries |
975 |
|
976 |
Additionally, we will need new OpCodes to modify the cluster-level |
977 |
defaults for the be/hv sets of parameters. |
978 |
|
979 |
Caveats |
980 |
+++++++ |
981 |
|
982 |
One problem that might appear is that our classification is not |
983 |
complete or not good enough, and we'll need to change this model. As |
984 |
the last resort, we will need to rollback and keep 1.2 style. |
985 |
|
986 |
Another problem is that classification of one parameter is unclear |
987 |
(e.g. ``network_port``, is this BE or HV?); in this case we'll take |
988 |
the risk of having to move parameters later between classes. |
989 |
|
990 |
Security |
991 |
++++++++ |
992 |
|
993 |
The only security issue that we foresee is if some new parameters will |
994 |
have sensitive value. If so, we will need to have a way to export the |
995 |
config data while purging the sensitive value. |
996 |
|
997 |
E.g. for the drbd shared secrets, we could export these with the |
998 |
values replaced by an empty string. |
999 |
|
1000 |
Node flags |
1001 |
~~~~~~~~~~ |
1002 |
|
1003 |
Ganeti 2.0 adds three node flags that change the way nodes are handled |
1004 |
within Ganeti and the related infrastructure (iallocator interaction, |
1005 |
RAPI data export). |
1006 |
|
1007 |
*master candidate* flag |
1008 |
+++++++++++++++++++++++ |
1009 |
|
1010 |
Ganeti 2.0 allows more scalability in operation by introducing |
1011 |
parallelization. However, a new bottleneck is reached that is the |
1012 |
synchronization and replication of cluster configuration to all nodes |
1013 |
in the cluster. |
1014 |
|
1015 |
This breaks scalability as the speed of the replication decreases |
1016 |
roughly with the size of the nodes in the cluster. The goal of the |
1017 |
master candidate flag is to change this O(n) into O(1) with respect to |
1018 |
job and configuration data propagation. |
1019 |
|
1020 |
Only nodes having this flag set (let's call this set of nodes the |
1021 |
*candidate pool*) will have jobs and configuration data replicated. |
1022 |
|
1023 |
The cluster will have a new parameter (runtime changeable) called |
1024 |
``candidate_pool_size`` which represents the number of candidates the |
1025 |
cluster tries to maintain (preferably automatically). |
1026 |
|
1027 |
This will impact the cluster operations as follows: |
1028 |
|
1029 |
- jobs and config data will be replicated only to a fixed set of nodes |
1030 |
- master fail-over will only be possible to a node in the candidate pool |
1031 |
- cluster verify needs changing to account for these two roles |
1032 |
- external scripts will no longer have access to the configuration |
1033 |
file (this is not recommended anyway) |
1034 |
|
1035 |
|
1036 |
The caveats of this change are: |
1037 |
|
1038 |
- if all candidates are lost (completely), cluster configuration is |
1039 |
lost (but it should be backed up external to the cluster anyway) |
1040 |
|
1041 |
- failed nodes which are candidate must be dealt with properly, so |
1042 |
that we don't lose too many candidates at the same time; this will be |
1043 |
reported in cluster verify |
1044 |
|
1045 |
- the 'all equal' concept of ganeti is no longer true |
1046 |
|
1047 |
- the partial distribution of config data means that all nodes will |
1048 |
have to revert to ssconf files for master info (as in 1.2) |
1049 |
|
1050 |
Advantages: |
1051 |
|
1052 |
- speed on a 100+ nodes simulated cluster is greatly enhanced, even |
1053 |
for a simple operation; ``gnt-instance remove`` on a diskless instance |
1054 |
remove goes from ~9seconds to ~2 seconds |
1055 |
|
1056 |
- node failure of non-candidates will be less impacting on the cluster |
1057 |
|
1058 |
The default value for the candidate pool size will be set to 10 but |
1059 |
this can be changed at cluster creation and modified any time later. |
1060 |
|
1061 |
Testing on simulated big clusters with sequential and parallel jobs |
1062 |
show that this value (10) is a sweet-spot from performance and load |
1063 |
point of view. |
1064 |
|
1065 |
*offline* flag |
1066 |
++++++++++++++ |
1067 |
|
1068 |
In order to support better the situation in which nodes are offline |
1069 |
(e.g. for repair) without altering the cluster configuration, Ganeti |
1070 |
needs to be told and needs to properly handle this state for nodes. |
1071 |
|
1072 |
This will result in simpler procedures, and less mistakes, when the |
1073 |
amount of node failures is high on an absolute scale (either due to |
1074 |
high failure rate or simply big clusters). |
1075 |
|
1076 |
Nodes having this attribute set will not be contacted for inter-node |
1077 |
RPC calls, will not be master candidates, and will not be able to host |
1078 |
instances as primaries. |
1079 |
|
1080 |
Setting this attribute on a node: |
1081 |
|
1082 |
- will not be allowed if the node is the master |
1083 |
- will not be allowed if the node has primary instances |
1084 |
- will cause the node to be demoted from the master candidate role (if |
1085 |
it was), possibly causing another node to be promoted to that role |
1086 |
|
1087 |
This attribute will impact the cluster operations as follows: |
1088 |
|
1089 |
- querying these nodes for anything will fail instantly in the RPC |
1090 |
library, with a specific RPC error (RpcResult.offline == True) |
1091 |
|
1092 |
- they will be listed in the Other section of cluster verify |
1093 |
|
1094 |
The code is changed in the following ways: |
1095 |
|
1096 |
- RPC calls were be converted to skip such nodes: |
1097 |
|
1098 |
- RpcRunner-instance-based RPC calls are easy to convert |
1099 |
|
1100 |
- static/classmethod RPC calls are harder to convert, and were left |
1101 |
alone |
1102 |
|
1103 |
- the RPC results were unified so that this new result state (offline) |
1104 |
can be differentiated |
1105 |
|
1106 |
- master voting still queries in repair nodes, as we need to ensure |
1107 |
consistency in case the (wrong) masters have old data, and nodes have |
1108 |
come back from repairs |
1109 |
|
1110 |
Caveats: |
1111 |
|
1112 |
- some operation semantics are less clear (e.g. what to do on instance |
1113 |
start with offline secondary?); for now, these will just fail as if |
1114 |
the flag is not set (but faster) |
1115 |
- 2-node cluster with one node offline needs manual startup of the |
1116 |
master with a special flag to skip voting (as the master can't get a |
1117 |
quorum there) |
1118 |
|
1119 |
One of the advantages of implementing this flag is that it will allow |
1120 |
in the future automation tools to automatically put the node in |
1121 |
repairs and recover from this state, and the code (should/will) handle |
1122 |
this much better than just timing out. So, future possible |
1123 |
improvements (for later versions): |
1124 |
|
1125 |
- watcher will detect nodes which fail RPC calls, will attempt to ssh |
1126 |
to them, if failure will put them offline |
1127 |
- watcher will try to ssh and query the offline nodes, if successful |
1128 |
will take them off the repair list |
1129 |
|
1130 |
Alternatives considered: The RPC call model in 2.0 is, by default, |
1131 |
much nicer - errors are logged in the background, and job/opcode |
1132 |
execution is clearer, so we could simply not introduce this. However, |
1133 |
having this state will make both the codepaths clearer (offline |
1134 |
vs. temporary failure) and the operational model (it's not a node with |
1135 |
errors, but an offline node). |
1136 |
|
1137 |
|
1138 |
*drained* flag |
1139 |
++++++++++++++ |
1140 |
|
1141 |
Due to parallel execution of jobs in Ganeti 2.0, we could have the |
1142 |
following situation: |
1143 |
|
1144 |
- gnt-node migrate + failover is run |
1145 |
- gnt-node evacuate is run, which schedules a long-running 6-opcode |
1146 |
job for the node |
1147 |
- partway through, a new job comes in that runs an iallocator script, |
1148 |
which finds the above node as empty and a very good candidate |
1149 |
- gnt-node evacuate has finished, but now it has to be run again, to |
1150 |
clean the above instance(s) |
1151 |
|
1152 |
In order to prevent this situation, and to be able to get nodes into |
1153 |
proper offline status easily, a new *drained* flag was added to the |
1154 |
nodes. |
1155 |
|
1156 |
This flag (which actually means "is being, or was drained, and is |
1157 |
expected to go offline"), will prevent allocations on the node, but |
1158 |
otherwise all other operations (start/stop instance, query, etc.) are |
1159 |
working without any restrictions. |
1160 |
|
1161 |
Interaction between flags |
1162 |
+++++++++++++++++++++++++ |
1163 |
|
1164 |
While these flags are implemented as separate flags, they are |
1165 |
mutually-exclusive and are acting together with the master node role |
1166 |
as a single *node status* value. In other words, a flag is only in one |
1167 |
of these roles at a given time. The lack of any of these flags denote |
1168 |
a regular node. |
1169 |
|
1170 |
The current node status is visible in the ``gnt-cluster verify`` |
1171 |
output, and the individual flags can be examined via separate flags in |
1172 |
the ``gnt-node list`` output. |
1173 |
|
1174 |
These new flags will be exported in both the iallocator input message |
1175 |
and via RAPI, see the respective man pages for the exact names. |
1176 |
|
1177 |
Feature changes |
1178 |
--------------- |
1179 |
|
1180 |
The main feature-level changes will be: |
1181 |
|
1182 |
- a number of disk related changes |
1183 |
- removal of fixed two-disk, one-nic per instance limitation |
1184 |
|
1185 |
Disk handling changes |
1186 |
~~~~~~~~~~~~~~~~~~~~~ |
1187 |
|
1188 |
The storage options available in Ganeti 1.x were introduced based on |
1189 |
then-current software (first DRBD 0.7 then later DRBD 8) and the |
1190 |
estimated usage patters. However, experience has later shown that some |
1191 |
assumptions made initially are not true and that more flexibility is |
1192 |
needed. |
1193 |
|
1194 |
One main assumption made was that disk failures should be treated as |
1195 |
'rare' events, and that each of them needs to be manually handled in |
1196 |
order to ensure data safety; however, both these assumptions are false: |
1197 |
|
1198 |
- disk failures can be a common occurrence, based on usage patterns or |
1199 |
cluster size |
1200 |
- our disk setup is robust enough (referring to DRBD8 + LVM) that we |
1201 |
could automate more of the recovery |
1202 |
|
1203 |
Note that we still don't have fully-automated disk recovery as a goal, |
1204 |
but our goal is to reduce the manual work needed. |
1205 |
|
1206 |
As such, we plan the following main changes: |
1207 |
|
1208 |
- DRBD8 is much more flexible and stable than its previous version |
1209 |
(0.7), such that removing the support for the ``remote_raid1`` |
1210 |
template and focusing only on DRBD8 is easier |
1211 |
|
1212 |
- dynamic discovery of DRBD devices is not actually needed in a cluster |
1213 |
that where the DRBD namespace is controlled by Ganeti; switching to a |
1214 |
static assignment (done at either instance creation time or change |
1215 |
secondary time) will change the disk activation time from O(n) to |
1216 |
O(1), which on big clusters is a significant gain |
1217 |
|
1218 |
- remove the hard dependency on LVM (currently all available storage |
1219 |
types are ultimately backed by LVM volumes) by introducing file-based |
1220 |
storage |
1221 |
|
1222 |
Additionally, a number of smaller enhancements are also planned: |
1223 |
- support variable number of disks |
1224 |
- support read-only disks |
1225 |
|
1226 |
Future enhancements in the 2.x series, which do not require base design |
1227 |
changes, might include: |
1228 |
|
1229 |
- enhancement of the LVM allocation method in order to try to keep |
1230 |
all of an instance's virtual disks on the same physical |
1231 |
disks |
1232 |
|
1233 |
- add support for DRBD8 authentication at handshake time in |
1234 |
order to ensure each device connects to the correct peer |
1235 |
|
1236 |
- remove the restrictions on failover only to the secondary |
1237 |
which creates very strict rules on cluster allocation |
1238 |
|
1239 |
DRBD minor allocation |
1240 |
+++++++++++++++++++++ |
1241 |
|
1242 |
Currently, when trying to identify or activate a new DRBD (or MD) |
1243 |
device, the code scans all in-use devices in order to see if we find |
1244 |
one that looks similar to our parameters and is already in the desired |
1245 |
state or not. Since this needs external commands to be run, it is very |
1246 |
slow when more than a few devices are already present. |
1247 |
|
1248 |
Therefore, we will change the discovery model from dynamic to |
1249 |
static. When a new device is logically created (added to the |
1250 |
configuration) a free minor number is computed from the list of |
1251 |
devices that should exist on that node and assigned to that |
1252 |
device. |
1253 |
|
1254 |
At device activation, if the minor is already in use, we check if |
1255 |
it has our parameters; if not so, we just destroy the device (if |
1256 |
possible, otherwise we abort) and start it with our own |
1257 |
parameters. |
1258 |
|
1259 |
This means that we in effect take ownership of the minor space for |
1260 |
that device type; if there's a user-created DRBD minor, it will be |
1261 |
automatically removed. |
1262 |
|
1263 |
The change will have the effect of reducing the number of external |
1264 |
commands run per device from a constant number times the index of the |
1265 |
first free DRBD minor to just a constant number. |
1266 |
|
1267 |
Removal of obsolete device types (MD, DRBD7) |
1268 |
++++++++++++++++++++++++++++++++++++++++++++ |
1269 |
|
1270 |
We need to remove these device types because of two issues. First, |
1271 |
DRBD7 has bad failure modes in case of dual failures (both network and |
1272 |
disk - it cannot propagate the error up the device stack and instead |
1273 |
just panics. Second, due to the asymmetry between primary and |
1274 |
secondary in MD+DRBD mode, we cannot do live failover (not even if we |
1275 |
had MD+DRBD8). |
1276 |
|
1277 |
File-based storage support |
1278 |
++++++++++++++++++++++++++ |
1279 |
|
1280 |
Using files instead of logical volumes for instance storage would |
1281 |
allow us to get rid of the hard requirement for volume groups for |
1282 |
testing clusters and it would also allow usage of SAN storage to do |
1283 |
live failover taking advantage of this storage solution. |
1284 |
|
1285 |
Better LVM allocation |
1286 |
+++++++++++++++++++++ |
1287 |
|
1288 |
Currently, the LV to PV allocation mechanism is a very simple one: at |
1289 |
each new request for a logical volume, tell LVM to allocate the volume |
1290 |
in order based on the amount of free space. This is good for |
1291 |
simplicity and for keeping the usage equally spread over the available |
1292 |
physical disks, however it introduces a problem that an instance could |
1293 |
end up with its (currently) two drives on two physical disks, or |
1294 |
(worse) that the data and metadata for a DRBD device end up on |
1295 |
different drives. |
1296 |
|
1297 |
This is bad because it causes unneeded ``replace-disks`` operations in |
1298 |
case of a physical failure. |
1299 |
|
1300 |
The solution is to batch allocations for an instance and make the LVM |
1301 |
handling code try to allocate as close as possible all the storage of |
1302 |
one instance. We will still allow the logical volumes to spill over to |
1303 |
additional disks as needed. |
1304 |
|
1305 |
Note that this clustered allocation can only be attempted at initial |
1306 |
instance creation, or at change secondary node time. At add disk time, |
1307 |
or at replacing individual disks, it's not easy enough to compute the |
1308 |
current disk map so we'll not attempt the clustering. |
1309 |
|
1310 |
DRBD8 peer authentication at handshake |
1311 |
++++++++++++++++++++++++++++++++++++++ |
1312 |
|
1313 |
DRBD8 has a new feature that allow authentication of the peer at |
1314 |
connect time. We can use this to prevent connecting to the wrong peer |
1315 |
more that securing the connection. Even though we never had issues |
1316 |
with wrong connections, it would be good to implement this. |
1317 |
|
1318 |
|
1319 |
LVM self-repair (optional) |
1320 |
++++++++++++++++++++++++++ |
1321 |
|
1322 |
The complete failure of a physical disk is very tedious to |
1323 |
troubleshoot, mainly because of the many failure modes and the many |
1324 |
steps needed. We can safely automate some of the steps, more |
1325 |
specifically the ``vgreduce --removemissing`` using the following |
1326 |
method: |
1327 |
|
1328 |
#. check if all nodes have consistent volume groups |
1329 |
#. if yes, and previous status was yes, do nothing |
1330 |
#. if yes, and previous status was no, save status and restart |
1331 |
#. if no, and previous status was no, do nothing |
1332 |
#. if no, and previous status was yes: |
1333 |
#. if more than one node is inconsistent, do nothing |
1334 |
#. if only one node is inconsistent: |
1335 |
#. run ``vgreduce --removemissing`` |
1336 |
#. log this occurrence in the Ganeti log in a form that |
1337 |
can be used for monitoring |
1338 |
#. [FUTURE] run ``replace-disks`` for all |
1339 |
instances affected |
1340 |
|
1341 |
Failover to any node |
1342 |
++++++++++++++++++++ |
1343 |
|
1344 |
With a modified disk activation sequence, we can implement the |
1345 |
*failover to any* functionality, removing many of the layout |
1346 |
restrictions of a cluster: |
1347 |
|
1348 |
- the need to reserve memory on the current secondary: this gets reduced |
1349 |
to a must to reserve memory anywhere on the cluster |
1350 |
|
1351 |
- the need to first failover and then replace secondary for an |
1352 |
instance: with failover-to-any, we can directly failover to |
1353 |
another node, which also does the replace disks at the same |
1354 |
step |
1355 |
|
1356 |
In the following, we denote the current primary by P1, the current |
1357 |
secondary by S1, and the new primary and secondaries by P2 and S2. P2 |
1358 |
is fixed to the node the user chooses, but the choice of S2 can be |
1359 |
made between P1 and S1. This choice can be constrained, depending on |
1360 |
which of P1 and S1 has failed. |
1361 |
|
1362 |
- if P1 has failed, then S1 must become S2, and live migration is not |
1363 |
possible |
1364 |
- if S1 has failed, then P1 must become S2, and live migration could be |
1365 |
possible (in theory, but this is not a design goal for 2.0) |
1366 |
|
1367 |
The algorithm for performing the failover is straightforward: |
1368 |
|
1369 |
- verify that S2 (the node the user has chosen to keep as secondary) has |
1370 |
valid data (is consistent) |
1371 |
|
1372 |
- tear down the current DRBD association and setup a DRBD pairing |
1373 |
between P2 (P2 is indicated by the user) and S2; since P2 has no data, |
1374 |
it will start re-syncing from S2 |
1375 |
|
1376 |
- as soon as P2 is in state SyncTarget (i.e. after the resync has |
1377 |
started but before it has finished), we can promote it to primary role |
1378 |
(r/w) and start the instance on P2 |
1379 |
|
1380 |
- as soon as the P2?S2 sync has finished, we can remove |
1381 |
the old data on the old node that has not been chosen for |
1382 |
S2 |
1383 |
|
1384 |
Caveats: during the P2?S2 sync, a (non-transient) network error |
1385 |
will cause I/O errors on the instance, so (if a longer instance |
1386 |
downtime is acceptable) we can postpone the restart of the instance |
1387 |
until the resync is done. However, disk I/O errors on S2 will cause |
1388 |
data loss, since we don't have a good copy of the data anymore, so in |
1389 |
this case waiting for the sync to complete is not an option. As such, |
1390 |
it is recommended that this feature is used only in conjunction with |
1391 |
proper disk monitoring. |
1392 |
|
1393 |
|
1394 |
Live migration note: While failover-to-any is possible for all choices |
1395 |
of S2, migration-to-any is possible only if we keep P1 as S2. |
1396 |
|
1397 |
Caveats |
1398 |
+++++++ |
1399 |
|
1400 |
The dynamic device model, while more complex, has an advantage: it |
1401 |
will not reuse by mistake the DRBD device of another instance, since |
1402 |
it always looks for either our own or a free one. |
1403 |
|
1404 |
The static one, in contrast, will assume that given a minor number N, |
1405 |
it's ours and we can take over. This needs careful implementation such |
1406 |
that if the minor is in use, either we are able to cleanly shut it |
1407 |
down, or we abort the startup. Otherwise, it could be that we start |
1408 |
syncing between two instance's disks, causing data loss. |
1409 |
|
1410 |
|
1411 |
Variable number of disk/NICs per instance |
1412 |
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
1413 |
|
1414 |
Variable number of disks |
1415 |
++++++++++++++++++++++++ |
1416 |
|
1417 |
In order to support high-security scenarios (for example read-only sda |
1418 |
and read-write sdb), we need to make a fully flexibly disk |
1419 |
definition. This has less impact that it might look at first sight: |
1420 |
only the instance creation has hard coded number of disks, not the disk |
1421 |
handling code. The block device handling and most of the instance |
1422 |
handling code is already working with "the instance's disks" as |
1423 |
opposed to "the two disks of the instance", but some pieces are not |
1424 |
(e.g. import/export) and the code needs a review to ensure safety. |
1425 |
|
1426 |
The objective is to be able to specify the number of disks at |
1427 |
instance creation, and to be able to toggle from read-only to |
1428 |
read-write a disk afterward. |
1429 |
|
1430 |
Variable number of NICs |
1431 |
+++++++++++++++++++++++ |
1432 |
|
1433 |
Similar to the disk change, we need to allow multiple network |
1434 |
interfaces per instance. This will affect the internal code (some |
1435 |
function will have to stop assuming that ``instance.nics`` is a list |
1436 |
of length one), the OS API which currently can export/import only one |
1437 |
instance, and the command line interface. |
1438 |
|
1439 |
Interface changes |
1440 |
----------------- |
1441 |
|
1442 |
There are two areas of interface changes: API-level changes (the OS |
1443 |
interface and the RAPI interface) and the command line interface |
1444 |
changes. |
1445 |
|
1446 |
OS interface |
1447 |
~~~~~~~~~~~~ |
1448 |
|
1449 |
The current Ganeti OS interface, version 5, is tailored for Ganeti 1.2. |
1450 |
The interface is composed by a series of scripts which get called with |
1451 |
certain parameters to perform OS-dependent operations on the cluster. |
1452 |
The current scripts are: |
1453 |
|
1454 |
create |
1455 |
called when a new instance is added to the cluster |
1456 |
export |
1457 |
called to export an instance disk to a stream |
1458 |
import |
1459 |
called to import from a stream to a new instance |
1460 |
rename |
1461 |
called to perform the os-specific operations necessary for renaming an |
1462 |
instance |
1463 |
|
1464 |
Currently these scripts suffer from the limitations of Ganeti 1.2: for |
1465 |
example they accept exactly one block and one swap devices to operate |
1466 |
on, rather than any amount of generic block devices, they blindly assume |
1467 |
that an instance will have just one network interface to operate, they |
1468 |
can not be configured to optimise the instance for a particular |
1469 |
hypervisor. |
1470 |
|
1471 |
Since in Ganeti 2.0 we want to support multiple hypervisors, and a |
1472 |
non-fixed number of network and disks the OS interface need to change to |
1473 |
transmit the appropriate amount of information about an instance to its |
1474 |
managing operating system, when operating on it. Moreover since some old |
1475 |
assumptions usually used in OS scripts are no longer valid we need to |
1476 |
re-establish a common knowledge on what can be assumed and what cannot |
1477 |
be regarding Ganeti environment. |
1478 |
|
1479 |
|
1480 |
When designing the new OS API our priorities are: |
1481 |
- ease of use |
1482 |
- future extensibility |
1483 |
- ease of porting from the old API |
1484 |
- modularity |
1485 |
|
1486 |
As such we want to limit the number of scripts that must be written to |
1487 |
support an OS, and make it easy to share code between them by uniforming |
1488 |
their input. We also will leave the current script structure unchanged, |
1489 |
as far as we can, and make a few of the scripts (import, export and |
1490 |
rename) optional. Most information will be passed to the script through |
1491 |
environment variables, for ease of access and at the same time ease of |
1492 |
using only the information a script needs. |
1493 |
|
1494 |
|
1495 |
The Scripts |
1496 |
+++++++++++ |
1497 |
|
1498 |
As in Ganeti 1.2, every OS which wants to be installed in Ganeti needs |
1499 |
to support the following functionality, through scripts: |
1500 |
|
1501 |
create: |
1502 |
used to create a new instance running that OS. This script should |
1503 |
prepare the block devices, and install them so that the new OS can |
1504 |
boot under the specified hypervisor. |
1505 |
export (optional): |
1506 |
used to export an installed instance using the given OS to a format |
1507 |
which can be used to import it back into a new instance. |
1508 |
import (optional): |
1509 |
used to import an exported instance into a new one. This script is |
1510 |
similar to create, but the new instance should have the content of the |
1511 |
export, rather than contain a pristine installation. |
1512 |
rename (optional): |
1513 |
used to perform the internal OS-specific operations needed to rename |
1514 |
an instance. |
1515 |
|
1516 |
If any optional script is not implemented Ganeti will refuse to perform |
1517 |
the given operation on instances using the non-implementing OS. Of |
1518 |
course the create script is mandatory, and it doesn't make sense to |
1519 |
support the either the export or the import operation but not both. |
1520 |
|
1521 |
Incompatibilities with 1.2 |
1522 |
__________________________ |
1523 |
|
1524 |
We expect the following incompatibilities between the OS scripts for 1.2 |
1525 |
and the ones for 2.0: |
1526 |
|
1527 |
- Input parameters: in 1.2 those were passed on the command line, in 2.0 |
1528 |
we'll use environment variables, as there will be a lot more |
1529 |
information and not all OSes may care about all of it. |
1530 |
- Number of calls: export scripts will be called once for each device |
1531 |
the instance has, and import scripts once for every exported disk. |
1532 |
Imported instances will be forced to have a number of disks greater or |
1533 |
equal to the one of the export. |
1534 |
- Some scripts are not compulsory: if such a script is missing the |
1535 |
relevant operations will be forbidden for instances of that OS. This |
1536 |
makes it easier to distinguish between unsupported operations and |
1537 |
no-op ones (if any). |
1538 |
|
1539 |
|
1540 |
Input |
1541 |
_____ |
1542 |
|
1543 |
Rather than using command line flags, as they do now, scripts will |
1544 |
accept inputs from environment variables. We expect the following input |
1545 |
values: |
1546 |
|
1547 |
OS_API_VERSION |
1548 |
The version of the OS API that the following parameters comply with; |
1549 |
this is used so that in the future we could have OSes supporting |
1550 |
multiple versions and thus Ganeti send the proper version in this |
1551 |
parameter |
1552 |
INSTANCE_NAME |
1553 |
Name of the instance acted on |
1554 |
HYPERVISOR |
1555 |
The hypervisor the instance should run on (e.g. 'xen-pvm', 'xen-hvm', |
1556 |
'kvm') |
1557 |
DISK_COUNT |
1558 |
The number of disks this instance will have |
1559 |
NIC_COUNT |
1560 |
The number of NICs this instance will have |
1561 |
DISK_<N>_PATH |
1562 |
Path to the Nth disk. |
1563 |
DISK_<N>_ACCESS |
1564 |
W if read/write, R if read only. OS scripts are not supposed to touch |
1565 |
read-only disks, but will be passed them to know. |
1566 |
DISK_<N>_FRONTEND_TYPE |
1567 |
Type of the disk as seen by the instance. Can be 'scsi', 'ide', |
1568 |
'virtio' |
1569 |
DISK_<N>_BACKEND_TYPE |
1570 |
Type of the disk as seen from the node. Can be 'block', 'file:loop' or |
1571 |
'file:blktap' |
1572 |
NIC_<N>_MAC |
1573 |
Mac address for the Nth network interface |
1574 |
NIC_<N>_IP |
1575 |
Ip address for the Nth network interface, if available |
1576 |
NIC_<N>_BRIDGE |
1577 |
Node bridge the Nth network interface will be connected to |
1578 |
NIC_<N>_FRONTEND_TYPE |
1579 |
Type of the Nth NIC as seen by the instance. For example 'virtio', |
1580 |
'rtl8139', etc. |
1581 |
DEBUG_LEVEL |
1582 |
Whether more out should be produced, for debugging purposes. Currently |
1583 |
the only valid values are 0 and 1. |
1584 |
|
1585 |
These are only the basic variables we are thinking of now, but more |
1586 |
may come during the implementation and they will be documented in the |
1587 |
:manpage:`ganeti-os-api` man page. All these variables will be |
1588 |
available to all scripts. |
1589 |
|
1590 |
Some scripts will need a few more information to work. These will have |
1591 |
per-script variables, such as for example: |
1592 |
|
1593 |
OLD_INSTANCE_NAME |
1594 |
rename: the name the instance should be renamed from. |
1595 |
EXPORT_DEVICE |
1596 |
export: device to be exported, a snapshot of the actual device. The |
1597 |
data must be exported to stdout. |
1598 |
EXPORT_INDEX |
1599 |
export: sequential number of the instance device targeted. |
1600 |
IMPORT_DEVICE |
1601 |
import: device to send the data to, part of the new instance. The data |
1602 |
must be imported from stdin. |
1603 |
IMPORT_INDEX |
1604 |
import: sequential number of the instance device targeted. |
1605 |
|
1606 |
(Rationale for INSTANCE_NAME as an environment variable: the instance |
1607 |
name is always needed and we could pass it on the command line. On the |
1608 |
other hand, though, this would force scripts to both access the |
1609 |
environment and parse the command line, so we'll move it for |
1610 |
uniformity.) |
1611 |
|
1612 |
|
1613 |
Output/Behaviour |
1614 |
________________ |
1615 |
|
1616 |
As discussed scripts should only send user-targeted information to |
1617 |
stderr. The create and import scripts are supposed to format/initialise |
1618 |
the given block devices and install the correct instance data. The |
1619 |
export script is supposed to export instance data to stdout in a format |
1620 |
understandable by the the import script. The data will be compressed by |
1621 |
Ganeti, so no compression should be done. The rename script should only |
1622 |
modify the instance's knowledge of what its name is. |
1623 |
|
1624 |
Other declarative style features |
1625 |
++++++++++++++++++++++++++++++++ |
1626 |
|
1627 |
Similar to Ganeti 1.2, OS specifications will need to provide a |
1628 |
'ganeti_api_version' containing list of numbers matching the |
1629 |
version(s) of the API they implement. Ganeti itself will always be |
1630 |
compatible with one version of the API and may maintain backwards |
1631 |
compatibility if it's feasible to do so. The numbers are one-per-line, |
1632 |
so an OS supporting both version 5 and version 20 will have a file |
1633 |
containing two lines. This is different from Ganeti 1.2, which only |
1634 |
supported one version number. |
1635 |
|
1636 |
In addition to that an OS will be able to declare that it does support |
1637 |
only a subset of the Ganeti hypervisors, by declaring them in the |
1638 |
'hypervisors' file. |
1639 |
|
1640 |
|
1641 |
Caveats/Notes |
1642 |
+++++++++++++ |
1643 |
|
1644 |
We might want to have a "default" import/export behaviour that just |
1645 |
dumps all disks and restores them. This can save work as most systems |
1646 |
will just do this, while allowing flexibility for different systems. |
1647 |
|
1648 |
Environment variables are limited in size, but we expect that there will |
1649 |
be enough space to store the information we need. If we discover that |
1650 |
this is not the case we may want to go to a more complex API such as |
1651 |
storing those information on the filesystem and providing the OS script |
1652 |
with the path to a file where they are encoded in some format. |
1653 |
|
1654 |
|
1655 |
|
1656 |
Remote API changes |
1657 |
~~~~~~~~~~~~~~~~~~ |
1658 |
|
1659 |
The first Ganeti remote API (RAPI) was designed and deployed with the |
1660 |
Ganeti 1.2.5 release. That version provide read-only access to the |
1661 |
cluster state. Fully functional read-write API demands significant |
1662 |
internal changes which will be implemented in version 2.0. |
1663 |
|
1664 |
We decided to go with implementing the Ganeti RAPI in a RESTful way, |
1665 |
which is aligned with key features we looking. It is simple, |
1666 |
stateless, scalable and extensible paradigm of API implementation. As |
1667 |
transport it uses HTTP over SSL, and we are implementing it with JSON |
1668 |
encoding, but in a way it possible to extend and provide any other |
1669 |
one. |
1670 |
|
1671 |
Design |
1672 |
++++++ |
1673 |
|
1674 |
The Ganeti RAPI is implemented as independent daemon, running on the |
1675 |
same node with the same permission level as Ganeti master |
1676 |
daemon. Communication is done through the LUXI library to the master |
1677 |
daemon. In order to keep communication asynchronous RAPI processes two |
1678 |
types of client requests: |
1679 |
|
1680 |
- queries: server is able to answer immediately |
1681 |
- job submission: some time is required for a useful response |
1682 |
|
1683 |
In the query case requested data send back to client in the HTTP |
1684 |
response body. Typical examples of queries would be: list of nodes, |
1685 |
instances, cluster info, etc. |
1686 |
|
1687 |
In the case of job submission, the client receive a job ID, the |
1688 |
identifier which allows one to query the job progress in the job queue |
1689 |
(see `Job Queue`_). |
1690 |
|
1691 |
Internally, each exported object has an version identifier, which is |
1692 |
used as a state identifier in the HTTP header E-Tag field for |
1693 |
requests/responses to avoid race conditions. |
1694 |
|
1695 |
|
1696 |
Resource representation |
1697 |
+++++++++++++++++++++++ |
1698 |
|
1699 |
The key difference of using REST instead of others API is that REST |
1700 |
requires separation of services via resources with unique URIs. Each |
1701 |
of them should have limited amount of state and support standard HTTP |
1702 |
methods: GET, POST, DELETE, PUT. |
1703 |
|
1704 |
For example in Ganeti's case we can have a set of URI: |
1705 |
|
1706 |
- ``/{clustername}/instances`` |
1707 |
- ``/{clustername}/instances/{instancename}`` |
1708 |
- ``/{clustername}/instances/{instancename}/tag`` |
1709 |
- ``/{clustername}/tag`` |
1710 |
|
1711 |
A GET request to ``/{clustername}/instances`` will return the list of |
1712 |
instances, a POST to ``/{clustername}/instances`` should create a new |
1713 |
instance, a DELETE ``/{clustername}/instances/{instancename}`` should |
1714 |
delete the instance, a GET ``/{clustername}/tag`` should return get |
1715 |
cluster tags. |
1716 |
|
1717 |
Each resource URI will have a version prefix. The resource IDs are to |
1718 |
be determined. |
1719 |
|
1720 |
Internal encoding might be JSON, XML, or any other. The JSON encoding |
1721 |
fits nicely in Ganeti RAPI needs. The client can request a specific |
1722 |
representation via the Accept field in the HTTP header. |
1723 |
|
1724 |
REST uses HTTP as its transport and application protocol for resource |
1725 |
access. The set of possible responses is a subset of standard HTTP |
1726 |
responses. |
1727 |
|
1728 |
The statelessness model provides additional reliability and |
1729 |
transparency to operations (e.g. only one request needs to be analyzed |
1730 |
to understand the in-progress operation, not a sequence of multiple |
1731 |
requests/responses). |
1732 |
|
1733 |
|
1734 |
Security |
1735 |
++++++++ |
1736 |
|
1737 |
With the write functionality security becomes a much bigger an issue. |
1738 |
The Ganeti RAPI uses basic HTTP authentication on top of an |
1739 |
SSL-secured connection to grant access to an exported resource. The |
1740 |
password is stored locally in an Apache-style ``.htpasswd`` file. Only |
1741 |
one level of privileges is supported. |
1742 |
|
1743 |
Caveats |
1744 |
+++++++ |
1745 |
|
1746 |
The model detailed above for job submission requires the client to |
1747 |
poll periodically for updates to the job; an alternative would be to |
1748 |
allow the client to request a callback, or a 'wait for updates' call. |
1749 |
|
1750 |
The callback model was not considered due to the following two issues: |
1751 |
|
1752 |
- callbacks would require a new model of allowed callback URLs, |
1753 |
together with a method of managing these |
1754 |
- callbacks only work when the client and the master are in the same |
1755 |
security domain, and they fail in the other cases (e.g. when there is |
1756 |
a firewall between the client and the RAPI daemon that only allows |
1757 |
client-to-RAPI calls, which is usual in DMZ cases) |
1758 |
|
1759 |
The 'wait for updates' method is not suited to the HTTP protocol, |
1760 |
where requests are supposed to be short-lived. |
1761 |
|
1762 |
Command line changes |
1763 |
~~~~~~~~~~~~~~~~~~~~ |
1764 |
|
1765 |
Ganeti 2.0 introduces several new features as well as new ways to |
1766 |
handle instance resources like disks or network interfaces. This |
1767 |
requires some noticeable changes in the way command line arguments are |
1768 |
handled. |
1769 |
|
1770 |
- extend and modify command line syntax to support new features |
1771 |
- ensure consistent patterns in command line arguments to reduce |
1772 |
cognitive load |
1773 |
|
1774 |
The design changes that require these changes are, in no particular |
1775 |
order: |
1776 |
|
1777 |
- flexible instance disk handling: support a variable number of disks |
1778 |
with varying properties per instance, |
1779 |
- flexible instance network interface handling: support a variable |
1780 |
number of network interfaces with varying properties per instance |
1781 |
- multiple hypervisors: multiple hypervisors can be active on the same |
1782 |
cluster, each supporting different parameters, |
1783 |
- support for device type CDROM (via ISO image) |
1784 |
|
1785 |
As such, there are several areas of Ganeti where the command line |
1786 |
arguments will change: |
1787 |
|
1788 |
- Cluster configuration |
1789 |
|
1790 |
- cluster initialization |
1791 |
- cluster default configuration |
1792 |
|
1793 |
- Instance configuration |
1794 |
|
1795 |
- handling of network cards for instances, |
1796 |
- handling of disks for instances, |
1797 |
- handling of CDROM devices and |
1798 |
- handling of hypervisor specific options. |
1799 |
|
1800 |
There are several areas of Ganeti where the command line arguments |
1801 |
will change: |
1802 |
|
1803 |
- Cluster configuration |
1804 |
|
1805 |
- cluster initialization |
1806 |
- cluster default configuration |
1807 |
|
1808 |
- Instance configuration |
1809 |
|
1810 |
- handling of network cards for instances, |
1811 |
- handling of disks for instances, |
1812 |
- handling of CDROM devices and |
1813 |
- handling of hypervisor specific options. |
1814 |
|
1815 |
Notes about device removal/addition |
1816 |
+++++++++++++++++++++++++++++++++++ |
1817 |
|
1818 |
To avoid problems with device location changes (e.g. second network |
1819 |
interface of the instance becoming the first or third and the like) |
1820 |
the list of network/disk devices is treated as a stack, i.e. devices |
1821 |
can only be added/removed at the end of the list of devices of each |
1822 |
class (disk or network) for each instance. |
1823 |
|
1824 |
gnt-instance commands |
1825 |
+++++++++++++++++++++ |
1826 |
|
1827 |
The commands for gnt-instance will be modified and extended to allow |
1828 |
for the new functionality: |
1829 |
|
1830 |
- the add command will be extended to support the new device and |
1831 |
hypervisor options, |
1832 |
- the modify command continues to handle all modifications to |
1833 |
instances, but will be extended with new arguments for handling |
1834 |
devices. |
1835 |
|
1836 |
Network Device Options |
1837 |
++++++++++++++++++++++ |
1838 |
|
1839 |
The generic format of the network device option is: |
1840 |
|
1841 |
--net $DEVNUM[:$OPTION=$VALUE][,$OPTION=VALUE] |
1842 |
|
1843 |
:$DEVNUM: device number, unsigned integer, starting at 0, |
1844 |
:$OPTION: device option, string, |
1845 |
:$VALUE: device option value, string. |
1846 |
|
1847 |
Currently, the following device options will be defined (open to |
1848 |
further changes): |
1849 |
|
1850 |
:mac: MAC address of the network interface, accepts either a valid |
1851 |
MAC address or the string 'auto'. If 'auto' is specified, a new MAC |
1852 |
address will be generated randomly. If the mac device option is not |
1853 |
specified, the default value 'auto' is assumed. |
1854 |
:bridge: network bridge the network interface is connected |
1855 |
to. Accepts either a valid bridge name (the specified bridge must |
1856 |
exist on the node(s)) as string or the string 'auto'. If 'auto' is |
1857 |
specified, the default brigde is used. If the bridge option is not |
1858 |
specified, the default value 'auto' is assumed. |
1859 |
|
1860 |
Disk Device Options |
1861 |
+++++++++++++++++++ |
1862 |
|
1863 |
The generic format of the disk device option is: |
1864 |
|
1865 |
--disk $DEVNUM[:$OPTION=$VALUE][,$OPTION=VALUE] |
1866 |
|
1867 |
:$DEVNUM: device number, unsigned integer, starting at 0, |
1868 |
:$OPTION: device option, string, |
1869 |
:$VALUE: device option value, string. |
1870 |
|
1871 |
Currently, the following device options will be defined (open to |
1872 |
further changes): |
1873 |
|
1874 |
:size: size of the disk device, either a positive number, specifying |
1875 |
the disk size in mebibytes, or a number followed by a magnitude suffix |
1876 |
(M for mebibytes, G for gibibytes). Also accepts the string 'auto' in |
1877 |
which case the default disk size will be used. If the size option is |
1878 |
not specified, 'auto' is assumed. This option is not valid for all |
1879 |
disk layout types. |
1880 |
:access: access mode of the disk device, a single letter, valid values |
1881 |
are: |
1882 |
|
1883 |
- *w*: read/write access to the disk device or |
1884 |
- *r*: read-only access to the disk device. |
1885 |
|
1886 |
If the access mode is not specified, the default mode of read/write |
1887 |
access will be configured. |
1888 |
:path: path to the image file for the disk device, string. No default |
1889 |
exists. This option is not valid for all disk layout types. |
1890 |
|
1891 |
Adding devices |
1892 |
++++++++++++++ |
1893 |
|
1894 |
To add devices to an already existing instance, use the device type |
1895 |
specific option to gnt-instance modify. Currently, there are two |
1896 |
device type specific options supported: |
1897 |
|
1898 |
:--net: for network interface cards |
1899 |
:--disk: for disk devices |
1900 |
|
1901 |
The syntax to the device specific options is similar to the generic |
1902 |
device options, but instead of specifying a device number like for |
1903 |
gnt-instance add, you specify the magic string add. The new device |
1904 |
will always be appended at the end of the list of devices of this type |
1905 |
for the specified instance, e.g. if the instance has disk devices 0,1 |
1906 |
and 2, the newly added disk device will be disk device 3. |
1907 |
|
1908 |
Example: gnt-instance modify --net add:mac=auto test-instance |
1909 |
|
1910 |
Removing devices |
1911 |
++++++++++++++++ |
1912 |
|
1913 |
Removing devices from and instance is done via gnt-instance |
1914 |
modify. The same device specific options as for adding instances are |
1915 |
used. Instead of a device number and further device options, only the |
1916 |
magic string remove is specified. It will always remove the last |
1917 |
device in the list of devices of this type for the instance specified, |
1918 |
e.g. if the instance has disk devices 0, 1, 2 and 3, the disk device |
1919 |
number 3 will be removed. |
1920 |
|
1921 |
Example: gnt-instance modify --net remove test-instance |
1922 |
|
1923 |
Modifying devices |
1924 |
+++++++++++++++++ |
1925 |
|
1926 |
Modifying devices is also done with device type specific options to |
1927 |
the gnt-instance modify command. There are currently two device type |
1928 |
options supported: |
1929 |
|
1930 |
:--net: for network interface cards |
1931 |
:--disk: for disk devices |
1932 |
|
1933 |
The syntax to the device specific options is similar to the generic |
1934 |
device options. The device number you specify identifies the device to |
1935 |
be modified. |
1936 |
|
1937 |
Example:: |
1938 |
|
1939 |
gnt-instance modify --disk 2:access=r |
1940 |
|
1941 |
Hypervisor Options |
1942 |
++++++++++++++++++ |
1943 |
|
1944 |
Ganeti 2.0 will support more than one hypervisor. Different |
1945 |
hypervisors have various options that only apply to a specific |
1946 |
hypervisor. Those hypervisor specific options are treated specially |
1947 |
via the ``--hypervisor`` option. The generic syntax of the hypervisor |
1948 |
option is as follows:: |
1949 |
|
1950 |
--hypervisor $HYPERVISOR:$OPTION=$VALUE[,$OPTION=$VALUE] |
1951 |
|
1952 |
:$HYPERVISOR: symbolic name of the hypervisor to use, string, |
1953 |
has to match the supported hypervisors. Example: xen-pvm |
1954 |
|
1955 |
:$OPTION: hypervisor option name, string |
1956 |
:$VALUE: hypervisor option value, string |
1957 |
|
1958 |
The hypervisor option for an instance can be set on instance creation |
1959 |
time via the ``gnt-instance add`` command. If the hypervisor for an |
1960 |
instance is not specified upon instance creation, the default |
1961 |
hypervisor will be used. |
1962 |
|
1963 |
Modifying hypervisor parameters |
1964 |
+++++++++++++++++++++++++++++++ |
1965 |
|
1966 |
The hypervisor parameters of an existing instance can be modified |
1967 |
using ``--hypervisor`` option of the ``gnt-instance modify`` |
1968 |
command. However, the hypervisor type of an existing instance can not |
1969 |
be changed, only the particular hypervisor specific option can be |
1970 |
changed. Therefore, the format of the option parameters has been |
1971 |
simplified to omit the hypervisor name and only contain the comma |
1972 |
separated list of option-value pairs. |
1973 |
|
1974 |
Example:: |
1975 |
|
1976 |
gnt-instance modify --hypervisor cdrom=/srv/boot.iso,boot_order=cdrom:network test-instance |
1977 |
|
1978 |
gnt-cluster commands |
1979 |
++++++++++++++++++++ |
1980 |
|
1981 |
The command for gnt-cluster will be extended to allow setting and |
1982 |
changing the default parameters of the cluster: |
1983 |
|
1984 |
- The init command will be extend to support the defaults option to |
1985 |
set the cluster defaults upon cluster initialization. |
1986 |
- The modify command will be added to modify the cluster |
1987 |
parameters. It will support the --defaults option to change the |
1988 |
cluster defaults. |
1989 |
|
1990 |
Cluster defaults |
1991 |
|
1992 |
The generic format of the cluster default setting option is: |
1993 |
|
1994 |
--defaults $OPTION=$VALUE[,$OPTION=$VALUE] |
1995 |
|
1996 |
:$OPTION: cluster default option, string, |
1997 |
:$VALUE: cluster default option value, string. |
1998 |
|
1999 |
Currently, the following cluster default options are defined (open to |
2000 |
further changes): |
2001 |
|
2002 |
:hypervisor: the default hypervisor to use for new instances, |
2003 |
string. Must be a valid hypervisor known to and supported by the |
2004 |
cluster. |
2005 |
:disksize: the disksize for newly created instance disks, where |
2006 |
applicable. Must be either a positive number, in which case the unit |
2007 |
of megabyte is assumed, or a positive number followed by a supported |
2008 |
magnitude symbol (M for megabyte or G for gigabyte). |
2009 |
:bridge: the default network bridge to use for newly created instance |
2010 |
network interfaces, string. Must be a valid bridge name of a bridge |
2011 |
existing on the node(s). |
2012 |
|
2013 |
Hypervisor cluster defaults |
2014 |
+++++++++++++++++++++++++++ |
2015 |
|
2016 |
The generic format of the hypervisor cluster wide default setting |
2017 |
option is:: |
2018 |
|
2019 |
--hypervisor-defaults $HYPERVISOR:$OPTION=$VALUE[,$OPTION=$VALUE] |
2020 |
|
2021 |
:$HYPERVISOR: symbolic name of the hypervisor whose defaults you want |
2022 |
to set, string |
2023 |
:$OPTION: cluster default option, string, |
2024 |
:$VALUE: cluster default option value, string. |
2025 |
|
2026 |
.. vim: set textwidth=72 : |