root / doc / design-2.0.rst @ 8fa74099
History | View | Annotate | Download (76.6 kB)
1 |
================= |
---|---|
2 |
Ganeti 2.0 design |
3 |
================= |
4 |
|
5 |
This document describes the major changes in Ganeti 2.0 compared to |
6 |
the 1.2 version. |
7 |
|
8 |
The 2.0 version will constitute a rewrite of the 'core' architecture, |
9 |
paving the way for additional features in future 2.x versions. |
10 |
|
11 |
.. contents:: :depth: 3 |
12 |
|
13 |
Objective |
14 |
========= |
15 |
|
16 |
Ganeti 1.2 has many scalability issues and restrictions due to its |
17 |
roots as software for managing small and 'static' clusters. |
18 |
|
19 |
Version 2.0 will attempt to remedy first the scalability issues and |
20 |
then the restrictions. |
21 |
|
22 |
Background |
23 |
========== |
24 |
|
25 |
While Ganeti 1.2 is usable, it severely limits the flexibility of the |
26 |
cluster administration and imposes a very rigid model. It has the |
27 |
following main scalability issues: |
28 |
|
29 |
- only one operation at a time on the cluster [#]_ |
30 |
- poor handling of node failures in the cluster |
31 |
- mixing hypervisors in a cluster not allowed |
32 |
|
33 |
It also has a number of artificial restrictions, due to historical |
34 |
design: |
35 |
|
36 |
- fixed number of disks (two) per instance |
37 |
- fixed number of NICs |
38 |
|
39 |
.. [#] Replace disks will release the lock, but this is an exception |
40 |
and not a recommended way to operate |
41 |
|
42 |
The 2.0 version is intended to address some of these problems, and |
43 |
create a more flexible code base for future developments. |
44 |
|
45 |
Among these problems, the single-operation at a time restriction is |
46 |
biggest issue with the current version of Ganeti. It is such a big |
47 |
impediment in operating bigger clusters that many times one is tempted |
48 |
to remove the lock just to do a simple operation like start instance |
49 |
while an OS installation is running. |
50 |
|
51 |
Scalability problems |
52 |
-------------------- |
53 |
|
54 |
Ganeti 1.2 has a single global lock, which is used for all cluster |
55 |
operations. This has been painful at various times, for example: |
56 |
|
57 |
- It is impossible for two people to efficiently interact with a cluster |
58 |
(for example for debugging) at the same time. |
59 |
- When batch jobs are running it's impossible to do other work (for |
60 |
example failovers/fixes) on a cluster. |
61 |
|
62 |
This poses scalability problems: as clusters grow in node and instance |
63 |
size it's a lot more likely that operations which one could conceive |
64 |
should run in parallel (for example because they happen on different |
65 |
nodes) are actually stalling each other while waiting for the global |
66 |
lock, without a real reason for that to happen. |
67 |
|
68 |
One of the main causes of this global lock (beside the higher |
69 |
difficulty of ensuring data consistency in a more granular lock model) |
70 |
is the fact that currently there is no long-lived process in Ganeti |
71 |
that can coordinate multiple operations. Each command tries to acquire |
72 |
the so called *cmd* lock and when it succeeds, it takes complete |
73 |
ownership of the cluster configuration and state. |
74 |
|
75 |
Other scalability problems are due the design of the DRBD device |
76 |
model, which assumed at its creation a low (one to four) number of |
77 |
instances per node, which is no longer true with today's hardware. |
78 |
|
79 |
Artificial restrictions |
80 |
----------------------- |
81 |
|
82 |
Ganeti 1.2 (and previous versions) have a fixed two-disks, one-NIC per |
83 |
instance model. This is a purely artificial restrictions, but it |
84 |
touches multiple areas (configuration, import/export, command line) |
85 |
that it's more fitted to a major release than a minor one. |
86 |
|
87 |
Architecture issues |
88 |
------------------- |
89 |
|
90 |
The fact that each command is a separate process that reads the |
91 |
cluster state, executes the command, and saves the new state is also |
92 |
an issue on big clusters where the configuration data for the cluster |
93 |
begins to be non-trivial in size. |
94 |
|
95 |
Overview |
96 |
======== |
97 |
|
98 |
In order to solve the scalability problems, a rewrite of the core |
99 |
design of Ganeti is required. While the cluster operations themselves |
100 |
won't change (e.g. start instance will do the same things, the way |
101 |
these operations are scheduled internally will change radically. |
102 |
|
103 |
The new design will change the cluster architecture to: |
104 |
|
105 |
.. digraph:: "ganeti-2.0-architecture" |
106 |
|
107 |
compound=false |
108 |
concentrate=true |
109 |
mclimit=100.0 |
110 |
nslimit=100.0 |
111 |
edge[fontsize="8" fontname="Helvetica-Oblique"] |
112 |
node[width="0" height="0" fontsize="12" fontcolor="black" shape=rect] |
113 |
|
114 |
subgraph outside { |
115 |
rclient[label="external clients"] |
116 |
label="Outside the cluster" |
117 |
} |
118 |
|
119 |
subgraph cluster_inside { |
120 |
label="ganeti cluster" |
121 |
labeljust=l |
122 |
subgraph cluster_master_node { |
123 |
label="master node" |
124 |
rapi[label="RAPI daemon"] |
125 |
cli[label="CLI"] |
126 |
watcher[label="Watcher"] |
127 |
burnin[label="Burnin"] |
128 |
masterd[shape=record style=filled label="{ <luxi> luxi endpoint | master I/O thread | job queue | {<w1> worker| <w2> worker | <w3> worker }}"] |
129 |
{rapi;cli;watcher;burnin} -> masterd:luxi [label="LUXI" labelpos=100] |
130 |
} |
131 |
|
132 |
subgraph cluster_nodes { |
133 |
label="nodes" |
134 |
noded1 [shape=record label="{ RPC listener | Disk management | Network management | Hypervisor } "] |
135 |
noded2 [shape=record label="{ RPC listener | Disk management | Network management | Hypervisor } "] |
136 |
noded3 [shape=record label="{ RPC listener | Disk management | Network management | Hypervisor } "] |
137 |
} |
138 |
masterd:w2 -> {noded1;noded2;noded3} [label="node RPC"] |
139 |
cli -> {noded1;noded2;noded3} [label="SSH"] |
140 |
} |
141 |
|
142 |
rclient -> rapi [label="RAPI protocol"] |
143 |
|
144 |
This differs from the 1.2 architecture by the addition of the master |
145 |
daemon, which will be the only entity to talk to the node daemons. |
146 |
|
147 |
|
148 |
Detailed design |
149 |
=============== |
150 |
|
151 |
The changes for 2.0 can be split into roughly three areas: |
152 |
|
153 |
- core changes that affect the design of the software |
154 |
- features (or restriction removals) but which do not have a wide |
155 |
impact on the design |
156 |
- user-level and API-level changes which translate into differences for |
157 |
the operation of the cluster |
158 |
|
159 |
Core changes |
160 |
------------ |
161 |
|
162 |
The main changes will be switching from a per-process model to a |
163 |
daemon based model, where the individual gnt-* commands will be |
164 |
clients that talk to this daemon (see `Master daemon`_). This will |
165 |
allow us to get rid of the global cluster lock for most operations, |
166 |
having instead a per-object lock (see `Granular locking`_). Also, the |
167 |
daemon will be able to queue jobs, and this will allow the individual |
168 |
clients to submit jobs without waiting for them to finish, and also |
169 |
see the result of old requests (see `Job Queue`_). |
170 |
|
171 |
Beside these major changes, another 'core' change but that will not be |
172 |
as visible to the users will be changing the model of object attribute |
173 |
storage, and separate that into name spaces (such that an Xen PVM |
174 |
instance will not have the Xen HVM parameters). This will allow future |
175 |
flexibility in defining additional parameters. For more details see |
176 |
`Object parameters`_. |
177 |
|
178 |
The various changes brought in by the master daemon model and the |
179 |
read-write RAPI will require changes to the cluster security; we move |
180 |
away from Twisted and use HTTP(s) for intra- and extra-cluster |
181 |
communications. For more details, see the security document in the |
182 |
doc/ directory. |
183 |
|
184 |
Master daemon |
185 |
~~~~~~~~~~~~~ |
186 |
|
187 |
In Ganeti 2.0, we will have the following *entities*: |
188 |
|
189 |
- the master daemon (on the master node) |
190 |
- the node daemon (on all nodes) |
191 |
- the command line tools (on the master node) |
192 |
- the RAPI daemon (on the master node) |
193 |
|
194 |
The master-daemon related interaction paths are: |
195 |
|
196 |
- (CLI tools/RAPI daemon) and the master daemon, via the so called |
197 |
*LUXI* API |
198 |
- the master daemon and the node daemons, via the node RPC |
199 |
|
200 |
There are also some additional interaction paths for exceptional cases: |
201 |
|
202 |
- CLI tools might access via SSH the nodes (for ``gnt-cluster copyfile`` |
203 |
and ``gnt-cluster command``) |
204 |
- master failover is a special case when a non-master node will SSH |
205 |
and do node-RPC calls to the current master |
206 |
|
207 |
The protocol between the master daemon and the node daemons will be |
208 |
changed from (Ganeti 1.2) Twisted PB (perspective broker) to HTTP(S), |
209 |
using a simple PUT/GET of JSON-encoded messages. This is done due to |
210 |
difficulties in working with the Twisted framework and its protocols |
211 |
in a multithreaded environment, which we can overcome by using a |
212 |
simpler stack (see the caveats section). |
213 |
|
214 |
The protocol between the CLI/RAPI and the master daemon will be a |
215 |
custom one (called *LUXI*): on a UNIX socket on the master node, with |
216 |
rights restricted by filesystem permissions, the CLI/RAPI will talk to |
217 |
the master daemon using JSON-encoded messages. |
218 |
|
219 |
The operations supported over this internal protocol will be encoded |
220 |
via a python library that will expose a simple API for its |
221 |
users. Internally, the protocol will simply encode all objects in JSON |
222 |
format and decode them on the receiver side. |
223 |
|
224 |
For more details about the RAPI daemon see `Remote API changes`_, and |
225 |
for the node daemon see `Node daemon changes`_. |
226 |
|
227 |
.. _luxi: |
228 |
|
229 |
The LUXI protocol |
230 |
+++++++++++++++++ |
231 |
|
232 |
As described above, the protocol for making requests or queries to the |
233 |
master daemon will be a UNIX-socket based simple RPC of JSON-encoded |
234 |
messages. |
235 |
|
236 |
The choice of UNIX was in order to get rid of the need of |
237 |
authentication and authorisation inside Ganeti; for 2.0, the |
238 |
permissions on the Unix socket itself will determine the access |
239 |
rights. |
240 |
|
241 |
We will have two main classes of operations over this API: |
242 |
|
243 |
- cluster query functions |
244 |
- job related functions |
245 |
|
246 |
The cluster query functions are usually short-duration, and are the |
247 |
equivalent of the ``OP_QUERY_*`` opcodes in Ganeti 1.2 (and they are |
248 |
internally implemented still with these opcodes). The clients are |
249 |
guaranteed to receive the response in a reasonable time via a timeout. |
250 |
|
251 |
The job-related functions will be: |
252 |
|
253 |
- submit job |
254 |
- query job (which could also be categorized in the query-functions) |
255 |
- archive job (see the job queue design doc) |
256 |
- wait for job change, which allows a client to wait without polling |
257 |
|
258 |
For more details of the actual operation list, see the `Job Queue`_. |
259 |
|
260 |
Both requests and responses will consist of a JSON-encoded message |
261 |
followed by the ``ETX`` character (ASCII decimal 3), which is not a |
262 |
valid character in JSON messages and thus can serve as a message |
263 |
delimiter. The contents of the messages will be a dictionary with two |
264 |
fields: |
265 |
|
266 |
:method: |
267 |
the name of the method called |
268 |
:args: |
269 |
the arguments to the method, as a list (no keyword arguments allowed) |
270 |
|
271 |
Responses will follow the same format, with the two fields being: |
272 |
|
273 |
:success: |
274 |
a boolean denoting the success of the operation |
275 |
:result: |
276 |
the actual result, or error message in case of failure |
277 |
|
278 |
There are two special value for the result field: |
279 |
|
280 |
- in the case that the operation failed, and this field is a list of |
281 |
length two, the client library will try to interpret is as an |
282 |
exception, the first element being the exception type and the second |
283 |
one the actual exception arguments; this will allow a simple method of |
284 |
passing Ganeti-related exception across the interface |
285 |
- for the *WaitForChange* call (that waits on the server for a job to |
286 |
change status), if the result is equal to ``nochange`` instead of the |
287 |
usual result for this call (a list of changes), then the library will |
288 |
internally retry the call; this is done in order to differentiate |
289 |
internally between master daemon hung and job simply not changed |
290 |
|
291 |
Users of the API that don't use the provided python library should |
292 |
take care of the above two cases. |
293 |
|
294 |
|
295 |
Master daemon implementation |
296 |
++++++++++++++++++++++++++++ |
297 |
|
298 |
The daemon will be based around a main I/O thread that will wait for |
299 |
new requests from the clients, and that does the setup/shutdown of the |
300 |
other thread (pools). |
301 |
|
302 |
There will two other classes of threads in the daemon: |
303 |
|
304 |
- job processing threads, part of a thread pool, and which are |
305 |
long-lived, started at daemon startup and terminated only at shutdown |
306 |
time |
307 |
- client I/O threads, which are the ones that talk the local protocol |
308 |
(LUXI) to the clients, and are short-lived |
309 |
|
310 |
Master startup/failover |
311 |
+++++++++++++++++++++++ |
312 |
|
313 |
In Ganeti 1.x there is no protection against failing over the master |
314 |
to a node with stale configuration. In effect, the responsibility of |
315 |
correct failovers falls on the admin. This is true both for the new |
316 |
master and for when an old, offline master startup. |
317 |
|
318 |
Since in 2.x we are extending the cluster state to cover the job queue |
319 |
and have a daemon that will execute by itself the job queue, we want |
320 |
to have more resilience for the master role. |
321 |
|
322 |
The following algorithm will happen whenever a node is ready to |
323 |
transition to the master role, either at startup time or at node |
324 |
failover: |
325 |
|
326 |
#. read the configuration file and parse the node list |
327 |
contained within |
328 |
|
329 |
#. query all the nodes and make sure we obtain an agreement via |
330 |
a quorum of at least half plus one nodes for the following: |
331 |
|
332 |
- we have the latest configuration and job list (as |
333 |
determined by the serial number on the configuration and |
334 |
highest job ID on the job queue) |
335 |
|
336 |
- if we are not failing over (but just starting), the |
337 |
quorum agrees that we are the designated master |
338 |
|
339 |
- if any of the above is false, we prevent the current operation |
340 |
(i.e. we don't become the master) |
341 |
|
342 |
#. at this point, the node transitions to the master role |
343 |
|
344 |
#. for all the in-progress jobs, mark them as failed, with |
345 |
reason unknown or something similar (master failed, etc.) |
346 |
|
347 |
Since due to exceptional conditions we could have a situation in which |
348 |
no node can become the master due to inconsistent data, we will have |
349 |
an override switch for the master daemon startup that will assume the |
350 |
current node has the right data and will replicate all the |
351 |
configuration files to the other nodes. |
352 |
|
353 |
**Note**: the above algorithm is by no means an election algorithm; it |
354 |
is a *confirmation* of the master role currently held by a node. |
355 |
|
356 |
Logging |
357 |
+++++++ |
358 |
|
359 |
The logging system will be switched completely to the standard python |
360 |
logging module; currently it's logging-based, but exposes a different |
361 |
API, which is just overhead. As such, the code will be switched over |
362 |
to standard logging calls, and only the setup will be custom. |
363 |
|
364 |
With this change, we will remove the separate debug/info/error logs, |
365 |
and instead have always one logfile per daemon model: |
366 |
|
367 |
- master-daemon.log for the master daemon |
368 |
- node-daemon.log for the node daemon (this is the same as in 1.2) |
369 |
- rapi-daemon.log for the RAPI daemon logs |
370 |
- rapi-access.log, an additional log file for the RAPI that will be |
371 |
in the standard HTTP log format for possible parsing by other tools |
372 |
|
373 |
Since the :term:`watcher` will only submit jobs to the master for |
374 |
startup of the instances, its log file will contain less information |
375 |
than before, mainly that it will start the instance, but not the |
376 |
results. |
377 |
|
378 |
Node daemon changes |
379 |
+++++++++++++++++++ |
380 |
|
381 |
The only change to the node daemon is that, since we need better |
382 |
concurrency, we don't process the inter-node RPC calls in the node |
383 |
daemon itself, but we fork and process each request in a separate |
384 |
child. |
385 |
|
386 |
Since we don't have many calls, and we only fork (not exec), the |
387 |
overhead should be minimal. |
388 |
|
389 |
Caveats |
390 |
+++++++ |
391 |
|
392 |
A discussed alternative is to keep the current individual processes |
393 |
touching the cluster configuration model. The reasons we have not |
394 |
chosen this approach is: |
395 |
|
396 |
- the speed of reading and unserializing the cluster state |
397 |
today is not small enough that we can ignore it; the addition of |
398 |
the job queue will make the startup cost even higher. While this |
399 |
runtime cost is low, it can be on the order of a few seconds on |
400 |
bigger clusters, which for very quick commands is comparable to |
401 |
the actual duration of the computation itself |
402 |
|
403 |
- individual commands would make it harder to implement a |
404 |
fire-and-forget job request, along the lines "start this |
405 |
instance but do not wait for it to finish"; it would require a |
406 |
model of backgrounding the operation and other things that are |
407 |
much better served by a daemon-based model |
408 |
|
409 |
Another area of discussion is moving away from Twisted in this new |
410 |
implementation. While Twisted has its advantages, there are also many |
411 |
disadvantages to using it: |
412 |
|
413 |
- first and foremost, it's not a library, but a framework; thus, if |
414 |
you use twisted, all the code needs to be 'twiste-ized' and written |
415 |
in an asynchronous manner, using deferreds; while this method works, |
416 |
it's not a common way to code and it requires that the entire process |
417 |
workflow is based around a single *reactor* (Twisted name for a main |
418 |
loop) |
419 |
- the more advanced granular locking that we want to implement would |
420 |
require, if written in the async-manner, deep integration with the |
421 |
Twisted stack, to such an extend that business-logic is inseparable |
422 |
from the protocol coding; we felt that this is an unreasonable |
423 |
request, and that a good protocol library should allow complete |
424 |
separation of low-level protocol calls and business logic; by |
425 |
comparison, the threaded approach combined with HTTPs protocol |
426 |
required (for the first iteration) absolutely no changes from the 1.2 |
427 |
code, and later changes for optimizing the inter-node RPC calls |
428 |
required just syntactic changes (e.g. ``rpc.call_...`` to |
429 |
``self.rpc.call_...``) |
430 |
|
431 |
Another issue is with the Twisted API stability - during the Ganeti |
432 |
1.x lifetime, we had to to implement many times workarounds to changes |
433 |
in the Twisted version, so that for example 1.2 is able to use both |
434 |
Twisted 2.x and 8.x. |
435 |
|
436 |
In the end, since we already had an HTTP server library for the RAPI, |
437 |
we just reused that for inter-node communication. |
438 |
|
439 |
|
440 |
Granular locking |
441 |
~~~~~~~~~~~~~~~~ |
442 |
|
443 |
We want to make sure that multiple operations can run in parallel on a |
444 |
Ganeti Cluster. In order for this to happen we need to make sure |
445 |
concurrently run operations don't step on each other toes and break the |
446 |
cluster. |
447 |
|
448 |
This design addresses how we are going to deal with locking so that: |
449 |
|
450 |
- we preserve data coherency |
451 |
- we prevent deadlocks |
452 |
- we prevent job starvation |
453 |
|
454 |
Reaching the maximum possible parallelism is a Non-Goal. We have |
455 |
identified a set of operations that are currently bottlenecks and need |
456 |
to be parallelised and have worked on those. In the future it will be |
457 |
possible to address other needs, thus making the cluster more and more |
458 |
parallel one step at a time. |
459 |
|
460 |
This section only talks about parallelising Ganeti level operations, aka |
461 |
Logical Units, and the locking needed for that. Any other |
462 |
synchronization lock needed internally by the code is outside its scope. |
463 |
|
464 |
Library details |
465 |
+++++++++++++++ |
466 |
|
467 |
The proposed library has these features: |
468 |
|
469 |
- internally managing all the locks, making the implementation |
470 |
transparent from their usage |
471 |
- automatically grabbing multiple locks in the right order (avoid |
472 |
deadlock) |
473 |
- ability to transparently handle conversion to more granularity |
474 |
- support asynchronous operation (future goal) |
475 |
|
476 |
Locking will be valid only on the master node and will not be a |
477 |
distributed operation. Therefore, in case of master failure, the |
478 |
operations currently running will be aborted and the locks will be |
479 |
lost; it remains to the administrator to cleanup (if needed) the |
480 |
operation result (e.g. make sure an instance is either installed |
481 |
correctly or removed). |
482 |
|
483 |
A corollary of this is that a master-failover operation with both |
484 |
masters alive needs to happen while no operations are running, and |
485 |
therefore no locks are held. |
486 |
|
487 |
All the locks will be represented by objects (like |
488 |
``lockings.SharedLock``), and the individual locks for each object |
489 |
will be created at initialisation time, from the config file. |
490 |
|
491 |
The API will have a way to grab one or more than one locks at the same |
492 |
time. Any attempt to grab a lock while already holding one in the wrong |
493 |
order will be checked for, and fail. |
494 |
|
495 |
|
496 |
The Locks |
497 |
+++++++++ |
498 |
|
499 |
At the first stage we have decided to provide the following locks: |
500 |
|
501 |
- One "config file" lock |
502 |
- One lock per node in the cluster |
503 |
- One lock per instance in the cluster |
504 |
|
505 |
All the instance locks will need to be taken before the node locks, and |
506 |
the node locks before the config lock. Locks will need to be acquired at |
507 |
the same time for multiple instances and nodes, and internal ordering |
508 |
will be dealt within the locking library, which, for simplicity, will |
509 |
just use alphabetical order. |
510 |
|
511 |
Each lock has the following three possible statuses: |
512 |
|
513 |
- unlocked (anyone can grab the lock) |
514 |
- shared (anyone can grab/have the lock but only in shared mode) |
515 |
- exclusive (no one else can grab/have the lock) |
516 |
|
517 |
Handling conversion to more granularity |
518 |
+++++++++++++++++++++++++++++++++++++++ |
519 |
|
520 |
In order to convert to a more granular approach transparently each time |
521 |
we split a lock into more we'll create a "metalock", which will depend |
522 |
on those sub-locks and live for the time necessary for all the code to |
523 |
convert (or forever, in some conditions). When a metalock exists all |
524 |
converted code must acquire it in shared mode, so it can run |
525 |
concurrently, but still be exclusive with old code, which acquires it |
526 |
exclusively. |
527 |
|
528 |
In the beginning the only such lock will be what replaces the current |
529 |
"command" lock, and will acquire all the locks in the system, before |
530 |
proceeding. This lock will be called the "Big Ganeti Lock" because |
531 |
holding that one will avoid any other concurrent Ganeti operations. |
532 |
|
533 |
We might also want to devise more metalocks (eg. all nodes, all |
534 |
nodes+config) in order to make it easier for some parts of the code to |
535 |
acquire what it needs without specifying it explicitly. |
536 |
|
537 |
In the future things like the node locks could become metalocks, should |
538 |
we decide to split them into an even more fine grained approach, but |
539 |
this will probably be only after the first 2.0 version has been |
540 |
released. |
541 |
|
542 |
Adding/Removing locks |
543 |
+++++++++++++++++++++ |
544 |
|
545 |
When a new instance or a new node is created an associated lock must be |
546 |
added to the list. The relevant code will need to inform the locking |
547 |
library of such a change. |
548 |
|
549 |
This needs to be compatible with every other lock in the system, |
550 |
especially metalocks that guarantee to grab sets of resources without |
551 |
specifying them explicitly. The implementation of this will be handled |
552 |
in the locking library itself. |
553 |
|
554 |
When instances or nodes disappear from the cluster the relevant locks |
555 |
must be removed. This is easier than adding new elements, as the code |
556 |
which removes them must own them exclusively already, and thus deals |
557 |
with metalocks exactly as normal code acquiring those locks. Any |
558 |
operation queuing on a removed lock will fail after its removal. |
559 |
|
560 |
Asynchronous operations |
561 |
+++++++++++++++++++++++ |
562 |
|
563 |
For the first version the locking library will only export synchronous |
564 |
operations, which will block till the needed lock are held, and only |
565 |
fail if the request is impossible or somehow erroneous. |
566 |
|
567 |
In the future we may want to implement different types of asynchronous |
568 |
operations such as: |
569 |
|
570 |
- try to acquire this lock set and fail if not possible |
571 |
- try to acquire one of these lock sets and return the first one you |
572 |
were able to get (or after a timeout) (select/poll like) |
573 |
|
574 |
These operations can be used to prioritize operations based on available |
575 |
locks, rather than making them just blindly queue for acquiring them. |
576 |
The inherent risk, though, is that any code using the first operation, |
577 |
or setting a timeout for the second one, is susceptible to starvation |
578 |
and thus may never be able to get the required locks and complete |
579 |
certain tasks. Considering this providing/using these operations should |
580 |
not be among our first priorities. |
581 |
|
582 |
Locking granularity |
583 |
+++++++++++++++++++ |
584 |
|
585 |
For the first version of this code we'll convert each Logical Unit to |
586 |
acquire/release the locks it needs, so locking will be at the Logical |
587 |
Unit level. In the future we may want to split logical units in |
588 |
independent "tasklets" with their own locking requirements. A different |
589 |
design doc (or mini design doc) will cover the move from Logical Units |
590 |
to tasklets. |
591 |
|
592 |
Code examples |
593 |
+++++++++++++ |
594 |
|
595 |
In general when acquiring locks we should use a code path equivalent |
596 |
to:: |
597 |
|
598 |
lock.acquire() |
599 |
try: |
600 |
... |
601 |
# other code |
602 |
finally: |
603 |
lock.release() |
604 |
|
605 |
This makes sure we release all locks, and avoid possible deadlocks. Of |
606 |
course extra care must be used not to leave, if possible locked |
607 |
structures in an unusable state. Note that with Python 2.5 a simpler |
608 |
syntax will be possible, but we want to keep compatibility with Python |
609 |
2.4 so the new constructs should not be used. |
610 |
|
611 |
In order to avoid this extra indentation and code changes everywhere in |
612 |
the Logical Units code, we decided to allow LUs to declare locks, and |
613 |
then execute their code with their locks acquired. In the new world LUs |
614 |
are called like this:: |
615 |
|
616 |
# user passed names are expanded to the internal lock/resource name, |
617 |
# then known needed locks are declared |
618 |
lu.ExpandNames() |
619 |
... some locking/adding of locks may happen ... |
620 |
# late declaration of locks for one level: this is useful because sometimes |
621 |
# we can't know which resource we need before locking the previous level |
622 |
lu.DeclareLocks() # for each level (cluster, instance, node) |
623 |
... more locking/adding of locks can happen ... |
624 |
# these functions are called with the proper locks held |
625 |
lu.CheckPrereq() |
626 |
lu.Exec() |
627 |
... locks declared for removal are removed, all acquired locks released ... |
628 |
|
629 |
The Processor and the LogicalUnit class will contain exact documentation |
630 |
on how locks are supposed to be declared. |
631 |
|
632 |
Caveats |
633 |
+++++++ |
634 |
|
635 |
This library will provide an easy upgrade path to bring all the code to |
636 |
granular locking without breaking everything, and it will also guarantee |
637 |
against a lot of common errors. Code switching from the old "lock |
638 |
everything" lock to the new system, though, needs to be carefully |
639 |
scrutinised to be sure it is really acquiring all the necessary locks, |
640 |
and none has been overlooked or forgotten. |
641 |
|
642 |
The code can contain other locks outside of this library, to synchronise |
643 |
other threaded code (eg for the job queue) but in general these should |
644 |
be leaf locks or carefully structured non-leaf ones, to avoid deadlock |
645 |
race conditions. |
646 |
|
647 |
|
648 |
.. _jqueue-original-design: |
649 |
|
650 |
Job Queue |
651 |
~~~~~~~~~ |
652 |
|
653 |
Granular locking is not enough to speed up operations, we also need a |
654 |
queue to store these and to be able to process as many as possible in |
655 |
parallel. |
656 |
|
657 |
A Ganeti job will consist of multiple ``OpCodes`` which are the basic |
658 |
element of operation in Ganeti 1.2 (and will remain as such). Most |
659 |
command-level commands are equivalent to one OpCode, or in some cases |
660 |
to a sequence of opcodes, all of the same type (e.g. evacuating a node |
661 |
will generate N opcodes of type replace disks). |
662 |
|
663 |
|
664 |
Job executionโโLife of a Ganeti jobโ |
665 |
++++++++++++++++++++++++++++++++++++ |
666 |
|
667 |
#. Job gets submitted by the client. A new job identifier is generated |
668 |
and assigned to the job. The job is then automatically replicated |
669 |
[#replic]_ to all nodes in the cluster. The identifier is returned to |
670 |
the client. |
671 |
#. A pool of worker threads waits for new jobs. If all are busy, the job |
672 |
has to wait and the first worker finishing its work will grab it. |
673 |
Otherwise any of the waiting threads will pick up the new job. |
674 |
#. Client waits for job status updates by calling a waiting RPC |
675 |
function. Log message may be shown to the user. Until the job is |
676 |
started, it can also be canceled. |
677 |
#. As soon as the job is finished, its final result and status can be |
678 |
retrieved from the server. |
679 |
#. If the client archives the job, it gets moved to a history directory. |
680 |
There will be a method to archive all jobs older than a a given age. |
681 |
|
682 |
.. [#replic] We need replication in order to maintain the consistency |
683 |
across all nodes in the system; the master node only differs in the |
684 |
fact that now it is running the master daemon, but it if fails and we |
685 |
do a master failover, the jobs are still visible on the new master |
686 |
(though marked as failed). |
687 |
|
688 |
Failures to replicate a job to other nodes will be only flagged as |
689 |
errors in the master daemon log if more than half of the nodes failed, |
690 |
otherwise we ignore the failure, and rely on the fact that the next |
691 |
update (for still running jobs) will retry the update. For finished |
692 |
jobs, it is less of a problem. |
693 |
|
694 |
Future improvements will look into checking the consistency of the job |
695 |
list and jobs themselves at master daemon startup. |
696 |
|
697 |
|
698 |
Job storage |
699 |
+++++++++++ |
700 |
|
701 |
Jobs are stored in the filesystem as individual files, serialized |
702 |
using JSON (standard serialization mechanism in Ganeti). |
703 |
|
704 |
The choice of storing each job in its own file was made because: |
705 |
|
706 |
- a file can be atomically replaced |
707 |
- a file can easily be replicated to other nodes |
708 |
- checking consistency across nodes can be implemented very easily, |
709 |
since all job files should be (at a given moment in time) identical |
710 |
|
711 |
The other possible choices that were discussed and discounted were: |
712 |
|
713 |
- single big file with all job data: not feasible due to difficult |
714 |
updates |
715 |
- in-process databases: hard to replicate the entire database to the |
716 |
other nodes, and replicating individual operations does not mean wee |
717 |
keep consistency |
718 |
|
719 |
|
720 |
Queue structure |
721 |
+++++++++++++++ |
722 |
|
723 |
All file operations have to be done atomically by writing to a temporary |
724 |
file and subsequent renaming. Except for log messages, every change in a |
725 |
job is stored and replicated to other nodes. |
726 |
|
727 |
:: |
728 |
|
729 |
/var/lib/ganeti/queue/ |
730 |
job-1 (JSON encoded job description and status) |
731 |
[โฆ] |
732 |
job-37 |
733 |
job-38 |
734 |
job-39 |
735 |
lock (Queue managing process opens this file in exclusive mode) |
736 |
serial (Last job ID used) |
737 |
version (Queue format version) |
738 |
|
739 |
|
740 |
Locking |
741 |
+++++++ |
742 |
|
743 |
Locking in the job queue is a complicated topic. It is called from more |
744 |
than one thread and must be thread-safe. For simplicity, a single lock |
745 |
is used for the whole job queue. |
746 |
|
747 |
A more detailed description can be found in doc/locking.rst. |
748 |
|
749 |
|
750 |
Internal RPC |
751 |
++++++++++++ |
752 |
|
753 |
RPC calls available between Ganeti master and node daemons: |
754 |
|
755 |
jobqueue_update(file_name, content) |
756 |
Writes a file in the job queue directory. |
757 |
jobqueue_purge() |
758 |
Cleans the job queue directory completely, including archived job. |
759 |
jobqueue_rename(old, new) |
760 |
Renames a file in the job queue directory. |
761 |
|
762 |
|
763 |
Client RPC |
764 |
++++++++++ |
765 |
|
766 |
RPC between Ganeti clients and the Ganeti master daemon supports the |
767 |
following operations: |
768 |
|
769 |
SubmitJob(ops) |
770 |
Submits a list of opcodes and returns the job identifier. The |
771 |
identifier is guaranteed to be unique during the lifetime of a |
772 |
cluster. |
773 |
WaitForJobChange(job_id, fields, [โฆ], timeout) |
774 |
This function waits until a job changes or a timeout expires. The |
775 |
condition for when a job changed is defined by the fields passed and |
776 |
the last log message received. |
777 |
QueryJobs(job_ids, fields) |
778 |
Returns field values for the job identifiers passed. |
779 |
CancelJob(job_id) |
780 |
Cancels the job specified by identifier. This operation may fail if |
781 |
the job is already running, canceled or finished. |
782 |
ArchiveJob(job_id) |
783 |
Moves a job into the โฆ/archive/ directory. This operation will fail if |
784 |
the job has not been canceled or finished. |
785 |
|
786 |
|
787 |
Job and opcode status |
788 |
+++++++++++++++++++++ |
789 |
|
790 |
Each job and each opcode has, at any time, one of the following states: |
791 |
|
792 |
Queued |
793 |
The job/opcode was submitted, but did not yet start. |
794 |
Waiting |
795 |
The job/opcode is waiting for a lock to proceed. |
796 |
Running |
797 |
The job/opcode is running. |
798 |
Canceled |
799 |
The job/opcode was canceled before it started. |
800 |
Success |
801 |
The job/opcode ran and finished successfully. |
802 |
Error |
803 |
The job/opcode was aborted with an error. |
804 |
|
805 |
If the master is aborted while a job is running, the job will be set to |
806 |
the Error status once the master started again. |
807 |
|
808 |
|
809 |
History |
810 |
+++++++ |
811 |
|
812 |
Archived jobs are kept in a separate directory, |
813 |
``/var/lib/ganeti/queue/archive/``. This is done in order to speed up |
814 |
the queue handling: by default, the jobs in the archive are not |
815 |
touched by any functions. Only the current (unarchived) jobs are |
816 |
parsed, loaded, and verified (if implemented) by the master daemon. |
817 |
|
818 |
|
819 |
Ganeti updates |
820 |
++++++++++++++ |
821 |
|
822 |
The queue has to be completely empty for Ganeti updates with changes |
823 |
in the job queue structure. In order to allow this, there will be a |
824 |
way to prevent new jobs entering the queue. |
825 |
|
826 |
|
827 |
Object parameters |
828 |
~~~~~~~~~~~~~~~~~ |
829 |
|
830 |
Across all cluster configuration data, we have multiple classes of |
831 |
parameters: |
832 |
|
833 |
A. cluster-wide parameters (e.g. name of the cluster, the master); |
834 |
these are the ones that we have today, and are unchanged from the |
835 |
current model |
836 |
|
837 |
#. node parameters |
838 |
|
839 |
#. instance specific parameters, e.g. the name of disks (LV), that |
840 |
cannot be shared with other instances |
841 |
|
842 |
#. instance parameters, that are or can be the same for many |
843 |
instances, but are not hypervisor related; e.g. the number of VCPUs, |
844 |
or the size of memory |
845 |
|
846 |
#. instance parameters that are hypervisor specific (e.g. kernel_path |
847 |
or PAE mode) |
848 |
|
849 |
|
850 |
The following definitions for instance parameters will be used below: |
851 |
|
852 |
:hypervisor parameter: |
853 |
a hypervisor parameter (or hypervisor specific parameter) is defined |
854 |
as a parameter that is interpreted by the hypervisor support code in |
855 |
Ganeti and usually is specific to a particular hypervisor (like the |
856 |
kernel path for :term:`PVM` which makes no sense for :term:`HVM`). |
857 |
|
858 |
:backend parameter: |
859 |
a backend parameter is defined as an instance parameter that can be |
860 |
shared among a list of instances, and is either generic enough not |
861 |
to be tied to a given hypervisor or cannot influence at all the |
862 |
hypervisor behaviour. |
863 |
|
864 |
For example: memory, vcpus, auto_balance |
865 |
|
866 |
All these parameters will be encoded into constants.py with the prefix |
867 |
"BE\_" and the whole list of parameters will exist in the set |
868 |
"BES_PARAMETERS" |
869 |
|
870 |
:proper parameter: |
871 |
a parameter whose value is unique to the instance (e.g. the name of a |
872 |
LV, or the MAC of a NIC) |
873 |
|
874 |
As a general rule, for all kind of parameters, โNoneโ (or in |
875 |
JSON-speak, โnilโ) will no longer be a valid value for a parameter. As |
876 |
such, only non-default parameters will be saved as part of objects in |
877 |
the serialization step, reducing the size of the serialized format. |
878 |
|
879 |
Cluster parameters |
880 |
++++++++++++++++++ |
881 |
|
882 |
Cluster parameters remain as today, attributes at the top level of the |
883 |
Cluster object. In addition, two new attributes at this level will |
884 |
hold defaults for the instances: |
885 |
|
886 |
- hvparams, a dictionary indexed by hypervisor type, holding default |
887 |
values for hypervisor parameters that are not defined/overridden by |
888 |
the instances of this hypervisor type |
889 |
|
890 |
- beparams, a dictionary holding (for 2.0) a single element 'default', |
891 |
which holds the default value for backend parameters |
892 |
|
893 |
Node parameters |
894 |
+++++++++++++++ |
895 |
|
896 |
Node-related parameters are very few, and we will continue using the |
897 |
same model for these as previously (attributes on the Node object). |
898 |
|
899 |
There are three new node flags, described in a separate section "node |
900 |
flags" below. |
901 |
|
902 |
Instance parameters |
903 |
+++++++++++++++++++ |
904 |
|
905 |
As described before, the instance parameters are split in three: |
906 |
instance proper parameters, unique to each instance, instance |
907 |
hypervisor parameters and instance backend parameters. |
908 |
|
909 |
The โhvparamsโ and โbeparamsโ are kept in two dictionaries at instance |
910 |
level. Only non-default parameters are stored (but once customized, a |
911 |
parameter will be kept, even with the same value as the default one, |
912 |
until reset). |
913 |
|
914 |
The names for hypervisor parameters in the instance.hvparams subtree |
915 |
should be choosen as generic as possible, especially if specific |
916 |
parameters could conceivably be useful for more than one hypervisor, |
917 |
e.g. ``instance.hvparams.vnc_console_port`` instead of using both |
918 |
``instance.hvparams.hvm_vnc_console_port`` and |
919 |
``instance.hvparams.kvm_vnc_console_port``. |
920 |
|
921 |
There are some special cases related to disks and NICs (for example): |
922 |
a disk has both Ganeti-related parameters (e.g. the name of the LV) |
923 |
and hypervisor-related parameters (how the disk is presented to/named |
924 |
in the instance). The former parameters remain as proper-instance |
925 |
parameters, while the latter value are migrated to the hvparams |
926 |
structure. In 2.0, we will have only globally-per-instance such |
927 |
hypervisor parameters, and not per-disk ones (e.g. all NICs will be |
928 |
exported as of the same type). |
929 |
|
930 |
Starting from the 1.2 list of instance parameters, here is how they |
931 |
will be mapped to the three classes of parameters: |
932 |
|
933 |
- name (P) |
934 |
- primary_node (P) |
935 |
- os (P) |
936 |
- hypervisor (P) |
937 |
- status (P) |
938 |
- memory (BE) |
939 |
- vcpus (BE) |
940 |
- nics (P) |
941 |
- disks (P) |
942 |
- disk_template (P) |
943 |
- network_port (P) |
944 |
- kernel_path (HV) |
945 |
- initrd_path (HV) |
946 |
- hvm_boot_order (HV) |
947 |
- hvm_acpi (HV) |
948 |
- hvm_pae (HV) |
949 |
- hvm_cdrom_image_path (HV) |
950 |
- hvm_nic_type (HV) |
951 |
- hvm_disk_type (HV) |
952 |
- vnc_bind_address (HV) |
953 |
- serial_no (P) |
954 |
|
955 |
|
956 |
Parameter validation |
957 |
++++++++++++++++++++ |
958 |
|
959 |
To support the new cluster parameter design, additional features will |
960 |
be required from the hypervisor support implementations in Ganeti. |
961 |
|
962 |
The hypervisor support implementation API will be extended with the |
963 |
following features: |
964 |
|
965 |
:PARAMETERS: class-level attribute holding the list of valid parameters |
966 |
for this hypervisor |
967 |
:CheckParamSyntax(hvparams): checks that the given parameters are |
968 |
valid (as in the names are valid) for this hypervisor; usually just |
969 |
comparing ``hvparams.keys()`` and ``cls.PARAMETERS``; this is a class |
970 |
method that can be called from within master code (i.e. cmdlib) and |
971 |
should be safe to do so |
972 |
:ValidateParameters(hvparams): verifies the values of the provided |
973 |
parameters against this hypervisor; this is a method that will be |
974 |
called on the target node, from backend.py code, and as such can |
975 |
make node-specific checks (e.g. kernel_path checking) |
976 |
|
977 |
Default value application |
978 |
+++++++++++++++++++++++++ |
979 |
|
980 |
The application of defaults to an instance is done in the Cluster |
981 |
object, via two new methods as follows: |
982 |
|
983 |
- ``Cluster.FillHV(instance)``, returns 'filled' hvparams dict, based on |
984 |
instance's hvparams and cluster's ``hvparams[instance.hypervisor]`` |
985 |
|
986 |
- ``Cluster.FillBE(instance, be_type="default")``, which returns the |
987 |
beparams dict, based on the instance and cluster beparams |
988 |
|
989 |
The FillHV/BE transformations will be used, for example, in the |
990 |
RpcRunner when sending an instance for activation/stop, and the sent |
991 |
instance hvparams/beparams will have the final value (noded code doesn't |
992 |
know about defaults). |
993 |
|
994 |
LU code will need to self-call the transformation, if needed. |
995 |
|
996 |
Opcode changes |
997 |
++++++++++++++ |
998 |
|
999 |
The parameter changes will have impact on the OpCodes, especially on |
1000 |
the following ones: |
1001 |
|
1002 |
- ``OpInstanceCreate``, where the new hv and be parameters will be sent |
1003 |
as dictionaries; note that all hv and be parameters are now optional, |
1004 |
as the values can be instead taken from the cluster |
1005 |
- ``OpInstanceQuery``, where we have to be able to query these new |
1006 |
parameters; the syntax for names will be ``hvparam/$NAME`` and |
1007 |
``beparam/$NAME`` for querying an individual parameter out of one |
1008 |
dictionary, and ``hvparams``, respectively ``beparams``, for the whole |
1009 |
dictionaries |
1010 |
- ``OpModifyInstance``, where the the modified parameters are sent as |
1011 |
dictionaries |
1012 |
|
1013 |
Additionally, we will need new OpCodes to modify the cluster-level |
1014 |
defaults for the be/hv sets of parameters. |
1015 |
|
1016 |
Caveats |
1017 |
+++++++ |
1018 |
|
1019 |
One problem that might appear is that our classification is not |
1020 |
complete or not good enough, and we'll need to change this model. As |
1021 |
the last resort, we will need to rollback and keep 1.2 style. |
1022 |
|
1023 |
Another problem is that classification of one parameter is unclear |
1024 |
(e.g. ``network_port``, is this BE or HV?); in this case we'll take |
1025 |
the risk of having to move parameters later between classes. |
1026 |
|
1027 |
Security |
1028 |
++++++++ |
1029 |
|
1030 |
The only security issue that we foresee is if some new parameters will |
1031 |
have sensitive value. If so, we will need to have a way to export the |
1032 |
config data while purging the sensitive value. |
1033 |
|
1034 |
E.g. for the drbd shared secrets, we could export these with the |
1035 |
values replaced by an empty string. |
1036 |
|
1037 |
Node flags |
1038 |
~~~~~~~~~~ |
1039 |
|
1040 |
Ganeti 2.0 adds three node flags that change the way nodes are handled |
1041 |
within Ganeti and the related infrastructure (iallocator interaction, |
1042 |
RAPI data export). |
1043 |
|
1044 |
*master candidate* flag |
1045 |
+++++++++++++++++++++++ |
1046 |
|
1047 |
Ganeti 2.0 allows more scalability in operation by introducing |
1048 |
parallelization. However, a new bottleneck is reached that is the |
1049 |
synchronization and replication of cluster configuration to all nodes |
1050 |
in the cluster. |
1051 |
|
1052 |
This breaks scalability as the speed of the replication decreases |
1053 |
roughly with the size of the nodes in the cluster. The goal of the |
1054 |
master candidate flag is to change this O(n) into O(1) with respect to |
1055 |
job and configuration data propagation. |
1056 |
|
1057 |
Only nodes having this flag set (let's call this set of nodes the |
1058 |
*candidate pool*) will have jobs and configuration data replicated. |
1059 |
|
1060 |
The cluster will have a new parameter (runtime changeable) called |
1061 |
``candidate_pool_size`` which represents the number of candidates the |
1062 |
cluster tries to maintain (preferably automatically). |
1063 |
|
1064 |
This will impact the cluster operations as follows: |
1065 |
|
1066 |
- jobs and config data will be replicated only to a fixed set of nodes |
1067 |
- master fail-over will only be possible to a node in the candidate pool |
1068 |
- cluster verify needs changing to account for these two roles |
1069 |
- external scripts will no longer have access to the configuration |
1070 |
file (this is not recommended anyway) |
1071 |
|
1072 |
|
1073 |
The caveats of this change are: |
1074 |
|
1075 |
- if all candidates are lost (completely), cluster configuration is |
1076 |
lost (but it should be backed up external to the cluster anyway) |
1077 |
|
1078 |
- failed nodes which are candidate must be dealt with properly, so |
1079 |
that we don't lose too many candidates at the same time; this will be |
1080 |
reported in cluster verify |
1081 |
|
1082 |
- the 'all equal' concept of ganeti is no longer true |
1083 |
|
1084 |
- the partial distribution of config data means that all nodes will |
1085 |
have to revert to ssconf files for master info (as in 1.2) |
1086 |
|
1087 |
Advantages: |
1088 |
|
1089 |
- speed on a 100+ nodes simulated cluster is greatly enhanced, even |
1090 |
for a simple operation; ``gnt-instance remove`` on a diskless instance |
1091 |
remove goes from ~9seconds to ~2 seconds |
1092 |
|
1093 |
- node failure of non-candidates will be less impacting on the cluster |
1094 |
|
1095 |
The default value for the candidate pool size will be set to 10 but |
1096 |
this can be changed at cluster creation and modified any time later. |
1097 |
|
1098 |
Testing on simulated big clusters with sequential and parallel jobs |
1099 |
show that this value (10) is a sweet-spot from performance and load |
1100 |
point of view. |
1101 |
|
1102 |
*offline* flag |
1103 |
++++++++++++++ |
1104 |
|
1105 |
In order to support better the situation in which nodes are offline |
1106 |
(e.g. for repair) without altering the cluster configuration, Ganeti |
1107 |
needs to be told and needs to properly handle this state for nodes. |
1108 |
|
1109 |
This will result in simpler procedures, and less mistakes, when the |
1110 |
amount of node failures is high on an absolute scale (either due to |
1111 |
high failure rate or simply big clusters). |
1112 |
|
1113 |
Nodes having this attribute set will not be contacted for inter-node |
1114 |
RPC calls, will not be master candidates, and will not be able to host |
1115 |
instances as primaries. |
1116 |
|
1117 |
Setting this attribute on a node: |
1118 |
|
1119 |
- will not be allowed if the node is the master |
1120 |
- will not be allowed if the node has primary instances |
1121 |
- will cause the node to be demoted from the master candidate role (if |
1122 |
it was), possibly causing another node to be promoted to that role |
1123 |
|
1124 |
This attribute will impact the cluster operations as follows: |
1125 |
|
1126 |
- querying these nodes for anything will fail instantly in the RPC |
1127 |
library, with a specific RPC error (RpcResult.offline == True) |
1128 |
|
1129 |
- they will be listed in the Other section of cluster verify |
1130 |
|
1131 |
The code is changed in the following ways: |
1132 |
|
1133 |
- RPC calls were be converted to skip such nodes: |
1134 |
|
1135 |
- RpcRunner-instance-based RPC calls are easy to convert |
1136 |
|
1137 |
- static/classmethod RPC calls are harder to convert, and were left |
1138 |
alone |
1139 |
|
1140 |
- the RPC results were unified so that this new result state (offline) |
1141 |
can be differentiated |
1142 |
|
1143 |
- master voting still queries in repair nodes, as we need to ensure |
1144 |
consistency in case the (wrong) masters have old data, and nodes have |
1145 |
come back from repairs |
1146 |
|
1147 |
Caveats: |
1148 |
|
1149 |
- some operation semantics are less clear (e.g. what to do on instance |
1150 |
start with offline secondary?); for now, these will just fail as if |
1151 |
the flag is not set (but faster) |
1152 |
- 2-node cluster with one node offline needs manual startup of the |
1153 |
master with a special flag to skip voting (as the master can't get a |
1154 |
quorum there) |
1155 |
|
1156 |
One of the advantages of implementing this flag is that it will allow |
1157 |
in the future automation tools to automatically put the node in |
1158 |
repairs and recover from this state, and the code (should/will) handle |
1159 |
this much better than just timing out. So, future possible |
1160 |
improvements (for later versions): |
1161 |
|
1162 |
- watcher will detect nodes which fail RPC calls, will attempt to ssh |
1163 |
to them, if failure will put them offline |
1164 |
- watcher will try to ssh and query the offline nodes, if successful |
1165 |
will take them off the repair list |
1166 |
|
1167 |
Alternatives considered: The RPC call model in 2.0 is, by default, |
1168 |
much nicer - errors are logged in the background, and job/opcode |
1169 |
execution is clearer, so we could simply not introduce this. However, |
1170 |
having this state will make both the codepaths clearer (offline |
1171 |
vs. temporary failure) and the operational model (it's not a node with |
1172 |
errors, but an offline node). |
1173 |
|
1174 |
|
1175 |
*drained* flag |
1176 |
++++++++++++++ |
1177 |
|
1178 |
Due to parallel execution of jobs in Ganeti 2.0, we could have the |
1179 |
following situation: |
1180 |
|
1181 |
- gnt-node migrate + failover is run |
1182 |
- gnt-node evacuate is run, which schedules a long-running 6-opcode |
1183 |
job for the node |
1184 |
- partway through, a new job comes in that runs an iallocator script, |
1185 |
which finds the above node as empty and a very good candidate |
1186 |
- gnt-node evacuate has finished, but now it has to be run again, to |
1187 |
clean the above instance(s) |
1188 |
|
1189 |
In order to prevent this situation, and to be able to get nodes into |
1190 |
proper offline status easily, a new *drained* flag was added to the |
1191 |
nodes. |
1192 |
|
1193 |
This flag (which actually means "is being, or was drained, and is |
1194 |
expected to go offline"), will prevent allocations on the node, but |
1195 |
otherwise all other operations (start/stop instance, query, etc.) are |
1196 |
working without any restrictions. |
1197 |
|
1198 |
Interaction between flags |
1199 |
+++++++++++++++++++++++++ |
1200 |
|
1201 |
While these flags are implemented as separate flags, they are |
1202 |
mutually-exclusive and are acting together with the master node role |
1203 |
as a single *node status* value. In other words, a flag is only in one |
1204 |
of these roles at a given time. The lack of any of these flags denote |
1205 |
a regular node. |
1206 |
|
1207 |
The current node status is visible in the ``gnt-cluster verify`` |
1208 |
output, and the individual flags can be examined via separate flags in |
1209 |
the ``gnt-node list`` output. |
1210 |
|
1211 |
These new flags will be exported in both the iallocator input message |
1212 |
and via RAPI, see the respective man pages for the exact names. |
1213 |
|
1214 |
Feature changes |
1215 |
--------------- |
1216 |
|
1217 |
The main feature-level changes will be: |
1218 |
|
1219 |
- a number of disk related changes |
1220 |
- removal of fixed two-disk, one-nic per instance limitation |
1221 |
|
1222 |
Disk handling changes |
1223 |
~~~~~~~~~~~~~~~~~~~~~ |
1224 |
|
1225 |
The storage options available in Ganeti 1.x were introduced based on |
1226 |
then-current software (first DRBD 0.7 then later DRBD 8) and the |
1227 |
estimated usage patters. However, experience has later shown that some |
1228 |
assumptions made initially are not true and that more flexibility is |
1229 |
needed. |
1230 |
|
1231 |
One main assumption made was that disk failures should be treated as |
1232 |
'rare' events, and that each of them needs to be manually handled in |
1233 |
order to ensure data safety; however, both these assumptions are false: |
1234 |
|
1235 |
- disk failures can be a common occurrence, based on usage patterns or |
1236 |
cluster size |
1237 |
- our disk setup is robust enough (referring to DRBD8 + LVM) that we |
1238 |
could automate more of the recovery |
1239 |
|
1240 |
Note that we still don't have fully-automated disk recovery as a goal, |
1241 |
but our goal is to reduce the manual work needed. |
1242 |
|
1243 |
As such, we plan the following main changes: |
1244 |
|
1245 |
- DRBD8 is much more flexible and stable than its previous version |
1246 |
(0.7), such that removing the support for the ``remote_raid1`` |
1247 |
template and focusing only on DRBD8 is easier |
1248 |
|
1249 |
- dynamic discovery of DRBD devices is not actually needed in a cluster |
1250 |
that where the DRBD namespace is controlled by Ganeti; switching to a |
1251 |
static assignment (done at either instance creation time or change |
1252 |
secondary time) will change the disk activation time from O(n) to |
1253 |
O(1), which on big clusters is a significant gain |
1254 |
|
1255 |
- remove the hard dependency on LVM (currently all available storage |
1256 |
types are ultimately backed by LVM volumes) by introducing file-based |
1257 |
storage |
1258 |
|
1259 |
Additionally, a number of smaller enhancements are also planned: |
1260 |
- support variable number of disks |
1261 |
- support read-only disks |
1262 |
|
1263 |
Future enhancements in the 2.x series, which do not require base design |
1264 |
changes, might include: |
1265 |
|
1266 |
- enhancement of the LVM allocation method in order to try to keep |
1267 |
all of an instance's virtual disks on the same physical |
1268 |
disks |
1269 |
|
1270 |
- add support for DRBD8 authentication at handshake time in |
1271 |
order to ensure each device connects to the correct peer |
1272 |
|
1273 |
- remove the restrictions on failover only to the secondary |
1274 |
which creates very strict rules on cluster allocation |
1275 |
|
1276 |
DRBD minor allocation |
1277 |
+++++++++++++++++++++ |
1278 |
|
1279 |
Currently, when trying to identify or activate a new DRBD (or MD) |
1280 |
device, the code scans all in-use devices in order to see if we find |
1281 |
one that looks similar to our parameters and is already in the desired |
1282 |
state or not. Since this needs external commands to be run, it is very |
1283 |
slow when more than a few devices are already present. |
1284 |
|
1285 |
Therefore, we will change the discovery model from dynamic to |
1286 |
static. When a new device is logically created (added to the |
1287 |
configuration) a free minor number is computed from the list of |
1288 |
devices that should exist on that node and assigned to that |
1289 |
device. |
1290 |
|
1291 |
At device activation, if the minor is already in use, we check if |
1292 |
it has our parameters; if not so, we just destroy the device (if |
1293 |
possible, otherwise we abort) and start it with our own |
1294 |
parameters. |
1295 |
|
1296 |
This means that we in effect take ownership of the minor space for |
1297 |
that device type; if there's a user-created DRBD minor, it will be |
1298 |
automatically removed. |
1299 |
|
1300 |
The change will have the effect of reducing the number of external |
1301 |
commands run per device from a constant number times the index of the |
1302 |
first free DRBD minor to just a constant number. |
1303 |
|
1304 |
Removal of obsolete device types (MD, DRBD7) |
1305 |
++++++++++++++++++++++++++++++++++++++++++++ |
1306 |
|
1307 |
We need to remove these device types because of two issues. First, |
1308 |
DRBD7 has bad failure modes in case of dual failures (both network and |
1309 |
disk - it cannot propagate the error up the device stack and instead |
1310 |
just panics. Second, due to the asymmetry between primary and |
1311 |
secondary in MD+DRBD mode, we cannot do live failover (not even if we |
1312 |
had MD+DRBD8). |
1313 |
|
1314 |
File-based storage support |
1315 |
++++++++++++++++++++++++++ |
1316 |
|
1317 |
Using files instead of logical volumes for instance storage would |
1318 |
allow us to get rid of the hard requirement for volume groups for |
1319 |
testing clusters and it would also allow usage of SAN storage to do |
1320 |
live failover taking advantage of this storage solution. |
1321 |
|
1322 |
Better LVM allocation |
1323 |
+++++++++++++++++++++ |
1324 |
|
1325 |
Currently, the LV to PV allocation mechanism is a very simple one: at |
1326 |
each new request for a logical volume, tell LVM to allocate the volume |
1327 |
in order based on the amount of free space. This is good for |
1328 |
simplicity and for keeping the usage equally spread over the available |
1329 |
physical disks, however it introduces a problem that an instance could |
1330 |
end up with its (currently) two drives on two physical disks, or |
1331 |
(worse) that the data and metadata for a DRBD device end up on |
1332 |
different drives. |
1333 |
|
1334 |
This is bad because it causes unneeded ``replace-disks`` operations in |
1335 |
case of a physical failure. |
1336 |
|
1337 |
The solution is to batch allocations for an instance and make the LVM |
1338 |
handling code try to allocate as close as possible all the storage of |
1339 |
one instance. We will still allow the logical volumes to spill over to |
1340 |
additional disks as needed. |
1341 |
|
1342 |
Note that this clustered allocation can only be attempted at initial |
1343 |
instance creation, or at change secondary node time. At add disk time, |
1344 |
or at replacing individual disks, it's not easy enough to compute the |
1345 |
current disk map so we'll not attempt the clustering. |
1346 |
|
1347 |
DRBD8 peer authentication at handshake |
1348 |
++++++++++++++++++++++++++++++++++++++ |
1349 |
|
1350 |
DRBD8 has a new feature that allow authentication of the peer at |
1351 |
connect time. We can use this to prevent connecting to the wrong peer |
1352 |
more that securing the connection. Even though we never had issues |
1353 |
with wrong connections, it would be good to implement this. |
1354 |
|
1355 |
|
1356 |
LVM self-repair (optional) |
1357 |
++++++++++++++++++++++++++ |
1358 |
|
1359 |
The complete failure of a physical disk is very tedious to |
1360 |
troubleshoot, mainly because of the many failure modes and the many |
1361 |
steps needed. We can safely automate some of the steps, more |
1362 |
specifically the ``vgreduce --removemissing`` using the following |
1363 |
method: |
1364 |
|
1365 |
#. check if all nodes have consistent volume groups |
1366 |
#. if yes, and previous status was yes, do nothing |
1367 |
#. if yes, and previous status was no, save status and restart |
1368 |
#. if no, and previous status was no, do nothing |
1369 |
#. if no, and previous status was yes: |
1370 |
#. if more than one node is inconsistent, do nothing |
1371 |
#. if only one node is inconsistent: |
1372 |
#. run ``vgreduce --removemissing`` |
1373 |
#. log this occurrence in the Ganeti log in a form that |
1374 |
can be used for monitoring |
1375 |
#. [FUTURE] run ``replace-disks`` for all |
1376 |
instances affected |
1377 |
|
1378 |
Failover to any node |
1379 |
++++++++++++++++++++ |
1380 |
|
1381 |
With a modified disk activation sequence, we can implement the |
1382 |
*failover to any* functionality, removing many of the layout |
1383 |
restrictions of a cluster: |
1384 |
|
1385 |
- the need to reserve memory on the current secondary: this gets reduced |
1386 |
to a must to reserve memory anywhere on the cluster |
1387 |
|
1388 |
- the need to first failover and then replace secondary for an |
1389 |
instance: with failover-to-any, we can directly failover to |
1390 |
another node, which also does the replace disks at the same |
1391 |
step |
1392 |
|
1393 |
In the following, we denote the current primary by P1, the current |
1394 |
secondary by S1, and the new primary and secondaries by P2 and S2. P2 |
1395 |
is fixed to the node the user chooses, but the choice of S2 can be |
1396 |
made between P1 and S1. This choice can be constrained, depending on |
1397 |
which of P1 and S1 has failed. |
1398 |
|
1399 |
- if P1 has failed, then S1 must become S2, and live migration is not |
1400 |
possible |
1401 |
- if S1 has failed, then P1 must become S2, and live migration could be |
1402 |
possible (in theory, but this is not a design goal for 2.0) |
1403 |
|
1404 |
The algorithm for performing the failover is straightforward: |
1405 |
|
1406 |
- verify that S2 (the node the user has chosen to keep as secondary) has |
1407 |
valid data (is consistent) |
1408 |
|
1409 |
- tear down the current DRBD association and setup a DRBD pairing |
1410 |
between P2 (P2 is indicated by the user) and S2; since P2 has no data, |
1411 |
it will start re-syncing from S2 |
1412 |
|
1413 |
- as soon as P2 is in state SyncTarget (i.e. after the resync has |
1414 |
started but before it has finished), we can promote it to primary role |
1415 |
(r/w) and start the instance on P2 |
1416 |
|
1417 |
- as soon as the P2?S2 sync has finished, we can remove |
1418 |
the old data on the old node that has not been chosen for |
1419 |
S2 |
1420 |
|
1421 |
Caveats: during the P2?S2 sync, a (non-transient) network error |
1422 |
will cause I/O errors on the instance, so (if a longer instance |
1423 |
downtime is acceptable) we can postpone the restart of the instance |
1424 |
until the resync is done. However, disk I/O errors on S2 will cause |
1425 |
data loss, since we don't have a good copy of the data anymore, so in |
1426 |
this case waiting for the sync to complete is not an option. As such, |
1427 |
it is recommended that this feature is used only in conjunction with |
1428 |
proper disk monitoring. |
1429 |
|
1430 |
|
1431 |
Live migration note: While failover-to-any is possible for all choices |
1432 |
of S2, migration-to-any is possible only if we keep P1 as S2. |
1433 |
|
1434 |
Caveats |
1435 |
+++++++ |
1436 |
|
1437 |
The dynamic device model, while more complex, has an advantage: it |
1438 |
will not reuse by mistake the DRBD device of another instance, since |
1439 |
it always looks for either our own or a free one. |
1440 |
|
1441 |
The static one, in contrast, will assume that given a minor number N, |
1442 |
it's ours and we can take over. This needs careful implementation such |
1443 |
that if the minor is in use, either we are able to cleanly shut it |
1444 |
down, or we abort the startup. Otherwise, it could be that we start |
1445 |
syncing between two instance's disks, causing data loss. |
1446 |
|
1447 |
|
1448 |
Variable number of disk/NICs per instance |
1449 |
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
1450 |
|
1451 |
Variable number of disks |
1452 |
++++++++++++++++++++++++ |
1453 |
|
1454 |
In order to support high-security scenarios (for example read-only sda |
1455 |
and read-write sdb), we need to make a fully flexibly disk |
1456 |
definition. This has less impact that it might look at first sight: |
1457 |
only the instance creation has hard coded number of disks, not the disk |
1458 |
handling code. The block device handling and most of the instance |
1459 |
handling code is already working with "the instance's disks" as |
1460 |
opposed to "the two disks of the instance", but some pieces are not |
1461 |
(e.g. import/export) and the code needs a review to ensure safety. |
1462 |
|
1463 |
The objective is to be able to specify the number of disks at |
1464 |
instance creation, and to be able to toggle from read-only to |
1465 |
read-write a disk afterward. |
1466 |
|
1467 |
Variable number of NICs |
1468 |
+++++++++++++++++++++++ |
1469 |
|
1470 |
Similar to the disk change, we need to allow multiple network |
1471 |
interfaces per instance. This will affect the internal code (some |
1472 |
function will have to stop assuming that ``instance.nics`` is a list |
1473 |
of length one), the OS API which currently can export/import only one |
1474 |
instance, and the command line interface. |
1475 |
|
1476 |
Interface changes |
1477 |
----------------- |
1478 |
|
1479 |
There are two areas of interface changes: API-level changes (the OS |
1480 |
interface and the RAPI interface) and the command line interface |
1481 |
changes. |
1482 |
|
1483 |
OS interface |
1484 |
~~~~~~~~~~~~ |
1485 |
|
1486 |
The current Ganeti OS interface, version 5, is tailored for Ganeti 1.2. |
1487 |
The interface is composed by a series of scripts which get called with |
1488 |
certain parameters to perform OS-dependent operations on the cluster. |
1489 |
The current scripts are: |
1490 |
|
1491 |
create |
1492 |
called when a new instance is added to the cluster |
1493 |
export |
1494 |
called to export an instance disk to a stream |
1495 |
import |
1496 |
called to import from a stream to a new instance |
1497 |
rename |
1498 |
called to perform the os-specific operations necessary for renaming an |
1499 |
instance |
1500 |
|
1501 |
Currently these scripts suffer from the limitations of Ganeti 1.2: for |
1502 |
example they accept exactly one block and one swap devices to operate |
1503 |
on, rather than any amount of generic block devices, they blindly assume |
1504 |
that an instance will have just one network interface to operate, they |
1505 |
can not be configured to optimise the instance for a particular |
1506 |
hypervisor. |
1507 |
|
1508 |
Since in Ganeti 2.0 we want to support multiple hypervisors, and a |
1509 |
non-fixed number of network and disks the OS interface need to change to |
1510 |
transmit the appropriate amount of information about an instance to its |
1511 |
managing operating system, when operating on it. Moreover since some old |
1512 |
assumptions usually used in OS scripts are no longer valid we need to |
1513 |
re-establish a common knowledge on what can be assumed and what cannot |
1514 |
be regarding Ganeti environment. |
1515 |
|
1516 |
|
1517 |
When designing the new OS API our priorities are: |
1518 |
- ease of use |
1519 |
- future extensibility |
1520 |
- ease of porting from the old API |
1521 |
- modularity |
1522 |
|
1523 |
As such we want to limit the number of scripts that must be written to |
1524 |
support an OS, and make it easy to share code between them by uniforming |
1525 |
their input. We also will leave the current script structure unchanged, |
1526 |
as far as we can, and make a few of the scripts (import, export and |
1527 |
rename) optional. Most information will be passed to the script through |
1528 |
environment variables, for ease of access and at the same time ease of |
1529 |
using only the information a script needs. |
1530 |
|
1531 |
|
1532 |
The Scripts |
1533 |
+++++++++++ |
1534 |
|
1535 |
As in Ganeti 1.2, every OS which wants to be installed in Ganeti needs |
1536 |
to support the following functionality, through scripts: |
1537 |
|
1538 |
create: |
1539 |
used to create a new instance running that OS. This script should |
1540 |
prepare the block devices, and install them so that the new OS can |
1541 |
boot under the specified hypervisor. |
1542 |
export (optional): |
1543 |
used to export an installed instance using the given OS to a format |
1544 |
which can be used to import it back into a new instance. |
1545 |
import (optional): |
1546 |
used to import an exported instance into a new one. This script is |
1547 |
similar to create, but the new instance should have the content of the |
1548 |
export, rather than contain a pristine installation. |
1549 |
rename (optional): |
1550 |
used to perform the internal OS-specific operations needed to rename |
1551 |
an instance. |
1552 |
|
1553 |
If any optional script is not implemented Ganeti will refuse to perform |
1554 |
the given operation on instances using the non-implementing OS. Of |
1555 |
course the create script is mandatory, and it doesn't make sense to |
1556 |
support the either the export or the import operation but not both. |
1557 |
|
1558 |
Incompatibilities with 1.2 |
1559 |
__________________________ |
1560 |
|
1561 |
We expect the following incompatibilities between the OS scripts for 1.2 |
1562 |
and the ones for 2.0: |
1563 |
|
1564 |
- Input parameters: in 1.2 those were passed on the command line, in 2.0 |
1565 |
we'll use environment variables, as there will be a lot more |
1566 |
information and not all OSes may care about all of it. |
1567 |
- Number of calls: export scripts will be called once for each device |
1568 |
the instance has, and import scripts once for every exported disk. |
1569 |
Imported instances will be forced to have a number of disks greater or |
1570 |
equal to the one of the export. |
1571 |
- Some scripts are not compulsory: if such a script is missing the |
1572 |
relevant operations will be forbidden for instances of that OS. This |
1573 |
makes it easier to distinguish between unsupported operations and |
1574 |
no-op ones (if any). |
1575 |
|
1576 |
|
1577 |
Input |
1578 |
_____ |
1579 |
|
1580 |
Rather than using command line flags, as they do now, scripts will |
1581 |
accept inputs from environment variables. We expect the following input |
1582 |
values: |
1583 |
|
1584 |
OS_API_VERSION |
1585 |
The version of the OS API that the following parameters comply with; |
1586 |
this is used so that in the future we could have OSes supporting |
1587 |
multiple versions and thus Ganeti send the proper version in this |
1588 |
parameter |
1589 |
INSTANCE_NAME |
1590 |
Name of the instance acted on |
1591 |
HYPERVISOR |
1592 |
The hypervisor the instance should run on (e.g. 'xen-pvm', 'xen-hvm', |
1593 |
'kvm') |
1594 |
DISK_COUNT |
1595 |
The number of disks this instance will have |
1596 |
NIC_COUNT |
1597 |
The number of NICs this instance will have |
1598 |
DISK_<N>_PATH |
1599 |
Path to the Nth disk. |
1600 |
DISK_<N>_ACCESS |
1601 |
W if read/write, R if read only. OS scripts are not supposed to touch |
1602 |
read-only disks, but will be passed them to know. |
1603 |
DISK_<N>_FRONTEND_TYPE |
1604 |
Type of the disk as seen by the instance. Can be 'scsi', 'ide', |
1605 |
'virtio' |
1606 |
DISK_<N>_BACKEND_TYPE |
1607 |
Type of the disk as seen from the node. Can be 'block', 'file:loop' or |
1608 |
'file:blktap' |
1609 |
NIC_<N>_MAC |
1610 |
Mac address for the Nth network interface |
1611 |
NIC_<N>_IP |
1612 |
Ip address for the Nth network interface, if available |
1613 |
NIC_<N>_BRIDGE |
1614 |
Node bridge the Nth network interface will be connected to |
1615 |
NIC_<N>_FRONTEND_TYPE |
1616 |
Type of the Nth NIC as seen by the instance. For example 'virtio', |
1617 |
'rtl8139', etc. |
1618 |
DEBUG_LEVEL |
1619 |
Whether more out should be produced, for debugging purposes. Currently |
1620 |
the only valid values are 0 and 1. |
1621 |
|
1622 |
These are only the basic variables we are thinking of now, but more |
1623 |
may come during the implementation and they will be documented in the |
1624 |
:manpage:`ganeti-os-api` man page. All these variables will be |
1625 |
available to all scripts. |
1626 |
|
1627 |
Some scripts will need a few more information to work. These will have |
1628 |
per-script variables, such as for example: |
1629 |
|
1630 |
OLD_INSTANCE_NAME |
1631 |
rename: the name the instance should be renamed from. |
1632 |
EXPORT_DEVICE |
1633 |
export: device to be exported, a snapshot of the actual device. The |
1634 |
data must be exported to stdout. |
1635 |
EXPORT_INDEX |
1636 |
export: sequential number of the instance device targeted. |
1637 |
IMPORT_DEVICE |
1638 |
import: device to send the data to, part of the new instance. The data |
1639 |
must be imported from stdin. |
1640 |
IMPORT_INDEX |
1641 |
import: sequential number of the instance device targeted. |
1642 |
|
1643 |
(Rationale for INSTANCE_NAME as an environment variable: the instance |
1644 |
name is always needed and we could pass it on the command line. On the |
1645 |
other hand, though, this would force scripts to both access the |
1646 |
environment and parse the command line, so we'll move it for |
1647 |
uniformity.) |
1648 |
|
1649 |
|
1650 |
Output/Behaviour |
1651 |
________________ |
1652 |
|
1653 |
As discussed scripts should only send user-targeted information to |
1654 |
stderr. The create and import scripts are supposed to format/initialise |
1655 |
the given block devices and install the correct instance data. The |
1656 |
export script is supposed to export instance data to stdout in a format |
1657 |
understandable by the the import script. The data will be compressed by |
1658 |
Ganeti, so no compression should be done. The rename script should only |
1659 |
modify the instance's knowledge of what its name is. |
1660 |
|
1661 |
Other declarative style features |
1662 |
++++++++++++++++++++++++++++++++ |
1663 |
|
1664 |
Similar to Ganeti 1.2, OS specifications will need to provide a |
1665 |
'ganeti_api_version' containing list of numbers matching the |
1666 |
version(s) of the API they implement. Ganeti itself will always be |
1667 |
compatible with one version of the API and may maintain backwards |
1668 |
compatibility if it's feasible to do so. The numbers are one-per-line, |
1669 |
so an OS supporting both version 5 and version 20 will have a file |
1670 |
containing two lines. This is different from Ganeti 1.2, which only |
1671 |
supported one version number. |
1672 |
|
1673 |
In addition to that an OS will be able to declare that it does support |
1674 |
only a subset of the Ganeti hypervisors, by declaring them in the |
1675 |
'hypervisors' file. |
1676 |
|
1677 |
|
1678 |
Caveats/Notes |
1679 |
+++++++++++++ |
1680 |
|
1681 |
We might want to have a "default" import/export behaviour that just |
1682 |
dumps all disks and restores them. This can save work as most systems |
1683 |
will just do this, while allowing flexibility for different systems. |
1684 |
|
1685 |
Environment variables are limited in size, but we expect that there will |
1686 |
be enough space to store the information we need. If we discover that |
1687 |
this is not the case we may want to go to a more complex API such as |
1688 |
storing those information on the filesystem and providing the OS script |
1689 |
with the path to a file where they are encoded in some format. |
1690 |
|
1691 |
|
1692 |
|
1693 |
Remote API changes |
1694 |
~~~~~~~~~~~~~~~~~~ |
1695 |
|
1696 |
The first Ganeti remote API (RAPI) was designed and deployed with the |
1697 |
Ganeti 1.2.5 release. That version provide read-only access to the |
1698 |
cluster state. Fully functional read-write API demands significant |
1699 |
internal changes which will be implemented in version 2.0. |
1700 |
|
1701 |
We decided to go with implementing the Ganeti RAPI in a RESTful way, |
1702 |
which is aligned with key features we looking. It is simple, |
1703 |
stateless, scalable and extensible paradigm of API implementation. As |
1704 |
transport it uses HTTP over SSL, and we are implementing it with JSON |
1705 |
encoding, but in a way it possible to extend and provide any other |
1706 |
one. |
1707 |
|
1708 |
Design |
1709 |
++++++ |
1710 |
|
1711 |
The Ganeti RAPI is implemented as independent daemon, running on the |
1712 |
same node with the same permission level as Ganeti master |
1713 |
daemon. Communication is done through the LUXI library to the master |
1714 |
daemon. In order to keep communication asynchronous RAPI processes two |
1715 |
types of client requests: |
1716 |
|
1717 |
- queries: server is able to answer immediately |
1718 |
- job submission: some time is required for a useful response |
1719 |
|
1720 |
In the query case requested data send back to client in the HTTP |
1721 |
response body. Typical examples of queries would be: list of nodes, |
1722 |
instances, cluster info, etc. |
1723 |
|
1724 |
In the case of job submission, the client receive a job ID, the |
1725 |
identifier which allows one to query the job progress in the job queue |
1726 |
(see `Job Queue`_). |
1727 |
|
1728 |
Internally, each exported object has an version identifier, which is |
1729 |
used as a state identifier in the HTTP header E-Tag field for |
1730 |
requests/responses to avoid race conditions. |
1731 |
|
1732 |
|
1733 |
Resource representation |
1734 |
+++++++++++++++++++++++ |
1735 |
|
1736 |
The key difference of using REST instead of others API is that REST |
1737 |
requires separation of services via resources with unique URIs. Each |
1738 |
of them should have limited amount of state and support standard HTTP |
1739 |
methods: GET, POST, DELETE, PUT. |
1740 |
|
1741 |
For example in Ganeti's case we can have a set of URI: |
1742 |
|
1743 |
- ``/{clustername}/instances`` |
1744 |
- ``/{clustername}/instances/{instancename}`` |
1745 |
- ``/{clustername}/instances/{instancename}/tag`` |
1746 |
- ``/{clustername}/tag`` |
1747 |
|
1748 |
A GET request to ``/{clustername}/instances`` will return the list of |
1749 |
instances, a POST to ``/{clustername}/instances`` should create a new |
1750 |
instance, a DELETE ``/{clustername}/instances/{instancename}`` should |
1751 |
delete the instance, a GET ``/{clustername}/tag`` should return get |
1752 |
cluster tags. |
1753 |
|
1754 |
Each resource URI will have a version prefix. The resource IDs are to |
1755 |
be determined. |
1756 |
|
1757 |
Internal encoding might be JSON, XML, or any other. The JSON encoding |
1758 |
fits nicely in Ganeti RAPI needs. The client can request a specific |
1759 |
representation via the Accept field in the HTTP header. |
1760 |
|
1761 |
REST uses HTTP as its transport and application protocol for resource |
1762 |
access. The set of possible responses is a subset of standard HTTP |
1763 |
responses. |
1764 |
|
1765 |
The statelessness model provides additional reliability and |
1766 |
transparency to operations (e.g. only one request needs to be analyzed |
1767 |
to understand the in-progress operation, not a sequence of multiple |
1768 |
requests/responses). |
1769 |
|
1770 |
|
1771 |
Security |
1772 |
++++++++ |
1773 |
|
1774 |
With the write functionality security becomes a much bigger an issue. |
1775 |
The Ganeti RAPI uses basic HTTP authentication on top of an |
1776 |
SSL-secured connection to grant access to an exported resource. The |
1777 |
password is stored locally in an Apache-style ``.htpasswd`` file. Only |
1778 |
one level of privileges is supported. |
1779 |
|
1780 |
Caveats |
1781 |
+++++++ |
1782 |
|
1783 |
The model detailed above for job submission requires the client to |
1784 |
poll periodically for updates to the job; an alternative would be to |
1785 |
allow the client to request a callback, or a 'wait for updates' call. |
1786 |
|
1787 |
The callback model was not considered due to the following two issues: |
1788 |
|
1789 |
- callbacks would require a new model of allowed callback URLs, |
1790 |
together with a method of managing these |
1791 |
- callbacks only work when the client and the master are in the same |
1792 |
security domain, and they fail in the other cases (e.g. when there is |
1793 |
a firewall between the client and the RAPI daemon that only allows |
1794 |
client-to-RAPI calls, which is usual in DMZ cases) |
1795 |
|
1796 |
The 'wait for updates' method is not suited to the HTTP protocol, |
1797 |
where requests are supposed to be short-lived. |
1798 |
|
1799 |
Command line changes |
1800 |
~~~~~~~~~~~~~~~~~~~~ |
1801 |
|
1802 |
Ganeti 2.0 introduces several new features as well as new ways to |
1803 |
handle instance resources like disks or network interfaces. This |
1804 |
requires some noticeable changes in the way command line arguments are |
1805 |
handled. |
1806 |
|
1807 |
- extend and modify command line syntax to support new features |
1808 |
- ensure consistent patterns in command line arguments to reduce |
1809 |
cognitive load |
1810 |
|
1811 |
The design changes that require these changes are, in no particular |
1812 |
order: |
1813 |
|
1814 |
- flexible instance disk handling: support a variable number of disks |
1815 |
with varying properties per instance, |
1816 |
- flexible instance network interface handling: support a variable |
1817 |
number of network interfaces with varying properties per instance |
1818 |
- multiple hypervisors: multiple hypervisors can be active on the same |
1819 |
cluster, each supporting different parameters, |
1820 |
- support for device type CDROM (via ISO image) |
1821 |
|
1822 |
As such, there are several areas of Ganeti where the command line |
1823 |
arguments will change: |
1824 |
|
1825 |
- Cluster configuration |
1826 |
|
1827 |
- cluster initialization |
1828 |
- cluster default configuration |
1829 |
|
1830 |
- Instance configuration |
1831 |
|
1832 |
- handling of network cards for instances, |
1833 |
- handling of disks for instances, |
1834 |
- handling of CDROM devices and |
1835 |
- handling of hypervisor specific options. |
1836 |
|
1837 |
There are several areas of Ganeti where the command line arguments |
1838 |
will change: |
1839 |
|
1840 |
- Cluster configuration |
1841 |
|
1842 |
- cluster initialization |
1843 |
- cluster default configuration |
1844 |
|
1845 |
- Instance configuration |
1846 |
|
1847 |
- handling of network cards for instances, |
1848 |
- handling of disks for instances, |
1849 |
- handling of CDROM devices and |
1850 |
- handling of hypervisor specific options. |
1851 |
|
1852 |
Notes about device removal/addition |
1853 |
+++++++++++++++++++++++++++++++++++ |
1854 |
|
1855 |
To avoid problems with device location changes (e.g. second network |
1856 |
interface of the instance becoming the first or third and the like) |
1857 |
the list of network/disk devices is treated as a stack, i.e. devices |
1858 |
can only be added/removed at the end of the list of devices of each |
1859 |
class (disk or network) for each instance. |
1860 |
|
1861 |
gnt-instance commands |
1862 |
+++++++++++++++++++++ |
1863 |
|
1864 |
The commands for gnt-instance will be modified and extended to allow |
1865 |
for the new functionality: |
1866 |
|
1867 |
- the add command will be extended to support the new device and |
1868 |
hypervisor options, |
1869 |
- the modify command continues to handle all modifications to |
1870 |
instances, but will be extended with new arguments for handling |
1871 |
devices. |
1872 |
|
1873 |
Network Device Options |
1874 |
++++++++++++++++++++++ |
1875 |
|
1876 |
The generic format of the network device option is: |
1877 |
|
1878 |
--net $DEVNUM[:$OPTION=$VALUE][,$OPTION=VALUE] |
1879 |
|
1880 |
:$DEVNUM: device number, unsigned integer, starting at 0, |
1881 |
:$OPTION: device option, string, |
1882 |
:$VALUE: device option value, string. |
1883 |
|
1884 |
Currently, the following device options will be defined (open to |
1885 |
further changes): |
1886 |
|
1887 |
:mac: MAC address of the network interface, accepts either a valid |
1888 |
MAC address or the string 'auto'. If 'auto' is specified, a new MAC |
1889 |
address will be generated randomly. If the mac device option is not |
1890 |
specified, the default value 'auto' is assumed. |
1891 |
:bridge: network bridge the network interface is connected |
1892 |
to. Accepts either a valid bridge name (the specified bridge must |
1893 |
exist on the node(s)) as string or the string 'auto'. If 'auto' is |
1894 |
specified, the default brigde is used. If the bridge option is not |
1895 |
specified, the default value 'auto' is assumed. |
1896 |
|
1897 |
Disk Device Options |
1898 |
+++++++++++++++++++ |
1899 |
|
1900 |
The generic format of the disk device option is: |
1901 |
|
1902 |
--disk $DEVNUM[:$OPTION=$VALUE][,$OPTION=VALUE] |
1903 |
|
1904 |
:$DEVNUM: device number, unsigned integer, starting at 0, |
1905 |
:$OPTION: device option, string, |
1906 |
:$VALUE: device option value, string. |
1907 |
|
1908 |
Currently, the following device options will be defined (open to |
1909 |
further changes): |
1910 |
|
1911 |
:size: size of the disk device, either a positive number, specifying |
1912 |
the disk size in mebibytes, or a number followed by a magnitude suffix |
1913 |
(M for mebibytes, G for gibibytes). Also accepts the string 'auto' in |
1914 |
which case the default disk size will be used. If the size option is |
1915 |
not specified, 'auto' is assumed. This option is not valid for all |
1916 |
disk layout types. |
1917 |
:access: access mode of the disk device, a single letter, valid values |
1918 |
are: |
1919 |
|
1920 |
- *w*: read/write access to the disk device or |
1921 |
- *r*: read-only access to the disk device. |
1922 |
|
1923 |
If the access mode is not specified, the default mode of read/write |
1924 |
access will be configured. |
1925 |
:path: path to the image file for the disk device, string. No default |
1926 |
exists. This option is not valid for all disk layout types. |
1927 |
|
1928 |
Adding devices |
1929 |
++++++++++++++ |
1930 |
|
1931 |
To add devices to an already existing instance, use the device type |
1932 |
specific option to gnt-instance modify. Currently, there are two |
1933 |
device type specific options supported: |
1934 |
|
1935 |
:--net: for network interface cards |
1936 |
:--disk: for disk devices |
1937 |
|
1938 |
The syntax to the device specific options is similar to the generic |
1939 |
device options, but instead of specifying a device number like for |
1940 |
gnt-instance add, you specify the magic string add. The new device |
1941 |
will always be appended at the end of the list of devices of this type |
1942 |
for the specified instance, e.g. if the instance has disk devices 0,1 |
1943 |
and 2, the newly added disk device will be disk device 3. |
1944 |
|
1945 |
Example: gnt-instance modify --net add:mac=auto test-instance |
1946 |
|
1947 |
Removing devices |
1948 |
++++++++++++++++ |
1949 |
|
1950 |
Removing devices from and instance is done via gnt-instance |
1951 |
modify. The same device specific options as for adding instances are |
1952 |
used. Instead of a device number and further device options, only the |
1953 |
magic string remove is specified. It will always remove the last |
1954 |
device in the list of devices of this type for the instance specified, |
1955 |
e.g. if the instance has disk devices 0, 1, 2 and 3, the disk device |
1956 |
number 3 will be removed. |
1957 |
|
1958 |
Example: gnt-instance modify --net remove test-instance |
1959 |
|
1960 |
Modifying devices |
1961 |
+++++++++++++++++ |
1962 |
|
1963 |
Modifying devices is also done with device type specific options to |
1964 |
the gnt-instance modify command. There are currently two device type |
1965 |
options supported: |
1966 |
|
1967 |
:--net: for network interface cards |
1968 |
:--disk: for disk devices |
1969 |
|
1970 |
The syntax to the device specific options is similar to the generic |
1971 |
device options. The device number you specify identifies the device to |
1972 |
be modified. |
1973 |
|
1974 |
Example:: |
1975 |
|
1976 |
gnt-instance modify --disk 2:access=r |
1977 |
|
1978 |
Hypervisor Options |
1979 |
++++++++++++++++++ |
1980 |
|
1981 |
Ganeti 2.0 will support more than one hypervisor. Different |
1982 |
hypervisors have various options that only apply to a specific |
1983 |
hypervisor. Those hypervisor specific options are treated specially |
1984 |
via the ``--hypervisor`` option. The generic syntax of the hypervisor |
1985 |
option is as follows:: |
1986 |
|
1987 |
--hypervisor $HYPERVISOR:$OPTION=$VALUE[,$OPTION=$VALUE] |
1988 |
|
1989 |
:$HYPERVISOR: symbolic name of the hypervisor to use, string, |
1990 |
has to match the supported hypervisors. Example: xen-pvm |
1991 |
|
1992 |
:$OPTION: hypervisor option name, string |
1993 |
:$VALUE: hypervisor option value, string |
1994 |
|
1995 |
The hypervisor option for an instance can be set on instance creation |
1996 |
time via the ``gnt-instance add`` command. If the hypervisor for an |
1997 |
instance is not specified upon instance creation, the default |
1998 |
hypervisor will be used. |
1999 |
|
2000 |
Modifying hypervisor parameters |
2001 |
+++++++++++++++++++++++++++++++ |
2002 |
|
2003 |
The hypervisor parameters of an existing instance can be modified |
2004 |
using ``--hypervisor`` option of the ``gnt-instance modify`` |
2005 |
command. However, the hypervisor type of an existing instance can not |
2006 |
be changed, only the particular hypervisor specific option can be |
2007 |
changed. Therefore, the format of the option parameters has been |
2008 |
simplified to omit the hypervisor name and only contain the comma |
2009 |
separated list of option-value pairs. |
2010 |
|
2011 |
Example:: |
2012 |
|
2013 |
gnt-instance modify --hypervisor cdrom=/srv/boot.iso,boot_order=cdrom:network test-instance |
2014 |
|
2015 |
gnt-cluster commands |
2016 |
++++++++++++++++++++ |
2017 |
|
2018 |
The command for gnt-cluster will be extended to allow setting and |
2019 |
changing the default parameters of the cluster: |
2020 |
|
2021 |
- The init command will be extend to support the defaults option to |
2022 |
set the cluster defaults upon cluster initialization. |
2023 |
- The modify command will be added to modify the cluster |
2024 |
parameters. It will support the --defaults option to change the |
2025 |
cluster defaults. |
2026 |
|
2027 |
Cluster defaults |
2028 |
|
2029 |
The generic format of the cluster default setting option is: |
2030 |
|
2031 |
--defaults $OPTION=$VALUE[,$OPTION=$VALUE] |
2032 |
|
2033 |
:$OPTION: cluster default option, string, |
2034 |
:$VALUE: cluster default option value, string. |
2035 |
|
2036 |
Currently, the following cluster default options are defined (open to |
2037 |
further changes): |
2038 |
|
2039 |
:hypervisor: the default hypervisor to use for new instances, |
2040 |
string. Must be a valid hypervisor known to and supported by the |
2041 |
cluster. |
2042 |
:disksize: the disksize for newly created instance disks, where |
2043 |
applicable. Must be either a positive number, in which case the unit |
2044 |
of megabyte is assumed, or a positive number followed by a supported |
2045 |
magnitude symbol (M for megabyte or G for gigabyte). |
2046 |
:bridge: the default network bridge to use for newly created instance |
2047 |
network interfaces, string. Must be a valid bridge name of a bridge |
2048 |
existing on the node(s). |
2049 |
|
2050 |
Hypervisor cluster defaults |
2051 |
+++++++++++++++++++++++++++ |
2052 |
|
2053 |
The generic format of the hypervisor cluster wide default setting |
2054 |
option is:: |
2055 |
|
2056 |
--hypervisor-defaults $HYPERVISOR:$OPTION=$VALUE[,$OPTION=$VALUE] |
2057 |
|
2058 |
:$HYPERVISOR: symbolic name of the hypervisor whose defaults you want |
2059 |
to set, string |
2060 |
:$OPTION: cluster default option, string, |
2061 |
:$VALUE: cluster default option value, string. |
2062 |
|
2063 |
.. vim: set textwidth=72 : |