Revision 6c2d0b44 doc/design-2.0.rst
b/doc/design-2.0.rst | ||
---|---|---|
22 | 22 |
Background |
23 | 23 |
========== |
24 | 24 |
|
25 |
While Ganeti 1.2 is usable, it severly limits the flexibility of the |
|
25 |
While Ganeti 1.2 is usable, it severely limits the flexibility of the
|
|
26 | 26 |
cluster administration and imposes a very rigid model. It has the |
27 | 27 |
following main scalability issues: |
28 | 28 |
|
... | ... | |
33 | 33 |
It also has a number of artificial restrictions, due to historical design: |
34 | 34 |
|
35 | 35 |
- fixed number of disks (two) per instance |
36 |
- fixed number of nics
|
|
36 |
- fixed number of NICs
|
|
37 | 37 |
|
38 | 38 |
.. [#] Replace disks will release the lock, but this is an exception |
39 | 39 |
and not a recommended way to operate |
40 | 40 |
|
41 | 41 |
The 2.0 version is intended to address some of these problems, and |
42 |
create a more flexible codebase for future developments. |
|
42 |
create a more flexible code base for future developments. |
|
43 |
|
|
44 |
Among these problems, the single-operation at a time restriction is |
|
45 |
biggest issue with the current version of Ganeti. It is such a big |
|
46 |
impediment in operating bigger clusters that many times one is tempted |
|
47 |
to remove the lock just to do a simple operation like start instance |
|
48 |
while an OS installation is running. |
|
43 | 49 |
|
44 | 50 |
Scalability problems |
45 | 51 |
-------------------- |
... | ... | |
60 | 66 |
|
61 | 67 |
One of the main causes of this global lock (beside the higher |
62 | 68 |
difficulty of ensuring data consistency in a more granular lock model) |
63 |
is the fact that currently there is no "master" daemon in Ganeti. Each
|
|
64 |
command tries to acquire the so called *cmd* lock and when it
|
|
65 |
succeeds, it takes complete ownership of the cluster configuration and
|
|
66 |
state. |
|
69 |
is the fact that currently there is no long-lived process in Ganeti
|
|
70 |
that can coordinate multiple operations. Each command tries to acquire
|
|
71 |
the so called *cmd* lock and when it succeeds, it takes complete
|
|
72 |
ownership of the cluster configuration and state.
|
|
67 | 73 |
|
68 | 74 |
Other scalability problems are due the design of the DRBD device |
69 | 75 |
model, which assumed at its creation a low (one to four) number of |
... | ... | |
77 | 83 |
touches multiple areas (configuration, import/export, command line) |
78 | 84 |
that it's more fitted to a major release than a minor one. |
79 | 85 |
|
86 |
Architecture issues |
|
87 |
------------------- |
|
88 |
|
|
89 |
The fact that each command is a separate process that reads the |
|
90 |
cluster state, executes the command, and saves the new state is also |
|
91 |
an issue on big clusters where the configuration data for the cluster |
|
92 |
begins to be non-trivial in size. |
|
93 |
|
|
80 | 94 |
Overview |
81 | 95 |
======== |
82 | 96 |
|
... | ... | |
109 | 123 |
|
110 | 124 |
The main changes will be switching from a per-process model to a |
111 | 125 |
daemon based model, where the individual gnt-* commands will be |
112 |
clients that talk to this daemon (see the design-2.0-master-daemon |
|
113 |
document). This will allow us to get rid of the global cluster lock |
|
114 |
for most operations, having instead a per-object lock (see |
|
115 |
design-2.0-granular-locking). Also, the daemon will be able to queue |
|
116 |
jobs, and this will allow the invidual clients to submit jobs without |
|
117 |
waiting for them to finish, and also see the result of old requests |
|
118 |
(see design-2.0-job-queue). |
|
126 |
clients that talk to this daemon (see `Master daemon`_). This will |
|
127 |
allow us to get rid of the global cluster lock for most operations, |
|
128 |
having instead a per-object lock (see `Granular locking`_). Also, the |
|
129 |
daemon will be able to queue jobs, and this will allow the individual |
|
130 |
clients to submit jobs without waiting for them to finish, and also |
|
131 |
see the result of old requests (see `Job Queue`_). |
|
119 | 132 |
|
120 | 133 |
Beside these major changes, another 'core' change but that will not be |
121 | 134 |
as visible to the users will be changing the model of object attribute |
122 |
storage, and separate that into namespaces (such that an Xen PVM |
|
135 |
storage, and separate that into name spaces (such that an Xen PVM
|
|
123 | 136 |
instance will not have the Xen HVM parameters). This will allow future |
124 |
flexibility in defining additional parameters. More details in the
|
|
125 |
design-2.0-cluster-parameters document.
|
|
137 |
flexibility in defining additional parameters. For more details see
|
|
138 |
`Object parameters`_.
|
|
126 | 139 |
|
127 | 140 |
The various changes brought in by the master daemon model and the |
128 | 141 |
read-write RAPI will require changes to the cluster security; we move |
129 |
away from Twisted and use http(s) for intra- and extra-cluster
|
|
142 |
away from Twisted and use HTTP(s) for intra- and extra-cluster
|
|
130 | 143 |
communications. For more details, see the security document in the |
131 | 144 |
doc/ directory. |
132 | 145 |
|
... | ... | |
140 | 153 |
- the command line tools (on the master node) |
141 | 154 |
- the RAPI daemon (on the master node) |
142 | 155 |
|
143 |
Interaction paths are between:
|
|
156 |
The master-daemon related interaction paths are:
|
|
144 | 157 |
|
145 |
- (CLI tools/RAPI daemon) and the master daemon, via the so called *luxi* API
|
|
158 |
- (CLI tools/RAPI daemon) and the master daemon, via the so called *LUXI* API
|
|
146 | 159 |
- the master daemon and the node daemons, via the node RPC |
147 | 160 |
|
161 |
There are also some additional interaction paths for exceptional cases: |
|
162 |
|
|
163 |
- CLI tools might access via SSH the nodes (for ``gnt-cluster copyfile`` |
|
164 |
and ``gnt-cluster command``) |
|
165 |
- master failover is a special case when a non-master node will SSH |
|
166 |
and do node-RPC calls to the current master |
|
167 |
|
|
148 | 168 |
The protocol between the master daemon and the node daemons will be |
149 |
changed to HTTP(S), using a simple PUT/GET of JSON-encoded |
|
150 |
messages. This is done due to difficulties in working with the Twisted |
|
151 |
framework and its protocols in a multithreaded environment, which we |
|
152 |
can overcome by using a simpler stack (see the caveats section). The |
|
153 |
protocol between the CLI/RAPI and the master daemon will be a custom |
|
154 |
one (called *luxi*): on a UNIX socket on the master node, with rights |
|
155 |
restricted by filesystem permissions, the CLI/RAPI will talk to the |
|
156 |
master daemon using JSON-encoded messages. |
|
169 |
changed from (Ganeti 1.2) Twisted PB (perspective broker) to HTTP(S), |
|
170 |
using a simple PUT/GET of JSON-encoded messages. This is done due to |
|
171 |
difficulties in working with the Twisted framework and its protocols |
|
172 |
in a multithreaded environment, which we can overcome by using a |
|
173 |
simpler stack (see the caveats section). |
|
174 |
|
|
175 |
The protocol between the CLI/RAPI and the master daemon will be a |
|
176 |
custom one (called *LUXI*): on a UNIX socket on the master node, with |
|
177 |
rights restricted by filesystem permissions, the CLI/RAPI will talk to |
|
178 |
the master daemon using JSON-encoded messages. |
|
157 | 179 |
|
158 | 180 |
The operations supported over this internal protocol will be encoded |
159 | 181 |
via a python library that will expose a simple API for its |
160 | 182 |
users. Internally, the protocol will simply encode all objects in JSON |
161 | 183 |
format and decode them on the receiver side. |
162 | 184 |
|
185 |
For more details about the RAPI daemon see `Remote API changes`_, and |
|
186 |
for the node daemon see `Node daemon changes`_. |
|
187 |
|
|
163 | 188 |
The LUXI protocol |
164 | 189 |
+++++++++++++++++ |
165 | 190 |
|
166 |
We will have two main classes of operations over the master daemon API: |
|
191 |
As described above, the protocol for making requests or queries to the |
|
192 |
master daemon will be a UNIX-socket based simple RPC of JSON-encoded |
|
193 |
messages. |
|
194 |
|
|
195 |
The choice of UNIX was in order to get rid of the need of |
|
196 |
authentication and authorisation inside Ganeti; for 2.0, the |
|
197 |
permissions on the Unix socket itself will determine the access |
|
198 |
rights. |
|
199 |
|
|
200 |
We will have two main classes of operations over this API: |
|
167 | 201 |
|
168 | 202 |
- cluster query functions |
169 | 203 |
- job related functions |
170 | 204 |
|
171 | 205 |
The cluster query functions are usually short-duration, and are the |
172 |
equivalent of the OP_QUERY_* opcodes in ganeti 1.2 (and they are
|
|
206 |
equivalent of the ``OP_QUERY_*`` opcodes in Ganeti 1.2 (and they are
|
|
173 | 207 |
internally implemented still with these opcodes). The clients are |
174 | 208 |
guaranteed to receive the response in a reasonable time via a timeout. |
175 | 209 |
|
... | ... | |
180 | 214 |
- archive job (see the job queue design doc) |
181 | 215 |
- wait for job change, which allows a client to wait without polling |
182 | 216 |
|
183 |
For more details, see the job queue design document.
|
|
217 |
For more details of the actual operation list, see the `Job Queue`_.
|
|
184 | 218 |
|
185 |
Daemon implementation |
|
186 |
+++++++++++++++++++++ |
|
219 |
Both requests and responses will consist of a JSON-encoded message |
|
220 |
followed by the ``ETX`` character (ASCII decimal 3), which is not a |
|
221 |
valid character in JSON messages and thus can serve as a message |
|
222 |
delimiter. The contents of the messages will be a dictionary with two |
|
223 |
fields: |
|
224 |
|
|
225 |
:method: |
|
226 |
the name of the method called |
|
227 |
:args: |
|
228 |
the arguments to the method, as a list (no keyword arguments allowed) |
|
229 |
|
|
230 |
Responses will follow the same format, with the two fields being: |
|
231 |
|
|
232 |
:success: |
|
233 |
a boolean denoting the success of the operation |
|
234 |
:result: |
|
235 |
the actual result, or error message in case of failure |
|
236 |
|
|
237 |
There are two special value for the result field: |
|
238 |
|
|
239 |
- in the case that the operation failed, and this field is a list of |
|
240 |
length two, the client library will try to interpret is as an exception, |
|
241 |
the first element being the exception type and the second one the |
|
242 |
actual exception arguments; this will allow a simple method of passing |
|
243 |
Ganeti-related exception across the interface |
|
244 |
- for the *WaitForChange* call (that waits on the server for a job to |
|
245 |
change status), if the result is equal to ``nochange`` instead of the |
|
246 |
usual result for this call (a list of changes), then the library will |
|
247 |
internally retry the call; this is done in order to differentiate |
|
248 |
internally between master daemon hung and job simply not changed |
|
249 |
|
|
250 |
Users of the API that don't use the provided python library should |
|
251 |
take care of the above two cases. |
|
252 |
|
|
253 |
|
|
254 |
Master daemon implementation |
|
255 |
++++++++++++++++++++++++++++ |
|
187 | 256 |
|
188 | 257 |
The daemon will be based around a main I/O thread that will wait for |
189 | 258 |
new requests from the clients, and that does the setup/shutdown of the |
... | ... | |
195 | 264 |
long-lived, started at daemon startup and terminated only at shutdown |
196 | 265 |
time |
197 | 266 |
- client I/O threads, which are the ones that talk the local protocol |
198 |
to the clients
|
|
267 |
(LUXI) to the clients, and are short-lived
|
|
199 | 268 |
|
200 | 269 |
Master startup/failover |
201 | 270 |
+++++++++++++++++++++++ |
... | ... | |
229 | 298 |
- if we are not failing over (but just starting), the |
230 | 299 |
quorum agrees that we are the designated master |
231 | 300 |
|
301 |
- if any of the above is false, we prevent the current operation |
|
302 |
(i.e. we don't become the master) |
|
303 |
|
|
232 | 304 |
#. at this point, the node transitions to the master role |
233 | 305 |
|
234 | 306 |
#. for all the in-progress jobs, mark them as failed, with |
235 | 307 |
reason unknown or something similar (master failed, etc.) |
236 | 308 |
|
309 |
Since due to exceptional conditions we could have a situation in which |
|
310 |
no node can become the master due to inconsistent data, we will have |
|
311 |
an override switch for the master daemon startup that will assume the |
|
312 |
current node has the right data and will replicate all the |
|
313 |
configuration files to the other nodes. |
|
314 |
|
|
315 |
**Note**: the above algorithm is by no means an election algorithm; it |
|
316 |
is a *confirmation* of the master role currently held by a node. |
|
237 | 317 |
|
238 | 318 |
Logging |
239 | 319 |
+++++++ |
240 | 320 |
|
241 |
The logging system will be switched completely to the logging module;
|
|
242 |
currently it's logging-based, but exposes a different API, which is
|
|
243 |
just overhead. As such, the code will be switched over to standard
|
|
244 |
logging calls, and only the setup will be custom. |
|
321 |
The logging system will be switched completely to the standard python
|
|
322 |
logging module; currently it's logging-based, but exposes a different
|
|
323 |
API, which is just overhead. As such, the code will be switched over
|
|
324 |
to standard logging calls, and only the setup will be custom.
|
|
245 | 325 |
|
246 | 326 |
With this change, we will remove the separate debug/info/error logs, |
247 | 327 |
and instead have always one logfile per daemon model: |
... | ... | |
250 | 330 |
- node-daemon.log for the node daemon (this is the same as in 1.2) |
251 | 331 |
- rapi-daemon.log for the RAPI daemon logs |
252 | 332 |
- rapi-access.log, an additional log file for the RAPI that will be |
253 |
in the standard http log format for possible parsing by other tools |
|
333 |
in the standard HTTP log format for possible parsing by other tools |
|
334 |
|
|
335 |
Since the `watcher`_ will only submit jobs to the master for startup |
|
336 |
of the instances, its log file will contain less information than |
|
337 |
before, mainly that it will start the instance, but not the results. |
|
338 |
|
|
339 |
Node daemon changes |
|
340 |
+++++++++++++++++++ |
|
341 |
|
|
342 |
The only change to the node daemon is that, since we need better |
|
343 |
concurrency, we don't process the inter-node RPC calls in the node |
|
344 |
daemon itself, but we fork and process each request in a separate |
|
345 |
child. |
|
254 | 346 |
|
255 |
Since the watcher will only submit jobs to the master for startup of |
|
256 |
the instances, its log file will contain less information than before, |
|
257 |
mainly that it will start the instance, but not the results. |
|
347 |
Since we don't have many calls, and we only fork (not exec), the |
|
348 |
overhead should be minimal. |
|
258 | 349 |
|
259 | 350 |
Caveats |
260 | 351 |
+++++++ |
... | ... | |
277 | 368 |
much better served by a daemon-based model |
278 | 369 |
|
279 | 370 |
Another area of discussion is moving away from Twisted in this new |
280 |
implementation. While Twisted hase its advantages, there are also many
|
|
281 |
disatvantanges to using it:
|
|
371 |
implementation. While Twisted has its advantages, there are also many |
|
372 |
disadvantages to using it:
|
|
282 | 373 |
|
283 | 374 |
- first and foremost, it's not a library, but a framework; thus, if |
284 |
you use twisted, all the code needs to be 'twiste-ized'; we were able |
|
285 |
to keep the 1.x code clean by hacking around twisted in an |
|
286 |
unsupported, unrecommended way, and the only alternative would have |
|
287 |
been to make all the code be written for twisted |
|
288 |
- it has some weaknesses in working with multiple threads, since its base |
|
289 |
model is designed to replace thread usage by using deferred calls, so while |
|
290 |
it can use threads, it's not less flexible in doing so |
|
291 |
|
|
292 |
And, since we already have an HTTP server library for the RAPI, we |
|
293 |
can just reuse that for inter-node communication. |
|
375 |
you use twisted, all the code needs to be 'twiste-ized' and written |
|
376 |
in an asynchronous manner, using deferreds; while this method works, |
|
377 |
it's not a common way to code and it requires that the entire process |
|
378 |
workflow is based around a single *reactor* (Twisted name for a main |
|
379 |
loop) |
|
380 |
- the more advanced granular locking that we want to implement would |
|
381 |
require, if written in the async-manner, deep integration with the |
|
382 |
Twisted stack, to such an extend that business-logic is inseparable |
|
383 |
from the protocol coding; we felt that this is an unreasonable request, |
|
384 |
and that a good protocol library should allow complete separation of |
|
385 |
low-level protocol calls and business logic; by comparison, the threaded |
|
386 |
approach combined with HTTPs protocol required (for the first iteration) |
|
387 |
absolutely no changes from the 1.2 code, and later changes for optimizing |
|
388 |
the inter-node RPC calls required just syntactic changes (e.g. |
|
389 |
``rpc.call_...`` to ``self.rpc.call_...``) |
|
390 |
|
|
391 |
Another issue is with the Twisted API stability - during the Ganeti |
|
392 |
1.x lifetime, we had to to implement many times workarounds to changes |
|
393 |
in the Twisted version, so that for example 1.2 is able to use both |
|
394 |
Twisted 2.x and 8.x. |
|
395 |
|
|
396 |
In the end, since we already had an HTTP server library for the RAPI, |
|
397 |
we just reused that for inter-node communication. |
|
294 | 398 |
|
295 | 399 |
|
296 | 400 |
Granular locking |
... | ... | |
302 | 406 |
|
303 | 407 |
This design addresses how we are going to deal with locking so that: |
304 | 408 |
|
305 |
- high urgency operations are not stopped by long length ones |
|
306 |
- long length operations can run in parallel |
|
307 |
- we preserve safety (data coherency) and liveness (no deadlock, no work |
|
308 |
postponed indefinitely) on the cluster |
|
409 |
- we preserve data coherency |
|
410 |
- we prevent deadlocks |
|
411 |
- we prevent job starvation |
|
309 | 412 |
|
310 | 413 |
Reaching the maximum possible parallelism is a Non-Goal. We have identified a |
311 | 414 |
set of operations that are currently bottlenecks and need to be parallelised |
312 | 415 |
and have worked on those. In the future it will be possible to address other |
313 | 416 |
needs, thus making the cluster more and more parallel one step at a time. |
314 | 417 |
|
315 |
This document only talks about parallelising Ganeti level operations, aka
|
|
316 |
Logical Units, and the locking needed for that. Any other synchronisation lock
|
|
418 |
This section only talks about parallelising Ganeti level operations, aka
|
|
419 |
Logical Units, and the locking needed for that. Any other synchronization lock
|
|
317 | 420 |
needed internally by the code is outside its scope. |
318 | 421 |
|
319 |
Ganeti 1.2 |
|
320 |
++++++++++ |
|
321 |
|
|
322 |
We intend to implement a Ganeti locking library, which can be used by the |
|
323 |
various ganeti code components in order to easily, efficiently and correctly |
|
324 |
grab the locks they need to perform their function. |
|
422 |
Library details |
|
423 |
+++++++++++++++ |
|
325 | 424 |
|
326 | 425 |
The proposed library has these features: |
327 | 426 |
|
328 |
- Internally managing all the locks, making the implementation transparent
|
|
427 |
- internally managing all the locks, making the implementation transparent
|
|
329 | 428 |
from their usage |
330 |
- Automatically grabbing multiple locks in the right order (avoid deadlock) |
|
331 |
- Ability to transparently handle conversion to more granularity |
|
332 |
- Support asynchronous operation (future goal) |
|
333 |
|
|
334 |
Locking will be valid only on the master node and will not be a distributed |
|
335 |
operation. In case of master failure, though, if some locks were held it means |
|
336 |
some opcodes were in progress, so when recovery of the job queue is done it |
|
337 |
will be possible to determine by the interrupted opcodes which operations could |
|
338 |
have been left half way through and thus which locks could have been held. It |
|
339 |
is then the responsibility either of the master failover code, of the cluster |
|
340 |
verification code, or of the admin to do what's necessary to make sure that any |
|
341 |
leftover state is dealt with. This is not an issue from a locking point of view |
|
342 |
because the fact that the previous master has failed means that it cannot do |
|
343 |
any job. |
|
344 |
|
|
345 |
A corollary of this is that a master-failover operation with both masters alive |
|
346 |
needs to happen while no other locks are held. |
|
429 |
- automatically grabbing multiple locks in the right order (avoid deadlock) |
|
430 |
- ability to transparently handle conversion to more granularity |
|
431 |
- support asynchronous operation (future goal) |
|
432 |
|
|
433 |
Locking will be valid only on the master node and will not be a |
|
434 |
distributed operation. Therefore, in case of master failure, the |
|
435 |
operations currently running will be aborted and the locks will be |
|
436 |
lost; it remains to the administrator to cleanup (if needed) the |
|
437 |
operation result (e.g. make sure an instance is either installed |
|
438 |
correctly or removed). |
|
439 |
|
|
440 |
A corollary of this is that a master-failover operation with both |
|
441 |
masters alive needs to happen while no operations are running, and |
|
442 |
therefore no locks are held. |
|
443 |
|
|
444 |
All the locks will be represented by objects (like |
|
445 |
``lockings.SharedLock``), and the individual locks for each object |
|
446 |
will be created at initialisation time, from the config file. |
|
447 |
|
|
448 |
The API will have a way to grab one or more than one locks at the same time. |
|
449 |
Any attempt to grab a lock while already holding one in the wrong order will be |
|
450 |
checked for, and fail. |
|
451 |
|
|
347 | 452 |
|
348 | 453 |
The Locks |
349 | 454 |
+++++++++ |
... | ... | |
360 | 465 |
within the locking library, which, for simplicity, will just use alphabetical |
361 | 466 |
order. |
362 | 467 |
|
468 |
Each lock has the following three possible statuses: |
|
469 |
|
|
470 |
- unlocked (anyone can grab the lock) |
|
471 |
- shared (anyone can grab/have the lock but only in shared mode) |
|
472 |
- exclusive (no one else can grab/have the lock) |
|
473 |
|
|
363 | 474 |
Handling conversion to more granularity |
364 | 475 |
+++++++++++++++++++++++++++++++++++++++ |
365 | 476 |
|
366 | 477 |
In order to convert to a more granular approach transparently each time we |
367 | 478 |
split a lock into more we'll create a "metalock", which will depend on those |
368 |
sublocks and live for the time necessary for all the code to convert (or |
|
479 |
sub-locks and live for the time necessary for all the code to convert (or
|
|
369 | 480 |
forever, in some conditions). When a metalock exists all converted code must |
370 | 481 |
acquire it in shared mode, so it can run concurrently, but still be exclusive |
371 | 482 |
with old code, which acquires it exclusively. |
... | ... | |
373 | 484 |
In the beginning the only such lock will be what replaces the current "command" |
374 | 485 |
lock, and will acquire all the locks in the system, before proceeding. This |
375 | 486 |
lock will be called the "Big Ganeti Lock" because holding that one will avoid |
376 |
any other concurrent ganeti operations.
|
|
487 |
any other concurrent Ganeti operations.
|
|
377 | 488 |
|
378 | 489 |
We might also want to devise more metalocks (eg. all nodes, all nodes+config) |
379 | 490 |
in order to make it easier for some parts of the code to acquire what it needs |
... | ... | |
383 | 494 |
decide to split them into an even more fine grained approach, but this will |
384 | 495 |
probably be only after the first 2.0 version has been released. |
385 | 496 |
|
386 |
Library API |
|
387 |
+++++++++++ |
|
388 |
|
|
389 |
All the locking will be its own class, and the locks will be created at |
|
390 |
initialisation time, from the config file. |
|
391 |
|
|
392 |
The API will have a way to grab one or more than one locks at the same time. |
|
393 |
Any attempt to grab a lock while already holding one in the wrong order will be |
|
394 |
checked for, and fail. |
|
395 |
|
|
396 | 497 |
Adding/Removing locks |
397 | 498 |
+++++++++++++++++++++ |
398 | 499 |
|
... | ... | |
405 | 506 |
explicitly. The implementation of this will be handled in the locking library |
406 | 507 |
itself. |
407 | 508 |
|
408 |
Of course when instances or nodes disappear from the cluster the relevant locks
|
|
409 |
must be removed. This is easier than adding new elements, as the code which
|
|
410 |
removes them must own them exclusively or can queue for their ownership, and
|
|
411 |
thus deals with metalocks exactly as normal code acquiring those locks. Any
|
|
412 |
operation queueing on a removed lock will fail after its removal.
|
|
509 |
When instances or nodes disappear from the cluster the relevant locks
|
|
510 |
must be removed. This is easier than adding new elements, as the code |
|
511 |
which removes them must own them exclusively already, and thus deals
|
|
512 |
with metalocks exactly as normal code acquiring those locks. Any |
|
513 |
operation queuing on a removed lock will fail after its removal. |
|
413 | 514 |
|
414 | 515 |
Asynchronous operations |
415 | 516 |
+++++++++++++++++++++++ |
... | ... | |
421 | 522 |
In the future we may want to implement different types of asynchronous |
422 | 523 |
operations such as: |
423 | 524 |
|
424 |
- Try to acquire this lock set and fail if not possible
|
|
425 |
- Try to acquire one of these lock sets and return the first one you were
|
|
525 |
- try to acquire this lock set and fail if not possible
|
|
526 |
- try to acquire one of these lock sets and return the first one you were
|
|
426 | 527 |
able to get (or after a timeout) (select/poll like) |
427 | 528 |
|
428 | 529 |
These operations can be used to prioritize operations based on available locks, |
... | ... | |
441 | 542 |
"tasklets" with their own locking requirements. A different design doc (or mini |
442 | 543 |
design doc) will cover the move from Logical Units to tasklets. |
443 | 544 |
|
444 |
Lock acquisition code path
|
|
445 |
++++++++++++++++++++++++++
|
|
545 |
Code examples
|
|
546 |
+++++++++++++ |
|
446 | 547 |
|
447 | 548 |
In general when acquiring locks we should use a code path equivalent to:: |
448 | 549 |
|
... | ... | |
453 | 554 |
finally: |
454 | 555 |
lock.release() |
455 | 556 |
|
456 |
This makes sure we release all locks, and avoid possible deadlocks. Of course |
|
457 |
extra care must be used not to leave, if possible locked structures in an |
|
458 |
unusable state. |
|
557 |
This makes sure we release all locks, and avoid possible deadlocks. Of |
|
558 |
course extra care must be used not to leave, if possible locked |
|
559 |
structures in an unusable state. Note that with Python 2.5 a simpler |
|
560 |
syntax will be possible, but we want to keep compatibility with Python |
|
561 |
2.4 so the new constructs should not be used. |
|
459 | 562 |
|
460 | 563 |
In order to avoid this extra indentation and code changes everywhere in the |
461 | 564 |
Logical Units code, we decided to allow LUs to declare locks, and then execute |
... | ... | |
500 | 603 |
queue to store these and to be able to process as many as possible in |
501 | 604 |
parallel. |
502 | 605 |
|
503 |
A ganeti job will consist of multiple ``OpCodes`` which are the basic
|
|
606 |
A Ganeti job will consist of multiple ``OpCodes`` which are the basic
|
|
504 | 607 |
element of operation in Ganeti 1.2 (and will remain as such). Most |
505 | 608 |
command-level commands are equivalent to one OpCode, or in some cases |
506 | 609 |
to a sequence of opcodes, all of the same type (e.g. evacuating a node |
... | ... | |
518 | 621 |
of the waiting threads will pick up the new job. |
519 | 622 |
#. Client waits for job status updates by calling a waiting RPC function. |
520 | 623 |
Log message may be shown to the user. Until the job is started, it can also |
521 |
be cancelled.
|
|
624 |
be canceled. |
|
522 | 625 |
#. As soon as the job is finished, its final result and status can be retrieved |
523 | 626 |
from the server. |
524 | 627 |
#. If the client archives the job, it gets moved to a history directory. |
... | ... | |
653 | 756 |
+++++++ |
654 | 757 |
|
655 | 758 |
Archived jobs are kept in a separate directory, |
656 |
/var/lib/ganeti/queue/archive/. This is done in order to speed up the
|
|
657 |
queue handling: by default, the jobs in the archive are not touched by
|
|
658 |
any functions. Only the current (unarchived) jobs are parsed, loaded,
|
|
659 |
and verified (if implemented) by the master daemon. |
|
759 |
``/var/lib/ganeti/queue/archive/``. This is done in order to speed up
|
|
760 |
the queue handling: by default, the jobs in the archive are not
|
|
761 |
touched by any functions. Only the current (unarchived) jobs are
|
|
762 |
parsed, loaded, and verified (if implemented) by the master daemon.
|
|
660 | 763 |
|
661 | 764 |
|
662 | 765 |
Ganeti updates |
... | ... | |
667 | 770 |
way to prevent new jobs entering the queue. |
668 | 771 |
|
669 | 772 |
|
670 |
|
|
671 | 773 |
Object parameters |
672 | 774 |
~~~~~~~~~~~~~~~~~ |
673 | 775 |
|
... | ... | |
697 | 799 |
a hypervisor parameter (or hypervisor specific parameter) is defined |
698 | 800 |
as a parameter that is interpreted by the hypervisor support code in |
699 | 801 |
Ganeti and usually is specific to a particular hypervisor (like the |
700 |
kernel path for PVM which makes no sense for HVM).
|
|
802 |
kernel path for `PVM`_ which makes no sense for `HVM`_).
|
|
701 | 803 |
|
702 | 804 |
:backend parameter: |
703 | 805 |
a backend parameter is defined as an instance parameter that can be |
... | ... | |
727 | 829 |
hold defaults for the instances: |
728 | 830 |
|
729 | 831 |
- hvparams, a dictionary indexed by hypervisor type, holding default |
730 |
values for hypervisor parameters that are not defined/overrided by
|
|
832 |
values for hypervisor parameters that are not defined/overridden by
|
|
731 | 833 |
the instances of this hypervisor type |
732 | 834 |
|
733 | 835 |
- beparams, a dictionary holding (for 2.0) a single element 'default', |
... | ... | |
754 | 856 |
The names for hypervisor parameters in the instance.hvparams subtree |
755 | 857 |
should be choosen as generic as possible, especially if specific |
756 | 858 |
parameters could conceivably be useful for more than one hypervisor, |
757 |
e.g. instance.hvparams.vnc_console_port instead of using both
|
|
758 |
instance.hvparams.hvm_vnc_console_port and
|
|
759 |
instance.hvparams.kvm_vnc_console_port.
|
|
859 |
e.g. ``instance.hvparams.vnc_console_port`` instead of using both
|
|
860 |
``instance.hvparams.hvm_vnc_console_port`` and
|
|
861 |
``instance.hvparams.kvm_vnc_console_port``.
|
|
760 | 862 |
|
761 | 863 |
There are some special cases related to disks and NICs (for example): |
762 |
a disk has both ganeti-related parameters (e.g. the name of the LV)
|
|
864 |
a disk has both Ganeti-related parameters (e.g. the name of the LV)
|
|
763 | 865 |
and hypervisor-related parameters (how the disk is presented to/named |
764 | 866 |
in the instance). The former parameters remain as proper-instance |
765 | 867 |
parameters, while the latter value are migrated to the hvparams |
... | ... | |
806 | 908 |
for this hypervisor |
807 | 909 |
:CheckParamSyntax(hvparams): checks that the given parameters are |
808 | 910 |
valid (as in the names are valid) for this hypervisor; usually just |
809 |
comparing hvparams.keys() and cls.PARAMETERS; this is a class method
|
|
810 |
that can be called from within master code (i.e. cmdlib) and should
|
|
811 |
be safe to do so |
|
911 |
comparing ``hvparams.keys()`` and ``cls.PARAMETERS``; this is a class
|
|
912 |
method that can be called from within master code (i.e. cmdlib) and
|
|
913 |
should be safe to do so
|
|
812 | 914 |
:ValidateParameters(hvparams): verifies the values of the provided |
813 | 915 |
parameters against this hypervisor; this is a method that will be |
814 | 916 |
called on the target node, from backend.py code, and as such can |
... | ... | |
839 | 941 |
The parameter changes will have impact on the OpCodes, especially on |
840 | 942 |
the following ones: |
841 | 943 |
|
842 |
- OpCreateInstance, where the new hv and be parameters will be sent as
|
|
944 |
- ``OpCreateInstance``, where the new hv and be parameters will be sent as
|
|
843 | 945 |
dictionaries; note that all hv and be parameters are now optional, as |
844 | 946 |
the values can be instead taken from the cluster |
845 |
- OpQueryInstances, where we have to be able to query these new
|
|
947 |
- ``OpQueryInstances``, where we have to be able to query these new
|
|
846 | 948 |
parameters; the syntax for names will be ``hvparam/$NAME`` and |
847 | 949 |
``beparam/$NAME`` for querying an individual parameter out of one |
848 | 950 |
dictionary, and ``hvparams``, respectively ``beparams``, for the whole |
849 | 951 |
dictionaries |
850 |
- OpModifyInstance, where the the modified parameters are sent as
|
|
952 |
- ``OpModifyInstance``, where the the modified parameters are sent as
|
|
851 | 953 |
dictionaries |
852 | 954 |
|
853 | 955 |
Additionally, we will need new OpCodes to modify the cluster-level |
... | ... | |
891 | 993 |
assumptions made initially are not true and that more flexibility is |
892 | 994 |
needed. |
893 | 995 |
|
894 |
One main assupmtion made was that disk failures should be treated as 'rare'
|
|
996 |
One main assumption made was that disk failures should be treated as 'rare'
|
|
895 | 997 |
events, and that each of them needs to be manually handled in order to ensure |
896 | 998 |
data safety; however, both these assumptions are false: |
897 | 999 |
|
898 |
- disk failures can be a common occurence, based on usage patterns or cluster |
|
1000 |
- disk failures can be a common occurrence, based on usage patterns or cluster
|
|
899 | 1001 |
size |
900 | 1002 |
- our disk setup is robust enough (referring to DRBD8 + LVM) that we could |
901 | 1003 |
automate more of the recovery |
... | ... | |
956 | 1058 |
parameters. |
957 | 1059 |
|
958 | 1060 |
This means that we in effect take ownership of the minor space for |
959 |
that device type; if there's a user-created drbd minor, it will be
|
|
1061 |
that device type; if there's a user-created DRBD minor, it will be
|
|
960 | 1062 |
automatically removed. |
961 | 1063 |
|
962 | 1064 |
The change will have the effect of reducing the number of external |
963 | 1065 |
commands run per device from a constant number times the index of the |
964 | 1066 |
first free DRBD minor to just a constant number. |
965 | 1067 |
|
966 |
Removal of obsolete device types (md, drbd7)
|
|
1068 |
Removal of obsolete device types (MD, DRBD7)
|
|
967 | 1069 |
++++++++++++++++++++++++++++++++++++++++++++ |
968 | 1070 |
|
969 | 1071 |
We need to remove these device types because of two issues. First, |
970 |
drbd7 has bad failure modes in case of dual failures (both network and
|
|
1072 |
DRBD7 has bad failure modes in case of dual failures (both network and
|
|
971 | 1073 |
disk - it cannot propagate the error up the device stack and instead |
972 |
just panics. Second, due to the assymetry between primary and
|
|
973 |
secondary in md+drbd mode, we cannot do live failover (not even if we
|
|
974 |
had md+drbd8).
|
|
1074 |
just panics. Second, due to the asymmetry between primary and
|
|
1075 |
secondary in MD+DRBD mode, we cannot do live failover (not even if we
|
|
1076 |
had MD+DRBD8).
|
|
975 | 1077 |
|
976 | 1078 |
File-based storage support |
977 | 1079 |
++++++++++++++++++++++++++ |
978 | 1080 |
|
979 |
This is covered by a separate design doc (<em>Vinales</em>) and
|
|
980 |
would allow us to get rid of the hard requirement for testing
|
|
981 |
clusters; it would also allow people who have SAN storage to do live
|
|
982 |
failover taking advantage of their storage solution.
|
|
1081 |
Using files instead of logical volumes for instance storage would
|
|
1082 |
allow us to get rid of the hard requirement for volume groups for
|
|
1083 |
testing clusters and it would also allow usage of SAN storage to do
|
|
1084 |
live failover taking advantage of this storage solution.
|
|
983 | 1085 |
|
984 | 1086 |
Better LVM allocation |
985 | 1087 |
+++++++++++++++++++++ |
... | ... | |
1030 | 1132 |
#. if no, and previous status was no, do nothing |
1031 | 1133 |
#. if no, and previous status was yes: |
1032 | 1134 |
#. if more than one node is inconsistent, do nothing |
1033 |
#. if only one node is incosistent: |
|
1135 |
#. if only one node is inconsistent:
|
|
1034 | 1136 |
#. run ``vgreduce --removemissing`` |
1035 |
#. log this occurence in the ganeti log in a form that
|
|
1137 |
#. log this occurrence in the Ganeti log in a form that
|
|
1036 | 1138 |
can be used for monitoring |
1037 | 1139 |
#. [FUTURE] run ``replace-disks`` for all |
1038 | 1140 |
instances affected |
... | ... | |
1067 | 1169 |
- verify that S2 (the node the user has chosen to keep as secondary) has |
1068 | 1170 |
valid data (is consistent) |
1069 | 1171 |
|
1070 |
- tear down the current DRBD association and setup a drbd pairing between
|
|
1172 |
- tear down the current DRBD association and setup a DRBD pairing between
|
|
1071 | 1173 |
P2 (P2 is indicated by the user) and S2; since P2 has no data, it will |
1072 |
start resyncing from S2 |
|
1174 |
start re-syncing from S2
|
|
1073 | 1175 |
|
1074 | 1176 |
- as soon as P2 is in state SyncTarget (i.e. after the resync has started |
1075 | 1177 |
but before it has finished), we can promote it to primary role (r/w) |
... | ... | |
1083 | 1185 |
will cause I/O errors on the instance, so (if a longer instance |
1084 | 1186 |
downtime is acceptable) we can postpone the restart of the instance |
1085 | 1187 |
until the resync is done. However, disk I/O errors on S2 will cause |
1086 |
dataloss, since we don't have a good copy of the data anymore, so in |
|
1188 |
data loss, since we don't have a good copy of the data anymore, so in
|
|
1087 | 1189 |
this case waiting for the sync to complete is not an option. As such, |
1088 | 1190 |
it is recommended that this feature is used only in conjunction with |
1089 | 1191 |
proper disk monitoring. |
... | ... | |
1096 | 1198 |
+++++++ |
1097 | 1199 |
|
1098 | 1200 |
The dynamic device model, while more complex, has an advantage: it |
1099 |
will not reuse by mistake another's instance DRBD device, since it
|
|
1100 |
always looks for either our own or a free one. |
|
1201 |
will not reuse by mistake the DRBD device of another instance, since
|
|
1202 |
it always looks for either our own or a free one.
|
|
1101 | 1203 |
|
1102 | 1204 |
The static one, in contrast, will assume that given a minor number N, |
1103 | 1205 |
it's ours and we can take over. This needs careful implementation such |
1104 | 1206 |
that if the minor is in use, either we are able to cleanly shut it |
1105 | 1207 |
down, or we abort the startup. Otherwise, it could be that we start |
1106 |
syncing between two instance's disks, causing dataloss. |
|
1208 |
syncing between two instance's disks, causing data loss.
|
|
1107 | 1209 |
|
1108 | 1210 |
|
1109 | 1211 |
Variable number of disk/NICs per instance |
... | ... | |
1115 | 1217 |
In order to support high-security scenarios (for example read-only sda |
1116 | 1218 |
and read-write sdb), we need to make a fully flexibly disk |
1117 | 1219 |
definition. This has less impact that it might look at first sight: |
1118 |
only the instance creation has hardcoded number of disks, not the disk |
|
1220 |
only the instance creation has hard coded number of disks, not the disk
|
|
1119 | 1221 |
handling code. The block device handling and most of the instance |
1120 | 1222 |
handling code is already working with "the instance's disks" as |
1121 | 1223 |
opposed to "the two disks of the instance", but some pieces are not |
... | ... | |
1123 | 1225 |
|
1124 | 1226 |
The objective is to be able to specify the number of disks at |
1125 | 1227 |
instance creation, and to be able to toggle from read-only to |
1126 |
read-write a disk afterwards.
|
|
1228 |
read-write a disk afterward. |
|
1127 | 1229 |
|
1128 | 1230 |
Variable number of NICs |
1129 | 1231 |
+++++++++++++++++++++++ |
... | ... | |
1131 | 1233 |
Similar to the disk change, we need to allow multiple network |
1132 | 1234 |
interfaces per instance. This will affect the internal code (some |
1133 | 1235 |
function will have to stop assuming that ``instance.nics`` is a list |
1134 |
of length one), the OS api which currently can export/import only one
|
|
1236 |
of length one), the OS API which currently can export/import only one
|
|
1135 | 1237 |
instance, and the command line interface. |
1136 | 1238 |
|
1137 | 1239 |
Interface changes |
... | ... | |
1176 | 1278 |
When designing the new OS API our priorities are: |
1177 | 1279 |
- ease of use |
1178 | 1280 |
- future extensibility |
1179 |
- ease of porting from the old api
|
|
1281 |
- ease of porting from the old API
|
|
1180 | 1282 |
- modularity |
1181 | 1283 |
|
1182 | 1284 |
As such we want to limit the number of scripts that must be written to support |
... | ... | |
1228 | 1330 |
instances will be forced to have a number of disks greater or equal to the |
1229 | 1331 |
one of the export. |
1230 | 1332 |
- Some scripts are not compulsory: if such a script is missing the relevant |
1231 |
operations will be forbidden for instances of that os. This makes it easier
|
|
1333 |
operations will be forbidden for instances of that OS. This makes it easier
|
|
1232 | 1334 |
to distinguish between unsupported operations and no-op ones (if any). |
1233 | 1335 |
|
1234 | 1336 |
|
... | ... | |
1239 | 1341 |
inputs from environment variables. We expect the following input values: |
1240 | 1342 |
|
1241 | 1343 |
OS_API_VERSION |
1242 |
The version of the OS api that the following parameters comply with;
|
|
1344 |
The version of the OS API that the following parameters comply with;
|
|
1243 | 1345 |
this is used so that in the future we could have OSes supporting |
1244 | 1346 |
multiple versions and thus Ganeti send the proper version in this |
1245 | 1347 |
parameter |
1246 | 1348 |
INSTANCE_NAME |
1247 | 1349 |
Name of the instance acted on |
1248 | 1350 |
HYPERVISOR |
1249 |
The hypervisor the instance should run on (eg. 'xen-pvm', 'xen-hvm', 'kvm') |
|
1351 |
The hypervisor the instance should run on (e.g. 'xen-pvm', 'xen-hvm', 'kvm')
|
|
1250 | 1352 |
DISK_COUNT |
1251 | 1353 |
The number of disks this instance will have |
1252 | 1354 |
NIC_COUNT |
1253 |
The number of nics this instance will have
|
|
1355 |
The number of NICs this instance will have
|
|
1254 | 1356 |
DISK_<N>_PATH |
1255 | 1357 |
Path to the Nth disk. |
1256 | 1358 |
DISK_<N>_ACCESS |
... | ... | |
1268 | 1370 |
NIC_<N>_BRIDGE |
1269 | 1371 |
Node bridge the Nth network interface will be connected to |
1270 | 1372 |
NIC_<N>_FRONTEND_TYPE |
1271 |
Type of the Nth nic as seen by the instance. For example 'virtio', 'rtl8139', etc. |
|
1373 |
Type of the Nth NIC as seen by the instance. For example 'virtio', |
|
1374 |
'rtl8139', etc. |
|
1272 | 1375 |
DEBUG_LEVEL |
1273 | 1376 |
Whether more out should be produced, for debugging purposes. Currently the |
1274 | 1377 |
only valid values are 0 and 1. |
1275 | 1378 |
|
1276 |
These are only the basic variables we are thinking of now, but more may come |
|
1277 |
during the implementation and they will be documented in the ganeti-os-api man |
|
1278 |
page. All these variables will be available to all scripts. |
|
1379 |
These are only the basic variables we are thinking of now, but more |
|
1380 |
may come during the implementation and they will be documented in the |
|
1381 |
``ganeti-os-api`` man page. All these variables will be available to |
|
1382 |
all scripts. |
|
1279 | 1383 |
|
1280 | 1384 |
Some scripts will need a few more information to work. These will have |
1281 | 1385 |
per-script variables, such as for example: |
... | ... | |
1304 | 1408 |
create and import scripts are supposed to format/initialise the given block |
1305 | 1409 |
devices and install the correct instance data. The export script is supposed to |
1306 | 1410 |
export instance data to stdout in a format understandable by the the import |
1307 |
script. The data will be compressed by ganeti, so no compression should be
|
|
1411 |
script. The data will be compressed by Ganeti, so no compression should be
|
|
1308 | 1412 |
done. The rename script should only modify the instance's knowledge of what |
1309 | 1413 |
its name is. |
1310 | 1414 |
|
... | ... | |
1312 | 1416 |
++++++++++++++++++++++++++++++++ |
1313 | 1417 |
|
1314 | 1418 |
Similar to Ganeti 1.2, OS specifications will need to provide a |
1315 |
'ganeti_api_version' containing list of numbers matching the version(s) of the |
|
1316 |
api they implement. Ganeti itself will always be compatible with one version of |
|
1317 |
the API and may maintain retrocompatibility if it's feasible to do so. The |
|
1318 |
numbers are one-per-line, so an OS supporting both version 5 and version 20 |
|
1319 |
will have a file containing two lines. This is different from Ganeti 1.2, which |
|
1320 |
only supported one version number. |
|
1419 |
'ganeti_api_version' containing list of numbers matching the |
|
1420 |
version(s) of the API they implement. Ganeti itself will always be |
|
1421 |
compatible with one version of the API and may maintain backwards |
|
1422 |
compatibility if it's feasible to do so. The numbers are one-per-line, |
|
1423 |
so an OS supporting both version 5 and version 20 will have a file |
|
1424 |
containing two lines. This is different from Ganeti 1.2, which only |
|
1425 |
supported one version number. |
|
1321 | 1426 |
|
1322 | 1427 |
In addition to that an OS will be able to declare that it does support only a |
1323 |
subset of the ganeti hypervisors, by declaring them in the 'hypervisors' file.
|
|
1428 |
subset of the Ganeti hypervisors, by declaring them in the 'hypervisors' file.
|
|
1324 | 1429 |
|
1325 | 1430 |
|
1326 | 1431 |
Caveats/Notes |
... | ... | |
1341 | 1446 |
Remote API changes |
1342 | 1447 |
~~~~~~~~~~~~~~~~~~ |
1343 | 1448 |
|
1344 |
The first Ganeti RAPI was designed and deployed with the Ganeti 1.2.5 release.
|
|
1345 |
That version provide Read-Only access to a cluster state. Fully functional
|
|
1346 |
read-write API demand significant internal changes which are in a pipeline for
|
|
1347 |
Ganeti 2.0 release.
|
|
1449 |
The first Ganeti remote API (RAPI) was designed and deployed with the
|
|
1450 |
Ganeti 1.2.5 release. That version provide read-only access to the
|
|
1451 |
cluster state. Fully functional read-write API demands significant
|
|
1452 |
internal changes which will be implemented in version 2.0.
|
|
1348 | 1453 |
|
1349 |
We decided to go with implementing the Ganeti RAPI in a RESTful way, which is |
|
1350 |
aligned with key features we looking. It is simple, stateless, scalable and |
|
1351 |
extensible paradigm of API implementation. As transport it uses HTTP over SSL, |
|
1352 |
and we are implementing it in JSON encoding, but in a way it possible to extend |
|
1353 |
and provide any other one. |
|
1454 |
We decided to go with implementing the Ganeti RAPI in a RESTful way, |
|
1455 |
which is aligned with key features we looking. It is simple, |
|
1456 |
stateless, scalable and extensible paradigm of API implementation. As |
|
1457 |
transport it uses HTTP over SSL, and we are implementing it with JSON |
|
1458 |
encoding, but in a way it possible to extend and provide any other |
|
1459 |
one. |
|
1354 | 1460 |
|
1355 | 1461 |
Design |
1356 | 1462 |
++++++ |
1357 | 1463 |
|
1358 |
The Ganeti API implemented as independent daemon, running on the same node
|
|
1359 |
with the same permission level as Ganeti master daemon. Communication done
|
|
1360 |
through unix socket protocol provided by Ganeti luxi library.
|
|
1361 |
In order to keep communication asynchronous RAPI process two types of client
|
|
1362 |
requests: |
|
1464 |
The Ganeti RAPI is implemented as independent daemon, running on the
|
|
1465 |
same node with the same permission level as Ganeti master
|
|
1466 |
daemon. Communication is done through the LUXI library to the master
|
|
1467 |
daemon. In order to keep communication asynchronous RAPI processes two
|
|
1468 |
types of client requests:
|
|
1363 | 1469 |
|
1364 |
- queries: sever able to answer immediately
|
|
1365 |
- jobs: some time needed.
|
|
1470 |
- queries: server is able to answer immediately
|
|
1471 |
- job submission: some time is required for a useful response
|
|
1366 | 1472 |
|
1367 |
In the query case requested data send back to client in http body. Typical |
|
1368 |
examples of queries would be list of nodes, instances, cluster info, etc. |
|
1369 |
Dealing with jobs client instead of waiting until job completes receive a job |
|
1370 |
id, the identifier which allows to query the job progress in the job queue. |
|
1371 |
(See job queue design doc for details) |
|
1473 |
In the query case requested data send back to client in the HTTP |
|
1474 |
response body. Typical examples of queries would be: list of nodes, |
|
1475 |
instances, cluster info, etc. |
|
1372 | 1476 |
|
1373 |
Internally, each exported object has an version identifier, which is used as a |
|
1374 |
state stamp in the http header E-Tag field for request/response to avoid a race |
|
1375 |
condition. |
|
1477 |
In the case of job submission, the client receive a job ID, the |
|
1478 |
identifier which allows to query the job progress in the job queue |
|
1479 |
(see `Job Queue`_). |
|
1480 |
|
|
1481 |
Internally, each exported object has an version identifier, which is |
|
1482 |
used as a state identifier in the HTTP header E-Tag field for |
|
1483 |
requests/responses to avoid race conditions. |
|
1376 | 1484 |
|
1377 | 1485 |
|
1378 | 1486 |
Resource representation |
1379 | 1487 |
+++++++++++++++++++++++ |
1380 | 1488 |
|
1381 |
The key difference of REST approach from others API is instead having one URI
|
|
1382 |
for all our requests, REST demand separate service by resources with unique
|
|
1383 |
URI. Each of them should have limited amount of stateless and standard HTTP
|
|
1489 |
The key difference of using REST instead of others API is that REST
|
|
1490 |
requires separation of services via resources with unique URIs. Each
|
|
1491 |
of them should have limited amount of state and support standard HTTP
|
|
1384 | 1492 |
methods: GET, POST, DELETE, PUT. |
1385 | 1493 |
|
1386 |
For example in Ganeti case we can have a set of URI: |
|
1387 |
- /{clustername}/instances |
|
1388 |
- /{clustername}/instances/{instancename} |
|
1389 |
- /{clustername}/instances/{instancename}/tag |
|
1390 |
- /{clustername}/tag |
|
1494 |
For example in Ganeti's case we can have a set of URI: |
|
1495 |
|
|
1496 |
- ``/{clustername}/instances`` |
|
1497 |
- ``/{clustername}/instances/{instancename}`` |
|
1498 |
- ``/{clustername}/instances/{instancename}/tag`` |
|
1499 |
- ``/{clustername}/tag`` |
|
1391 | 1500 |
|
1392 |
A GET request to /{clustername}/instances will return list of instances, a POST |
|
1393 |
to /{clustername}/instances should create new instance, a DELETE |
|
1394 |
/{clustername}/instances/{instancename} should delete instance, a GET |
|
1395 |
/{clustername}/tag get cluster tag |
|
1501 |
A GET request to ``/{clustername}/instances`` will return the list of |
|
1502 |
instances, a POST to ``/{clustername}/instances`` should create a new |
|
1503 |
instance, a DELETE ``/{clustername}/instances/{instancename}`` should |
|
1504 |
delete the instance, a GET ``/{clustername}/tag`` should return get |
|
1505 |
cluster tags. |
|
1396 | 1506 |
|
1397 |
Each resource URI has a version prefix. The complete list of resources id TBD. |
|
1507 |
Each resource URI will have a version prefix. The resource IDs are to |
|
1508 |
be determined. |
|
1398 | 1509 |
|
1399 |
Internal encoding might be JSON, XML, or any other. The JSON encoding fits
|
|
1400 |
nicely in Ganeti RAPI needs. Specific representation client can request with
|
|
1401 |
Accept field in the HTTP header. |
|
1510 |
Internal encoding might be JSON, XML, or any other. The JSON encoding |
|
1511 |
fits nicely in Ganeti RAPI needs. The client can request a specific
|
|
1512 |
representation via the Accept field in the HTTP header.
|
|
1402 | 1513 |
|
1403 |
The REST uses standard HTTP as application protocol (not just as a transport) |
|
1404 |
for resources access. Set of possible result codes is a subset of standard HTTP |
|
1405 |
results. The stateless provide additional reliability and transparency to |
|
1406 |
operations. |
|
1514 |
REST uses HTTP as its transport and application protocol for resource |
|
1515 |
access. The set of possible responses is a subset of standard HTTP |
|
1516 |
responses. |
|
1517 |
|
|
1518 |
The statelessness model provides additional reliability and |
|
1519 |
transparency to operations (e.g. only one request needs to be analyzed |
|
1520 |
to understand the in-progress operation, not a sequence of multiple |
|
1521 |
requests/responses). |
|
1407 | 1522 |
|
1408 | 1523 |
|
1409 | 1524 |
Security |
1410 | 1525 |
++++++++ |
1411 | 1526 |
|
1412 |
With the write functionality security becomes much bigger an issue. The Ganeti |
|
1413 |
RAPI uses basic HTTP authentication on top of SSL connection to grant access to |
|
1414 |
an exported resource. The password stores locally in Apache-style .htpasswd |
|
1415 |
file. Only one level of privileges is supported. |
|
1527 |
With the write functionality security becomes a much bigger an issue. |
|
1528 |
The Ganeti RAPI uses basic HTTP authentication on top of an |
|
1529 |
SSL-secured connection to grant access to an exported resource. The |
|
1530 |
password is stored locally in an Apache-style ``.htpasswd`` file. Only |
|
1531 |
one level of privileges is supported. |
|
1532 |
|
|
1533 |
Caveats |
|
1534 |
+++++++ |
|
1535 |
|
|
1536 |
The model detailed above for job submission requires the client to |
|
1537 |
poll periodically for updates to the job; an alternative would be to |
|
1538 |
allow the client to request a callback, or a 'wait for updates' call. |
|
1539 |
|
|
1540 |
The callback model was not considered due to the following two issues: |
|
1416 | 1541 |
|
1542 |
- callbacks would require a new model of allowed callback URLs, |
|
1543 |
together with a method of managing these |
|
1544 |
- callbacks only work when the client and the master are in the same |
|
1545 |
security domain, and they fail in the other cases (e.g. when there is |
|
1546 |
a firewall between the client and the RAPI daemon that only allows |
|
1547 |
client-to-RAPI calls, which is usual in DMZ cases) |
|
1548 |
|
|
1549 |
The 'wait for updates' method is not suited to the HTTP protocol, |
|
1550 |
where requests are supposed to be short-lived. |
|
1417 | 1551 |
|
1418 | 1552 |
Command line changes |
1419 | 1553 |
~~~~~~~~~~~~~~~~~~~~ |
1420 | 1554 |
|
1421 | 1555 |
Ganeti 2.0 introduces several new features as well as new ways to |
1422 | 1556 |
handle instance resources like disks or network interfaces. This |
1423 |
requires some noticable changes in the way commandline arguments are
|
|
1557 |
requires some noticeable changes in the way command line arguments are
|
|
1424 | 1558 |
handled. |
1425 | 1559 |
|
1426 |
- extend and modify commandline syntax to support new features |
|
1427 |
- ensure consistent patterns in commandline arguments to reduce cognitive load |
|
1560 |
- extend and modify command line syntax to support new features |
|
1561 |
- ensure consistent patterns in command line arguments to reduce |
|
1562 |
cognitive load |
|
1428 | 1563 |
|
1429 | 1564 |
The design changes that require these changes are, in no particular |
1430 | 1565 |
order: |
... | ... | |
1437 | 1572 |
cluster, each supporting different parameters, |
1438 | 1573 |
- support for device type CDROM (via ISO image) |
1439 | 1574 |
|
1440 |
As such, there are several areas of Ganeti where the commandline |
|
1575 |
As such, there are several areas of Ganeti where the command line
|
|
1441 | 1576 |
arguments will change: |
1442 | 1577 |
|
1443 | 1578 |
- Cluster configuration |
... | ... | |
1452 | 1587 |
- handling of CDROM devices and |
1453 | 1588 |
- handling of hypervisor specific options. |
1454 | 1589 |
|
1455 |
There are several areas of Ganeti where the commandline arguments will change: |
|
1590 |
There are several areas of Ganeti where the command line arguments |
|
1591 |
will change: |
|
1456 | 1592 |
|
1457 | 1593 |
- Cluster configuration |
1458 | 1594 |
|
... | ... | |
1552 | 1688 |
:--net: for network interface cards |
1553 | 1689 |
:--disk: for disk devices |
1554 | 1690 |
|
1555 |
The syntax to the device specific options is similiar to the generic
|
|
1691 |
The syntax to the device specific options is similar to the generic |
|
1556 | 1692 |
device options, but instead of specifying a device number like for |
1557 | 1693 |
gnt-instance add, you specify the magic string add. The new device |
1558 | 1694 |
will always be appended at the end of the list of devices of this type |
... | ... | |
1584 | 1720 |
:--net: for network interface cards |
1585 | 1721 |
:--disk: for disk devices |
1586 | 1722 |
|
1587 |
The syntax to the device specific options is similiar to the generic
|
|
1723 |
The syntax to the device specific options is similar to the generic |
|
1588 | 1724 |
device options. The device number you specify identifies the device to |
1589 | 1725 |
be modified. |
1590 | 1726 |
|
1591 |
Example: gnt-instance modify --disk 2:access=r |
|
1727 |
Example:: |
|
1728 |
|
|
1729 |
gnt-instance modify --disk 2:access=r |
|
1592 | 1730 |
|
1593 | 1731 |
Hypervisor Options |
1594 | 1732 |
++++++++++++++++++ |
... | ... | |
1596 | 1734 |
Ganeti 2.0 will support more than one hypervisor. Different |
1597 | 1735 |
hypervisors have various options that only apply to a specific |
1598 | 1736 |
hypervisor. Those hypervisor specific options are treated specially |
1599 |
via the --hypervisor option. The generic syntax of the hypervisor
|
|
1600 |
option is as follows: |
|
1737 |
via the ``--hypervisor`` option. The generic syntax of the hypervisor
|
|
1738 |
option is as follows::
|
|
1601 | 1739 |
|
1602 | 1740 |
--hypervisor $HYPERVISOR:$OPTION=$VALUE[,$OPTION=$VALUE] |
1603 | 1741 |
|
... | ... | |
1608 | 1746 |
:$VALUE: hypervisor option value, string |
1609 | 1747 |
|
1610 | 1748 |
The hypervisor option for an instance can be set on instance creation |
1611 |
time via the gnt-instance add command. If the hypervisor for an
|
|
1749 |
time via the ``gnt-instance add`` command. If the hypervisor for an
|
|
1612 | 1750 |
instance is not specified upon instance creation, the default |
1613 | 1751 |
hypervisor will be used. |
1614 | 1752 |
|
... | ... | |
1616 | 1754 |
+++++++++++++++++++++++++++++++ |
1617 | 1755 |
|
1618 | 1756 |
The hypervisor parameters of an existing instance can be modified |
1619 |
using --hypervisor option of the gnt-instance modify command. However,
|
|
1620 |
the hypervisor type of an existing instance can not be changed, only
|
|
1621 |
the particular hypervisor specific option can be changed. Therefore,
|
|
1622 |
the format of the option parameters has been simplified to omit the
|
|
1623 |
hypervisor name and only contain the comma separated list of
|
|
1624 |
option-value pairs. |
|
1757 |
using ``--hypervisor`` option of the ``gnt-instance modify``
|
|
1758 |
command. However, the hypervisor type of an existing instance can not
|
|
1759 |
be changed, only the particular hypervisor specific option can be
|
|
1760 |
changed. Therefore, the format of the option parameters has been
|
|
1761 |
simplified to omit the hypervisor name and only contain the comma
|
|
1762 |
separated list of option-value pairs.
|
|
1625 | 1763 |
|
1626 |
Example: gnt-instance modify --hypervisor |
|
1627 |
cdrom=/srv/boot.iso,boot_order=cdrom:network test-instance |
|
1764 |
Example:: |
|
1765 |
|
|
1766 |
gnt-instance modify --hypervisor cdrom=/srv/boot.iso,boot_order=cdrom:network test-instance |
|
1628 | 1767 |
|
1629 | 1768 |
gnt-cluster commands |
1630 | 1769 |
++++++++++++++++++++ |
... | ... | |
1664 | 1803 |
Hypervisor cluster defaults |
1665 | 1804 |
+++++++++++++++++++++++++++ |
1666 | 1805 |
|
1667 |
The generic format of the hypervisor clusterwide default setting option is: |
|
1806 |
The generic format of the hypervisor cluster wide default setting |
|
1807 |
option is:: |
|
1668 | 1808 |
|
1669 | 1809 |
--hypervisor-defaults $HYPERVISOR:$OPTION=$VALUE[,$OPTION=$VALUE] |
1670 | 1810 |
|
... | ... | |
1673 | 1813 |
:$OPTION: cluster default option, string, |
1674 | 1814 |
:$VALUE: cluster default option value, string. |
1675 | 1815 |
|
1816 |
Glossary |
|
1817 |
======== |
|
1676 | 1818 |
|
1677 |
Functionality changes
|
|
1678 |
---------------------
|
|
1819 |
Since this document is only a delta from the Ganeti 1.2, there are
|
|
1820 |
some unexplained terms. Here is a non-exhaustive list.
|
|
1679 | 1821 |
|
1680 |
The disk storage will receive some changes, and will also remove |
|
1681 |
support for the drbd7 and md disk types. See the |
|
1682 |
design-2.0-disk-changes document. |
|
1822 |
.. _HVM: |
|
1683 | 1823 |
|
1684 |
The configuration storage will be changed, with the effect that more |
|
1685 |
data will be available on the nodes for access from outside ganeti |
|
1686 |
(e.g. from shell scripts) and that nodes will get slightly more |
|
1687 |
awareness of the cluster configuration. |
|
1824 |
HVM |
|
1825 |
hardware virtualization mode, where the virtual machine is oblivious |
|
1826 |
to the fact that's being virtualized and all the hardware is emulated |
|
1688 | 1827 |
|
1689 |
The RAPI will enable modify operations (beside the read-only queries |
|
1690 |
that are available today), so in effect almost all the operations |
|
1691 |
available today via the ``gnt-*`` commands will be available via the |
|
1692 |
remote API. |
|
1828 |
.. _LU: |
|
1693 | 1829 |
|
1694 |
A change in the hypervisor support area will be that we will support
|
|
1695 |
multiple hypervisors in parallel in the same cluster, so one could run
|
|
1696 |
Xen HVM side-by-side with Xen PVM on the same cluster.
|
|
1830 |
LogicalUnit
|
|
1831 |
the code associated with an OpCode, i.e. the code that implements the
|
|
1832 |
startup of an instance
|
|
1697 | 1833 |
|
1698 |
New features |
|
1699 |
------------ |
|
1834 |
.. _opcode: |
|
1835 |
|
|
1836 |
OpCode |
|
1837 |
a data structure encapsulating a basic cluster operation; for example, |
|
1838 |
start instance, add instance, etc.; |
|
1839 |
|
|
1840 |
.. _PVM: |
|
1700 | 1841 |
|
1701 |
There will be a number of minor feature enhancements targeted to |
|
1702 |
either 2.0 or subsequent 2.x releases: |
|
1842 |
PVM |
|
1843 |
para-virtualization mode, where the virtual machine knows it's being |
|
1844 |
virtualized and as such there is no need for hardware emulation |
|
1703 | 1845 |
|
1704 |
- multiple disks, with custom properties (read-only/read-write, exportable, |
|
1705 |
etc.) |
|
1706 |
- multiple NICs |
|
1846 |
.. _watcher: |
|
1707 | 1847 |
|
1708 |
These changes will require OS API changes, details are in the |
|
1709 |
design-2.0-os-interface document. And they will also require many |
|
1710 |
command line changes, see the design-2.0-commandline-parameters |
|
1711 |
document. |
|
1848 |
watcher |
|
1849 |
``ganeti-watcher`` is a tool that should be run regularly from cron |
|
1850 |
and takes care of restarting failed instances, restarting secondary |
|
1851 |
DRBD devices, etc. For more details, see the man page |
|
1852 |
``ganeti-watcher(8)``. |
Also available in: Unified diff