22 |
22 |
Background
|
23 |
23 |
==========
|
24 |
24 |
|
25 |
|
While Ganeti 1.2 is usable, it severly limits the flexibility of the
|
|
25 |
While Ganeti 1.2 is usable, it severely limits the flexibility of the
|
26 |
26 |
cluster administration and imposes a very rigid model. It has the
|
27 |
27 |
following main scalability issues:
|
28 |
28 |
|
... | ... | |
33 |
33 |
It also has a number of artificial restrictions, due to historical design:
|
34 |
34 |
|
35 |
35 |
- fixed number of disks (two) per instance
|
36 |
|
- fixed number of nics
|
|
36 |
- fixed number of NICs
|
37 |
37 |
|
38 |
38 |
.. [#] Replace disks will release the lock, but this is an exception
|
39 |
39 |
and not a recommended way to operate
|
40 |
40 |
|
41 |
41 |
The 2.0 version is intended to address some of these problems, and
|
42 |
|
create a more flexible codebase for future developments.
|
|
42 |
create a more flexible code base for future developments.
|
|
43 |
|
|
44 |
Among these problems, the single-operation at a time restriction is
|
|
45 |
biggest issue with the current version of Ganeti. It is such a big
|
|
46 |
impediment in operating bigger clusters that many times one is tempted
|
|
47 |
to remove the lock just to do a simple operation like start instance
|
|
48 |
while an OS installation is running.
|
43 |
49 |
|
44 |
50 |
Scalability problems
|
45 |
51 |
--------------------
|
... | ... | |
60 |
66 |
|
61 |
67 |
One of the main causes of this global lock (beside the higher
|
62 |
68 |
difficulty of ensuring data consistency in a more granular lock model)
|
63 |
|
is the fact that currently there is no "master" daemon in Ganeti. Each
|
64 |
|
command tries to acquire the so called *cmd* lock and when it
|
65 |
|
succeeds, it takes complete ownership of the cluster configuration and
|
66 |
|
state.
|
|
69 |
is the fact that currently there is no long-lived process in Ganeti
|
|
70 |
that can coordinate multiple operations. Each command tries to acquire
|
|
71 |
the so called *cmd* lock and when it succeeds, it takes complete
|
|
72 |
ownership of the cluster configuration and state.
|
67 |
73 |
|
68 |
74 |
Other scalability problems are due the design of the DRBD device
|
69 |
75 |
model, which assumed at its creation a low (one to four) number of
|
... | ... | |
77 |
83 |
touches multiple areas (configuration, import/export, command line)
|
78 |
84 |
that it's more fitted to a major release than a minor one.
|
79 |
85 |
|
|
86 |
Architecture issues
|
|
87 |
-------------------
|
|
88 |
|
|
89 |
The fact that each command is a separate process that reads the
|
|
90 |
cluster state, executes the command, and saves the new state is also
|
|
91 |
an issue on big clusters where the configuration data for the cluster
|
|
92 |
begins to be non-trivial in size.
|
|
93 |
|
80 |
94 |
Overview
|
81 |
95 |
========
|
82 |
96 |
|
... | ... | |
109 |
123 |
|
110 |
124 |
The main changes will be switching from a per-process model to a
|
111 |
125 |
daemon based model, where the individual gnt-* commands will be
|
112 |
|
clients that talk to this daemon (see the design-2.0-master-daemon
|
113 |
|
document). This will allow us to get rid of the global cluster lock
|
114 |
|
for most operations, having instead a per-object lock (see
|
115 |
|
design-2.0-granular-locking). Also, the daemon will be able to queue
|
116 |
|
jobs, and this will allow the invidual clients to submit jobs without
|
117 |
|
waiting for them to finish, and also see the result of old requests
|
118 |
|
(see design-2.0-job-queue).
|
|
126 |
clients that talk to this daemon (see `Master daemon`_). This will
|
|
127 |
allow us to get rid of the global cluster lock for most operations,
|
|
128 |
having instead a per-object lock (see `Granular locking`_). Also, the
|
|
129 |
daemon will be able to queue jobs, and this will allow the individual
|
|
130 |
clients to submit jobs without waiting for them to finish, and also
|
|
131 |
see the result of old requests (see `Job Queue`_).
|
119 |
132 |
|
120 |
133 |
Beside these major changes, another 'core' change but that will not be
|
121 |
134 |
as visible to the users will be changing the model of object attribute
|
122 |
|
storage, and separate that into namespaces (such that an Xen PVM
|
|
135 |
storage, and separate that into name spaces (such that an Xen PVM
|
123 |
136 |
instance will not have the Xen HVM parameters). This will allow future
|
124 |
|
flexibility in defining additional parameters. More details in the
|
125 |
|
design-2.0-cluster-parameters document.
|
|
137 |
flexibility in defining additional parameters. For more details see
|
|
138 |
`Object parameters`_.
|
126 |
139 |
|
127 |
140 |
The various changes brought in by the master daemon model and the
|
128 |
141 |
read-write RAPI will require changes to the cluster security; we move
|
129 |
|
away from Twisted and use http(s) for intra- and extra-cluster
|
|
142 |
away from Twisted and use HTTP(s) for intra- and extra-cluster
|
130 |
143 |
communications. For more details, see the security document in the
|
131 |
144 |
doc/ directory.
|
132 |
145 |
|
... | ... | |
140 |
153 |
- the command line tools (on the master node)
|
141 |
154 |
- the RAPI daemon (on the master node)
|
142 |
155 |
|
143 |
|
Interaction paths are between:
|
|
156 |
The master-daemon related interaction paths are:
|
144 |
157 |
|
145 |
|
- (CLI tools/RAPI daemon) and the master daemon, via the so called *luxi* API
|
|
158 |
- (CLI tools/RAPI daemon) and the master daemon, via the so called *LUXI* API
|
146 |
159 |
- the master daemon and the node daemons, via the node RPC
|
147 |
160 |
|
|
161 |
There are also some additional interaction paths for exceptional cases:
|
|
162 |
|
|
163 |
- CLI tools might access via SSH the nodes (for ``gnt-cluster copyfile``
|
|
164 |
and ``gnt-cluster command``)
|
|
165 |
- master failover is a special case when a non-master node will SSH
|
|
166 |
and do node-RPC calls to the current master
|
|
167 |
|
148 |
168 |
The protocol between the master daemon and the node daemons will be
|
149 |
|
changed to HTTP(S), using a simple PUT/GET of JSON-encoded
|
150 |
|
messages. This is done due to difficulties in working with the Twisted
|
151 |
|
framework and its protocols in a multithreaded environment, which we
|
152 |
|
can overcome by using a simpler stack (see the caveats section). The
|
153 |
|
protocol between the CLI/RAPI and the master daemon will be a custom
|
154 |
|
one (called *luxi*): on a UNIX socket on the master node, with rights
|
155 |
|
restricted by filesystem permissions, the CLI/RAPI will talk to the
|
156 |
|
master daemon using JSON-encoded messages.
|
|
169 |
changed from (Ganeti 1.2) Twisted PB (perspective broker) to HTTP(S),
|
|
170 |
using a simple PUT/GET of JSON-encoded messages. This is done due to
|
|
171 |
difficulties in working with the Twisted framework and its protocols
|
|
172 |
in a multithreaded environment, which we can overcome by using a
|
|
173 |
simpler stack (see the caveats section).
|
|
174 |
|
|
175 |
The protocol between the CLI/RAPI and the master daemon will be a
|
|
176 |
custom one (called *LUXI*): on a UNIX socket on the master node, with
|
|
177 |
rights restricted by filesystem permissions, the CLI/RAPI will talk to
|
|
178 |
the master daemon using JSON-encoded messages.
|
157 |
179 |
|
158 |
180 |
The operations supported over this internal protocol will be encoded
|
159 |
181 |
via a python library that will expose a simple API for its
|
160 |
182 |
users. Internally, the protocol will simply encode all objects in JSON
|
161 |
183 |
format and decode them on the receiver side.
|
162 |
184 |
|
|
185 |
For more details about the RAPI daemon see `Remote API changes`_, and
|
|
186 |
for the node daemon see `Node daemon changes`_.
|
|
187 |
|
163 |
188 |
The LUXI protocol
|
164 |
189 |
+++++++++++++++++
|
165 |
190 |
|
166 |
|
We will have two main classes of operations over the master daemon API:
|
|
191 |
As described above, the protocol for making requests or queries to the
|
|
192 |
master daemon will be a UNIX-socket based simple RPC of JSON-encoded
|
|
193 |
messages.
|
|
194 |
|
|
195 |
The choice of UNIX was in order to get rid of the need of
|
|
196 |
authentication and authorisation inside Ganeti; for 2.0, the
|
|
197 |
permissions on the Unix socket itself will determine the access
|
|
198 |
rights.
|
|
199 |
|
|
200 |
We will have two main classes of operations over this API:
|
167 |
201 |
|
168 |
202 |
- cluster query functions
|
169 |
203 |
- job related functions
|
170 |
204 |
|
171 |
205 |
The cluster query functions are usually short-duration, and are the
|
172 |
|
equivalent of the OP_QUERY_* opcodes in ganeti 1.2 (and they are
|
|
206 |
equivalent of the ``OP_QUERY_*`` opcodes in Ganeti 1.2 (and they are
|
173 |
207 |
internally implemented still with these opcodes). The clients are
|
174 |
208 |
guaranteed to receive the response in a reasonable time via a timeout.
|
175 |
209 |
|
... | ... | |
180 |
214 |
- archive job (see the job queue design doc)
|
181 |
215 |
- wait for job change, which allows a client to wait without polling
|
182 |
216 |
|
183 |
|
For more details, see the job queue design document.
|
|
217 |
For more details of the actual operation list, see the `Job Queue`_.
|
184 |
218 |
|
185 |
|
Daemon implementation
|
186 |
|
+++++++++++++++++++++
|
|
219 |
Both requests and responses will consist of a JSON-encoded message
|
|
220 |
followed by the ``ETX`` character (ASCII decimal 3), which is not a
|
|
221 |
valid character in JSON messages and thus can serve as a message
|
|
222 |
delimiter. The contents of the messages will be a dictionary with two
|
|
223 |
fields:
|
|
224 |
|
|
225 |
:method:
|
|
226 |
the name of the method called
|
|
227 |
:args:
|
|
228 |
the arguments to the method, as a list (no keyword arguments allowed)
|
|
229 |
|
|
230 |
Responses will follow the same format, with the two fields being:
|
|
231 |
|
|
232 |
:success:
|
|
233 |
a boolean denoting the success of the operation
|
|
234 |
:result:
|
|
235 |
the actual result, or error message in case of failure
|
|
236 |
|
|
237 |
There are two special value for the result field:
|
|
238 |
|
|
239 |
- in the case that the operation failed, and this field is a list of
|
|
240 |
length two, the client library will try to interpret is as an exception,
|
|
241 |
the first element being the exception type and the second one the
|
|
242 |
actual exception arguments; this will allow a simple method of passing
|
|
243 |
Ganeti-related exception across the interface
|
|
244 |
- for the *WaitForChange* call (that waits on the server for a job to
|
|
245 |
change status), if the result is equal to ``nochange`` instead of the
|
|
246 |
usual result for this call (a list of changes), then the library will
|
|
247 |
internally retry the call; this is done in order to differentiate
|
|
248 |
internally between master daemon hung and job simply not changed
|
|
249 |
|
|
250 |
Users of the API that don't use the provided python library should
|
|
251 |
take care of the above two cases.
|
|
252 |
|
|
253 |
|
|
254 |
Master daemon implementation
|
|
255 |
++++++++++++++++++++++++++++
|
187 |
256 |
|
188 |
257 |
The daemon will be based around a main I/O thread that will wait for
|
189 |
258 |
new requests from the clients, and that does the setup/shutdown of the
|
... | ... | |
195 |
264 |
long-lived, started at daemon startup and terminated only at shutdown
|
196 |
265 |
time
|
197 |
266 |
- client I/O threads, which are the ones that talk the local protocol
|
198 |
|
to the clients
|
|
267 |
(LUXI) to the clients, and are short-lived
|
199 |
268 |
|
200 |
269 |
Master startup/failover
|
201 |
270 |
+++++++++++++++++++++++
|
... | ... | |
229 |
298 |
- if we are not failing over (but just starting), the
|
230 |
299 |
quorum agrees that we are the designated master
|
231 |
300 |
|
|
301 |
- if any of the above is false, we prevent the current operation
|
|
302 |
(i.e. we don't become the master)
|
|
303 |
|
232 |
304 |
#. at this point, the node transitions to the master role
|
233 |
305 |
|
234 |
306 |
#. for all the in-progress jobs, mark them as failed, with
|
235 |
307 |
reason unknown or something similar (master failed, etc.)
|
236 |
308 |
|
|
309 |
Since due to exceptional conditions we could have a situation in which
|
|
310 |
no node can become the master due to inconsistent data, we will have
|
|
311 |
an override switch for the master daemon startup that will assume the
|
|
312 |
current node has the right data and will replicate all the
|
|
313 |
configuration files to the other nodes.
|
|
314 |
|
|
315 |
**Note**: the above algorithm is by no means an election algorithm; it
|
|
316 |
is a *confirmation* of the master role currently held by a node.
|
237 |
317 |
|
238 |
318 |
Logging
|
239 |
319 |
+++++++
|
240 |
320 |
|
241 |
|
The logging system will be switched completely to the logging module;
|
242 |
|
currently it's logging-based, but exposes a different API, which is
|
243 |
|
just overhead. As such, the code will be switched over to standard
|
244 |
|
logging calls, and only the setup will be custom.
|
|
321 |
The logging system will be switched completely to the standard python
|
|
322 |
logging module; currently it's logging-based, but exposes a different
|
|
323 |
API, which is just overhead. As such, the code will be switched over
|
|
324 |
to standard logging calls, and only the setup will be custom.
|
245 |
325 |
|
246 |
326 |
With this change, we will remove the separate debug/info/error logs,
|
247 |
327 |
and instead have always one logfile per daemon model:
|
... | ... | |
250 |
330 |
- node-daemon.log for the node daemon (this is the same as in 1.2)
|
251 |
331 |
- rapi-daemon.log for the RAPI daemon logs
|
252 |
332 |
- rapi-access.log, an additional log file for the RAPI that will be
|
253 |
|
in the standard http log format for possible parsing by other tools
|
|
333 |
in the standard HTTP log format for possible parsing by other tools
|
|
334 |
|
|
335 |
Since the `watcher`_ will only submit jobs to the master for startup
|
|
336 |
of the instances, its log file will contain less information than
|
|
337 |
before, mainly that it will start the instance, but not the results.
|
|
338 |
|
|
339 |
Node daemon changes
|
|
340 |
+++++++++++++++++++
|
|
341 |
|
|
342 |
The only change to the node daemon is that, since we need better
|
|
343 |
concurrency, we don't process the inter-node RPC calls in the node
|
|
344 |
daemon itself, but we fork and process each request in a separate
|
|
345 |
child.
|
254 |
346 |
|
255 |
|
Since the watcher will only submit jobs to the master for startup of
|
256 |
|
the instances, its log file will contain less information than before,
|
257 |
|
mainly that it will start the instance, but not the results.
|
|
347 |
Since we don't have many calls, and we only fork (not exec), the
|
|
348 |
overhead should be minimal.
|
258 |
349 |
|
259 |
350 |
Caveats
|
260 |
351 |
+++++++
|
... | ... | |
277 |
368 |
much better served by a daemon-based model
|
278 |
369 |
|
279 |
370 |
Another area of discussion is moving away from Twisted in this new
|
280 |
|
implementation. While Twisted hase its advantages, there are also many
|
281 |
|
disatvantanges to using it:
|
|
371 |
implementation. While Twisted has its advantages, there are also many
|
|
372 |
disadvantages to using it:
|
282 |
373 |
|
283 |
374 |
- first and foremost, it's not a library, but a framework; thus, if
|
284 |
|
you use twisted, all the code needs to be 'twiste-ized'; we were able
|
285 |
|
to keep the 1.x code clean by hacking around twisted in an
|
286 |
|
unsupported, unrecommended way, and the only alternative would have
|
287 |
|
been to make all the code be written for twisted
|
288 |
|
- it has some weaknesses in working with multiple threads, since its base
|
289 |
|
model is designed to replace thread usage by using deferred calls, so while
|
290 |
|
it can use threads, it's not less flexible in doing so
|
291 |
|
|
292 |
|
And, since we already have an HTTP server library for the RAPI, we
|
293 |
|
can just reuse that for inter-node communication.
|
|
375 |
you use twisted, all the code needs to be 'twiste-ized' and written
|
|
376 |
in an asynchronous manner, using deferreds; while this method works,
|
|
377 |
it's not a common way to code and it requires that the entire process
|
|
378 |
workflow is based around a single *reactor* (Twisted name for a main
|
|
379 |
loop)
|
|
380 |
- the more advanced granular locking that we want to implement would
|
|
381 |
require, if written in the async-manner, deep integration with the
|
|
382 |
Twisted stack, to such an extend that business-logic is inseparable
|
|
383 |
from the protocol coding; we felt that this is an unreasonable request,
|
|
384 |
and that a good protocol library should allow complete separation of
|
|
385 |
low-level protocol calls and business logic; by comparison, the threaded
|
|
386 |
approach combined with HTTPs protocol required (for the first iteration)
|
|
387 |
absolutely no changes from the 1.2 code, and later changes for optimizing
|
|
388 |
the inter-node RPC calls required just syntactic changes (e.g.
|
|
389 |
``rpc.call_...`` to ``self.rpc.call_...``)
|
|
390 |
|
|
391 |
Another issue is with the Twisted API stability - during the Ganeti
|
|
392 |
1.x lifetime, we had to to implement many times workarounds to changes
|
|
393 |
in the Twisted version, so that for example 1.2 is able to use both
|
|
394 |
Twisted 2.x and 8.x.
|
|
395 |
|
|
396 |
In the end, since we already had an HTTP server library for the RAPI,
|
|
397 |
we just reused that for inter-node communication.
|
294 |
398 |
|
295 |
399 |
|
296 |
400 |
Granular locking
|
... | ... | |
302 |
406 |
|
303 |
407 |
This design addresses how we are going to deal with locking so that:
|
304 |
408 |
|
305 |
|
- high urgency operations are not stopped by long length ones
|
306 |
|
- long length operations can run in parallel
|
307 |
|
- we preserve safety (data coherency) and liveness (no deadlock, no work
|
308 |
|
postponed indefinitely) on the cluster
|
|
409 |
- we preserve data coherency
|
|
410 |
- we prevent deadlocks
|
|
411 |
- we prevent job starvation
|
309 |
412 |
|
310 |
413 |
Reaching the maximum possible parallelism is a Non-Goal. We have identified a
|
311 |
414 |
set of operations that are currently bottlenecks and need to be parallelised
|
312 |
415 |
and have worked on those. In the future it will be possible to address other
|
313 |
416 |
needs, thus making the cluster more and more parallel one step at a time.
|
314 |
417 |
|
315 |
|
This document only talks about parallelising Ganeti level operations, aka
|
316 |
|
Logical Units, and the locking needed for that. Any other synchronisation lock
|
|
418 |
This section only talks about parallelising Ganeti level operations, aka
|
|
419 |
Logical Units, and the locking needed for that. Any other synchronization lock
|
317 |
420 |
needed internally by the code is outside its scope.
|
318 |
421 |
|
319 |
|
Ganeti 1.2
|
320 |
|
++++++++++
|
321 |
|
|
322 |
|
We intend to implement a Ganeti locking library, which can be used by the
|
323 |
|
various ganeti code components in order to easily, efficiently and correctly
|
324 |
|
grab the locks they need to perform their function.
|
|
422 |
Library details
|
|
423 |
+++++++++++++++
|
325 |
424 |
|
326 |
425 |
The proposed library has these features:
|
327 |
426 |
|
328 |
|
- Internally managing all the locks, making the implementation transparent
|
|
427 |
- internally managing all the locks, making the implementation transparent
|
329 |
428 |
from their usage
|
330 |
|
- Automatically grabbing multiple locks in the right order (avoid deadlock)
|
331 |
|
- Ability to transparently handle conversion to more granularity
|
332 |
|
- Support asynchronous operation (future goal)
|
333 |
|
|
334 |
|
Locking will be valid only on the master node and will not be a distributed
|
335 |
|
operation. In case of master failure, though, if some locks were held it means
|
336 |
|
some opcodes were in progress, so when recovery of the job queue is done it
|
337 |
|
will be possible to determine by the interrupted opcodes which operations could
|
338 |
|
have been left half way through and thus which locks could have been held. It
|
339 |
|
is then the responsibility either of the master failover code, of the cluster
|
340 |
|
verification code, or of the admin to do what's necessary to make sure that any
|
341 |
|
leftover state is dealt with. This is not an issue from a locking point of view
|
342 |
|
because the fact that the previous master has failed means that it cannot do
|
343 |
|
any job.
|
344 |
|
|
345 |
|
A corollary of this is that a master-failover operation with both masters alive
|
346 |
|
needs to happen while no other locks are held.
|
|
429 |
- automatically grabbing multiple locks in the right order (avoid deadlock)
|
|
430 |
- ability to transparently handle conversion to more granularity
|
|
431 |
- support asynchronous operation (future goal)
|
|
432 |
|
|
433 |
Locking will be valid only on the master node and will not be a
|
|
434 |
distributed operation. Therefore, in case of master failure, the
|
|
435 |
operations currently running will be aborted and the locks will be
|
|
436 |
lost; it remains to the administrator to cleanup (if needed) the
|
|
437 |
operation result (e.g. make sure an instance is either installed
|
|
438 |
correctly or removed).
|
|
439 |
|
|
440 |
A corollary of this is that a master-failover operation with both
|
|
441 |
masters alive needs to happen while no operations are running, and
|
|
442 |
therefore no locks are held.
|
|
443 |
|
|
444 |
All the locks will be represented by objects (like
|
|
445 |
``lockings.SharedLock``), and the individual locks for each object
|
|
446 |
will be created at initialisation time, from the config file.
|
|
447 |
|
|
448 |
The API will have a way to grab one or more than one locks at the same time.
|
|
449 |
Any attempt to grab a lock while already holding one in the wrong order will be
|
|
450 |
checked for, and fail.
|
|
451 |
|
347 |
452 |
|
348 |
453 |
The Locks
|
349 |
454 |
+++++++++
|
... | ... | |
360 |
465 |
within the locking library, which, for simplicity, will just use alphabetical
|
361 |
466 |
order.
|
362 |
467 |
|
|
468 |
Each lock has the following three possible statuses:
|
|
469 |
|
|
470 |
- unlocked (anyone can grab the lock)
|
|
471 |
- shared (anyone can grab/have the lock but only in shared mode)
|
|
472 |
- exclusive (no one else can grab/have the lock)
|
|
473 |
|
363 |
474 |
Handling conversion to more granularity
|
364 |
475 |
+++++++++++++++++++++++++++++++++++++++
|
365 |
476 |
|
366 |
477 |
In order to convert to a more granular approach transparently each time we
|
367 |
478 |
split a lock into more we'll create a "metalock", which will depend on those
|
368 |
|
sublocks and live for the time necessary for all the code to convert (or
|
|
479 |
sub-locks and live for the time necessary for all the code to convert (or
|
369 |
480 |
forever, in some conditions). When a metalock exists all converted code must
|
370 |
481 |
acquire it in shared mode, so it can run concurrently, but still be exclusive
|
371 |
482 |
with old code, which acquires it exclusively.
|
... | ... | |
373 |
484 |
In the beginning the only such lock will be what replaces the current "command"
|
374 |
485 |
lock, and will acquire all the locks in the system, before proceeding. This
|
375 |
486 |
lock will be called the "Big Ganeti Lock" because holding that one will avoid
|
376 |
|
any other concurrent ganeti operations.
|
|
487 |
any other concurrent Ganeti operations.
|
377 |
488 |
|
378 |
489 |
We might also want to devise more metalocks (eg. all nodes, all nodes+config)
|
379 |
490 |
in order to make it easier for some parts of the code to acquire what it needs
|
... | ... | |
383 |
494 |
decide to split them into an even more fine grained approach, but this will
|
384 |
495 |
probably be only after the first 2.0 version has been released.
|
385 |
496 |
|
386 |
|
Library API
|
387 |
|
+++++++++++
|
388 |
|
|
389 |
|
All the locking will be its own class, and the locks will be created at
|
390 |
|
initialisation time, from the config file.
|
391 |
|
|
392 |
|
The API will have a way to grab one or more than one locks at the same time.
|
393 |
|
Any attempt to grab a lock while already holding one in the wrong order will be
|
394 |
|
checked for, and fail.
|
395 |
|
|
396 |
497 |
Adding/Removing locks
|
397 |
498 |
+++++++++++++++++++++
|
398 |
499 |
|
... | ... | |
405 |
506 |
explicitly. The implementation of this will be handled in the locking library
|
406 |
507 |
itself.
|
407 |
508 |
|
408 |
|
Of course when instances or nodes disappear from the cluster the relevant locks
|
409 |
|
must be removed. This is easier than adding new elements, as the code which
|
410 |
|
removes them must own them exclusively or can queue for their ownership, and
|
411 |
|
thus deals with metalocks exactly as normal code acquiring those locks. Any
|
412 |
|
operation queueing on a removed lock will fail after its removal.
|
|
509 |
When instances or nodes disappear from the cluster the relevant locks
|
|
510 |
must be removed. This is easier than adding new elements, as the code
|
|
511 |
which removes them must own them exclusively already, and thus deals
|
|
512 |
with metalocks exactly as normal code acquiring those locks. Any
|
|
513 |
operation queuing on a removed lock will fail after its removal.
|
413 |
514 |
|
414 |
515 |
Asynchronous operations
|
415 |
516 |
+++++++++++++++++++++++
|
... | ... | |
421 |
522 |
In the future we may want to implement different types of asynchronous
|
422 |
523 |
operations such as:
|
423 |
524 |
|
424 |
|
- Try to acquire this lock set and fail if not possible
|
425 |
|
- Try to acquire one of these lock sets and return the first one you were
|
|
525 |
- try to acquire this lock set and fail if not possible
|
|
526 |
- try to acquire one of these lock sets and return the first one you were
|
426 |
527 |
able to get (or after a timeout) (select/poll like)
|
427 |
528 |
|
428 |
529 |
These operations can be used to prioritize operations based on available locks,
|
... | ... | |
441 |
542 |
"tasklets" with their own locking requirements. A different design doc (or mini
|
442 |
543 |
design doc) will cover the move from Logical Units to tasklets.
|
443 |
544 |
|
444 |
|
Lock acquisition code path
|
445 |
|
++++++++++++++++++++++++++
|
|
545 |
Code examples
|
|
546 |
+++++++++++++
|
446 |
547 |
|
447 |
548 |
In general when acquiring locks we should use a code path equivalent to::
|
448 |
549 |
|
... | ... | |
453 |
554 |
finally:
|
454 |
555 |
lock.release()
|
455 |
556 |
|
456 |
|
This makes sure we release all locks, and avoid possible deadlocks. Of course
|
457 |
|
extra care must be used not to leave, if possible locked structures in an
|
458 |
|
unusable state.
|
|
557 |
This makes sure we release all locks, and avoid possible deadlocks. Of
|
|
558 |
course extra care must be used not to leave, if possible locked
|
|
559 |
structures in an unusable state. Note that with Python 2.5 a simpler
|
|
560 |
syntax will be possible, but we want to keep compatibility with Python
|
|
561 |
2.4 so the new constructs should not be used.
|
459 |
562 |
|
460 |
563 |
In order to avoid this extra indentation and code changes everywhere in the
|
461 |
564 |
Logical Units code, we decided to allow LUs to declare locks, and then execute
|
... | ... | |
500 |
603 |
queue to store these and to be able to process as many as possible in
|
501 |
604 |
parallel.
|
502 |
605 |
|
503 |
|
A ganeti job will consist of multiple ``OpCodes`` which are the basic
|
|
606 |
A Ganeti job will consist of multiple ``OpCodes`` which are the basic
|
504 |
607 |
element of operation in Ganeti 1.2 (and will remain as such). Most
|
505 |
608 |
command-level commands are equivalent to one OpCode, or in some cases
|
506 |
609 |
to a sequence of opcodes, all of the same type (e.g. evacuating a node
|
... | ... | |
518 |
621 |
of the waiting threads will pick up the new job.
|
519 |
622 |
#. Client waits for job status updates by calling a waiting RPC function.
|
520 |
623 |
Log message may be shown to the user. Until the job is started, it can also
|
521 |
|
be cancelled.
|
|
624 |
be canceled.
|
522 |
625 |
#. As soon as the job is finished, its final result and status can be retrieved
|
523 |
626 |
from the server.
|
524 |
627 |
#. If the client archives the job, it gets moved to a history directory.
|
... | ... | |
653 |
756 |
+++++++
|
654 |
757 |
|
655 |
758 |
Archived jobs are kept in a separate directory,
|
656 |
|
/var/lib/ganeti/queue/archive/. This is done in order to speed up the
|
657 |
|
queue handling: by default, the jobs in the archive are not touched by
|
658 |
|
any functions. Only the current (unarchived) jobs are parsed, loaded,
|
659 |
|
and verified (if implemented) by the master daemon.
|
|
759 |
``/var/lib/ganeti/queue/archive/``. This is done in order to speed up
|
|
760 |
the queue handling: by default, the jobs in the archive are not
|
|
761 |
touched by any functions. Only the current (unarchived) jobs are
|
|
762 |
parsed, loaded, and verified (if implemented) by the master daemon.
|
660 |
763 |
|
661 |
764 |
|
662 |
765 |
Ganeti updates
|
... | ... | |
667 |
770 |
way to prevent new jobs entering the queue.
|
668 |
771 |
|
669 |
772 |
|
670 |
|
|
671 |
773 |
Object parameters
|
672 |
774 |
~~~~~~~~~~~~~~~~~
|
673 |
775 |
|
... | ... | |
697 |
799 |
a hypervisor parameter (or hypervisor specific parameter) is defined
|
698 |
800 |
as a parameter that is interpreted by the hypervisor support code in
|
699 |
801 |
Ganeti and usually is specific to a particular hypervisor (like the
|
700 |
|
kernel path for PVM which makes no sense for HVM).
|
|
802 |
kernel path for `PVM`_ which makes no sense for `HVM`_).
|
701 |
803 |
|
702 |
804 |
:backend parameter:
|
703 |
805 |
a backend parameter is defined as an instance parameter that can be
|
... | ... | |
727 |
829 |
hold defaults for the instances:
|
728 |
830 |
|
729 |
831 |
- hvparams, a dictionary indexed by hypervisor type, holding default
|
730 |
|
values for hypervisor parameters that are not defined/overrided by
|
|
832 |
values for hypervisor parameters that are not defined/overridden by
|
731 |
833 |
the instances of this hypervisor type
|
732 |
834 |
|
733 |
835 |
- beparams, a dictionary holding (for 2.0) a single element 'default',
|
... | ... | |
754 |
856 |
The names for hypervisor parameters in the instance.hvparams subtree
|
755 |
857 |
should be choosen as generic as possible, especially if specific
|
756 |
858 |
parameters could conceivably be useful for more than one hypervisor,
|
757 |
|
e.g. instance.hvparams.vnc_console_port instead of using both
|
758 |
|
instance.hvparams.hvm_vnc_console_port and
|
759 |
|
instance.hvparams.kvm_vnc_console_port.
|
|
859 |
e.g. ``instance.hvparams.vnc_console_port`` instead of using both
|
|
860 |
``instance.hvparams.hvm_vnc_console_port`` and
|
|
861 |
``instance.hvparams.kvm_vnc_console_port``.
|
760 |
862 |
|
761 |
863 |
There are some special cases related to disks and NICs (for example):
|
762 |
|
a disk has both ganeti-related parameters (e.g. the name of the LV)
|
|
864 |
a disk has both Ganeti-related parameters (e.g. the name of the LV)
|
763 |
865 |
and hypervisor-related parameters (how the disk is presented to/named
|
764 |
866 |
in the instance). The former parameters remain as proper-instance
|
765 |
867 |
parameters, while the latter value are migrated to the hvparams
|
... | ... | |
806 |
908 |
for this hypervisor
|
807 |
909 |
:CheckParamSyntax(hvparams): checks that the given parameters are
|
808 |
910 |
valid (as in the names are valid) for this hypervisor; usually just
|
809 |
|
comparing hvparams.keys() and cls.PARAMETERS; this is a class method
|
810 |
|
that can be called from within master code (i.e. cmdlib) and should
|
811 |
|
be safe to do so
|
|
911 |
comparing ``hvparams.keys()`` and ``cls.PARAMETERS``; this is a class
|
|
912 |
method that can be called from within master code (i.e. cmdlib) and
|
|
913 |
should be safe to do so
|
812 |
914 |
:ValidateParameters(hvparams): verifies the values of the provided
|
813 |
915 |
parameters against this hypervisor; this is a method that will be
|
814 |
916 |
called on the target node, from backend.py code, and as such can
|
... | ... | |
839 |
941 |
The parameter changes will have impact on the OpCodes, especially on
|
840 |
942 |
the following ones:
|
841 |
943 |
|
842 |
|
- OpCreateInstance, where the new hv and be parameters will be sent as
|
|
944 |
- ``OpCreateInstance``, where the new hv and be parameters will be sent as
|
843 |
945 |
dictionaries; note that all hv and be parameters are now optional, as
|
844 |
946 |
the values can be instead taken from the cluster
|
845 |
|
- OpQueryInstances, where we have to be able to query these new
|
|
947 |
- ``OpQueryInstances``, where we have to be able to query these new
|
846 |
948 |
parameters; the syntax for names will be ``hvparam/$NAME`` and
|
847 |
949 |
``beparam/$NAME`` for querying an individual parameter out of one
|
848 |
950 |
dictionary, and ``hvparams``, respectively ``beparams``, for the whole
|
849 |
951 |
dictionaries
|
850 |
|
- OpModifyInstance, where the the modified parameters are sent as
|
|
952 |
- ``OpModifyInstance``, where the the modified parameters are sent as
|
851 |
953 |
dictionaries
|
852 |
954 |
|
853 |
955 |
Additionally, we will need new OpCodes to modify the cluster-level
|
... | ... | |
891 |
993 |
assumptions made initially are not true and that more flexibility is
|
892 |
994 |
needed.
|
893 |
995 |
|
894 |
|
One main assupmtion made was that disk failures should be treated as 'rare'
|
|
996 |
One main assumption made was that disk failures should be treated as 'rare'
|
895 |
997 |
events, and that each of them needs to be manually handled in order to ensure
|
896 |
998 |
data safety; however, both these assumptions are false:
|
897 |
999 |
|
898 |
|
- disk failures can be a common occurence, based on usage patterns or cluster
|
|
1000 |
- disk failures can be a common occurrence, based on usage patterns or cluster
|
899 |
1001 |
size
|
900 |
1002 |
- our disk setup is robust enough (referring to DRBD8 + LVM) that we could
|
901 |
1003 |
automate more of the recovery
|
... | ... | |
956 |
1058 |
parameters.
|
957 |
1059 |
|
958 |
1060 |
This means that we in effect take ownership of the minor space for
|
959 |
|
that device type; if there's a user-created drbd minor, it will be
|
|
1061 |
that device type; if there's a user-created DRBD minor, it will be
|
960 |
1062 |
automatically removed.
|
961 |
1063 |
|
962 |
1064 |
The change will have the effect of reducing the number of external
|
963 |
1065 |
commands run per device from a constant number times the index of the
|
964 |
1066 |
first free DRBD minor to just a constant number.
|
965 |
1067 |
|
966 |
|
Removal of obsolete device types (md, drbd7)
|
|
1068 |
Removal of obsolete device types (MD, DRBD7)
|
967 |
1069 |
++++++++++++++++++++++++++++++++++++++++++++
|
968 |
1070 |
|
969 |
1071 |
We need to remove these device types because of two issues. First,
|
970 |
|
drbd7 has bad failure modes in case of dual failures (both network and
|
|
1072 |
DRBD7 has bad failure modes in case of dual failures (both network and
|
971 |
1073 |
disk - it cannot propagate the error up the device stack and instead
|
972 |
|
just panics. Second, due to the assymetry between primary and
|
973 |
|
secondary in md+drbd mode, we cannot do live failover (not even if we
|
974 |
|
had md+drbd8).
|
|
1074 |
just panics. Second, due to the asymmetry between primary and
|
|
1075 |
secondary in MD+DRBD mode, we cannot do live failover (not even if we
|
|
1076 |
had MD+DRBD8).
|
975 |
1077 |
|
976 |
1078 |
File-based storage support
|
977 |
1079 |
++++++++++++++++++++++++++
|
978 |
1080 |
|
979 |
|
This is covered by a separate design doc (<em>Vinales</em>) and
|
980 |
|
would allow us to get rid of the hard requirement for testing
|
981 |
|
clusters; it would also allow people who have SAN storage to do live
|
982 |
|
failover taking advantage of their storage solution.
|
|
1081 |
Using files instead of logical volumes for instance storage would
|
|
1082 |
allow us to get rid of the hard requirement for volume groups for
|
|
1083 |
testing clusters and it would also allow usage of SAN storage to do
|
|
1084 |
live failover taking advantage of this storage solution.
|
983 |
1085 |
|
984 |
1086 |
Better LVM allocation
|
985 |
1087 |
+++++++++++++++++++++
|
... | ... | |
1030 |
1132 |
#. if no, and previous status was no, do nothing
|
1031 |
1133 |
#. if no, and previous status was yes:
|
1032 |
1134 |
#. if more than one node is inconsistent, do nothing
|
1033 |
|
#. if only one node is incosistent:
|
|
1135 |
#. if only one node is inconsistent:
|
1034 |
1136 |
#. run ``vgreduce --removemissing``
|
1035 |
|
#. log this occurence in the ganeti log in a form that
|
|
1137 |
#. log this occurrence in the Ganeti log in a form that
|
1036 |
1138 |
can be used for monitoring
|
1037 |
1139 |
#. [FUTURE] run ``replace-disks`` for all
|
1038 |
1140 |
instances affected
|
... | ... | |
1067 |
1169 |
- verify that S2 (the node the user has chosen to keep as secondary) has
|
1068 |
1170 |
valid data (is consistent)
|
1069 |
1171 |
|
1070 |
|
- tear down the current DRBD association and setup a drbd pairing between
|
|
1172 |
- tear down the current DRBD association and setup a DRBD pairing between
|
1071 |
1173 |
P2 (P2 is indicated by the user) and S2; since P2 has no data, it will
|
1072 |
|
start resyncing from S2
|
|
1174 |
start re-syncing from S2
|
1073 |
1175 |
|
1074 |
1176 |
- as soon as P2 is in state SyncTarget (i.e. after the resync has started
|
1075 |
1177 |
but before it has finished), we can promote it to primary role (r/w)
|
... | ... | |
1083 |
1185 |
will cause I/O errors on the instance, so (if a longer instance
|
1084 |
1186 |
downtime is acceptable) we can postpone the restart of the instance
|
1085 |
1187 |
until the resync is done. However, disk I/O errors on S2 will cause
|
1086 |
|
dataloss, since we don't have a good copy of the data anymore, so in
|
|
1188 |
data loss, since we don't have a good copy of the data anymore, so in
|
1087 |
1189 |
this case waiting for the sync to complete is not an option. As such,
|
1088 |
1190 |
it is recommended that this feature is used only in conjunction with
|
1089 |
1191 |
proper disk monitoring.
|
... | ... | |
1096 |
1198 |
+++++++
|
1097 |
1199 |
|
1098 |
1200 |
The dynamic device model, while more complex, has an advantage: it
|
1099 |
|
will not reuse by mistake another's instance DRBD device, since it
|
1100 |
|
always looks for either our own or a free one.
|
|
1201 |
will not reuse by mistake the DRBD device of another instance, since
|
|
1202 |
it always looks for either our own or a free one.
|
1101 |
1203 |
|
1102 |
1204 |
The static one, in contrast, will assume that given a minor number N,
|
1103 |
1205 |
it's ours and we can take over. This needs careful implementation such
|
1104 |
1206 |
that if the minor is in use, either we are able to cleanly shut it
|
1105 |
1207 |
down, or we abort the startup. Otherwise, it could be that we start
|
1106 |
|
syncing between two instance's disks, causing dataloss.
|
|
1208 |
syncing between two instance's disks, causing data loss.
|
1107 |
1209 |
|
1108 |
1210 |
|
1109 |
1211 |
Variable number of disk/NICs per instance
|
... | ... | |
1115 |
1217 |
In order to support high-security scenarios (for example read-only sda
|
1116 |
1218 |
and read-write sdb), we need to make a fully flexibly disk
|
1117 |
1219 |
definition. This has less impact that it might look at first sight:
|
1118 |
|
only the instance creation has hardcoded number of disks, not the disk
|
|
1220 |
only the instance creation has hard coded number of disks, not the disk
|
1119 |
1221 |
handling code. The block device handling and most of the instance
|
1120 |
1222 |
handling code is already working with "the instance's disks" as
|
1121 |
1223 |
opposed to "the two disks of the instance", but some pieces are not
|
... | ... | |
1123 |
1225 |
|
1124 |
1226 |
The objective is to be able to specify the number of disks at
|
1125 |
1227 |
instance creation, and to be able to toggle from read-only to
|
1126 |
|
read-write a disk afterwards.
|
|
1228 |
read-write a disk afterward.
|
1127 |
1229 |
|
1128 |
1230 |
Variable number of NICs
|
1129 |
1231 |
+++++++++++++++++++++++
|
... | ... | |
1131 |
1233 |
Similar to the disk change, we need to allow multiple network
|
1132 |
1234 |
interfaces per instance. This will affect the internal code (some
|
1133 |
1235 |
function will have to stop assuming that ``instance.nics`` is a list
|
1134 |
|
of length one), the OS api which currently can export/import only one
|
|
1236 |
of length one), the OS API which currently can export/import only one
|
1135 |
1237 |
instance, and the command line interface.
|
1136 |
1238 |
|
1137 |
1239 |
Interface changes
|
... | ... | |
1176 |
1278 |
When designing the new OS API our priorities are:
|
1177 |
1279 |
- ease of use
|
1178 |
1280 |
- future extensibility
|
1179 |
|
- ease of porting from the old api
|
|
1281 |
- ease of porting from the old API
|
1180 |
1282 |
- modularity
|
1181 |
1283 |
|
1182 |
1284 |
As such we want to limit the number of scripts that must be written to support
|
... | ... | |
1228 |
1330 |
instances will be forced to have a number of disks greater or equal to the
|
1229 |
1331 |
one of the export.
|
1230 |
1332 |
- Some scripts are not compulsory: if such a script is missing the relevant
|
1231 |
|
operations will be forbidden for instances of that os. This makes it easier
|
|
1333 |
operations will be forbidden for instances of that OS. This makes it easier
|
1232 |
1334 |
to distinguish between unsupported operations and no-op ones (if any).
|
1233 |
1335 |
|
1234 |
1336 |
|
... | ... | |
1239 |
1341 |
inputs from environment variables. We expect the following input values:
|
1240 |
1342 |
|
1241 |
1343 |
OS_API_VERSION
|
1242 |
|
The version of the OS api that the following parameters comply with;
|
|
1344 |
The version of the OS API that the following parameters comply with;
|
1243 |
1345 |
this is used so that in the future we could have OSes supporting
|
1244 |
1346 |
multiple versions and thus Ganeti send the proper version in this
|
1245 |
1347 |
parameter
|
1246 |
1348 |
INSTANCE_NAME
|
1247 |
1349 |
Name of the instance acted on
|
1248 |
1350 |
HYPERVISOR
|
1249 |
|
The hypervisor the instance should run on (eg. 'xen-pvm', 'xen-hvm', 'kvm')
|
|
1351 |
The hypervisor the instance should run on (e.g. 'xen-pvm', 'xen-hvm', 'kvm')
|
1250 |
1352 |
DISK_COUNT
|
1251 |
1353 |
The number of disks this instance will have
|
1252 |
1354 |
NIC_COUNT
|
1253 |
|
The number of nics this instance will have
|
|
1355 |
The number of NICs this instance will have
|
1254 |
1356 |
DISK_<N>_PATH
|
1255 |
1357 |
Path to the Nth disk.
|
1256 |
1358 |
DISK_<N>_ACCESS
|
... | ... | |
1268 |
1370 |
NIC_<N>_BRIDGE
|
1269 |
1371 |
Node bridge the Nth network interface will be connected to
|
1270 |
1372 |
NIC_<N>_FRONTEND_TYPE
|
1271 |
|
Type of the Nth nic as seen by the instance. For example 'virtio', 'rtl8139', etc.
|
|
1373 |
Type of the Nth NIC as seen by the instance. For example 'virtio',
|
|
1374 |
'rtl8139', etc.
|
1272 |
1375 |
DEBUG_LEVEL
|
1273 |
1376 |
Whether more out should be produced, for debugging purposes. Currently the
|
1274 |
1377 |
only valid values are 0 and 1.
|
1275 |
1378 |
|
1276 |
|
These are only the basic variables we are thinking of now, but more may come
|
1277 |
|
during the implementation and they will be documented in the ganeti-os-api man
|
1278 |
|
page. All these variables will be available to all scripts.
|
|
1379 |
These are only the basic variables we are thinking of now, but more
|
|
1380 |
may come during the implementation and they will be documented in the
|
|
1381 |
``ganeti-os-api`` man page. All these variables will be available to
|
|
1382 |
all scripts.
|
1279 |
1383 |
|
1280 |
1384 |
Some scripts will need a few more information to work. These will have
|
1281 |
1385 |
per-script variables, such as for example:
|
... | ... | |
1304 |
1408 |
create and import scripts are supposed to format/initialise the given block
|
1305 |
1409 |
devices and install the correct instance data. The export script is supposed to
|
1306 |
1410 |
export instance data to stdout in a format understandable by the the import
|
1307 |
|
script. The data will be compressed by ganeti, so no compression should be
|
|
1411 |
script. The data will be compressed by Ganeti, so no compression should be
|
1308 |
1412 |
done. The rename script should only modify the instance's knowledge of what
|
1309 |
1413 |
its name is.
|
1310 |
1414 |
|
... | ... | |
1312 |
1416 |
++++++++++++++++++++++++++++++++
|
1313 |
1417 |
|
1314 |
1418 |
Similar to Ganeti 1.2, OS specifications will need to provide a
|
1315 |
|
'ganeti_api_version' containing list of numbers matching the version(s) of the
|
1316 |
|
api they implement. Ganeti itself will always be compatible with one version of
|
1317 |
|
the API and may maintain retrocompatibility if it's feasible to do so. The
|
1318 |
|
numbers are one-per-line, so an OS supporting both version 5 and version 20
|
1319 |
|
will have a file containing two lines. This is different from Ganeti 1.2, which
|
1320 |
|
only supported one version number.
|
|
1419 |
'ganeti_api_version' containing list of numbers matching the
|
|
1420 |
version(s) of the API they implement. Ganeti itself will always be
|
|
1421 |
compatible with one version of the API and may maintain backwards
|
|
1422 |
compatibility if it's feasible to do so. The numbers are one-per-line,
|
|
1423 |
so an OS supporting both version 5 and version 20 will have a file
|
|
1424 |
containing two lines. This is different from Ganeti 1.2, which only
|
|
1425 |
supported one version number.
|
1321 |
1426 |
|
1322 |
1427 |
In addition to that an OS will be able to declare that it does support only a
|
1323 |
|
subset of the ganeti hypervisors, by declaring them in the 'hypervisors' file.
|
|
1428 |
subset of the Ganeti hypervisors, by declaring them in the 'hypervisors' file.
|
1324 |
1429 |
|
1325 |
1430 |
|
1326 |
1431 |
Caveats/Notes
|
... | ... | |
1341 |
1446 |
Remote API changes
|
1342 |
1447 |
~~~~~~~~~~~~~~~~~~
|
1343 |
1448 |
|
1344 |
|
The first Ganeti RAPI was designed and deployed with the Ganeti 1.2.5 release.
|
1345 |
|
That version provide Read-Only access to a cluster state. Fully functional
|
1346 |
|
read-write API demand significant internal changes which are in a pipeline for
|
1347 |
|
Ganeti 2.0 release.
|
|
1449 |
The first Ganeti remote API (RAPI) was designed and deployed with the
|
|
1450 |
Ganeti 1.2.5 release. That version provide read-only access to the
|
|
1451 |
cluster state. Fully functional read-write API demands significant
|
|
1452 |
internal changes which will be implemented in version 2.0.
|
1348 |
1453 |
|
1349 |
|
We decided to go with implementing the Ganeti RAPI in a RESTful way, which is
|
1350 |
|
aligned with key features we looking. It is simple, stateless, scalable and
|
1351 |
|
extensible paradigm of API implementation. As transport it uses HTTP over SSL,
|
1352 |
|
and we are implementing it in JSON encoding, but in a way it possible to extend
|
1353 |
|
and provide any other one.
|
|
1454 |
We decided to go with implementing the Ganeti RAPI in a RESTful way,
|
|
1455 |
which is aligned with key features we looking. It is simple,
|
|
1456 |
stateless, scalable and extensible paradigm of API implementation. As
|
|
1457 |
transport it uses HTTP over SSL, and we are implementing it with JSON
|
|
1458 |
encoding, but in a way it possible to extend and provide any other
|
|
1459 |
one.
|
1354 |
1460 |
|
1355 |
1461 |
Design
|
1356 |
1462 |
++++++
|
1357 |
1463 |
|
1358 |
|
The Ganeti API implemented as independent daemon, running on the same node
|
1359 |
|
with the same permission level as Ganeti master daemon. Communication done
|
1360 |
|
through unix socket protocol provided by Ganeti luxi library.
|
1361 |
|
In order to keep communication asynchronous RAPI process two types of client
|
1362 |
|
requests:
|
|
1464 |
The Ganeti RAPI is implemented as independent daemon, running on the
|
|
1465 |
same node with the same permission level as Ganeti master
|
|
1466 |
daemon. Communication is done through the LUXI library to the master
|
|
1467 |
daemon. In order to keep communication asynchronous RAPI processes two
|
|
1468 |
types of client requests:
|
1363 |
1469 |
|
1364 |
|
- queries: sever able to answer immediately
|
1365 |
|
- jobs: some time needed.
|
|
1470 |
- queries: server is able to answer immediately
|
|
1471 |
- job submission: some time is required for a useful response
|
1366 |
1472 |
|
1367 |
|
In the query case requested data send back to client in http body. Typical
|
1368 |
|
examples of queries would be list of nodes, instances, cluster info, etc.
|
1369 |
|
Dealing with jobs client instead of waiting until job completes receive a job
|
1370 |
|
id, the identifier which allows to query the job progress in the job queue.
|
1371 |
|
(See job queue design doc for details)
|
|
1473 |
In the query case requested data send back to client in the HTTP
|
|
1474 |
response body. Typical examples of queries would be: list of nodes,
|
|
1475 |
instances, cluster info, etc.
|
1372 |
1476 |
|
1373 |
|
Internally, each exported object has an version identifier, which is used as a
|
1374 |
|
state stamp in the http header E-Tag field for request/response to avoid a race
|
1375 |
|
condition.
|
|
1477 |
In the case of job submission, the client receive a job ID, the
|
|
1478 |
identifier which allows to query the job progress in the job queue
|
|
1479 |
(see `Job Queue`_).
|
|
1480 |
|
|
1481 |
Internally, each exported object has an version identifier, which is
|
|
1482 |
used as a state identifier in the HTTP header E-Tag field for
|
|
1483 |
requests/responses to avoid race conditions.
|
1376 |
1484 |
|
1377 |
1485 |
|
1378 |
1486 |
Resource representation
|
1379 |
1487 |
+++++++++++++++++++++++
|
1380 |
1488 |
|
1381 |
|
The key difference of REST approach from others API is instead having one URI
|
1382 |
|
for all our requests, REST demand separate service by resources with unique
|
1383 |
|
URI. Each of them should have limited amount of stateless and standard HTTP
|
|
1489 |
The key difference of using REST instead of others API is that REST
|
|
1490 |
requires separation of services via resources with unique URIs. Each
|
|
1491 |
of them should have limited amount of state and support standard HTTP
|
1384 |
1492 |
methods: GET, POST, DELETE, PUT.
|
1385 |
1493 |
|
1386 |
|
For example in Ganeti case we can have a set of URI:
|
1387 |
|
- /{clustername}/instances
|
1388 |
|
- /{clustername}/instances/{instancename}
|
1389 |
|
- /{clustername}/instances/{instancename}/tag
|
1390 |
|
- /{clustername}/tag
|
|
1494 |
For example in Ganeti's case we can have a set of URI:
|
|
1495 |
|
|
1496 |
- ``/{clustername}/instances``
|
|
1497 |
- ``/{clustername}/instances/{instancename}``
|
|
1498 |
- ``/{clustername}/instances/{instancename}/tag``
|
|
1499 |
- ``/{clustername}/tag``
|
1391 |
1500 |
|
1392 |
|
A GET request to /{clustername}/instances will return list of instances, a POST
|
1393 |
|
to /{clustername}/instances should create new instance, a DELETE
|
1394 |
|
/{clustername}/instances/{instancename} should delete instance, a GET
|
1395 |
|
/{clustername}/tag get cluster tag
|
|
1501 |
A GET request to ``/{clustername}/instances`` will return the list of
|
|
1502 |
instances, a POST to ``/{clustername}/instances`` should create a new
|
|
1503 |
instance, a DELETE ``/{clustername}/instances/{instancename}`` should
|
|
1504 |
delete the instance, a GET ``/{clustername}/tag`` should return get
|
|
1505 |
cluster tags.
|
1396 |
1506 |
|
1397 |
|
Each resource URI has a version prefix. The complete list of resources id TBD.
|
|
1507 |
Each resource URI will have a version prefix. The resource IDs are to
|
|
1508 |
be determined.
|
1398 |
1509 |
|
1399 |
|
Internal encoding might be JSON, XML, or any other. The JSON encoding fits
|
1400 |
|
nicely in Ganeti RAPI needs. Specific representation client can request with
|
1401 |
|
Accept field in the HTTP header.
|
|
1510 |
Internal encoding might be JSON, XML, or any other. The JSON encoding
|
|
1511 |
fits nicely in Ganeti RAPI needs. The client can request a specific
|
|
1512 |
representation via the Accept field in the HTTP header.
|
1402 |
1513 |
|
1403 |
|
The REST uses standard HTTP as application protocol (not just as a transport)
|
1404 |
|
for resources access. Set of possible result codes is a subset of standard HTTP
|
1405 |
|
results. The stateless provide additional reliability and transparency to
|
1406 |
|
operations.
|
|
1514 |
REST uses HTTP as its transport and application protocol for resource
|
|
1515 |
access. The set of possible responses is a subset of standard HTTP
|
|
1516 |
responses.
|
|
1517 |
|
|
1518 |
The statelessness model provides additional reliability and
|
|
1519 |
transparency to operations (e.g. only one request needs to be analyzed
|
|
1520 |
to understand the in-progress operation, not a sequence of multiple
|
|
1521 |
requests/responses).
|
1407 |
1522 |
|
1408 |
1523 |
|
1409 |
1524 |
Security
|
1410 |
1525 |
++++++++
|
1411 |
1526 |
|
1412 |
|
With the write functionality security becomes much bigger an issue. The Ganeti
|
1413 |
|
RAPI uses basic HTTP authentication on top of SSL connection to grant access to
|
1414 |
|
an exported resource. The password stores locally in Apache-style .htpasswd
|
1415 |
|
file. Only one level of privileges is supported.
|
|
1527 |
With the write functionality security becomes a much bigger an issue.
|
|
1528 |
The Ganeti RAPI uses basic HTTP authentication on top of an
|
|
1529 |
SSL-secured connection to grant access to an exported resource. The
|
|
1530 |
password is stored locally in an Apache-style ``.htpasswd`` file. Only
|
|
1531 |
one level of privileges is supported.
|
|
1532 |
|
|
1533 |
Caveats
|
|
1534 |
+++++++
|
|
1535 |
|
|
1536 |
The model detailed above for job submission requires the client to
|
|
1537 |
poll periodically for updates to the job; an alternative would be to
|
|
1538 |
allow the client to request a callback, or a 'wait for updates' call.
|
|
1539 |
|
|
1540 |
The callback model was not considered due to the following two issues:
|
1416 |
1541 |
|
|
1542 |
- callbacks would require a new model of allowed callback URLs,
|
|
1543 |
together with a method of managing these
|
|
1544 |
- callbacks only work when the client and the master are in the same
|
|
1545 |
security domain, and they fail in the other cases (e.g. when there is
|
|
1546 |
a firewall between the client and the RAPI daemon that only allows
|
|
1547 |
client-to-RAPI calls, which is usual in DMZ cases)
|
|
1548 |
|
|
1549 |
The 'wait for updates' method is not suited to the HTTP protocol,
|
|
1550 |
where requests are supposed to be short-lived.
|
1417 |
1551 |
|
1418 |
1552 |
Command line changes
|
1419 |
1553 |
~~~~~~~~~~~~~~~~~~~~
|
1420 |
1554 |
|
1421 |
1555 |
Ganeti 2.0 introduces several new features as well as new ways to
|
1422 |
1556 |
handle instance resources like disks or network interfaces. This
|
1423 |
|
requires some noticable changes in the way commandline arguments are
|
|
1557 |
requires some noticeable changes in the way command line arguments are
|
1424 |
1558 |
handled.
|
1425 |
1559 |
|
1426 |
|
- extend and modify commandline syntax to support new features
|
1427 |
|
- ensure consistent patterns in commandline arguments to reduce cognitive load
|
|
1560 |
- extend and modify command line syntax to support new features
|
|
1561 |
- ensure consistent patterns in command line arguments to reduce
|
|
1562 |
cognitive load
|
1428 |
1563 |
|
1429 |
1564 |
The design changes that require these changes are, in no particular
|
1430 |
1565 |
order:
|
... | ... | |
1437 |
1572 |
cluster, each supporting different parameters,
|
1438 |
1573 |
- support for device type CDROM (via ISO image)
|
1439 |
1574 |
|
1440 |
|
As such, there are several areas of Ganeti where the commandline
|
|
1575 |
As such, there are several areas of Ganeti where the command line
|
1441 |
1576 |
arguments will change:
|
1442 |
1577 |
|
1443 |
1578 |
- Cluster configuration
|
... | ... | |
1452 |
1587 |
- handling of CDROM devices and
|
1453 |
1588 |
- handling of hypervisor specific options.
|
1454 |
1589 |
|
1455 |
|
There are several areas of Ganeti where the commandline arguments will change:
|
|
1590 |
There are several areas of Ganeti where the command line arguments
|
|
1591 |
will change:
|
1456 |
1592 |
|
1457 |
1593 |
- Cluster configuration
|
1458 |
1594 |
|
... | ... | |
1552 |
1688 |
:--net: for network interface cards
|
1553 |
1689 |
:--disk: for disk devices
|
1554 |
1690 |
|
1555 |
|
The syntax to the device specific options is similiar to the generic
|
|
1691 |
The syntax to the device specific options is similar to the generic
|
1556 |
1692 |
device options, but instead of specifying a device number like for
|
1557 |
1693 |
gnt-instance add, you specify the magic string add. The new device
|
1558 |
1694 |
will always be appended at the end of the list of devices of this type
|
... | ... | |
1584 |
1720 |
:--net: for network interface cards
|
1585 |
1721 |
:--disk: for disk devices
|
1586 |
1722 |
|
1587 |
|
The syntax to the device specific options is similiar to the generic
|
|
1723 |
The syntax to the device specific options is similar to the generic
|
1588 |
1724 |
device options. The device number you specify identifies the device to
|
1589 |
1725 |
be modified.
|
1590 |
1726 |
|
1591 |
|
Example: gnt-instance modify --disk 2:access=r
|
|
1727 |
Example::
|
|
1728 |
|
|
1729 |
gnt-instance modify --disk 2:access=r
|
1592 |
1730 |
|
1593 |
1731 |
Hypervisor Options
|
1594 |
1732 |
++++++++++++++++++
|
... | ... | |
1596 |
1734 |
Ganeti 2.0 will support more than one hypervisor. Different
|
1597 |
1735 |
hypervisors have various options that only apply to a specific
|
1598 |
1736 |
hypervisor. Those hypervisor specific options are treated specially
|
1599 |
|
via the --hypervisor option. The generic syntax of the hypervisor
|
1600 |
|
option is as follows:
|
|
1737 |
via the ``--hypervisor`` option. The generic syntax of the hypervisor
|
|
1738 |
option is as follows::
|
1601 |
1739 |
|
1602 |
1740 |
--hypervisor $HYPERVISOR:$OPTION=$VALUE[,$OPTION=$VALUE]
|
1603 |
1741 |
|
... | ... | |
1608 |
1746 |
:$VALUE: hypervisor option value, string
|
1609 |
1747 |
|
1610 |
1748 |
The hypervisor option for an instance can be set on instance creation
|
1611 |
|
time via the gnt-instance add command. If the hypervisor for an
|
|
1749 |
time via the ``gnt-instance add`` command. If the hypervisor for an
|
1612 |
1750 |
instance is not specified upon instance creation, the default
|
1613 |
1751 |
hypervisor will be used.
|
1614 |
1752 |
|
... | ... | |
1616 |
1754 |
+++++++++++++++++++++++++++++++
|
1617 |
1755 |
|
1618 |
1756 |
The hypervisor parameters of an existing instance can be modified
|
1619 |
|
using --hypervisor option of the gnt-instance modify command. However,
|
1620 |
|
the hypervisor type of an existing instance can not be changed, only
|
1621 |
|
the particular hypervisor specific option can be changed. Therefore,
|
1622 |
|
the format of the option parameters has been simplified to omit the
|
1623 |
|
hypervisor name and only contain the comma separated list of
|
1624 |
|
option-value pairs.
|
|
1757 |
using ``--hypervisor`` option of the ``gnt-instance modify``
|
|
1758 |
command. However, the hypervisor type of an existing instance can not
|
|
1759 |
be changed, only the particular hypervisor specific option can be
|
|
1760 |
changed. Therefore, the format of the option parameters has been
|
|
1761 |
simplified to omit the hypervisor name and only contain the comma
|
|
1762 |
separated list of option-value pairs.
|
1625 |
1763 |
|
1626 |
|
Example: gnt-instance modify --hypervisor
|
1627 |
|
cdrom=/srv/boot.iso,boot_order=cdrom:network test-instance
|
|
1764 |
Example::
|
|
1765 |
|
|
1766 |
gnt-instance modify --hypervisor cdrom=/srv/boot.iso,boot_order=cdrom:network test-instance
|
1628 |
1767 |
|
1629 |
1768 |
gnt-cluster commands
|
1630 |
1769 |
++++++++++++++++++++
|
... | ... | |
1664 |
1803 |
Hypervisor cluster defaults
|
1665 |
1804 |
+++++++++++++++++++++++++++
|
1666 |
1805 |
|
1667 |
|
The generic format of the hypervisor clusterwide default setting option is:
|
|
1806 |
The generic format of the hypervisor cluster wide default setting
|
|
1807 |
option is::
|
1668 |
1808 |
|
1669 |
1809 |
--hypervisor-defaults $HYPERVISOR:$OPTION=$VALUE[,$OPTION=$VALUE]
|
1670 |
1810 |
|
... | ... | |
1673 |
1813 |
:$OPTION: cluster default option, string,
|
1674 |
1814 |
:$VALUE: cluster default option value, string.
|
1675 |
1815 |
|
|
1816 |
Glossary
|
|
1817 |
========
|
1676 |
1818 |
|
1677 |
|
Functionality changes
|
1678 |
|
---------------------
|
|
1819 |
Since this document is only a delta from the Ganeti 1.2, there are
|
|
1820 |
some unexplained terms. Here is a non-exhaustive list.
|
1679 |
1821 |
|
1680 |
|
The disk storage will receive some changes, and will also remove
|
1681 |
|
support for the drbd7 and md disk types. See the
|
1682 |
|
design-2.0-disk-changes document.
|
|
1822 |
.. _HVM:
|
1683 |
1823 |
|
1684 |
|
The configuration storage will be changed, with the effect that more
|
1685 |
|
data will be available on the nodes for access from outside ganeti
|
1686 |
|
(e.g. from shell scripts) and that nodes will get slightly more
|
1687 |
|
awareness of the cluster configuration.
|
|
1824 |
HVM
|
|
1825 |
hardware virtualization mode, where the virtual machine is oblivious
|
|
1826 |
to the fact that's being virtualized and all the hardware is emulated
|
1688 |
1827 |
|
1689 |
|
The RAPI will enable modify operations (beside the read-only queries
|
1690 |
|
that are available today), so in effect almost all the operations
|
1691 |
|
available today via the ``gnt-*`` commands will be available via the
|
1692 |
|
remote API.
|
|
1828 |
.. _LU:
|
1693 |
1829 |
|
1694 |
|
A change in the hypervisor support area will be that we will support
|
1695 |
|
multiple hypervisors in parallel in the same cluster, so one could run
|
1696 |
|
Xen HVM side-by-side with Xen PVM on the same cluster.
|
|
1830 |
LogicalUnit
|
|
1831 |
the code associated with an OpCode, i.e. the code that implements the
|
|
1832 |
startup of an instance
|
1697 |
1833 |
|
1698 |
|
New features
|
1699 |
|
------------
|
|
1834 |
.. _opcode:
|
|
1835 |
|
|
1836 |
OpCode
|
|
1837 |
a data structure encapsulating a basic cluster operation; for example,
|
|
1838 |
start instance, add instance, etc.;
|
|
1839 |
|
|
1840 |
.. _PVM:
|
1700 |
1841 |
|
1701 |
|
There will be a number of minor feature enhancements targeted to
|
1702 |
|
either 2.0 or subsequent 2.x releases:
|
|
1842 |
PVM
|
|
1843 |
para-virtualization mode, where the virtual machine knows it's being
|
|
1844 |
virtualized and as such there is no need for hardware emulation
|
1703 |
1845 |
|
1704 |
|
- multiple disks, with custom properties (read-only/read-write, exportable,
|
1705 |
|
etc.)
|
1706 |
|
- multiple NICs
|
|
1846 |
.. _watcher:
|
1707 |
1847 |
|
1708 |
|
These changes will require OS API changes, details are in the
|
1709 |
|
design-2.0-os-interface document. And they will also require many
|
1710 |
|
command line changes, see the design-2.0-commandline-parameters
|
1711 |
|
document.
|
|
1848 |
watcher
|
|
1849 |
``ganeti-watcher`` is a tool that should be run regularly from cron
|
|
1850 |
and takes care of restarting failed instances, restarting secondary
|
|
1851 |
DRBD devices, etc. For more details, see the man page
|
|
1852 |
``ganeti-watcher(8)``.
|