Revision 7faf5110
b/NEWS | ||
---|---|---|
7 | 7 |
- Added ``--ignore-size`` to the ``gnt-instance activate-disks`` command |
8 | 8 |
to allow using the pre-2.0.2 behaviour in activation, if any existing |
9 | 9 |
instances have mismatched disk sizes in the configuration |
10 |
- Added ``gnt-cluster repair-disk-sizes`` command to check and update any
|
|
11 |
configuration mismatches for disk sizes |
|
10 |
- Added ``gnt-cluster repair-disk-sizes`` command to check and update |
|
11 |
any configuration mismatches for disk sizes
|
|
12 | 12 |
- Added ``gnt-master cluste-failover --no-voting`` to allow master |
13 | 13 |
failover to work on two-node clusters |
14 | 14 |
- Fixed the ‘--net’ option of ``gnt-backup import``, which was unusable |
... | ... | |
61 | 61 |
- the watcher now also restarts the node daemon and the rapi daemon if |
62 | 62 |
they died |
63 | 63 |
- fixed the watcher to handle full and drained queue cases |
64 |
- hooks export more instance data in the environment, which helps if hook
|
|
65 |
scripts need to take action based on the instance's properties (no
|
|
66 |
longer need to query back into ganeti) |
|
64 |
- hooks export more instance data in the environment, which helps if |
|
65 |
hook scripts need to take action based on the instance's properties
|
|
66 |
(no longer need to query back into ganeti)
|
|
67 | 67 |
- instance failovers when the instance is stopped do not check for free |
68 | 68 |
RAM, so that failing over a stopped instance is possible in low memory |
69 | 69 |
situations |
... | ... | |
169 | 169 |
|
170 | 170 |
- all commands are executed by a daemon (``ganeti-masterd``) and the |
171 | 171 |
various ``gnt-*`` commands are just front-ends to it |
172 |
- all the commands are entered into, and executed from a job queue, see
|
|
173 |
the ``gnt-job(8)`` manpage |
|
174 |
- the RAPI daemon supports read-write operations, secured by basic HTTP
|
|
175 |
authentication on top of HTTPS |
|
172 |
- all the commands are entered into, and executed from a job queue, |
|
173 |
see the ``gnt-job(8)`` manpage
|
|
174 |
- the RAPI daemon supports read-write operations, secured by basic |
|
175 |
HTTP authentication on top of HTTPS
|
|
176 | 176 |
- DRBD version 0.7 support has been removed, DRBD 8 is the only |
177 | 177 |
supported version (when migrating from Ganeti 1.2 to 2.0, you need |
178 | 178 |
to migrate to DRBD 8 first while still running Ganeti 1.2) |
... | ... | |
193 | 193 |
- Change the default reboot type in ``gnt-instance reboot`` to "hard" |
194 | 194 |
- Reuse the old instance mac address by default on instance import, if |
195 | 195 |
the instance name is the same. |
196 |
- Handle situations in which the node info rpc returns incomplete results
|
|
197 |
(issue 46) |
|
196 |
- Handle situations in which the node info rpc returns incomplete |
|
197 |
results (issue 46)
|
|
198 | 198 |
- Add checks for tcp/udp ports collisions in ``gnt-cluster verify`` |
199 | 199 |
- Improved version of batcher: |
200 | 200 |
|
... | ... | |
218 | 218 |
- new ``--hvm-nic-type`` and ``--hvm-disk-type`` flags to control the |
219 | 219 |
type of disk exported to fully virtualized instances. |
220 | 220 |
- provide access to the serial console of HVM instances |
221 |
- instance auto_balance flag, set by default. If turned off it will avoid
|
|
222 |
warnings on cluster verify if there is not enough memory to fail over
|
|
223 |
an instance. in the future it will prevent automatically failing it
|
|
224 |
over when we will support that. |
|
221 |
- instance auto_balance flag, set by default. If turned off it will |
|
222 |
avoid warnings on cluster verify if there is not enough memory to fail
|
|
223 |
over an instance. in the future it will prevent automatically failing
|
|
224 |
it over when we will support that.
|
|
225 | 225 |
- batcher tool for instance creation, see ``tools/README.batcher`` |
226 | 226 |
- ``gnt-instance reinstall --select-os`` to interactively select a new |
227 | 227 |
operating system when reinstalling an instance. |
... | ... | |
347 | 347 |
Version 1.2.0 |
348 | 348 |
------------- |
349 | 349 |
|
350 |
- Log the ``xm create`` output to the node daemon log on failure (to help
|
|
351 |
diagnosing the error) |
|
350 |
- Log the ``xm create`` output to the node daemon log on failure (to |
|
351 |
help diagnosing the error)
|
|
352 | 352 |
- In debug mode, log all external commands output if failed to the logs |
353 | 353 |
- Change parsing of lvm commands to ignore stderr |
354 | 354 |
|
... | ... | |
384 | 384 |
reboots |
385 | 385 |
- Removed dependency on debian's patched fping that uses the |
386 | 386 |
non-standard ``-S`` option |
387 |
- Now the OS definitions are searched for in multiple, configurable paths
|
|
388 |
(easier for distros to package) |
|
387 |
- Now the OS definitions are searched for in multiple, configurable |
|
388 |
paths (easier for distros to package)
|
|
389 | 389 |
- Some changes to the hooks infrastructure (especially the new |
390 | 390 |
post-configuration update hook) |
391 | 391 |
- Other small bugfixes |
b/doc/admin.rst | ||
---|---|---|
343 | 343 |
you want to remove Ganeti completely, you need to also undo some of |
344 | 344 |
the SSH changes and log directories: |
345 | 345 |
|
346 |
- ``rm -rf /var/log/ganeti /srv/ganeti`` (replace with the correct paths) |
|
346 |
- ``rm -rf /var/log/ganeti /srv/ganeti`` (replace with the correct |
|
347 |
paths) |
|
347 | 348 |
- remove from ``/root/.ssh`` the keys that Ganeti added (check |
348 | 349 |
the ``authorized_keys`` and ``id_dsa`` files) |
349 | 350 |
- regenerate the host's SSH keys (check the OpenSSH startup scripts) |
b/doc/design-2.0.rst | ||
---|---|---|
30 | 30 |
- poor handling of node failures in the cluster |
31 | 31 |
- mixing hypervisors in a cluster not allowed |
32 | 32 |
|
33 |
It also has a number of artificial restrictions, due to historical design: |
|
33 |
It also has a number of artificial restrictions, due to historical |
|
34 |
design: |
|
34 | 35 |
|
35 | 36 |
- fixed number of disks (two) per instance |
36 | 37 |
- fixed number of NICs |
... | ... | |
55 | 56 |
|
56 | 57 |
- It is impossible for two people to efficiently interact with a cluster |
57 | 58 |
(for example for debugging) at the same time. |
58 |
- When batch jobs are running it's impossible to do other work (for example
|
|
59 |
failovers/fixes) on a cluster. |
|
59 |
- When batch jobs are running it's impossible to do other work (for |
|
60 |
example failovers/fixes) on a cluster.
|
|
60 | 61 |
|
61 | 62 |
This poses scalability problems: as clusters grow in node and instance |
62 | 63 |
size it's a lot more likely that operations which one could conceive |
... | ... | |
155 | 156 |
|
156 | 157 |
The master-daemon related interaction paths are: |
157 | 158 |
|
158 |
- (CLI tools/RAPI daemon) and the master daemon, via the so called *LUXI* API |
|
159 |
- (CLI tools/RAPI daemon) and the master daemon, via the so called |
|
160 |
*LUXI* API |
|
159 | 161 |
- the master daemon and the node daemons, via the node RPC |
160 | 162 |
|
161 | 163 |
There are also some additional interaction paths for exceptional cases: |
... | ... | |
237 | 239 |
There are two special value for the result field: |
238 | 240 |
|
239 | 241 |
- in the case that the operation failed, and this field is a list of |
240 |
length two, the client library will try to interpret is as an exception,
|
|
241 |
the first element being the exception type and the second one the
|
|
242 |
actual exception arguments; this will allow a simple method of passing
|
|
243 |
Ganeti-related exception across the interface |
|
242 |
length two, the client library will try to interpret is as an |
|
243 |
exception, the first element being the exception type and the second
|
|
244 |
one the actual exception arguments; this will allow a simple method of
|
|
245 |
passing Ganeti-related exception across the interface
|
|
244 | 246 |
- for the *WaitForChange* call (that waits on the server for a job to |
245 | 247 |
change status), if the result is equal to ``nochange`` instead of the |
246 | 248 |
usual result for this call (a list of changes), then the library will |
... | ... | |
381 | 383 |
- the more advanced granular locking that we want to implement would |
382 | 384 |
require, if written in the async-manner, deep integration with the |
383 | 385 |
Twisted stack, to such an extend that business-logic is inseparable |
384 |
from the protocol coding; we felt that this is an unreasonable request, |
|
385 |
and that a good protocol library should allow complete separation of |
|
386 |
low-level protocol calls and business logic; by comparison, the threaded |
|
387 |
approach combined with HTTPs protocol required (for the first iteration) |
|
388 |
absolutely no changes from the 1.2 code, and later changes for optimizing |
|
389 |
the inter-node RPC calls required just syntactic changes (e.g. |
|
390 |
``rpc.call_...`` to ``self.rpc.call_...``) |
|
386 |
from the protocol coding; we felt that this is an unreasonable |
|
387 |
request, and that a good protocol library should allow complete |
|
388 |
separation of low-level protocol calls and business logic; by |
|
389 |
comparison, the threaded approach combined with HTTPs protocol |
|
390 |
required (for the first iteration) absolutely no changes from the 1.2 |
|
391 |
code, and later changes for optimizing the inter-node RPC calls |
|
392 |
required just syntactic changes (e.g. ``rpc.call_...`` to |
|
393 |
``self.rpc.call_...``) |
|
391 | 394 |
|
392 | 395 |
Another issue is with the Twisted API stability - during the Ganeti |
393 | 396 |
1.x lifetime, we had to to implement many times workarounds to changes |
... | ... | |
401 | 404 |
Granular locking |
402 | 405 |
~~~~~~~~~~~~~~~~ |
403 | 406 |
|
404 |
We want to make sure that multiple operations can run in parallel on a Ganeti |
|
405 |
Cluster. In order for this to happen we need to make sure concurrently run |
|
406 |
operations don't step on each other toes and break the cluster. |
|
407 |
We want to make sure that multiple operations can run in parallel on a |
|
408 |
Ganeti Cluster. In order for this to happen we need to make sure |
|
409 |
concurrently run operations don't step on each other toes and break the |
|
410 |
cluster. |
|
407 | 411 |
|
408 | 412 |
This design addresses how we are going to deal with locking so that: |
409 | 413 |
|
... | ... | |
411 | 415 |
- we prevent deadlocks |
412 | 416 |
- we prevent job starvation |
413 | 417 |
|
414 |
Reaching the maximum possible parallelism is a Non-Goal. We have identified a |
|
415 |
set of operations that are currently bottlenecks and need to be parallelised |
|
416 |
and have worked on those. In the future it will be possible to address other |
|
417 |
needs, thus making the cluster more and more parallel one step at a time. |
|
418 |
Reaching the maximum possible parallelism is a Non-Goal. We have |
|
419 |
identified a set of operations that are currently bottlenecks and need |
|
420 |
to be parallelised and have worked on those. In the future it will be |
|
421 |
possible to address other needs, thus making the cluster more and more |
|
422 |
parallel one step at a time. |
|
418 | 423 |
|
419 | 424 |
This section only talks about parallelising Ganeti level operations, aka |
420 |
Logical Units, and the locking needed for that. Any other synchronization lock
|
|
421 |
needed internally by the code is outside its scope. |
|
425 |
Logical Units, and the locking needed for that. Any other |
|
426 |
synchronization lock needed internally by the code is outside its scope.
|
|
422 | 427 |
|
423 | 428 |
Library details |
424 | 429 |
+++++++++++++++ |
425 | 430 |
|
426 | 431 |
The proposed library has these features: |
427 | 432 |
|
428 |
- internally managing all the locks, making the implementation transparent |
|
429 |
from their usage |
|
430 |
- automatically grabbing multiple locks in the right order (avoid deadlock) |
|
433 |
- internally managing all the locks, making the implementation |
|
434 |
transparent from their usage |
|
435 |
- automatically grabbing multiple locks in the right order (avoid |
|
436 |
deadlock) |
|
431 | 437 |
- ability to transparently handle conversion to more granularity |
432 | 438 |
- support asynchronous operation (future goal) |
433 | 439 |
|
... | ... | |
446 | 452 |
``lockings.SharedLock``), and the individual locks for each object |
447 | 453 |
will be created at initialisation time, from the config file. |
448 | 454 |
|
449 |
The API will have a way to grab one or more than one locks at the same time.
|
|
450 |
Any attempt to grab a lock while already holding one in the wrong order will be
|
|
451 |
checked for, and fail. |
|
455 |
The API will have a way to grab one or more than one locks at the same |
|
456 |
time. Any attempt to grab a lock while already holding one in the wrong
|
|
457 |
order will be checked for, and fail.
|
|
452 | 458 |
|
453 | 459 |
|
454 | 460 |
The Locks |
... | ... | |
460 | 466 |
- One lock per node in the cluster |
461 | 467 |
- One lock per instance in the cluster |
462 | 468 |
|
463 |
All the instance locks will need to be taken before the node locks, and the
|
|
464 |
node locks before the config lock. Locks will need to be acquired at the same
|
|
465 |
time for multiple instances and nodes, and internal ordering will be dealt
|
|
466 |
within the locking library, which, for simplicity, will just use alphabetical
|
|
467 |
order. |
|
469 |
All the instance locks will need to be taken before the node locks, and |
|
470 |
the node locks before the config lock. Locks will need to be acquired at
|
|
471 |
the same time for multiple instances and nodes, and internal ordering
|
|
472 |
will be dealt within the locking library, which, for simplicity, will
|
|
473 |
just use alphabetical order.
|
|
468 | 474 |
|
469 | 475 |
Each lock has the following three possible statuses: |
470 | 476 |
|
... | ... | |
475 | 481 |
Handling conversion to more granularity |
476 | 482 |
+++++++++++++++++++++++++++++++++++++++ |
477 | 483 |
|
478 |
In order to convert to a more granular approach transparently each time we |
|
479 |
split a lock into more we'll create a "metalock", which will depend on those |
|
480 |
sub-locks and live for the time necessary for all the code to convert (or |
|
481 |
forever, in some conditions). When a metalock exists all converted code must |
|
482 |
acquire it in shared mode, so it can run concurrently, but still be exclusive |
|
483 |
with old code, which acquires it exclusively. |
|
484 |
In order to convert to a more granular approach transparently each time |
|
485 |
we split a lock into more we'll create a "metalock", which will depend |
|
486 |
on those sub-locks and live for the time necessary for all the code to |
|
487 |
convert (or forever, in some conditions). When a metalock exists all |
|
488 |
converted code must acquire it in shared mode, so it can run |
|
489 |
concurrently, but still be exclusive with old code, which acquires it |
|
490 |
exclusively. |
|
484 | 491 |
|
485 |
In the beginning the only such lock will be what replaces the current "command"
|
|
486 |
lock, and will acquire all the locks in the system, before proceeding. This
|
|
487 |
lock will be called the "Big Ganeti Lock" because holding that one will avoid
|
|
488 |
any other concurrent Ganeti operations. |
|
492 |
In the beginning the only such lock will be what replaces the current |
|
493 |
"command" lock, and will acquire all the locks in the system, before
|
|
494 |
proceeding. This lock will be called the "Big Ganeti Lock" because
|
|
495 |
holding that one will avoid any other concurrent Ganeti operations.
|
|
489 | 496 |
|
490 |
We might also want to devise more metalocks (eg. all nodes, all nodes+config)
|
|
491 |
in order to make it easier for some parts of the code to acquire what it needs
|
|
492 |
without specifying it explicitly. |
|
497 |
We might also want to devise more metalocks (eg. all nodes, all |
|
498 |
nodes+config) in order to make it easier for some parts of the code to
|
|
499 |
acquire what it needs without specifying it explicitly.
|
|
493 | 500 |
|
494 |
In the future things like the node locks could become metalocks, should we |
|
495 |
decide to split them into an even more fine grained approach, but this will |
|
496 |
probably be only after the first 2.0 version has been released. |
|
501 |
In the future things like the node locks could become metalocks, should |
|
502 |
we decide to split them into an even more fine grained approach, but |
|
503 |
this will probably be only after the first 2.0 version has been |
|
504 |
released. |
|
497 | 505 |
|
498 | 506 |
Adding/Removing locks |
499 | 507 |
+++++++++++++++++++++ |
500 | 508 |
|
501 |
When a new instance or a new node is created an associated lock must be added
|
|
502 |
to the list. The relevant code will need to inform the locking library of such
|
|
503 |
a change. |
|
509 |
When a new instance or a new node is created an associated lock must be |
|
510 |
added to the list. The relevant code will need to inform the locking
|
|
511 |
library of such a change.
|
|
504 | 512 |
|
505 |
This needs to be compatible with every other lock in the system, especially
|
|
506 |
metalocks that guarantee to grab sets of resources without specifying them
|
|
507 |
explicitly. The implementation of this will be handled in the locking library
|
|
508 |
itself. |
|
513 |
This needs to be compatible with every other lock in the system, |
|
514 |
especially metalocks that guarantee to grab sets of resources without
|
|
515 |
specifying them explicitly. The implementation of this will be handled
|
|
516 |
in the locking library itself.
|
|
509 | 517 |
|
510 | 518 |
When instances or nodes disappear from the cluster the relevant locks |
511 | 519 |
must be removed. This is easier than adding new elements, as the code |
... | ... | |
517 | 525 |
+++++++++++++++++++++++ |
518 | 526 |
|
519 | 527 |
For the first version the locking library will only export synchronous |
520 |
operations, which will block till the needed lock are held, and only fail if
|
|
521 |
the request is impossible or somehow erroneous. |
|
528 |
operations, which will block till the needed lock are held, and only |
|
529 |
fail if the request is impossible or somehow erroneous.
|
|
522 | 530 |
|
523 | 531 |
In the future we may want to implement different types of asynchronous |
524 | 532 |
operations such as: |
525 | 533 |
|
526 | 534 |
- try to acquire this lock set and fail if not possible |
527 |
- try to acquire one of these lock sets and return the first one you were
|
|
528 |
able to get (or after a timeout) (select/poll like) |
|
535 |
- try to acquire one of these lock sets and return the first one you |
|
536 |
were able to get (or after a timeout) (select/poll like)
|
|
529 | 537 |
|
530 |
These operations can be used to prioritize operations based on available locks, |
|
531 |
rather than making them just blindly queue for acquiring them. The inherent |
|
532 |
risk, though, is that any code using the first operation, or setting a timeout |
|
533 |
for the second one, is susceptible to starvation and thus may never be able to |
|
534 |
get the required locks and complete certain tasks. Considering this |
|
535 |
providing/using these operations should not be among our first priorities. |
|
538 |
These operations can be used to prioritize operations based on available |
|
539 |
locks, rather than making them just blindly queue for acquiring them. |
|
540 |
The inherent risk, though, is that any code using the first operation, |
|
541 |
or setting a timeout for the second one, is susceptible to starvation |
|
542 |
and thus may never be able to get the required locks and complete |
|
543 |
certain tasks. Considering this providing/using these operations should |
|
544 |
not be among our first priorities. |
|
536 | 545 |
|
537 | 546 |
Locking granularity |
538 | 547 |
+++++++++++++++++++ |
539 | 548 |
|
540 | 549 |
For the first version of this code we'll convert each Logical Unit to |
541 |
acquire/release the locks it needs, so locking will be at the Logical Unit |
|
542 |
level. In the future we may want to split logical units in independent |
|
543 |
"tasklets" with their own locking requirements. A different design doc (or mini |
|
544 |
design doc) will cover the move from Logical Units to tasklets. |
|
550 |
acquire/release the locks it needs, so locking will be at the Logical |
|
551 |
Unit level. In the future we may want to split logical units in |
|
552 |
independent "tasklets" with their own locking requirements. A different |
|
553 |
design doc (or mini design doc) will cover the move from Logical Units |
|
554 |
to tasklets. |
|
545 | 555 |
|
546 | 556 |
Code examples |
547 | 557 |
+++++++++++++ |
548 | 558 |
|
549 |
In general when acquiring locks we should use a code path equivalent to:: |
|
559 |
In general when acquiring locks we should use a code path equivalent |
|
560 |
to:: |
|
550 | 561 |
|
551 | 562 |
lock.acquire() |
552 | 563 |
try: |
... | ... | |
561 | 572 |
syntax will be possible, but we want to keep compatibility with Python |
562 | 573 |
2.4 so the new constructs should not be used. |
563 | 574 |
|
564 |
In order to avoid this extra indentation and code changes everywhere in the
|
|
565 |
Logical Units code, we decided to allow LUs to declare locks, and then execute
|
|
566 |
their code with their locks acquired. In the new world LUs are called like
|
|
567 |
this:: |
|
575 |
In order to avoid this extra indentation and code changes everywhere in |
|
576 |
the Logical Units code, we decided to allow LUs to declare locks, and
|
|
577 |
then execute their code with their locks acquired. In the new world LUs
|
|
578 |
are called like this::
|
|
568 | 579 |
|
569 | 580 |
# user passed names are expanded to the internal lock/resource name, |
570 | 581 |
# then known needed locks are declared |
... | ... | |
579 | 590 |
lu.Exec() |
580 | 591 |
... locks declared for removal are removed, all acquired locks released ... |
581 | 592 |
|
582 |
The Processor and the LogicalUnit class will contain exact documentation on how
|
|
583 |
locks are supposed to be declared. |
|
593 |
The Processor and the LogicalUnit class will contain exact documentation |
|
594 |
on how locks are supposed to be declared.
|
|
584 | 595 |
|
585 | 596 |
Caveats |
586 | 597 |
+++++++ |
587 | 598 |
|
588 | 599 |
This library will provide an easy upgrade path to bring all the code to |
589 | 600 |
granular locking without breaking everything, and it will also guarantee |
590 |
against a lot of common errors. Code switching from the old "lock everything"
|
|
591 |
lock to the new system, though, needs to be carefully scrutinised to be sure it
|
|
592 |
is really acquiring all the necessary locks, and none has been overlooked or
|
|
593 |
forgotten. |
|
601 |
against a lot of common errors. Code switching from the old "lock |
|
602 |
everything" lock to the new system, though, needs to be carefully
|
|
603 |
scrutinised to be sure it is really acquiring all the necessary locks,
|
|
604 |
and none has been overlooked or forgotten.
|
|
594 | 605 |
|
595 |
The code can contain other locks outside of this library, to synchronise other |
|
596 |
threaded code (eg for the job queue) but in general these should be leaf locks |
|
597 |
or carefully structured non-leaf ones, to avoid deadlock race conditions. |
|
606 |
The code can contain other locks outside of this library, to synchronise |
|
607 |
other threaded code (eg for the job queue) but in general these should |
|
608 |
be leaf locks or carefully structured non-leaf ones, to avoid deadlock |
|
609 |
race conditions. |
|
598 | 610 |
|
599 | 611 |
|
600 | 612 |
Job Queue |
... | ... | |
614 | 626 |
Job execution—“Life of a Ganeti job” |
615 | 627 |
++++++++++++++++++++++++++++++++++++ |
616 | 628 |
|
617 |
#. Job gets submitted by the client. A new job identifier is generated and |
|
618 |
assigned to the job. The job is then automatically replicated [#replic]_ |
|
619 |
to all nodes in the cluster. The identifier is returned to the client. |
|
620 |
#. A pool of worker threads waits for new jobs. If all are busy, the job has |
|
621 |
to wait and the first worker finishing its work will grab it. Otherwise any |
|
622 |
of the waiting threads will pick up the new job. |
|
623 |
#. Client waits for job status updates by calling a waiting RPC function. |
|
624 |
Log message may be shown to the user. Until the job is started, it can also |
|
625 |
be canceled. |
|
626 |
#. As soon as the job is finished, its final result and status can be retrieved |
|
627 |
from the server. |
|
629 |
#. Job gets submitted by the client. A new job identifier is generated |
|
630 |
and assigned to the job. The job is then automatically replicated |
|
631 |
[#replic]_ to all nodes in the cluster. The identifier is returned to |
|
632 |
the client. |
|
633 |
#. A pool of worker threads waits for new jobs. If all are busy, the job |
|
634 |
has to wait and the first worker finishing its work will grab it. |
|
635 |
Otherwise any of the waiting threads will pick up the new job. |
|
636 |
#. Client waits for job status updates by calling a waiting RPC |
|
637 |
function. Log message may be shown to the user. Until the job is |
|
638 |
started, it can also be canceled. |
|
639 |
#. As soon as the job is finished, its final result and status can be |
|
640 |
retrieved from the server. |
|
628 | 641 |
#. If the client archives the job, it gets moved to a history directory. |
629 | 642 |
There will be a method to archive all jobs older than a a given age. |
630 | 643 |
|
631 |
.. [#replic] We need replication in order to maintain the consistency across
|
|
632 |
all nodes in the system; the master node only differs in the fact that
|
|
633 |
now it is running the master daemon, but it if fails and we do a master
|
|
634 |
failover, the jobs are still visible on the new master (though marked as
|
|
635 |
failed). |
|
644 |
.. [#replic] We need replication in order to maintain the consistency |
|
645 |
across all nodes in the system; the master node only differs in the
|
|
646 |
fact that now it is running the master daemon, but it if fails and we
|
|
647 |
do a master failover, the jobs are still visible on the new master
|
|
648 |
(though marked as failed).
|
|
636 | 649 |
|
637 | 650 |
Failures to replicate a job to other nodes will be only flagged as |
638 | 651 |
errors in the master daemon log if more than half of the nodes failed, |
... | ... | |
654 | 667 |
|
655 | 668 |
- a file can be atomically replaced |
656 | 669 |
- a file can easily be replicated to other nodes |
657 |
- checking consistency across nodes can be implemented very easily, since
|
|
658 |
all job files should be (at a given moment in time) identical |
|
670 |
- checking consistency across nodes can be implemented very easily, |
|
671 |
since all job files should be (at a given moment in time) identical
|
|
659 | 672 |
|
660 | 673 |
The other possible choices that were discussed and discounted were: |
661 | 674 |
|
662 |
- single big file with all job data: not feasible due to difficult updates |
|
675 |
- single big file with all job data: not feasible due to difficult |
|
676 |
updates |
|
663 | 677 |
- in-process databases: hard to replicate the entire database to the |
664 |
other nodes, and replicating individual operations does not mean wee keep
|
|
665 |
consistency |
|
678 |
other nodes, and replicating individual operations does not mean wee |
|
679 |
keep consistency
|
|
666 | 680 |
|
667 | 681 |
|
668 | 682 |
Queue structure |
669 | 683 |
+++++++++++++++ |
670 | 684 |
|
671 |
All file operations have to be done atomically by writing to a temporary file
|
|
672 |
and subsequent renaming. Except for log messages, every change in a job is
|
|
673 |
stored and replicated to other nodes. |
|
685 |
All file operations have to be done atomically by writing to a temporary |
|
686 |
file and subsequent renaming. Except for log messages, every change in a
|
|
687 |
job is stored and replicated to other nodes.
|
|
674 | 688 |
|
675 | 689 |
:: |
676 | 690 |
|
... | ... | |
688 | 702 |
Locking |
689 | 703 |
+++++++ |
690 | 704 |
|
691 |
Locking in the job queue is a complicated topic. It is called from more than
|
|
692 |
one thread and must be thread-safe. For simplicity, a single lock is used for
|
|
693 |
the whole job queue. |
|
705 |
Locking in the job queue is a complicated topic. It is called from more |
|
706 |
than one thread and must be thread-safe. For simplicity, a single lock
|
|
707 |
is used for the whole job queue.
|
|
694 | 708 |
|
695 | 709 |
A more detailed description can be found in doc/locking.rst. |
696 | 710 |
|
... | ... | |
711 | 725 |
Client RPC |
712 | 726 |
++++++++++ |
713 | 727 |
|
714 |
RPC between Ganeti clients and the Ganeti master daemon supports the following
|
|
715 |
operations: |
|
728 |
RPC between Ganeti clients and the Ganeti master daemon supports the |
|
729 |
following operations:
|
|
716 | 730 |
|
717 | 731 |
SubmitJob(ops) |
718 |
Submits a list of opcodes and returns the job identifier. The identifier is |
|
719 |
guaranteed to be unique during the lifetime of a cluster. |
|
732 |
Submits a list of opcodes and returns the job identifier. The |
|
733 |
identifier is guaranteed to be unique during the lifetime of a |
|
734 |
cluster. |
|
720 | 735 |
WaitForJobChange(job_id, fields, […], timeout) |
721 |
This function waits until a job changes or a timeout expires. The condition
|
|
722 |
for when a job changed is defined by the fields passed and the last log
|
|
723 |
message received. |
|
736 |
This function waits until a job changes or a timeout expires. The |
|
737 |
condition for when a job changed is defined by the fields passed and
|
|
738 |
the last log message received.
|
|
724 | 739 |
QueryJobs(job_ids, fields) |
725 | 740 |
Returns field values for the job identifiers passed. |
726 | 741 |
CancelJob(job_id) |
727 |
Cancels the job specified by identifier. This operation may fail if the job
|
|
728 |
is already running, canceled or finished. |
|
742 |
Cancels the job specified by identifier. This operation may fail if |
|
743 |
the job is already running, canceled or finished.
|
|
729 | 744 |
ArchiveJob(job_id) |
730 |
Moves a job into the …/archive/ directory. This operation will fail if the
|
|
731 |
job has not been canceled or finished. |
|
745 |
Moves a job into the …/archive/ directory. This operation will fail if |
|
746 |
the job has not been canceled or finished.
|
|
732 | 747 |
|
733 | 748 |
|
734 | 749 |
Job and opcode status |
... | ... | |
749 | 764 |
Error |
750 | 765 |
The job/opcode was aborted with an error. |
751 | 766 |
|
752 |
If the master is aborted while a job is running, the job will be set to the
|
|
753 |
Error status once the master started again. |
|
767 |
If the master is aborted while a job is running, the job will be set to |
|
768 |
the Error status once the master started again.
|
|
754 | 769 |
|
755 | 770 |
|
756 | 771 |
History |
... | ... | |
810 | 825 |
|
811 | 826 |
For example: memory, vcpus, auto_balance |
812 | 827 |
|
813 |
All these parameters will be encoded into constants.py with the prefix "BE\_" |
|
814 |
and the whole list of parameters will exist in the set "BES_PARAMETERS" |
|
828 |
All these parameters will be encoded into constants.py with the prefix |
|
829 |
"BE\_" and the whole list of parameters will exist in the set |
|
830 |
"BES_PARAMETERS" |
|
815 | 831 |
|
816 | 832 |
:proper parameter: |
817 |
a parameter whose value is unique to the instance (e.g. the name of a LV,
|
|
818 |
or the MAC of a NIC) |
|
833 |
a parameter whose value is unique to the instance (e.g. the name of a |
|
834 |
LV, or the MAC of a NIC)
|
|
819 | 835 |
|
820 | 836 |
As a general rule, for all kind of parameters, “None” (or in |
821 | 837 |
JSON-speak, “nil”) will no longer be a valid value for a parameter. As |
... | ... | |
932 | 948 |
- ``Cluster.FillBE(instance, be_type="default")``, which returns the |
933 | 949 |
beparams dict, based on the instance and cluster beparams |
934 | 950 |
|
935 |
The FillHV/BE transformations will be used, for example, in the RpcRunner
|
|
936 |
when sending an instance for activation/stop, and the sent instance
|
|
937 |
hvparams/beparams will have the final value (noded code doesn't know
|
|
938 |
about defaults). |
|
951 |
The FillHV/BE transformations will be used, for example, in the |
|
952 |
RpcRunner when sending an instance for activation/stop, and the sent
|
|
953 |
instance hvparams/beparams will have the final value (noded code doesn't
|
|
954 |
know about defaults).
|
|
939 | 955 |
|
940 | 956 |
LU code will need to self-call the transformation, if needed. |
941 | 957 |
|
... | ... | |
945 | 961 |
The parameter changes will have impact on the OpCodes, especially on |
946 | 962 |
the following ones: |
947 | 963 |
|
948 |
- ``OpCreateInstance``, where the new hv and be parameters will be sent as
|
|
949 |
dictionaries; note that all hv and be parameters are now optional, as
|
|
950 |
the values can be instead taken from the cluster |
|
964 |
- ``OpCreateInstance``, where the new hv and be parameters will be sent |
|
965 |
as dictionaries; note that all hv and be parameters are now optional,
|
|
966 |
as the values can be instead taken from the cluster
|
|
951 | 967 |
- ``OpQueryInstances``, where we have to be able to query these new |
952 | 968 |
parameters; the syntax for names will be ``hvparam/$NAME`` and |
953 | 969 |
``beparam/$NAME`` for querying an individual parameter out of one |
... | ... | |
1093 | 1109 |
Caveats: |
1094 | 1110 |
|
1095 | 1111 |
- some operation semantics are less clear (e.g. what to do on instance |
1096 |
start with offline secondary?); for now, these will just fail as if the
|
|
1097 |
flag is not set (but faster) |
|
1112 |
start with offline secondary?); for now, these will just fail as if |
|
1113 |
the flag is not set (but faster)
|
|
1098 | 1114 |
- 2-node cluster with one node offline needs manual startup of the |
1099 | 1115 |
master with a special flag to skip voting (as the master can't get a |
1100 | 1116 |
quorum there) |
... | ... | |
1133 | 1149 |
clean the above instance(s) |
1134 | 1150 |
|
1135 | 1151 |
In order to prevent this situation, and to be able to get nodes into |
1136 |
proper offline status easily, a new *drained* flag was added to the nodes. |
|
1152 |
proper offline status easily, a new *drained* flag was added to the |
|
1153 |
nodes. |
|
1137 | 1154 |
|
1138 | 1155 |
This flag (which actually means "is being, or was drained, and is |
1139 | 1156 |
expected to go offline"), will prevent allocations on the node, but |
... | ... | |
1173 | 1190 |
assumptions made initially are not true and that more flexibility is |
1174 | 1191 |
needed. |
1175 | 1192 |
|
1176 |
One main assumption made was that disk failures should be treated as 'rare'
|
|
1177 |
events, and that each of them needs to be manually handled in order to ensure
|
|
1178 |
data safety; however, both these assumptions are false: |
|
1193 |
One main assumption made was that disk failures should be treated as |
|
1194 |
'rare' events, and that each of them needs to be manually handled in
|
|
1195 |
order to ensure data safety; however, both these assumptions are false:
|
|
1179 | 1196 |
|
1180 |
- disk failures can be a common occurrence, based on usage patterns or cluster
|
|
1181 |
size |
|
1182 |
- our disk setup is robust enough (referring to DRBD8 + LVM) that we could
|
|
1183 |
automate more of the recovery |
|
1197 |
- disk failures can be a common occurrence, based on usage patterns or |
|
1198 |
cluster size
|
|
1199 |
- our disk setup is robust enough (referring to DRBD8 + LVM) that we |
|
1200 |
could automate more of the recovery
|
|
1184 | 1201 |
|
1185 |
Note that we still don't have fully-automated disk recovery as a goal, but our
|
|
1186 |
goal is to reduce the manual work needed. |
|
1202 |
Note that we still don't have fully-automated disk recovery as a goal, |
|
1203 |
but our goal is to reduce the manual work needed.
|
|
1187 | 1204 |
|
1188 | 1205 |
As such, we plan the following main changes: |
1189 | 1206 |
|
1190 |
- DRBD8 is much more flexible and stable than its previous version (0.7),
|
|
1191 |
such that removing the support for the ``remote_raid1`` template and
|
|
1192 |
focusing only on DRBD8 is easier |
|
1207 |
- DRBD8 is much more flexible and stable than its previous version |
|
1208 |
(0.7), such that removing the support for the ``remote_raid1``
|
|
1209 |
template and focusing only on DRBD8 is easier
|
|
1193 | 1210 |
|
1194 |
- dynamic discovery of DRBD devices is not actually needed in a cluster that
|
|
1195 |
where the DRBD namespace is controlled by Ganeti; switching to a static
|
|
1196 |
assignment (done at either instance creation time or change secondary time)
|
|
1197 |
will change the disk activation time from O(n) to O(1), which on big
|
|
1198 |
clusters is a significant gain |
|
1211 |
- dynamic discovery of DRBD devices is not actually needed in a cluster |
|
1212 |
that where the DRBD namespace is controlled by Ganeti; switching to a
|
|
1213 |
static assignment (done at either instance creation time or change
|
|
1214 |
secondary time) will change the disk activation time from O(n) to
|
|
1215 |
O(1), which on big clusters is a significant gain
|
|
1199 | 1216 |
|
1200 |
- remove the hard dependency on LVM (currently all available storage types are |
|
1201 |
ultimately backed by LVM volumes) by introducing file-based storage |
|
1217 |
- remove the hard dependency on LVM (currently all available storage |
|
1218 |
types are ultimately backed by LVM volumes) by introducing file-based |
|
1219 |
storage |
|
1202 | 1220 |
|
1203 | 1221 |
Additionally, a number of smaller enhancements are also planned: |
1204 | 1222 |
- support variable number of disks |
... | ... | |
1326 | 1344 |
*failover to any* functionality, removing many of the layout |
1327 | 1345 |
restrictions of a cluster: |
1328 | 1346 |
|
1329 |
- the need to reserve memory on the current secondary: this gets reduced to
|
|
1330 |
a must to reserve memory anywhere on the cluster |
|
1347 |
- the need to reserve memory on the current secondary: this gets reduced |
|
1348 |
to a must to reserve memory anywhere on the cluster
|
|
1331 | 1349 |
|
1332 | 1350 |
- the need to first failover and then replace secondary for an |
1333 | 1351 |
instance: with failover-to-any, we can directly failover to |
... | ... | |
1340 | 1358 |
made between P1 and S1. This choice can be constrained, depending on |
1341 | 1359 |
which of P1 and S1 has failed. |
1342 | 1360 |
|
1343 |
- if P1 has failed, then S1 must become S2, and live migration is not possible |
|
1361 |
- if P1 has failed, then S1 must become S2, and live migration is not |
|
1362 |
possible |
|
1344 | 1363 |
- if S1 has failed, then P1 must become S2, and live migration could be |
1345 | 1364 |
possible (in theory, but this is not a design goal for 2.0) |
1346 | 1365 |
|
... | ... | |
1349 | 1368 |
- verify that S2 (the node the user has chosen to keep as secondary) has |
1350 | 1369 |
valid data (is consistent) |
1351 | 1370 |
|
1352 |
- tear down the current DRBD association and setup a DRBD pairing between
|
|
1353 |
P2 (P2 is indicated by the user) and S2; since P2 has no data, it will
|
|
1354 |
start re-syncing from S2 |
|
1371 |
- tear down the current DRBD association and setup a DRBD pairing |
|
1372 |
between P2 (P2 is indicated by the user) and S2; since P2 has no data,
|
|
1373 |
it will start re-syncing from S2
|
|
1355 | 1374 |
|
1356 |
- as soon as P2 is in state SyncTarget (i.e. after the resync has started
|
|
1357 |
but before it has finished), we can promote it to primary role (r/w)
|
|
1358 |
and start the instance on P2 |
|
1375 |
- as soon as P2 is in state SyncTarget (i.e. after the resync has |
|
1376 |
started but before it has finished), we can promote it to primary role
|
|
1377 |
(r/w) and start the instance on P2
|
|
1359 | 1378 |
|
1360 | 1379 |
- as soon as the P2?S2 sync has finished, we can remove |
1361 | 1380 |
the old data on the old node that has not been chosen for |
... | ... | |
1426 | 1445 |
OS interface |
1427 | 1446 |
~~~~~~~~~~~~ |
1428 | 1447 |
|
1429 |
The current Ganeti OS interface, version 5, is tailored for Ganeti 1.2. The
|
|
1430 |
interface is composed by a series of scripts which get called with certain
|
|
1431 |
parameters to perform OS-dependent operations on the cluster. The current
|
|
1432 |
scripts are: |
|
1448 |
The current Ganeti OS interface, version 5, is tailored for Ganeti 1.2. |
|
1449 |
The interface is composed by a series of scripts which get called with
|
|
1450 |
certain parameters to perform OS-dependent operations on the cluster.
|
|
1451 |
The current scripts are:
|
|
1433 | 1452 |
|
1434 | 1453 |
create |
1435 | 1454 |
called when a new instance is added to the cluster |
... | ... | |
1441 | 1460 |
called to perform the os-specific operations necessary for renaming an |
1442 | 1461 |
instance |
1443 | 1462 |
|
1444 |
Currently these scripts suffer from the limitations of Ganeti 1.2: for example |
|
1445 |
they accept exactly one block and one swap devices to operate on, rather than |
|
1446 |
any amount of generic block devices, they blindly assume that an instance will |
|
1447 |
have just one network interface to operate, they can not be configured to |
|
1448 |
optimise the instance for a particular hypervisor. |
|
1463 |
Currently these scripts suffer from the limitations of Ganeti 1.2: for |
|
1464 |
example they accept exactly one block and one swap devices to operate |
|
1465 |
on, rather than any amount of generic block devices, they blindly assume |
|
1466 |
that an instance will have just one network interface to operate, they |
|
1467 |
can not be configured to optimise the instance for a particular |
|
1468 |
hypervisor. |
|
1449 | 1469 |
|
1450 |
Since in Ganeti 2.0 we want to support multiple hypervisors, and a non-fixed |
|
1451 |
number of network and disks the OS interface need to change to transmit the |
|
1452 |
appropriate amount of information about an instance to its managing operating |
|
1453 |
system, when operating on it. Moreover since some old assumptions usually used |
|
1454 |
in OS scripts are no longer valid we need to re-establish a common knowledge on |
|
1455 |
what can be assumed and what cannot be regarding Ganeti environment. |
|
1470 |
Since in Ganeti 2.0 we want to support multiple hypervisors, and a |
|
1471 |
non-fixed number of network and disks the OS interface need to change to |
|
1472 |
transmit the appropriate amount of information about an instance to its |
|
1473 |
managing operating system, when operating on it. Moreover since some old |
|
1474 |
assumptions usually used in OS scripts are no longer valid we need to |
|
1475 |
re-establish a common knowledge on what can be assumed and what cannot |
|
1476 |
be regarding Ganeti environment. |
|
1456 | 1477 |
|
1457 | 1478 |
|
1458 | 1479 |
When designing the new OS API our priorities are: |
... | ... | |
1461 | 1482 |
- ease of porting from the old API |
1462 | 1483 |
- modularity |
1463 | 1484 |
|
1464 |
As such we want to limit the number of scripts that must be written to support
|
|
1465 |
an OS, and make it easy to share code between them by uniforming their input.
|
|
1466 |
We also will leave the current script structure unchanged, as far as we can,
|
|
1467 |
and make a few of the scripts (import, export and rename) optional. Most
|
|
1468 |
information will be passed to the script through environment variables, for
|
|
1469 |
ease of access and at the same time ease of using only the information a script
|
|
1470 |
needs. |
|
1485 |
As such we want to limit the number of scripts that must be written to |
|
1486 |
support an OS, and make it easy to share code between them by uniforming
|
|
1487 |
their input. We also will leave the current script structure unchanged,
|
|
1488 |
as far as we can, and make a few of the scripts (import, export and
|
|
1489 |
rename) optional. Most information will be passed to the script through
|
|
1490 |
environment variables, for ease of access and at the same time ease of
|
|
1491 |
using only the information a script needs.
|
|
1471 | 1492 |
|
1472 | 1493 |
|
1473 | 1494 |
The Scripts |
1474 | 1495 |
+++++++++++ |
1475 | 1496 |
|
1476 |
As in Ganeti 1.2, every OS which wants to be installed in Ganeti needs to
|
|
1477 |
support the following functionality, through scripts: |
|
1497 |
As in Ganeti 1.2, every OS which wants to be installed in Ganeti needs |
|
1498 |
to support the following functionality, through scripts:
|
|
1478 | 1499 |
|
1479 | 1500 |
create: |
1480 |
used to create a new instance running that OS. This script should prepare the
|
|
1481 |
block devices, and install them so that the new OS can boot under the
|
|
1482 |
specified hypervisor. |
|
1501 |
used to create a new instance running that OS. This script should |
|
1502 |
prepare the block devices, and install them so that the new OS can
|
|
1503 |
boot under the specified hypervisor.
|
|
1483 | 1504 |
export (optional): |
1484 |
used to export an installed instance using the given OS to a format which can
|
|
1485 |
be used to import it back into a new instance. |
|
1505 |
used to export an installed instance using the given OS to a format |
|
1506 |
which can be used to import it back into a new instance.
|
|
1486 | 1507 |
import (optional): |
1487 |
used to import an exported instance into a new one. This script is similar to
|
|
1488 |
create, but the new instance should have the content of the export, rather
|
|
1489 |
than contain a pristine installation. |
|
1508 |
used to import an exported instance into a new one. This script is |
|
1509 |
similar to create, but the new instance should have the content of the
|
|
1510 |
export, rather than contain a pristine installation.
|
|
1490 | 1511 |
rename (optional): |
1491 |
used to perform the internal OS-specific operations needed to rename an
|
|
1492 |
instance. |
|
1512 |
used to perform the internal OS-specific operations needed to rename |
|
1513 |
an instance.
|
|
1493 | 1514 |
|
1494 |
If any optional script is not implemented Ganeti will refuse to perform the
|
|
1495 |
given operation on instances using the non-implementing OS. Of course the
|
|
1496 |
create script is mandatory, and it doesn't make sense to support the either the
|
|
1497 |
export or the import operation but not both. |
|
1515 |
If any optional script is not implemented Ganeti will refuse to perform |
|
1516 |
the given operation on instances using the non-implementing OS. Of
|
|
1517 |
course the create script is mandatory, and it doesn't make sense to
|
|
1518 |
support the either the export or the import operation but not both.
|
|
1498 | 1519 |
|
1499 | 1520 |
Incompatibilities with 1.2 |
1500 | 1521 |
__________________________ |
1501 | 1522 |
|
1502 |
We expect the following incompatibilities between the OS scripts for 1.2 and
|
|
1503 |
the ones for 2.0: |
|
1523 |
We expect the following incompatibilities between the OS scripts for 1.2 |
|
1524 |
and the ones for 2.0:
|
|
1504 | 1525 |
|
1505 |
- Input parameters: in 1.2 those were passed on the command line, in 2.0 we'll |
|
1506 |
use environment variables, as there will be a lot more information and not |
|
1507 |
all OSes may care about all of it. |
|
1508 |
- Number of calls: export scripts will be called once for each device the |
|
1509 |
instance has, and import scripts once for every exported disk. Imported |
|
1510 |
instances will be forced to have a number of disks greater or equal to the |
|
1511 |
one of the export. |
|
1512 |
- Some scripts are not compulsory: if such a script is missing the relevant |
|
1513 |
operations will be forbidden for instances of that OS. This makes it easier |
|
1514 |
to distinguish between unsupported operations and no-op ones (if any). |
|
1526 |
- Input parameters: in 1.2 those were passed on the command line, in 2.0 |
|
1527 |
we'll use environment variables, as there will be a lot more |
|
1528 |
information and not all OSes may care about all of it. |
|
1529 |
- Number of calls: export scripts will be called once for each device |
|
1530 |
the instance has, and import scripts once for every exported disk. |
|
1531 |
Imported instances will be forced to have a number of disks greater or |
|
1532 |
equal to the one of the export. |
|
1533 |
- Some scripts are not compulsory: if such a script is missing the |
|
1534 |
relevant operations will be forbidden for instances of that OS. This |
|
1535 |
makes it easier to distinguish between unsupported operations and |
|
1536 |
no-op ones (if any). |
|
1515 | 1537 |
|
1516 | 1538 |
|
1517 | 1539 |
Input |
1518 | 1540 |
_____ |
1519 | 1541 |
|
1520 |
Rather than using command line flags, as they do now, scripts will accept |
|
1521 |
inputs from environment variables. We expect the following input values: |
|
1542 |
Rather than using command line flags, as they do now, scripts will |
|
1543 |
accept inputs from environment variables. We expect the following input |
|
1544 |
values: |
|
1522 | 1545 |
|
1523 | 1546 |
OS_API_VERSION |
1524 | 1547 |
The version of the OS API that the following parameters comply with; |
... | ... | |
1528 | 1551 |
INSTANCE_NAME |
1529 | 1552 |
Name of the instance acted on |
1530 | 1553 |
HYPERVISOR |
1531 |
The hypervisor the instance should run on (e.g. 'xen-pvm', 'xen-hvm', 'kvm') |
|
1554 |
The hypervisor the instance should run on (e.g. 'xen-pvm', 'xen-hvm', |
|
1555 |
'kvm') |
|
1532 | 1556 |
DISK_COUNT |
1533 | 1557 |
The number of disks this instance will have |
1534 | 1558 |
NIC_COUNT |
... | ... | |
1539 | 1563 |
W if read/write, R if read only. OS scripts are not supposed to touch |
1540 | 1564 |
read-only disks, but will be passed them to know. |
1541 | 1565 |
DISK_<N>_FRONTEND_TYPE |
1542 |
Type of the disk as seen by the instance. Can be 'scsi', 'ide', 'virtio' |
|
1566 |
Type of the disk as seen by the instance. Can be 'scsi', 'ide', |
|
1567 |
'virtio' |
|
1543 | 1568 |
DISK_<N>_BACKEND_TYPE |
1544 | 1569 |
Type of the disk as seen from the node. Can be 'block', 'file:loop' or |
1545 | 1570 |
'file:blktap' |
... | ... | |
1553 | 1578 |
Type of the Nth NIC as seen by the instance. For example 'virtio', |
1554 | 1579 |
'rtl8139', etc. |
1555 | 1580 |
DEBUG_LEVEL |
1556 |
Whether more out should be produced, for debugging purposes. Currently the
|
|
1557 |
only valid values are 0 and 1. |
|
1581 |
Whether more out should be produced, for debugging purposes. Currently |
|
1582 |
the only valid values are 0 and 1.
|
|
1558 | 1583 |
|
1559 | 1584 |
These are only the basic variables we are thinking of now, but more |
1560 | 1585 |
may come during the implementation and they will be documented in the |
... | ... | |
1567 | 1592 |
OLD_INSTANCE_NAME |
1568 | 1593 |
rename: the name the instance should be renamed from. |
1569 | 1594 |
EXPORT_DEVICE |
1570 |
export: device to be exported, a snapshot of the actual device. The data must be exported to stdout. |
|
1595 |
export: device to be exported, a snapshot of the actual device. The |
|
1596 |
data must be exported to stdout. |
|
1571 | 1597 |
EXPORT_INDEX |
1572 | 1598 |
export: sequential number of the instance device targeted. |
1573 | 1599 |
IMPORT_DEVICE |
1574 |
import: device to send the data to, part of the new instance. The data must be imported from stdin. |
|
1600 |
import: device to send the data to, part of the new instance. The data |
|
1601 |
must be imported from stdin. |
|
1575 | 1602 |
IMPORT_INDEX |
1576 | 1603 |
import: sequential number of the instance device targeted. |
1577 | 1604 |
|
1578 |
(Rationale for INSTANCE_NAME as an environment variable: the instance name is |
|
1579 |
always needed and we could pass it on the command line. On the other hand, |
|
1580 |
though, this would force scripts to both access the environment and parse the |
|
1581 |
command line, so we'll move it for uniformity.) |
|
1605 |
(Rationale for INSTANCE_NAME as an environment variable: the instance |
|
1606 |
name is always needed and we could pass it on the command line. On the |
|
1607 |
other hand, though, this would force scripts to both access the |
|
1608 |
environment and parse the command line, so we'll move it for |
|
1609 |
uniformity.) |
|
1582 | 1610 |
|
1583 | 1611 |
|
1584 | 1612 |
Output/Behaviour |
1585 | 1613 |
________________ |
1586 | 1614 |
|
1587 |
As discussed scripts should only send user-targeted information to stderr. The
|
|
1588 |
create and import scripts are supposed to format/initialise the given block
|
|
1589 |
devices and install the correct instance data. The export script is supposed to
|
|
1590 |
export instance data to stdout in a format understandable by the the import
|
|
1591 |
script. The data will be compressed by Ganeti, so no compression should be
|
|
1592 |
done. The rename script should only modify the instance's knowledge of what
|
|
1593 |
its name is. |
|
1615 |
As discussed scripts should only send user-targeted information to |
|
1616 |
stderr. The create and import scripts are supposed to format/initialise
|
|
1617 |
the given block devices and install the correct instance data. The
|
|
1618 |
export script is supposed to export instance data to stdout in a format
|
|
1619 |
understandable by the the import script. The data will be compressed by
|
|
1620 |
Ganeti, so no compression should be done. The rename script should only
|
|
1621 |
modify the instance's knowledge of what its name is.
|
|
1594 | 1622 |
|
1595 | 1623 |
Other declarative style features |
1596 | 1624 |
++++++++++++++++++++++++++++++++ |
... | ... | |
1604 | 1632 |
containing two lines. This is different from Ganeti 1.2, which only |
1605 | 1633 |
supported one version number. |
1606 | 1634 |
|
1607 |
In addition to that an OS will be able to declare that it does support only a |
|
1608 |
subset of the Ganeti hypervisors, by declaring them in the 'hypervisors' file. |
|
1635 |
In addition to that an OS will be able to declare that it does support |
|
1636 |
only a subset of the Ganeti hypervisors, by declaring them in the |
|
1637 |
'hypervisors' file. |
|
1609 | 1638 |
|
1610 | 1639 |
|
1611 | 1640 |
Caveats/Notes |
1612 | 1641 |
+++++++++++++ |
1613 | 1642 |
|
1614 |
We might want to have a "default" import/export behaviour that just dumps all
|
|
1615 |
disks and restores them. This can save work as most systems will just do this,
|
|
1616 |
while allowing flexibility for different systems. |
|
1643 |
We might want to have a "default" import/export behaviour that just |
|
1644 |
dumps all disks and restores them. This can save work as most systems
|
|
1645 |
will just do this, while allowing flexibility for different systems.
|
|
1617 | 1646 |
|
1618 |
Environment variables are limited in size, but we expect that there will be
|
|
1619 |
enough space to store the information we need. If we discover that this is not
|
|
1620 |
the case we may want to go to a more complex API such as storing those
|
|
1621 |
information on the filesystem and providing the OS script with the path to a
|
|
1622 |
file where they are encoded in some format. |
|
1647 |
Environment variables are limited in size, but we expect that there will |
|
1648 |
be enough space to store the information we need. If we discover that
|
|
1649 |
this is not the case we may want to go to a more complex API such as
|
|
1650 |
storing those information on the filesystem and providing the OS script
|
|
1651 |
with the path to a file where they are encoded in some format.
|
|
1623 | 1652 |
|
1624 | 1653 |
|
1625 | 1654 |
|
b/doc/design-2.1.rst | ||
---|---|---|
5 | 5 |
This document describes the major changes in Ganeti 2.1 compared to |
6 | 6 |
the 2.0 version. |
7 | 7 |
|
8 |
The 2.1 version will be a relatively small release. Its main aim is to avoid
|
|
9 |
changing too much of the core code, while addressing issues and adding new
|
|
10 |
features and improvements over 2.0, in a timely fashion. |
|
8 |
The 2.1 version will be a relatively small release. Its main aim is to |
|
9 |
avoid changing too much of the core code, while addressing issues and
|
|
10 |
adding new features and improvements over 2.0, in a timely fashion.
|
|
11 | 11 |
|
12 | 12 |
.. contents:: :depth: 4 |
13 | 13 |
|
... | ... | |
15 | 15 |
========= |
16 | 16 |
|
17 | 17 |
Ganeti 2.1 will add features to help further automatization of cluster |
18 |
operations, further improbe scalability to even bigger clusters, and make it
|
|
19 |
easier to debug the Ganeti core. |
|
18 |
operations, further improbe scalability to even bigger clusters, and |
|
19 |
make it easier to debug the Ganeti core.
|
|
20 | 20 |
|
21 | 21 |
Background |
22 | 22 |
========== |
... | ... | |
29 | 29 |
|
30 | 30 |
As for 2.0 we divide the 2.1 design into three areas: |
31 | 31 |
|
32 |
- core changes, which affect the master daemon/job queue/locking or all/most
|
|
33 |
logical units |
|
32 |
- core changes, which affect the master daemon/job queue/locking or |
|
33 |
all/most logical units
|
|
34 | 34 |
- logical unit/feature changes |
35 | 35 |
- external interface changes (eg. command line, os api, hooks, ...) |
36 | 36 |
|
... | ... | |
60 | 60 |
- list of storage units of this type |
61 | 61 |
- check status of the storage unit |
62 | 62 |
|
63 |
Additionally, there will be specific methods for each method, for example: |
|
63 |
Additionally, there will be specific methods for each method, for |
|
64 |
example: |
|
64 | 65 |
|
65 | 66 |
- enable/disable allocations on a specific PV |
66 | 67 |
- file storage directory creation/deletion |
... | ... | |
88 | 89 |
++++++++++++++++++++++++++++++ |
89 | 90 |
|
90 | 91 |
The class ``LockSet`` (see ``lib/locking.py``) is a container for one or |
91 |
many ``SharedLock`` instances. It provides an interface to add/remove locks
|
|
92 |
and to acquire and subsequently release any number of those locks contained
|
|
93 |
in it. |
|
92 |
many ``SharedLock`` instances. It provides an interface to add/remove |
|
93 |
locks and to acquire and subsequently release any number of those locks
|
|
94 |
contained in it.
|
|
94 | 95 |
|
95 |
Locks in a ``LockSet`` are always acquired in alphabetic order. Due to the
|
|
96 |
way we're using locks for nodes and instances (the single cluster lock isn't
|
|
97 |
affected by this issue) this can lead to long delays when acquiring locks if
|
|
98 |
another operation tries to acquire multiple locks but has to wait for yet
|
|
99 |
another operation. |
|
96 |
Locks in a ``LockSet`` are always acquired in alphabetic order. Due to |
|
97 |
the way we're using locks for nodes and instances (the single cluster
|
|
98 |
lock isn't affected by this issue) this can lead to long delays when
|
|
99 |
acquiring locks if another operation tries to acquire multiple locks but
|
|
100 |
has to wait for yet another operation.
|
|
100 | 101 |
|
101 | 102 |
In the following demonstration we assume to have the instance locks |
102 | 103 |
``inst1``, ``inst2``, ``inst3`` and ``inst4``. |
103 | 104 |
|
104 | 105 |
#. Operation A grabs lock for instance ``inst4``. |
105 |
#. Operation B wants to acquire all instance locks in alphabetic order, but
|
|
106 |
it has to wait for ``inst4``. |
|
106 |
#. Operation B wants to acquire all instance locks in alphabetic order, |
|
107 |
but it has to wait for ``inst4``.
|
|
107 | 108 |
#. Operation C tries to lock ``inst1``, but it has to wait until |
108 | 109 |
Operation B (which is trying to acquire all locks) releases the lock |
109 | 110 |
again. |
... | ... | |
121 | 122 |
Non-blocking lock acquiring |
122 | 123 |
^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
123 | 124 |
|
124 |
Acquiring locks for OpCode execution is always done in blocking mode. They
|
|
125 |
won't return until the lock has successfully been acquired (or an error
|
|
126 |
occurred, although we won't cover that case here). |
|
125 |
Acquiring locks for OpCode execution is always done in blocking mode. |
|
126 |
They won't return until the lock has successfully been acquired (or an
|
|
127 |
error occurred, although we won't cover that case here).
|
|
127 | 128 |
|
128 |
``SharedLock`` and ``LockSet`` must be able to be acquired in a non-blocking
|
|
129 |
way. They must support a timeout and abort trying to acquire the lock(s)
|
|
130 |
after the specified amount of time. |
|
129 |
``SharedLock`` and ``LockSet`` must be able to be acquired in a |
|
130 |
non-blocking way. They must support a timeout and abort trying to
|
|
131 |
acquire the lock(s) after the specified amount of time.
|
|
131 | 132 |
|
132 | 133 |
Retry acquiring locks |
133 | 134 |
^^^^^^^^^^^^^^^^^^^^^ |
134 | 135 |
|
135 |
To prevent other operations from waiting for a long time, such as described |
|
136 |
in the demonstration before, ``LockSet`` must not keep locks for a prolonged |
|
137 |
period of time when trying to acquire two or more locks. Instead it should, |
|
138 |
with an increasing timeout for acquiring all locks, release all locks again |
|
139 |
and sleep some time if it fails to acquire all requested locks. |
|
136 |
To prevent other operations from waiting for a long time, such as |
|
137 |
described in the demonstration before, ``LockSet`` must not keep locks |
|
138 |
for a prolonged period of time when trying to acquire two or more locks. |
|
139 |
Instead it should, with an increasing timeout for acquiring all locks, |
|
140 |
release all locks again and sleep some time if it fails to acquire all |
|
141 |
requested locks. |
|
140 | 142 |
|
141 |
A good timeout value needs to be determined. In any case should ``LockSet``
|
|
142 |
proceed to acquire locks in blocking mode after a few (unsuccessful)
|
|
143 |
attempts to acquire all requested locks. |
|
143 |
A good timeout value needs to be determined. In any case should |
|
144 |
``LockSet`` proceed to acquire locks in blocking mode after a few
|
|
145 |
(unsuccessful) attempts to acquire all requested locks.
|
|
144 | 146 |
|
145 |
One proposal for the timeout is to use ``2**tries`` seconds, where ``tries``
|
|
146 |
is the number of unsuccessful tries. |
|
147 |
One proposal for the timeout is to use ``2**tries`` seconds, where |
|
148 |
``tries`` is the number of unsuccessful tries.
|
|
147 | 149 |
|
148 |
In the demonstration before this would allow Operation C to continue after
|
|
149 |
Operation B unsuccessfully tried to acquire all locks and released all
|
|
150 |
acquired locks (``inst1``, ``inst2`` and ``inst3``) again. |
|
150 |
In the demonstration before this would allow Operation C to continue |
|
151 |
after Operation B unsuccessfully tried to acquire all locks and released
|
|
152 |
all acquired locks (``inst1``, ``inst2`` and ``inst3``) again.
|
|
151 | 153 |
|
152 | 154 |
Other solutions discussed |
153 | 155 |
+++++++++++++++++++++++++ |
154 | 156 |
|
155 |
There was also some discussion on going one step further and extend the job |
|
156 |
queue (see ``lib/jqueue.py``) to select the next task for a worker depending |
|
157 |
on whether it can acquire the necessary locks. While this may reduce the |
|
158 |
number of necessary worker threads and/or increase throughput on large |
|
159 |
clusters with many jobs, it also brings many potential problems, such as |
|
160 |
contention and increased memory usage, with it. As this would be an |
|
161 |
extension of the changes proposed before it could be implemented at a later |
|
162 |
point in time, but we decided to stay with the simpler solution for now. |
|
157 |
There was also some discussion on going one step further and extend the |
|
158 |
job queue (see ``lib/jqueue.py``) to select the next task for a worker |
|
159 |
depending on whether it can acquire the necessary locks. While this may |
|
160 |
reduce the number of necessary worker threads and/or increase throughput |
|
161 |
on large clusters with many jobs, it also brings many potential |
|
162 |
problems, such as contention and increased memory usage, with it. As |
|
163 |
this would be an extension of the changes proposed before it could be |
|
164 |
implemented at a later point in time, but we decided to stay with the |
|
165 |
simpler solution for now. |
|
163 | 166 |
|
164 | 167 |
Implementation details |
165 | 168 |
++++++++++++++++++++++ |
... | ... | |
169 | 172 |
|
170 | 173 |
The current design of ``SharedLock`` is not good for supporting timeouts |
171 | 174 |
when acquiring a lock and there are also minor fairness issues in it. We |
172 |
plan to address both with a redesign. A proof of concept implementation was |
|
173 |
written and resulted in significantly simpler code. |
|
174 |
|
|
175 |
Currently ``SharedLock`` uses two separate queues for shared and exclusive |
|
176 |
acquires and waiters get to run in turns. This means if an exclusive acquire |
|
177 |
is released, the lock will allow shared waiters to run and vice versa. |
|
178 |
Although it's still fair in the end there is a slight bias towards shared |
|
179 |
waiters in the current implementation. The same implementation with two |
|
180 |
shared queues can not support timeouts without adding a lot of complexity. |
|
181 |
|
|
182 |
Our proposed redesign changes ``SharedLock`` to have only one single queue. |
|
183 |
There will be one condition (see Condition_ for a note about performance) in |
|
184 |
the queue per exclusive acquire and two for all shared acquires (see below for |
|
185 |
an explanation). The maximum queue length will always be ``2 + (number of |
|
186 |
exclusive acquires waiting)``. The number of queue entries for shared acquires |
|
187 |
can vary from 0 to 2. |
|
188 |
|
|
189 |
The two conditions for shared acquires are a bit special. They will be used |
|
190 |
in turn. When the lock is instantiated, no conditions are in the queue. As |
|
191 |
soon as the first shared acquire arrives (and there are holder(s) or waiting |
|
192 |
acquires; see Acquire_), the active condition is added to the queue. Until |
|
193 |
it becomes the topmost condition in the queue and has been notified, any |
|
194 |
shared acquire is added to this active condition. When the active condition |
|
195 |
is notified, the conditions are swapped and further shared acquires are |
|
196 |
added to the previously inactive condition (which has now become the active |
|
197 |
condition). After all waiters on the previously active (now inactive) and |
|
198 |
now notified condition received the notification, it is removed from the |
|
199 |
queue of pending acquires. |
|
200 |
|
|
201 |
This means shared acquires will skip any exclusive acquire in the queue. We |
|
202 |
believe it's better to improve parallelization on operations only asking for |
|
203 |
shared (or read-only) locks. Exclusive operations holding the same lock can |
|
204 |
not be parallelized. |
|
175 |
plan to address both with a redesign. A proof of concept implementation |
|
176 |
was written and resulted in significantly simpler code. |
|
177 |
|
|
178 |
Currently ``SharedLock`` uses two separate queues for shared and |
|
179 |
exclusive acquires and waiters get to run in turns. This means if an |
|
180 |
exclusive acquire is released, the lock will allow shared waiters to run |
|
181 |
and vice versa. Although it's still fair in the end there is a slight |
|
182 |
bias towards shared waiters in the current implementation. The same |
|
183 |
implementation with two shared queues can not support timeouts without |
|
184 |
adding a lot of complexity. |
|
185 |
|
|
186 |
Our proposed redesign changes ``SharedLock`` to have only one single |
|
187 |
queue. There will be one condition (see Condition_ for a note about |
|
188 |
performance) in the queue per exclusive acquire and two for all shared |
|
189 |
acquires (see below for an explanation). The maximum queue length will |
|
190 |
always be ``2 + (number of exclusive acquires waiting)``. The number of |
|
191 |
queue entries for shared acquires can vary from 0 to 2. |
|
192 |
|
|
193 |
The two conditions for shared acquires are a bit special. They will be |
|
194 |
used in turn. When the lock is instantiated, no conditions are in the |
|
195 |
queue. As soon as the first shared acquire arrives (and there are |
|
196 |
holder(s) or waiting acquires; see Acquire_), the active condition is |
|
197 |
added to the queue. Until it becomes the topmost condition in the queue |
|
198 |
and has been notified, any shared acquire is added to this active |
|
199 |
condition. When the active condition is notified, the conditions are |
|
200 |
swapped and further shared acquires are added to the previously inactive |
|
201 |
condition (which has now become the active condition). After all waiters |
|
202 |
on the previously active (now inactive) and now notified condition |
|
203 |
received the notification, it is removed from the queue of pending |
|
204 |
acquires. |
|
205 |
|
|
206 |
This means shared acquires will skip any exclusive acquire in the queue. |
|
207 |
We believe it's better to improve parallelization on operations only |
|
208 |
asking for shared (or read-only) locks. Exclusive operations holding the |
|
209 |
same lock can not be parallelized. |
|
205 | 210 |
|
206 | 211 |
|
207 | 212 |
Acquire |
208 | 213 |
******* |
209 | 214 |
|
210 |
For exclusive acquires a new condition is created and appended to the queue.
|
|
211 |
Shared acquires are added to the active condition for shared acquires and if
|
|
212 |
the condition is not yet on the queue, it's appended. |
|
215 |
For exclusive acquires a new condition is created and appended to the |
|
216 |
queue. Shared acquires are added to the active condition for shared
|
|
217 |
acquires and if the condition is not yet on the queue, it's appended.
|
|
213 | 218 |
|
214 |
The next step is to wait for our condition to be on the top of the queue (to
|
|
215 |
guarantee fairness). If the timeout expired, we return to the caller without
|
|
216 |
acquiring the lock. On every notification we check whether the lock has been
|
|
217 |
deleted, in which case an error is returned to the caller. |
|
219 |
The next step is to wait for our condition to be on the top of the queue |
|
220 |
(to guarantee fairness). If the timeout expired, we return to the caller
|
|
221 |
without acquiring the lock. On every notification we check whether the
|
|
222 |
lock has been deleted, in which case an error is returned to the caller.
|
|
218 | 223 |
|
219 |
The lock can be acquired if we're on top of the queue (there is no one else |
|
220 |
ahead of us). For an exclusive acquire, there must not be other exclusive or |
|
221 |
shared holders. For a shared acquire, there must not be an exclusive holder. |
|
222 |
If these conditions are all true, the lock is acquired and we return to the |
|
223 |
caller. In any other case we wait again on the condition. |
|
224 |
The lock can be acquired if we're on top of the queue (there is no one |
|
225 |
else ahead of us). For an exclusive acquire, there must not be other |
|
226 |
exclusive or shared holders. For a shared acquire, there must not be an |
|
227 |
exclusive holder. If these conditions are all true, the lock is |
|
228 |
acquired and we return to the caller. In any other case we wait again on |
|
229 |
the condition. |
|
224 | 230 |
|
225 |
If it was the last waiter on a condition, the condition is removed from the
|
|
226 |
queue. |
|
231 |
If it was the last waiter on a condition, the condition is removed from |
|
232 |
the queue.
|
|
227 | 233 |
|
228 | 234 |
Optimization: There's no need to touch the queue if there are no pending |
229 |
acquires and no current holders. The caller can have the lock immediately. |
|
235 |
acquires and no current holders. The caller can have the lock |
|
236 |
immediately. |
|
230 | 237 |
|
231 | 238 |
.. image:: design-2.1-lock-acquire.png |
232 | 239 |
|
... | ... | |
234 | 241 |
Release |
235 | 242 |
******* |
236 | 243 |
|
237 |
First the lock removes the caller from the internal owner list. If there are |
|
238 |
pending acquires in the queue, the first (the oldest) condition is notified. |
|
244 |
First the lock removes the caller from the internal owner list. If there |
|
245 |
are pending acquires in the queue, the first (the oldest) condition is |
|
246 |
notified. |
|
239 | 247 |
|
240 | 248 |
If the first condition was the active condition for shared acquires, the |
241 |
inactive condition will be made active. This ensures fairness with exclusive |
|
242 |
locks by forcing consecutive shared acquires to wait in the queue. |
|
249 |
inactive condition will be made active. This ensures fairness with |
|
250 |
exclusive locks by forcing consecutive shared acquires to wait in the |
|
251 |
queue. |
|
243 | 252 |
|
244 | 253 |
.. image:: design-2.1-lock-release.png |
245 | 254 |
|
... | ... | |
247 | 256 |
Delete |
248 | 257 |
****** |
249 | 258 |
|
250 |
The caller must either hold the lock in exclusive mode already or the lock
|
|
251 |
must be acquired in exclusive mode. Trying to delete a lock while it's held
|
|
252 |
in shared mode must fail. |
|
259 |
The caller must either hold the lock in exclusive mode already or the |
|
260 |
lock must be acquired in exclusive mode. Trying to delete a lock while
|
|
261 |
it's held in shared mode must fail.
|
|
253 | 262 |
|
254 |
After ensuring the lock is held in exclusive mode, the lock will mark itself
|
|
255 |
as deleted and continue to notify all pending acquires. They will wake up,
|
|
256 |
notice the deleted lock and return an error to the caller. |
|
263 |
After ensuring the lock is held in exclusive mode, the lock will mark |
|
264 |
itself as deleted and continue to notify all pending acquires. They will
|
|
265 |
wake up, notice the deleted lock and return an error to the caller.
|
|
257 | 266 |
|
258 | 267 |
|
259 | 268 |
Condition |
260 | 269 |
^^^^^^^^^ |
261 | 270 |
|
262 |
Note: This is not necessary for the locking changes above, but it may be a
|
|
263 |
good optimization (pending performance tests). |
|
271 |
Note: This is not necessary for the locking changes above, but it may be |
|
272 |
a good optimization (pending performance tests).
|
|
264 | 273 |
|
265 | 274 |
The existing locking code in Ganeti 2.0 uses Python's built-in |
266 | 275 |
``threading.Condition`` class. Unfortunately ``Condition`` implements |
267 |
timeouts by sleeping 1ms to 20ms between tries to acquire the condition lock
|
|
268 |
in non-blocking mode. This requires unnecessary context switches and
|
|
269 |
contention on the CPython GIL (Global Interpreter Lock). |
|
276 |
timeouts by sleeping 1ms to 20ms between tries to acquire the condition |
|
277 |
lock in non-blocking mode. This requires unnecessary context switches
|
|
278 |
and contention on the CPython GIL (Global Interpreter Lock).
|
|
270 | 279 |
|
271 | 280 |
By using POSIX pipes (see ``pipe(2)``) we can use the operating system's |
272 | 281 |
support for timeouts on file descriptors (see ``select(2)``). A custom |
273 | 282 |
condition class will have to be written for this. |
274 | 283 |
|
275 | 284 |
On instantiation the class creates a pipe. After each notification the |
276 |
previous pipe is abandoned and re-created (technically the old pipe needs to
|
|
277 |
stay around until all notifications have been delivered). |
|
285 |
previous pipe is abandoned and re-created (technically the old pipe |
|
286 |
needs to stay around until all notifications have been delivered).
|
|
278 | 287 |
|
279 | 288 |
All waiting clients of the condition use ``select(2)`` or ``poll(2)`` to |
280 |
wait for notifications, optionally with a timeout. A notification will be
|
|
281 |
signalled to the waiting clients by closing the pipe. If the pipe wasn't
|
|
282 |
closed during the timeout, the waiting function returns to its caller
|
|
283 |
nonetheless. |
|
289 |
wait for notifications, optionally with a timeout. A notification will |
|
290 |
be signalled to the waiting clients by closing the pipe. If the pipe
|
|
291 |
wasn't closed during the timeout, the waiting function returns to its
|
|
292 |
caller nonetheless.
|
|
284 | 293 |
|
285 | 294 |
|
286 | 295 |
Feature changes |
... | ... | |
291 | 300 |
|
292 | 301 |
Current State and shortcomings |
293 | 302 |
++++++++++++++++++++++++++++++ |
294 |
In Ganeti 2.0 all nodes are equal, but some are more equal than others. In |
|
295 |
particular they are divided between "master", "master candidates" and "normal". |
|
296 |
(Moreover they can be offline or drained, but this is not important for the |
|
297 |
current discussion). In general the whole configuration is only replicated to |
|
298 |
master candidates, and some partial information is spread to all nodes via |
|
299 |
ssconf. |
|
300 |
|
|
301 |
This change was done so that the most frequent Ganeti operations didn't need to |
|
302 |
contact all nodes, and so clusters could become bigger. If we want more |
|
303 |
information to be available on all nodes, we need to add more ssconf values, |
|
304 |
which is counter-balancing the change, or to talk with the master node, which |
|
305 |
is not designed to happen now, and requires its availability. |
|
306 |
|
|
307 |
Information such as the instance->primary_node mapping will be needed on all |
|
308 |
nodes, and we also want to make sure services external to the cluster can query |
|
309 |
this information as well. This information must be available at all times, so |
|
310 |
we can't query it through RAPI, which would be a single point of failure, as |
|
311 |
it's only available on the master. |
|
303 |
|
|
304 |
In Ganeti 2.0 all nodes are equal, but some are more equal than others. |
|
305 |
In particular they are divided between "master", "master candidates" and |
|
306 |
"normal". (Moreover they can be offline or drained, but this is not |
|
307 |
important for the current discussion). In general the whole |
|
308 |
configuration is only replicated to master candidates, and some partial |
|
309 |
information is spread to all nodes via ssconf. |
|
310 |
|
|
311 |
This change was done so that the most frequent Ganeti operations didn't |
|
312 |
need to contact all nodes, and so clusters could become bigger. If we |
|
313 |
want more information to be available on all nodes, we need to add more |
|
314 |
ssconf values, which is counter-balancing the change, or to talk with |
|
315 |
the master node, which is not designed to happen now, and requires its |
|
316 |
availability. |
|
317 |
|
|
318 |
Information such as the instance->primary_node mapping will be needed on |
|
319 |
all nodes, and we also want to make sure services external to the |
|
320 |
cluster can query this information as well. This information must be |
|
321 |
available at all times, so we can't query it through RAPI, which would |
|
322 |
be a single point of failure, as it's only available on the master. |
|
312 | 323 |
|
313 | 324 |
|
314 | 325 |
Proposed changes |
315 | 326 |
++++++++++++++++ |
316 | 327 |
|
317 | 328 |
In order to allow fast and highly available access read-only to some |
318 |
configuration values, we'll create a new ganeti-confd daemon, which will run on |
|
319 |
master candidates. This daemon will talk via UDP, and authenticate messages |
|
320 |
using HMAC with a cluster-wide shared key. This key will be generated at |
|
321 |
cluster init time, and stored on the clusters alongside the ganeti SSL keys, |
|
322 |
and readable only by root. |
|
323 |
|
|
324 |
An interested client can query a value by making a request to a subset of the |
|
325 |
cluster master candidates. It will then wait to get a few responses, and use |
|
326 |
the one with the highest configuration serial number. Since the configuration |
|
327 |
serial number is increased each time the ganeti config is updated, and the |
|
328 |
serial number is included in all answers, this can be used to make sure to use |
|
329 |
the most recent answer, in case some master candidates are stale or in the |
|
330 |
middle of a configuration update. |
|
329 |
configuration values, we'll create a new ganeti-confd daemon, which will |
|
330 |
run on master candidates. This daemon will talk via UDP, and |
|
331 |
authenticate messages using HMAC with a cluster-wide shared key. This |
|
332 |
key will be generated at cluster init time, and stored on the clusters |
|
333 |
alongside the ganeti SSL keys, and readable only by root. |
|
334 |
|
|
335 |
An interested client can query a value by making a request to a subset |
|
336 |
of the cluster master candidates. It will then wait to get a few |
|
337 |
responses, and use the one with the highest configuration serial number. |
|
338 |
Since the configuration serial number is increased each time the ganeti |
|
339 |
config is updated, and the serial number is included in all answers, |
|
340 |
this can be used to make sure to use the most recent answer, in case |
|
341 |
some master candidates are stale or in the middle of a configuration |
|
342 |
update. |
|
331 | 343 |
|
332 | 344 |
In order to prevent replay attacks queries will contain the current unix |
333 | 345 |
timestamp according to the client, and the server will verify that its |
334 |
timestamp is in the same 5 minutes range (this requires synchronized clocks,
|
|
335 |
which is a good idea anyway). Queries will also contain a "salt" which they
|
|
336 |
expect the answers to be sent with, and clients are supposed to accept only
|
|
337 |
answers which contain salt generated by them. |
|
346 |
timestamp is in the same 5 minutes range (this requires synchronized |
|
347 |
clocks, which is a good idea anyway). Queries will also contain a "salt"
|
|
348 |
which they expect the answers to be sent with, and clients are supposed
|
|
349 |
to accept only answers which contain salt generated by them.
|
|
338 | 350 |
|
339 | 351 |
The configuration daemon will be able to answer simple queries such as: |
340 | 352 |
|
... | ... | |
364 | 376 |
|
365 | 377 |
- 'protocol', integer, is the confd protocol version (initially just |
366 | 378 |
constants.CONFD_PROTOCOL_VERSION, with a value of 1) |
367 |
- 'type', integer, is the query type. For example "node role by name" or |
|
368 |
"node primary ip by instance ip". Constants will be provided for the actual |
|
369 |
available query types. |
|
370 |
- 'query', string, is the search key. For example an ip, or a node name. |
|
371 |
- 'rsalt', string, is the required response salt. The client must use it to |
|
372 |
recognize which answer it's getting. |
|
373 |
|
|
374 |
- 'salt' must be the current unix timestamp, according to the client. Servers |
|
375 |
can refuse messages which have a wrong timing, according to their |
|
376 |
configuration and clock. |
|
379 |
- 'type', integer, is the query type. For example "node role by name" |
|
380 |
or "node primary ip by instance ip". Constants will be provided for |
|
381 |
the actual available query types. |
|
382 |
- 'query', string, is the search key. For example an ip, or a node |
|
383 |
name. |
|
384 |
- 'rsalt', string, is the required response salt. The client must use |
|
385 |
it to recognize which answer it's getting. |
|
386 |
|
|
387 |
- 'salt' must be the current unix timestamp, according to the client. |
|
388 |
Servers can refuse messages which have a wrong timing, according to |
|
389 |
their configuration and clock. |
|
377 | 390 |
- 'hmac' is an hmac signature of salt+msg, with the cluster hmac key |
378 | 391 |
|
379 |
If an answer comes back (which is optional, since confd works over UDP) it will
|
|
380 |
be in this format:: |
|
392 |
If an answer comes back (which is optional, since confd works over UDP) |
|
393 |
it will be in this format::
|
|
381 | 394 |
|
382 | 395 |
{ |
383 | 396 |
"msg": "{\"status\": 0, |
... | ... | |
394 | 407 |
|
395 | 408 |
- 'protocol', integer, is the confd protocol version (initially just |
396 | 409 |
constants.CONFD_PROTOCOL_VERSION, with a value of 1) |
397 |
- 'status', integer, is the error code. Initially just 0 for 'ok' or '1' for
|
|
398 |
'error' (in which case answer contains an error detail, rather than an
|
|
399 |
answer), but in the future it may be expanded to have more meanings (eg: 2,
|
|
400 |
the answer is compressed) |
|
401 |
- 'answer', is the actual answer. Its type and meaning is query specific. For
|
|
402 |
example for "node primary ip by instance ip" queries it will be a string
|
|
403 |
containing an IP address, for "node role by name" queries it will be an
|
|
404 |
integer which encodes the role (master, candidate, drained, offline)
|
|
405 |
according to constants. |
|
406 |
|
|
407 |
- 'salt' is the requested salt from the query. A client can use it to recognize
|
|
408 |
what query the answer is answering. |
|
410 |
- 'status', integer, is the error code. Initially just 0 for 'ok' or |
|
411 |
'1' for 'error' (in which case answer contains an error detail,
|
|
412 |
rather than an answer), but in the future it may be expanded to have
|
|
413 |
more meanings (eg: 2, the answer is compressed)
|
|
414 |
- 'answer', is the actual answer. Its type and meaning is query |
|
415 |
specific. For example for "node primary ip by instance ip" queries
|
|
416 |
it will be a string containing an IP address, for "node role by
|
|
417 |
name" queries it will be an integer which encodes the role (master,
|
|
418 |
candidate, drained, offline) according to constants.
|
|
419 |
|
|
420 |
- 'salt' is the requested salt from the query. A client can use it to |
|
421 |
recognize what query the answer is answering.
|
|
409 | 422 |
- 'hmac' is an hmac signature of salt+msg, with the cluster hmac key |
410 | 423 |
|
411 | 424 |
|
... | ... | |
414 | 427 |
|
415 | 428 |
Current State and shortcomings |
416 | 429 |
++++++++++++++++++++++++++++++ |
Also available in: Unified diff