Revision 7faf5110 doc/design-2.0.rst
b/doc/design-2.0.rst | ||
---|---|---|
30 | 30 |
- poor handling of node failures in the cluster |
31 | 31 |
- mixing hypervisors in a cluster not allowed |
32 | 32 |
|
33 |
It also has a number of artificial restrictions, due to historical design: |
|
33 |
It also has a number of artificial restrictions, due to historical |
|
34 |
design: |
|
34 | 35 |
|
35 | 36 |
- fixed number of disks (two) per instance |
36 | 37 |
- fixed number of NICs |
... | ... | |
55 | 56 |
|
56 | 57 |
- It is impossible for two people to efficiently interact with a cluster |
57 | 58 |
(for example for debugging) at the same time. |
58 |
- When batch jobs are running it's impossible to do other work (for example
|
|
59 |
failovers/fixes) on a cluster. |
|
59 |
- When batch jobs are running it's impossible to do other work (for |
|
60 |
example failovers/fixes) on a cluster.
|
|
60 | 61 |
|
61 | 62 |
This poses scalability problems: as clusters grow in node and instance |
62 | 63 |
size it's a lot more likely that operations which one could conceive |
... | ... | |
155 | 156 |
|
156 | 157 |
The master-daemon related interaction paths are: |
157 | 158 |
|
158 |
- (CLI tools/RAPI daemon) and the master daemon, via the so called *LUXI* API |
|
159 |
- (CLI tools/RAPI daemon) and the master daemon, via the so called |
|
160 |
*LUXI* API |
|
159 | 161 |
- the master daemon and the node daemons, via the node RPC |
160 | 162 |
|
161 | 163 |
There are also some additional interaction paths for exceptional cases: |
... | ... | |
237 | 239 |
There are two special value for the result field: |
238 | 240 |
|
239 | 241 |
- in the case that the operation failed, and this field is a list of |
240 |
length two, the client library will try to interpret is as an exception,
|
|
241 |
the first element being the exception type and the second one the
|
|
242 |
actual exception arguments; this will allow a simple method of passing
|
|
243 |
Ganeti-related exception across the interface |
|
242 |
length two, the client library will try to interpret is as an |
|
243 |
exception, the first element being the exception type and the second
|
|
244 |
one the actual exception arguments; this will allow a simple method of
|
|
245 |
passing Ganeti-related exception across the interface
|
|
244 | 246 |
- for the *WaitForChange* call (that waits on the server for a job to |
245 | 247 |
change status), if the result is equal to ``nochange`` instead of the |
246 | 248 |
usual result for this call (a list of changes), then the library will |
... | ... | |
381 | 383 |
- the more advanced granular locking that we want to implement would |
382 | 384 |
require, if written in the async-manner, deep integration with the |
383 | 385 |
Twisted stack, to such an extend that business-logic is inseparable |
384 |
from the protocol coding; we felt that this is an unreasonable request, |
|
385 |
and that a good protocol library should allow complete separation of |
|
386 |
low-level protocol calls and business logic; by comparison, the threaded |
|
387 |
approach combined with HTTPs protocol required (for the first iteration) |
|
388 |
absolutely no changes from the 1.2 code, and later changes for optimizing |
|
389 |
the inter-node RPC calls required just syntactic changes (e.g. |
|
390 |
``rpc.call_...`` to ``self.rpc.call_...``) |
|
386 |
from the protocol coding; we felt that this is an unreasonable |
|
387 |
request, and that a good protocol library should allow complete |
|
388 |
separation of low-level protocol calls and business logic; by |
|
389 |
comparison, the threaded approach combined with HTTPs protocol |
|
390 |
required (for the first iteration) absolutely no changes from the 1.2 |
|
391 |
code, and later changes for optimizing the inter-node RPC calls |
|
392 |
required just syntactic changes (e.g. ``rpc.call_...`` to |
|
393 |
``self.rpc.call_...``) |
|
391 | 394 |
|
392 | 395 |
Another issue is with the Twisted API stability - during the Ganeti |
393 | 396 |
1.x lifetime, we had to to implement many times workarounds to changes |
... | ... | |
401 | 404 |
Granular locking |
402 | 405 |
~~~~~~~~~~~~~~~~ |
403 | 406 |
|
404 |
We want to make sure that multiple operations can run in parallel on a Ganeti |
|
405 |
Cluster. In order for this to happen we need to make sure concurrently run |
|
406 |
operations don't step on each other toes and break the cluster. |
|
407 |
We want to make sure that multiple operations can run in parallel on a |
|
408 |
Ganeti Cluster. In order for this to happen we need to make sure |
|
409 |
concurrently run operations don't step on each other toes and break the |
|
410 |
cluster. |
|
407 | 411 |
|
408 | 412 |
This design addresses how we are going to deal with locking so that: |
409 | 413 |
|
... | ... | |
411 | 415 |
- we prevent deadlocks |
412 | 416 |
- we prevent job starvation |
413 | 417 |
|
414 |
Reaching the maximum possible parallelism is a Non-Goal. We have identified a |
|
415 |
set of operations that are currently bottlenecks and need to be parallelised |
|
416 |
and have worked on those. In the future it will be possible to address other |
|
417 |
needs, thus making the cluster more and more parallel one step at a time. |
|
418 |
Reaching the maximum possible parallelism is a Non-Goal. We have |
|
419 |
identified a set of operations that are currently bottlenecks and need |
|
420 |
to be parallelised and have worked on those. In the future it will be |
|
421 |
possible to address other needs, thus making the cluster more and more |
|
422 |
parallel one step at a time. |
|
418 | 423 |
|
419 | 424 |
This section only talks about parallelising Ganeti level operations, aka |
420 |
Logical Units, and the locking needed for that. Any other synchronization lock
|
|
421 |
needed internally by the code is outside its scope. |
|
425 |
Logical Units, and the locking needed for that. Any other |
|
426 |
synchronization lock needed internally by the code is outside its scope.
|
|
422 | 427 |
|
423 | 428 |
Library details |
424 | 429 |
+++++++++++++++ |
425 | 430 |
|
426 | 431 |
The proposed library has these features: |
427 | 432 |
|
428 |
- internally managing all the locks, making the implementation transparent |
|
429 |
from their usage |
|
430 |
- automatically grabbing multiple locks in the right order (avoid deadlock) |
|
433 |
- internally managing all the locks, making the implementation |
|
434 |
transparent from their usage |
|
435 |
- automatically grabbing multiple locks in the right order (avoid |
|
436 |
deadlock) |
|
431 | 437 |
- ability to transparently handle conversion to more granularity |
432 | 438 |
- support asynchronous operation (future goal) |
433 | 439 |
|
... | ... | |
446 | 452 |
``lockings.SharedLock``), and the individual locks for each object |
447 | 453 |
will be created at initialisation time, from the config file. |
448 | 454 |
|
449 |
The API will have a way to grab one or more than one locks at the same time.
|
|
450 |
Any attempt to grab a lock while already holding one in the wrong order will be
|
|
451 |
checked for, and fail. |
|
455 |
The API will have a way to grab one or more than one locks at the same |
|
456 |
time. Any attempt to grab a lock while already holding one in the wrong
|
|
457 |
order will be checked for, and fail.
|
|
452 | 458 |
|
453 | 459 |
|
454 | 460 |
The Locks |
... | ... | |
460 | 466 |
- One lock per node in the cluster |
461 | 467 |
- One lock per instance in the cluster |
462 | 468 |
|
463 |
All the instance locks will need to be taken before the node locks, and the
|
|
464 |
node locks before the config lock. Locks will need to be acquired at the same
|
|
465 |
time for multiple instances and nodes, and internal ordering will be dealt
|
|
466 |
within the locking library, which, for simplicity, will just use alphabetical
|
|
467 |
order. |
|
469 |
All the instance locks will need to be taken before the node locks, and |
|
470 |
the node locks before the config lock. Locks will need to be acquired at
|
|
471 |
the same time for multiple instances and nodes, and internal ordering
|
|
472 |
will be dealt within the locking library, which, for simplicity, will
|
|
473 |
just use alphabetical order.
|
|
468 | 474 |
|
469 | 475 |
Each lock has the following three possible statuses: |
470 | 476 |
|
... | ... | |
475 | 481 |
Handling conversion to more granularity |
476 | 482 |
+++++++++++++++++++++++++++++++++++++++ |
477 | 483 |
|
478 |
In order to convert to a more granular approach transparently each time we |
|
479 |
split a lock into more we'll create a "metalock", which will depend on those |
|
480 |
sub-locks and live for the time necessary for all the code to convert (or |
|
481 |
forever, in some conditions). When a metalock exists all converted code must |
|
482 |
acquire it in shared mode, so it can run concurrently, but still be exclusive |
|
483 |
with old code, which acquires it exclusively. |
|
484 |
In order to convert to a more granular approach transparently each time |
|
485 |
we split a lock into more we'll create a "metalock", which will depend |
|
486 |
on those sub-locks and live for the time necessary for all the code to |
|
487 |
convert (or forever, in some conditions). When a metalock exists all |
|
488 |
converted code must acquire it in shared mode, so it can run |
|
489 |
concurrently, but still be exclusive with old code, which acquires it |
|
490 |
exclusively. |
|
484 | 491 |
|
485 |
In the beginning the only such lock will be what replaces the current "command"
|
|
486 |
lock, and will acquire all the locks in the system, before proceeding. This
|
|
487 |
lock will be called the "Big Ganeti Lock" because holding that one will avoid
|
|
488 |
any other concurrent Ganeti operations. |
|
492 |
In the beginning the only such lock will be what replaces the current |
|
493 |
"command" lock, and will acquire all the locks in the system, before
|
|
494 |
proceeding. This lock will be called the "Big Ganeti Lock" because
|
|
495 |
holding that one will avoid any other concurrent Ganeti operations.
|
|
489 | 496 |
|
490 |
We might also want to devise more metalocks (eg. all nodes, all nodes+config)
|
|
491 |
in order to make it easier for some parts of the code to acquire what it needs
|
|
492 |
without specifying it explicitly. |
|
497 |
We might also want to devise more metalocks (eg. all nodes, all |
|
498 |
nodes+config) in order to make it easier for some parts of the code to
|
|
499 |
acquire what it needs without specifying it explicitly.
|
|
493 | 500 |
|
494 |
In the future things like the node locks could become metalocks, should we |
|
495 |
decide to split them into an even more fine grained approach, but this will |
|
496 |
probably be only after the first 2.0 version has been released. |
|
501 |
In the future things like the node locks could become metalocks, should |
|
502 |
we decide to split them into an even more fine grained approach, but |
|
503 |
this will probably be only after the first 2.0 version has been |
|
504 |
released. |
|
497 | 505 |
|
498 | 506 |
Adding/Removing locks |
499 | 507 |
+++++++++++++++++++++ |
500 | 508 |
|
501 |
When a new instance or a new node is created an associated lock must be added
|
|
502 |
to the list. The relevant code will need to inform the locking library of such
|
|
503 |
a change. |
|
509 |
When a new instance or a new node is created an associated lock must be |
|
510 |
added to the list. The relevant code will need to inform the locking
|
|
511 |
library of such a change.
|
|
504 | 512 |
|
505 |
This needs to be compatible with every other lock in the system, especially
|
|
506 |
metalocks that guarantee to grab sets of resources without specifying them
|
|
507 |
explicitly. The implementation of this will be handled in the locking library
|
|
508 |
itself. |
|
513 |
This needs to be compatible with every other lock in the system, |
|
514 |
especially metalocks that guarantee to grab sets of resources without
|
|
515 |
specifying them explicitly. The implementation of this will be handled
|
|
516 |
in the locking library itself.
|
|
509 | 517 |
|
510 | 518 |
When instances or nodes disappear from the cluster the relevant locks |
511 | 519 |
must be removed. This is easier than adding new elements, as the code |
... | ... | |
517 | 525 |
+++++++++++++++++++++++ |
518 | 526 |
|
519 | 527 |
For the first version the locking library will only export synchronous |
520 |
operations, which will block till the needed lock are held, and only fail if
|
|
521 |
the request is impossible or somehow erroneous. |
|
528 |
operations, which will block till the needed lock are held, and only |
|
529 |
fail if the request is impossible or somehow erroneous.
|
|
522 | 530 |
|
523 | 531 |
In the future we may want to implement different types of asynchronous |
524 | 532 |
operations such as: |
525 | 533 |
|
526 | 534 |
- try to acquire this lock set and fail if not possible |
527 |
- try to acquire one of these lock sets and return the first one you were
|
|
528 |
able to get (or after a timeout) (select/poll like) |
|
535 |
- try to acquire one of these lock sets and return the first one you |
|
536 |
were able to get (or after a timeout) (select/poll like)
|
|
529 | 537 |
|
530 |
These operations can be used to prioritize operations based on available locks, |
|
531 |
rather than making them just blindly queue for acquiring them. The inherent |
|
532 |
risk, though, is that any code using the first operation, or setting a timeout |
|
533 |
for the second one, is susceptible to starvation and thus may never be able to |
|
534 |
get the required locks and complete certain tasks. Considering this |
|
535 |
providing/using these operations should not be among our first priorities. |
|
538 |
These operations can be used to prioritize operations based on available |
|
539 |
locks, rather than making them just blindly queue for acquiring them. |
|
540 |
The inherent risk, though, is that any code using the first operation, |
|
541 |
or setting a timeout for the second one, is susceptible to starvation |
|
542 |
and thus may never be able to get the required locks and complete |
|
543 |
certain tasks. Considering this providing/using these operations should |
|
544 |
not be among our first priorities. |
|
536 | 545 |
|
537 | 546 |
Locking granularity |
538 | 547 |
+++++++++++++++++++ |
539 | 548 |
|
540 | 549 |
For the first version of this code we'll convert each Logical Unit to |
541 |
acquire/release the locks it needs, so locking will be at the Logical Unit |
|
542 |
level. In the future we may want to split logical units in independent |
|
543 |
"tasklets" with their own locking requirements. A different design doc (or mini |
|
544 |
design doc) will cover the move from Logical Units to tasklets. |
|
550 |
acquire/release the locks it needs, so locking will be at the Logical |
|
551 |
Unit level. In the future we may want to split logical units in |
|
552 |
independent "tasklets" with their own locking requirements. A different |
|
553 |
design doc (or mini design doc) will cover the move from Logical Units |
|
554 |
to tasklets. |
|
545 | 555 |
|
546 | 556 |
Code examples |
547 | 557 |
+++++++++++++ |
548 | 558 |
|
549 |
In general when acquiring locks we should use a code path equivalent to:: |
|
559 |
In general when acquiring locks we should use a code path equivalent |
|
560 |
to:: |
|
550 | 561 |
|
551 | 562 |
lock.acquire() |
552 | 563 |
try: |
... | ... | |
561 | 572 |
syntax will be possible, but we want to keep compatibility with Python |
562 | 573 |
2.4 so the new constructs should not be used. |
563 | 574 |
|
564 |
In order to avoid this extra indentation and code changes everywhere in the
|
|
565 |
Logical Units code, we decided to allow LUs to declare locks, and then execute
|
|
566 |
their code with their locks acquired. In the new world LUs are called like
|
|
567 |
this:: |
|
575 |
In order to avoid this extra indentation and code changes everywhere in |
|
576 |
the Logical Units code, we decided to allow LUs to declare locks, and
|
|
577 |
then execute their code with their locks acquired. In the new world LUs
|
|
578 |
are called like this::
|
|
568 | 579 |
|
569 | 580 |
# user passed names are expanded to the internal lock/resource name, |
570 | 581 |
# then known needed locks are declared |
... | ... | |
579 | 590 |
lu.Exec() |
580 | 591 |
... locks declared for removal are removed, all acquired locks released ... |
581 | 592 |
|
582 |
The Processor and the LogicalUnit class will contain exact documentation on how
|
|
583 |
locks are supposed to be declared. |
|
593 |
The Processor and the LogicalUnit class will contain exact documentation |
|
594 |
on how locks are supposed to be declared.
|
|
584 | 595 |
|
585 | 596 |
Caveats |
586 | 597 |
+++++++ |
587 | 598 |
|
588 | 599 |
This library will provide an easy upgrade path to bring all the code to |
589 | 600 |
granular locking without breaking everything, and it will also guarantee |
590 |
against a lot of common errors. Code switching from the old "lock everything"
|
|
591 |
lock to the new system, though, needs to be carefully scrutinised to be sure it
|
|
592 |
is really acquiring all the necessary locks, and none has been overlooked or
|
|
593 |
forgotten. |
|
601 |
against a lot of common errors. Code switching from the old "lock |
|
602 |
everything" lock to the new system, though, needs to be carefully
|
|
603 |
scrutinised to be sure it is really acquiring all the necessary locks,
|
|
604 |
and none has been overlooked or forgotten.
|
|
594 | 605 |
|
595 |
The code can contain other locks outside of this library, to synchronise other |
|
596 |
threaded code (eg for the job queue) but in general these should be leaf locks |
|
597 |
or carefully structured non-leaf ones, to avoid deadlock race conditions. |
|
606 |
The code can contain other locks outside of this library, to synchronise |
|
607 |
other threaded code (eg for the job queue) but in general these should |
|
608 |
be leaf locks or carefully structured non-leaf ones, to avoid deadlock |
|
609 |
race conditions. |
|
598 | 610 |
|
599 | 611 |
|
600 | 612 |
Job Queue |
... | ... | |
614 | 626 |
Job execution—“Life of a Ganeti job” |
615 | 627 |
++++++++++++++++++++++++++++++++++++ |
616 | 628 |
|
617 |
#. Job gets submitted by the client. A new job identifier is generated and |
|
618 |
assigned to the job. The job is then automatically replicated [#replic]_ |
|
619 |
to all nodes in the cluster. The identifier is returned to the client. |
|
620 |
#. A pool of worker threads waits for new jobs. If all are busy, the job has |
|
621 |
to wait and the first worker finishing its work will grab it. Otherwise any |
|
622 |
of the waiting threads will pick up the new job. |
|
623 |
#. Client waits for job status updates by calling a waiting RPC function. |
|
624 |
Log message may be shown to the user. Until the job is started, it can also |
|
625 |
be canceled. |
|
626 |
#. As soon as the job is finished, its final result and status can be retrieved |
|
627 |
from the server. |
|
629 |
#. Job gets submitted by the client. A new job identifier is generated |
|
630 |
and assigned to the job. The job is then automatically replicated |
|
631 |
[#replic]_ to all nodes in the cluster. The identifier is returned to |
|
632 |
the client. |
|
633 |
#. A pool of worker threads waits for new jobs. If all are busy, the job |
|
634 |
has to wait and the first worker finishing its work will grab it. |
|
635 |
Otherwise any of the waiting threads will pick up the new job. |
|
636 |
#. Client waits for job status updates by calling a waiting RPC |
|
637 |
function. Log message may be shown to the user. Until the job is |
|
638 |
started, it can also be canceled. |
|
639 |
#. As soon as the job is finished, its final result and status can be |
|
640 |
retrieved from the server. |
|
628 | 641 |
#. If the client archives the job, it gets moved to a history directory. |
629 | 642 |
There will be a method to archive all jobs older than a a given age. |
630 | 643 |
|
631 |
.. [#replic] We need replication in order to maintain the consistency across
|
|
632 |
all nodes in the system; the master node only differs in the fact that
|
|
633 |
now it is running the master daemon, but it if fails and we do a master
|
|
634 |
failover, the jobs are still visible on the new master (though marked as
|
|
635 |
failed). |
|
644 |
.. [#replic] We need replication in order to maintain the consistency |
|
645 |
across all nodes in the system; the master node only differs in the
|
|
646 |
fact that now it is running the master daemon, but it if fails and we
|
|
647 |
do a master failover, the jobs are still visible on the new master
|
|
648 |
(though marked as failed).
|
|
636 | 649 |
|
637 | 650 |
Failures to replicate a job to other nodes will be only flagged as |
638 | 651 |
errors in the master daemon log if more than half of the nodes failed, |
... | ... | |
654 | 667 |
|
655 | 668 |
- a file can be atomically replaced |
656 | 669 |
- a file can easily be replicated to other nodes |
657 |
- checking consistency across nodes can be implemented very easily, since
|
|
658 |
all job files should be (at a given moment in time) identical |
|
670 |
- checking consistency across nodes can be implemented very easily, |
|
671 |
since all job files should be (at a given moment in time) identical
|
|
659 | 672 |
|
660 | 673 |
The other possible choices that were discussed and discounted were: |
661 | 674 |
|
662 |
- single big file with all job data: not feasible due to difficult updates |
|
675 |
- single big file with all job data: not feasible due to difficult |
|
676 |
updates |
|
663 | 677 |
- in-process databases: hard to replicate the entire database to the |
664 |
other nodes, and replicating individual operations does not mean wee keep
|
|
665 |
consistency |
|
678 |
other nodes, and replicating individual operations does not mean wee |
|
679 |
keep consistency
|
|
666 | 680 |
|
667 | 681 |
|
668 | 682 |
Queue structure |
669 | 683 |
+++++++++++++++ |
670 | 684 |
|
671 |
All file operations have to be done atomically by writing to a temporary file
|
|
672 |
and subsequent renaming. Except for log messages, every change in a job is
|
|
673 |
stored and replicated to other nodes. |
|
685 |
All file operations have to be done atomically by writing to a temporary |
|
686 |
file and subsequent renaming. Except for log messages, every change in a
|
|
687 |
job is stored and replicated to other nodes.
|
|
674 | 688 |
|
675 | 689 |
:: |
676 | 690 |
|
... | ... | |
688 | 702 |
Locking |
689 | 703 |
+++++++ |
690 | 704 |
|
691 |
Locking in the job queue is a complicated topic. It is called from more than
|
|
692 |
one thread and must be thread-safe. For simplicity, a single lock is used for
|
|
693 |
the whole job queue. |
|
705 |
Locking in the job queue is a complicated topic. It is called from more |
|
706 |
than one thread and must be thread-safe. For simplicity, a single lock
|
|
707 |
is used for the whole job queue.
|
|
694 | 708 |
|
695 | 709 |
A more detailed description can be found in doc/locking.rst. |
696 | 710 |
|
... | ... | |
711 | 725 |
Client RPC |
712 | 726 |
++++++++++ |
713 | 727 |
|
714 |
RPC between Ganeti clients and the Ganeti master daemon supports the following
|
|
715 |
operations: |
|
728 |
RPC between Ganeti clients and the Ganeti master daemon supports the |
|
729 |
following operations:
|
|
716 | 730 |
|
717 | 731 |
SubmitJob(ops) |
718 |
Submits a list of opcodes and returns the job identifier. The identifier is |
|
719 |
guaranteed to be unique during the lifetime of a cluster. |
|
732 |
Submits a list of opcodes and returns the job identifier. The |
|
733 |
identifier is guaranteed to be unique during the lifetime of a |
|
734 |
cluster. |
|
720 | 735 |
WaitForJobChange(job_id, fields, […], timeout) |
721 |
This function waits until a job changes or a timeout expires. The condition
|
|
722 |
for when a job changed is defined by the fields passed and the last log
|
|
723 |
message received. |
|
736 |
This function waits until a job changes or a timeout expires. The |
|
737 |
condition for when a job changed is defined by the fields passed and
|
|
738 |
the last log message received.
|
|
724 | 739 |
QueryJobs(job_ids, fields) |
725 | 740 |
Returns field values for the job identifiers passed. |
726 | 741 |
CancelJob(job_id) |
727 |
Cancels the job specified by identifier. This operation may fail if the job
|
|
728 |
is already running, canceled or finished. |
|
742 |
Cancels the job specified by identifier. This operation may fail if |
|
743 |
the job is already running, canceled or finished.
|
|
729 | 744 |
ArchiveJob(job_id) |
730 |
Moves a job into the …/archive/ directory. This operation will fail if the
|
|
731 |
job has not been canceled or finished. |
|
745 |
Moves a job into the …/archive/ directory. This operation will fail if |
|
746 |
the job has not been canceled or finished.
|
|
732 | 747 |
|
733 | 748 |
|
734 | 749 |
Job and opcode status |
... | ... | |
749 | 764 |
Error |
750 | 765 |
The job/opcode was aborted with an error. |
751 | 766 |
|
752 |
If the master is aborted while a job is running, the job will be set to the
|
|
753 |
Error status once the master started again. |
|
767 |
If the master is aborted while a job is running, the job will be set to |
|
768 |
the Error status once the master started again.
|
|
754 | 769 |
|
755 | 770 |
|
756 | 771 |
History |
... | ... | |
810 | 825 |
|
811 | 826 |
For example: memory, vcpus, auto_balance |
812 | 827 |
|
813 |
All these parameters will be encoded into constants.py with the prefix "BE\_" |
|
814 |
and the whole list of parameters will exist in the set "BES_PARAMETERS" |
|
828 |
All these parameters will be encoded into constants.py with the prefix |
|
829 |
"BE\_" and the whole list of parameters will exist in the set |
|
830 |
"BES_PARAMETERS" |
|
815 | 831 |
|
816 | 832 |
:proper parameter: |
817 |
a parameter whose value is unique to the instance (e.g. the name of a LV,
|
|
818 |
or the MAC of a NIC) |
|
833 |
a parameter whose value is unique to the instance (e.g. the name of a |
|
834 |
LV, or the MAC of a NIC)
|
|
819 | 835 |
|
820 | 836 |
As a general rule, for all kind of parameters, “None” (or in |
821 | 837 |
JSON-speak, “nil”) will no longer be a valid value for a parameter. As |
... | ... | |
932 | 948 |
- ``Cluster.FillBE(instance, be_type="default")``, which returns the |
933 | 949 |
beparams dict, based on the instance and cluster beparams |
934 | 950 |
|
935 |
The FillHV/BE transformations will be used, for example, in the RpcRunner
|
|
936 |
when sending an instance for activation/stop, and the sent instance
|
|
937 |
hvparams/beparams will have the final value (noded code doesn't know
|
|
938 |
about defaults). |
|
951 |
The FillHV/BE transformations will be used, for example, in the |
|
952 |
RpcRunner when sending an instance for activation/stop, and the sent
|
|
953 |
instance hvparams/beparams will have the final value (noded code doesn't
|
|
954 |
know about defaults).
|
|
939 | 955 |
|
940 | 956 |
LU code will need to self-call the transformation, if needed. |
941 | 957 |
|
... | ... | |
945 | 961 |
The parameter changes will have impact on the OpCodes, especially on |
946 | 962 |
the following ones: |
947 | 963 |
|
948 |
- ``OpCreateInstance``, where the new hv and be parameters will be sent as
|
|
949 |
dictionaries; note that all hv and be parameters are now optional, as
|
|
950 |
the values can be instead taken from the cluster |
|
964 |
- ``OpCreateInstance``, where the new hv and be parameters will be sent |
|
965 |
as dictionaries; note that all hv and be parameters are now optional,
|
|
966 |
as the values can be instead taken from the cluster
|
|
951 | 967 |
- ``OpQueryInstances``, where we have to be able to query these new |
952 | 968 |
parameters; the syntax for names will be ``hvparam/$NAME`` and |
953 | 969 |
``beparam/$NAME`` for querying an individual parameter out of one |
... | ... | |
1093 | 1109 |
Caveats: |
1094 | 1110 |
|
1095 | 1111 |
- some operation semantics are less clear (e.g. what to do on instance |
1096 |
start with offline secondary?); for now, these will just fail as if the
|
|
1097 |
flag is not set (but faster) |
|
1112 |
start with offline secondary?); for now, these will just fail as if |
|
1113 |
the flag is not set (but faster)
|
|
1098 | 1114 |
- 2-node cluster with one node offline needs manual startup of the |
1099 | 1115 |
master with a special flag to skip voting (as the master can't get a |
1100 | 1116 |
quorum there) |
... | ... | |
1133 | 1149 |
clean the above instance(s) |
1134 | 1150 |
|
1135 | 1151 |
In order to prevent this situation, and to be able to get nodes into |
1136 |
proper offline status easily, a new *drained* flag was added to the nodes. |
|
1152 |
proper offline status easily, a new *drained* flag was added to the |
|
1153 |
nodes. |
|
1137 | 1154 |
|
1138 | 1155 |
This flag (which actually means "is being, or was drained, and is |
1139 | 1156 |
expected to go offline"), will prevent allocations on the node, but |
... | ... | |
1173 | 1190 |
assumptions made initially are not true and that more flexibility is |
1174 | 1191 |
needed. |
1175 | 1192 |
|
1176 |
One main assumption made was that disk failures should be treated as 'rare'
|
|
1177 |
events, and that each of them needs to be manually handled in order to ensure
|
|
1178 |
data safety; however, both these assumptions are false: |
|
1193 |
One main assumption made was that disk failures should be treated as |
|
1194 |
'rare' events, and that each of them needs to be manually handled in
|
|
1195 |
order to ensure data safety; however, both these assumptions are false:
|
|
1179 | 1196 |
|
1180 |
- disk failures can be a common occurrence, based on usage patterns or cluster
|
|
1181 |
size |
|
1182 |
- our disk setup is robust enough (referring to DRBD8 + LVM) that we could
|
|
1183 |
automate more of the recovery |
|
1197 |
- disk failures can be a common occurrence, based on usage patterns or |
|
1198 |
cluster size
|
|
1199 |
- our disk setup is robust enough (referring to DRBD8 + LVM) that we |
|
1200 |
could automate more of the recovery
|
|
1184 | 1201 |
|
1185 |
Note that we still don't have fully-automated disk recovery as a goal, but our
|
|
1186 |
goal is to reduce the manual work needed. |
|
1202 |
Note that we still don't have fully-automated disk recovery as a goal, |
|
1203 |
but our goal is to reduce the manual work needed.
|
|
1187 | 1204 |
|
1188 | 1205 |
As such, we plan the following main changes: |
1189 | 1206 |
|
1190 |
- DRBD8 is much more flexible and stable than its previous version (0.7),
|
|
1191 |
such that removing the support for the ``remote_raid1`` template and
|
|
1192 |
focusing only on DRBD8 is easier |
|
1207 |
- DRBD8 is much more flexible and stable than its previous version |
|
1208 |
(0.7), such that removing the support for the ``remote_raid1``
|
|
1209 |
template and focusing only on DRBD8 is easier
|
|
1193 | 1210 |
|
1194 |
- dynamic discovery of DRBD devices is not actually needed in a cluster that
|
|
1195 |
where the DRBD namespace is controlled by Ganeti; switching to a static
|
|
1196 |
assignment (done at either instance creation time or change secondary time)
|
|
1197 |
will change the disk activation time from O(n) to O(1), which on big
|
|
1198 |
clusters is a significant gain |
|
1211 |
- dynamic discovery of DRBD devices is not actually needed in a cluster |
|
1212 |
that where the DRBD namespace is controlled by Ganeti; switching to a
|
|
1213 |
static assignment (done at either instance creation time or change
|
|
1214 |
secondary time) will change the disk activation time from O(n) to
|
|
1215 |
O(1), which on big clusters is a significant gain
|
|
1199 | 1216 |
|
1200 |
- remove the hard dependency on LVM (currently all available storage types are |
|
1201 |
ultimately backed by LVM volumes) by introducing file-based storage |
|
1217 |
- remove the hard dependency on LVM (currently all available storage |
|
1218 |
types are ultimately backed by LVM volumes) by introducing file-based |
|
1219 |
storage |
|
1202 | 1220 |
|
1203 | 1221 |
Additionally, a number of smaller enhancements are also planned: |
1204 | 1222 |
- support variable number of disks |
... | ... | |
1326 | 1344 |
*failover to any* functionality, removing many of the layout |
1327 | 1345 |
restrictions of a cluster: |
1328 | 1346 |
|
1329 |
- the need to reserve memory on the current secondary: this gets reduced to
|
|
1330 |
a must to reserve memory anywhere on the cluster |
|
1347 |
- the need to reserve memory on the current secondary: this gets reduced |
|
1348 |
to a must to reserve memory anywhere on the cluster
|
|
1331 | 1349 |
|
1332 | 1350 |
- the need to first failover and then replace secondary for an |
1333 | 1351 |
instance: with failover-to-any, we can directly failover to |
... | ... | |
1340 | 1358 |
made between P1 and S1. This choice can be constrained, depending on |
1341 | 1359 |
which of P1 and S1 has failed. |
1342 | 1360 |
|
1343 |
- if P1 has failed, then S1 must become S2, and live migration is not possible |
|
1361 |
- if P1 has failed, then S1 must become S2, and live migration is not |
|
1362 |
possible |
|
1344 | 1363 |
- if S1 has failed, then P1 must become S2, and live migration could be |
1345 | 1364 |
possible (in theory, but this is not a design goal for 2.0) |
1346 | 1365 |
|
... | ... | |
1349 | 1368 |
- verify that S2 (the node the user has chosen to keep as secondary) has |
1350 | 1369 |
valid data (is consistent) |
1351 | 1370 |
|
1352 |
- tear down the current DRBD association and setup a DRBD pairing between
|
|
1353 |
P2 (P2 is indicated by the user) and S2; since P2 has no data, it will
|
|
1354 |
start re-syncing from S2 |
|
1371 |
- tear down the current DRBD association and setup a DRBD pairing |
|
1372 |
between P2 (P2 is indicated by the user) and S2; since P2 has no data,
|
|
1373 |
it will start re-syncing from S2
|
|
1355 | 1374 |
|
1356 |
- as soon as P2 is in state SyncTarget (i.e. after the resync has started
|
|
1357 |
but before it has finished), we can promote it to primary role (r/w)
|
|
1358 |
and start the instance on P2 |
|
1375 |
- as soon as P2 is in state SyncTarget (i.e. after the resync has |
|
1376 |
started but before it has finished), we can promote it to primary role
|
|
1377 |
(r/w) and start the instance on P2
|
|
1359 | 1378 |
|
1360 | 1379 |
- as soon as the P2?S2 sync has finished, we can remove |
1361 | 1380 |
the old data on the old node that has not been chosen for |
... | ... | |
1426 | 1445 |
OS interface |
1427 | 1446 |
~~~~~~~~~~~~ |
1428 | 1447 |
|
1429 |
The current Ganeti OS interface, version 5, is tailored for Ganeti 1.2. The
|
|
1430 |
interface is composed by a series of scripts which get called with certain
|
|
1431 |
parameters to perform OS-dependent operations on the cluster. The current
|
|
1432 |
scripts are: |
|
1448 |
The current Ganeti OS interface, version 5, is tailored for Ganeti 1.2. |
|
1449 |
The interface is composed by a series of scripts which get called with
|
|
1450 |
certain parameters to perform OS-dependent operations on the cluster.
|
|
1451 |
The current scripts are:
|
|
1433 | 1452 |
|
1434 | 1453 |
create |
1435 | 1454 |
called when a new instance is added to the cluster |
... | ... | |
1441 | 1460 |
called to perform the os-specific operations necessary for renaming an |
1442 | 1461 |
instance |
1443 | 1462 |
|
1444 |
Currently these scripts suffer from the limitations of Ganeti 1.2: for example |
|
1445 |
they accept exactly one block and one swap devices to operate on, rather than |
|
1446 |
any amount of generic block devices, they blindly assume that an instance will |
|
1447 |
have just one network interface to operate, they can not be configured to |
|
1448 |
optimise the instance for a particular hypervisor. |
|
1463 |
Currently these scripts suffer from the limitations of Ganeti 1.2: for |
|
1464 |
example they accept exactly one block and one swap devices to operate |
|
1465 |
on, rather than any amount of generic block devices, they blindly assume |
|
1466 |
that an instance will have just one network interface to operate, they |
|
1467 |
can not be configured to optimise the instance for a particular |
|
1468 |
hypervisor. |
|
1449 | 1469 |
|
1450 |
Since in Ganeti 2.0 we want to support multiple hypervisors, and a non-fixed |
|
1451 |
number of network and disks the OS interface need to change to transmit the |
|
1452 |
appropriate amount of information about an instance to its managing operating |
|
1453 |
system, when operating on it. Moreover since some old assumptions usually used |
|
1454 |
in OS scripts are no longer valid we need to re-establish a common knowledge on |
|
1455 |
what can be assumed and what cannot be regarding Ganeti environment. |
|
1470 |
Since in Ganeti 2.0 we want to support multiple hypervisors, and a |
|
1471 |
non-fixed number of network and disks the OS interface need to change to |
|
1472 |
transmit the appropriate amount of information about an instance to its |
|
1473 |
managing operating system, when operating on it. Moreover since some old |
|
1474 |
assumptions usually used in OS scripts are no longer valid we need to |
|
1475 |
re-establish a common knowledge on what can be assumed and what cannot |
|
1476 |
be regarding Ganeti environment. |
|
1456 | 1477 |
|
1457 | 1478 |
|
1458 | 1479 |
When designing the new OS API our priorities are: |
... | ... | |
1461 | 1482 |
- ease of porting from the old API |
1462 | 1483 |
- modularity |
1463 | 1484 |
|
1464 |
As such we want to limit the number of scripts that must be written to support
|
|
1465 |
an OS, and make it easy to share code between them by uniforming their input.
|
|
1466 |
We also will leave the current script structure unchanged, as far as we can,
|
|
1467 |
and make a few of the scripts (import, export and rename) optional. Most
|
|
1468 |
information will be passed to the script through environment variables, for
|
|
1469 |
ease of access and at the same time ease of using only the information a script
|
|
1470 |
needs. |
|
1485 |
As such we want to limit the number of scripts that must be written to |
|
1486 |
support an OS, and make it easy to share code between them by uniforming
|
|
1487 |
their input. We also will leave the current script structure unchanged,
|
|
1488 |
as far as we can, and make a few of the scripts (import, export and
|
|
1489 |
rename) optional. Most information will be passed to the script through
|
|
1490 |
environment variables, for ease of access and at the same time ease of
|
|
1491 |
using only the information a script needs.
|
|
1471 | 1492 |
|
1472 | 1493 |
|
1473 | 1494 |
The Scripts |
1474 | 1495 |
+++++++++++ |
1475 | 1496 |
|
1476 |
As in Ganeti 1.2, every OS which wants to be installed in Ganeti needs to
|
|
1477 |
support the following functionality, through scripts: |
|
1497 |
As in Ganeti 1.2, every OS which wants to be installed in Ganeti needs |
|
1498 |
to support the following functionality, through scripts:
|
|
1478 | 1499 |
|
1479 | 1500 |
create: |
1480 |
used to create a new instance running that OS. This script should prepare the
|
|
1481 |
block devices, and install them so that the new OS can boot under the
|
|
1482 |
specified hypervisor. |
|
1501 |
used to create a new instance running that OS. This script should |
|
1502 |
prepare the block devices, and install them so that the new OS can
|
|
1503 |
boot under the specified hypervisor.
|
|
1483 | 1504 |
export (optional): |
1484 |
used to export an installed instance using the given OS to a format which can
|
|
1485 |
be used to import it back into a new instance. |
|
1505 |
used to export an installed instance using the given OS to a format |
|
1506 |
which can be used to import it back into a new instance.
|
|
1486 | 1507 |
import (optional): |
1487 |
used to import an exported instance into a new one. This script is similar to
|
|
1488 |
create, but the new instance should have the content of the export, rather
|
|
1489 |
than contain a pristine installation. |
|
1508 |
used to import an exported instance into a new one. This script is |
|
1509 |
similar to create, but the new instance should have the content of the
|
|
1510 |
export, rather than contain a pristine installation.
|
|
1490 | 1511 |
rename (optional): |
1491 |
used to perform the internal OS-specific operations needed to rename an
|
|
1492 |
instance. |
|
1512 |
used to perform the internal OS-specific operations needed to rename |
|
1513 |
an instance.
|
|
1493 | 1514 |
|
1494 |
If any optional script is not implemented Ganeti will refuse to perform the
|
|
1495 |
given operation on instances using the non-implementing OS. Of course the
|
|
1496 |
create script is mandatory, and it doesn't make sense to support the either the
|
|
1497 |
export or the import operation but not both. |
|
1515 |
If any optional script is not implemented Ganeti will refuse to perform |
|
1516 |
the given operation on instances using the non-implementing OS. Of
|
|
1517 |
course the create script is mandatory, and it doesn't make sense to
|
|
1518 |
support the either the export or the import operation but not both.
|
|
1498 | 1519 |
|
1499 | 1520 |
Incompatibilities with 1.2 |
1500 | 1521 |
__________________________ |
1501 | 1522 |
|
1502 |
We expect the following incompatibilities between the OS scripts for 1.2 and
|
|
1503 |
the ones for 2.0: |
|
1523 |
We expect the following incompatibilities between the OS scripts for 1.2 |
|
1524 |
and the ones for 2.0:
|
|
1504 | 1525 |
|
1505 |
- Input parameters: in 1.2 those were passed on the command line, in 2.0 we'll |
|
1506 |
use environment variables, as there will be a lot more information and not |
|
1507 |
all OSes may care about all of it. |
|
1508 |
- Number of calls: export scripts will be called once for each device the |
|
1509 |
instance has, and import scripts once for every exported disk. Imported |
|
1510 |
instances will be forced to have a number of disks greater or equal to the |
|
1511 |
one of the export. |
|
1512 |
- Some scripts are not compulsory: if such a script is missing the relevant |
|
1513 |
operations will be forbidden for instances of that OS. This makes it easier |
|
1514 |
to distinguish between unsupported operations and no-op ones (if any). |
|
1526 |
- Input parameters: in 1.2 those were passed on the command line, in 2.0 |
|
1527 |
we'll use environment variables, as there will be a lot more |
|
1528 |
information and not all OSes may care about all of it. |
|
1529 |
- Number of calls: export scripts will be called once for each device |
|
1530 |
the instance has, and import scripts once for every exported disk. |
|
1531 |
Imported instances will be forced to have a number of disks greater or |
|
1532 |
equal to the one of the export. |
|
1533 |
- Some scripts are not compulsory: if such a script is missing the |
|
1534 |
relevant operations will be forbidden for instances of that OS. This |
|
1535 |
makes it easier to distinguish between unsupported operations and |
|
1536 |
no-op ones (if any). |
|
1515 | 1537 |
|
1516 | 1538 |
|
1517 | 1539 |
Input |
1518 | 1540 |
_____ |
1519 | 1541 |
|
1520 |
Rather than using command line flags, as they do now, scripts will accept |
|
1521 |
inputs from environment variables. We expect the following input values: |
|
1542 |
Rather than using command line flags, as they do now, scripts will |
|
1543 |
accept inputs from environment variables. We expect the following input |
|
1544 |
values: |
|
1522 | 1545 |
|
1523 | 1546 |
OS_API_VERSION |
1524 | 1547 |
The version of the OS API that the following parameters comply with; |
... | ... | |
1528 | 1551 |
INSTANCE_NAME |
1529 | 1552 |
Name of the instance acted on |
1530 | 1553 |
HYPERVISOR |
1531 |
The hypervisor the instance should run on (e.g. 'xen-pvm', 'xen-hvm', 'kvm') |
|
1554 |
The hypervisor the instance should run on (e.g. 'xen-pvm', 'xen-hvm', |
|
1555 |
'kvm') |
|
1532 | 1556 |
DISK_COUNT |
1533 | 1557 |
The number of disks this instance will have |
1534 | 1558 |
NIC_COUNT |
... | ... | |
1539 | 1563 |
W if read/write, R if read only. OS scripts are not supposed to touch |
1540 | 1564 |
read-only disks, but will be passed them to know. |
1541 | 1565 |
DISK_<N>_FRONTEND_TYPE |
1542 |
Type of the disk as seen by the instance. Can be 'scsi', 'ide', 'virtio' |
|
1566 |
Type of the disk as seen by the instance. Can be 'scsi', 'ide', |
|
1567 |
'virtio' |
|
1543 | 1568 |
DISK_<N>_BACKEND_TYPE |
1544 | 1569 |
Type of the disk as seen from the node. Can be 'block', 'file:loop' or |
1545 | 1570 |
'file:blktap' |
... | ... | |
1553 | 1578 |
Type of the Nth NIC as seen by the instance. For example 'virtio', |
1554 | 1579 |
'rtl8139', etc. |
1555 | 1580 |
DEBUG_LEVEL |
1556 |
Whether more out should be produced, for debugging purposes. Currently the
|
|
1557 |
only valid values are 0 and 1. |
|
1581 |
Whether more out should be produced, for debugging purposes. Currently |
|
1582 |
the only valid values are 0 and 1.
|
|
1558 | 1583 |
|
1559 | 1584 |
These are only the basic variables we are thinking of now, but more |
1560 | 1585 |
may come during the implementation and they will be documented in the |
... | ... | |
1567 | 1592 |
OLD_INSTANCE_NAME |
1568 | 1593 |
rename: the name the instance should be renamed from. |
1569 | 1594 |
EXPORT_DEVICE |
1570 |
export: device to be exported, a snapshot of the actual device. The data must be exported to stdout. |
|
1595 |
export: device to be exported, a snapshot of the actual device. The |
|
1596 |
data must be exported to stdout. |
|
1571 | 1597 |
EXPORT_INDEX |
1572 | 1598 |
export: sequential number of the instance device targeted. |
1573 | 1599 |
IMPORT_DEVICE |
1574 |
import: device to send the data to, part of the new instance. The data must be imported from stdin. |
|
1600 |
import: device to send the data to, part of the new instance. The data |
|
1601 |
must be imported from stdin. |
|
1575 | 1602 |
IMPORT_INDEX |
1576 | 1603 |
import: sequential number of the instance device targeted. |
1577 | 1604 |
|
1578 |
(Rationale for INSTANCE_NAME as an environment variable: the instance name is |
|
1579 |
always needed and we could pass it on the command line. On the other hand, |
|
1580 |
though, this would force scripts to both access the environment and parse the |
|
1581 |
command line, so we'll move it for uniformity.) |
|
1605 |
(Rationale for INSTANCE_NAME as an environment variable: the instance |
|
1606 |
name is always needed and we could pass it on the command line. On the |
|
1607 |
other hand, though, this would force scripts to both access the |
|
1608 |
environment and parse the command line, so we'll move it for |
|
1609 |
uniformity.) |
|
1582 | 1610 |
|
1583 | 1611 |
|
1584 | 1612 |
Output/Behaviour |
1585 | 1613 |
________________ |
1586 | 1614 |
|
1587 |
As discussed scripts should only send user-targeted information to stderr. The
|
|
1588 |
create and import scripts are supposed to format/initialise the given block
|
|
1589 |
devices and install the correct instance data. The export script is supposed to
|
|
1590 |
export instance data to stdout in a format understandable by the the import
|
|
1591 |
script. The data will be compressed by Ganeti, so no compression should be
|
|
1592 |
done. The rename script should only modify the instance's knowledge of what
|
|
1593 |
its name is. |
|
1615 |
As discussed scripts should only send user-targeted information to |
|
1616 |
stderr. The create and import scripts are supposed to format/initialise
|
|
1617 |
the given block devices and install the correct instance data. The
|
|
1618 |
export script is supposed to export instance data to stdout in a format
|
|
1619 |
understandable by the the import script. The data will be compressed by
|
|
1620 |
Ganeti, so no compression should be done. The rename script should only
|
|
1621 |
modify the instance's knowledge of what its name is.
|
|
1594 | 1622 |
|
1595 | 1623 |
Other declarative style features |
1596 | 1624 |
++++++++++++++++++++++++++++++++ |
... | ... | |
1604 | 1632 |
containing two lines. This is different from Ganeti 1.2, which only |
1605 | 1633 |
supported one version number. |
1606 | 1634 |
|
1607 |
In addition to that an OS will be able to declare that it does support only a |
|
1608 |
subset of the Ganeti hypervisors, by declaring them in the 'hypervisors' file. |
|
1635 |
In addition to that an OS will be able to declare that it does support |
|
1636 |
only a subset of the Ganeti hypervisors, by declaring them in the |
|
1637 |
'hypervisors' file. |
|
1609 | 1638 |
|
1610 | 1639 |
|
1611 | 1640 |
Caveats/Notes |
1612 | 1641 |
+++++++++++++ |
1613 | 1642 |
|
1614 |
We might want to have a "default" import/export behaviour that just dumps all
|
|
1615 |
disks and restores them. This can save work as most systems will just do this,
|
|
1616 |
while allowing flexibility for different systems. |
|
1643 |
We might want to have a "default" import/export behaviour that just |
|
1644 |
dumps all disks and restores them. This can save work as most systems
|
|
1645 |
will just do this, while allowing flexibility for different systems.
|
|
1617 | 1646 |
|
1618 |
Environment variables are limited in size, but we expect that there will be
|
|
1619 |
enough space to store the information we need. If we discover that this is not
|
|
1620 |
the case we may want to go to a more complex API such as storing those
|
|
1621 |
information on the filesystem and providing the OS script with the path to a
|
|
1622 |
file where they are encoded in some format. |
|
1647 |
Environment variables are limited in size, but we expect that there will |
|
1648 |
be enough space to store the information we need. If we discover that
|
|
1649 |
this is not the case we may want to go to a more complex API such as
|
|
1650 |
storing those information on the filesystem and providing the OS script
|
|
1651 |
with the path to a file where they are encoded in some format.
|
|
1623 | 1652 |
|
1624 | 1653 |
|
1625 | 1654 |
|
Also available in: Unified diff