Revision 7faf5110

b/NEWS
7 7
- Added ``--ignore-size`` to the ``gnt-instance activate-disks`` command
8 8
  to allow using the pre-2.0.2 behaviour in activation, if any existing
9 9
  instances have mismatched disk sizes in the configuration
10
- Added ``gnt-cluster repair-disk-sizes`` command to check and update any
11
  configuration mismatches for disk sizes
10
- Added ``gnt-cluster repair-disk-sizes`` command to check and update
11
  any configuration mismatches for disk sizes
12 12
- Added ``gnt-master cluste-failover --no-voting`` to allow master
13 13
  failover to work on two-node clusters
14 14
- Fixed the ‘--net’ option of ``gnt-backup import``, which was unusable
......
61 61
- the watcher now also restarts the node daemon and the rapi daemon if
62 62
  they died
63 63
- fixed the watcher to handle full and drained queue cases
64
- hooks export more instance data in the environment, which helps if hook
65
  scripts need to take action based on the instance's properties (no
66
  longer need to query back into ganeti)
64
- hooks export more instance data in the environment, which helps if
65
  hook scripts need to take action based on the instance's properties
66
  (no longer need to query back into ganeti)
67 67
- instance failovers when the instance is stopped do not check for free
68 68
  RAM, so that failing over a stopped instance is possible in low memory
69 69
  situations
......
169 169

  
170 170
  - all commands are executed by a daemon (``ganeti-masterd``) and the
171 171
    various ``gnt-*`` commands are just front-ends to it
172
  - all the commands are entered into, and executed from a job queue, see
173
    the ``gnt-job(8)`` manpage
174
  - the RAPI daemon supports read-write operations, secured by basic HTTP
175
    authentication on top of HTTPS
172
  - all the commands are entered into, and executed from a job queue,
173
    see the ``gnt-job(8)`` manpage
174
  - the RAPI daemon supports read-write operations, secured by basic
175
    HTTP authentication on top of HTTPS
176 176
  - DRBD version 0.7 support has been removed, DRBD 8 is the only
177 177
    supported version (when migrating from Ganeti 1.2 to 2.0, you need
178 178
    to migrate to DRBD 8 first while still running Ganeti 1.2)
......
193 193
- Change the default reboot type in ``gnt-instance reboot`` to "hard"
194 194
- Reuse the old instance mac address by default on instance import, if
195 195
  the instance name is the same.
196
- Handle situations in which the node info rpc returns incomplete results
197
  (issue 46)
196
- Handle situations in which the node info rpc returns incomplete
197
  results (issue 46)
198 198
- Add checks for tcp/udp ports collisions in ``gnt-cluster verify``
199 199
- Improved version of batcher:
200 200

  
......
218 218
- new ``--hvm-nic-type`` and ``--hvm-disk-type`` flags to control the
219 219
  type of disk exported to fully virtualized instances.
220 220
- provide access to the serial console of HVM instances
221
- instance auto_balance flag, set by default. If turned off it will avoid
222
  warnings on cluster verify if there is not enough memory to fail over
223
  an instance. in the future it will prevent automatically failing it
224
  over when we will support that.
221
- instance auto_balance flag, set by default. If turned off it will
222
  avoid warnings on cluster verify if there is not enough memory to fail
223
  over an instance. in the future it will prevent automatically failing
224
  it over when we will support that.
225 225
- batcher tool for instance creation, see ``tools/README.batcher``
226 226
- ``gnt-instance reinstall --select-os`` to interactively select a new
227 227
  operating system when reinstalling an instance.
......
347 347
Version 1.2.0
348 348
-------------
349 349

  
350
- Log the ``xm create`` output to the node daemon log on failure (to help
351
  diagnosing the error)
350
- Log the ``xm create`` output to the node daemon log on failure (to
351
  help diagnosing the error)
352 352
- In debug mode, log all external commands output if failed to the logs
353 353
- Change parsing of lvm commands to ignore stderr
354 354

  
......
384 384
  reboots
385 385
- Removed dependency on debian's patched fping that uses the
386 386
  non-standard ``-S`` option
387
- Now the OS definitions are searched for in multiple, configurable paths
388
  (easier for distros to package)
387
- Now the OS definitions are searched for in multiple, configurable
388
  paths (easier for distros to package)
389 389
- Some changes to the hooks infrastructure (especially the new
390 390
  post-configuration update hook)
391 391
- Other small bugfixes
b/doc/admin.rst
343 343
you want to remove Ganeti completely, you need to also undo some of
344 344
the SSH changes and log directories:
345 345

  
346
- ``rm -rf /var/log/ganeti /srv/ganeti`` (replace with the correct paths)
346
- ``rm -rf /var/log/ganeti /srv/ganeti`` (replace with the correct
347
  paths)
347 348
- remove from ``/root/.ssh`` the keys that Ganeti added (check
348 349
  the ``authorized_keys`` and ``id_dsa`` files)
349 350
- regenerate the host's SSH keys (check the OpenSSH startup scripts)
b/doc/design-2.0.rst
30 30
- poor handling of node failures in the cluster
31 31
- mixing hypervisors in a cluster not allowed
32 32

  
33
It also has a number of artificial restrictions, due to historical design:
33
It also has a number of artificial restrictions, due to historical
34
design:
34 35

  
35 36
- fixed number of disks (two) per instance
36 37
- fixed number of NICs
......
55 56

  
56 57
- It is impossible for two people to efficiently interact with a cluster
57 58
  (for example for debugging) at the same time.
58
- When batch jobs are running it's impossible to do other work (for example
59
  failovers/fixes) on a cluster.
59
- When batch jobs are running it's impossible to do other work (for
60
  example failovers/fixes) on a cluster.
60 61

  
61 62
This poses scalability problems: as clusters grow in node and instance
62 63
size it's a lot more likely that operations which one could conceive
......
155 156

  
156 157
The master-daemon related interaction paths are:
157 158

  
158
- (CLI tools/RAPI daemon) and the master daemon, via the so called *LUXI* API
159
- (CLI tools/RAPI daemon) and the master daemon, via the so called
160
  *LUXI* API
159 161
- the master daemon and the node daemons, via the node RPC
160 162

  
161 163
There are also some additional interaction paths for exceptional cases:
......
237 239
There are two special value for the result field:
238 240

  
239 241
- in the case that the operation failed, and this field is a list of
240
  length two, the client library will try to interpret is as an exception,
241
  the first element being the exception type and the second one the
242
  actual exception arguments; this will allow a simple method of passing
243
  Ganeti-related exception across the interface
242
  length two, the client library will try to interpret is as an
243
  exception, the first element being the exception type and the second
244
  one the actual exception arguments; this will allow a simple method of
245
  passing Ganeti-related exception across the interface
244 246
- for the *WaitForChange* call (that waits on the server for a job to
245 247
  change status), if the result is equal to ``nochange`` instead of the
246 248
  usual result for this call (a list of changes), then the library will
......
381 383
- the more advanced granular locking that we want to implement would
382 384
  require, if written in the async-manner, deep integration with the
383 385
  Twisted stack, to such an extend that business-logic is inseparable
384
  from the protocol coding; we felt that this is an unreasonable request,
385
  and that a good protocol library should allow complete separation of
386
  low-level protocol calls and business logic; by comparison, the threaded
387
  approach combined with HTTPs protocol required (for the first iteration)
388
  absolutely no changes from the 1.2 code, and later changes for optimizing
389
  the inter-node RPC calls required just syntactic changes (e.g.
390
  ``rpc.call_...`` to ``self.rpc.call_...``)
386
  from the protocol coding; we felt that this is an unreasonable
387
  request, and that a good protocol library should allow complete
388
  separation of low-level protocol calls and business logic; by
389
  comparison, the threaded approach combined with HTTPs protocol
390
  required (for the first iteration) absolutely no changes from the 1.2
391
  code, and later changes for optimizing the inter-node RPC calls
392
  required just syntactic changes (e.g.  ``rpc.call_...`` to
393
  ``self.rpc.call_...``)
391 394

  
392 395
Another issue is with the Twisted API stability - during the Ganeti
393 396
1.x lifetime, we had to to implement many times workarounds to changes
......
401 404
Granular locking
402 405
~~~~~~~~~~~~~~~~
403 406

  
404
We want to make sure that multiple operations can run in parallel on a Ganeti
405
Cluster. In order for this to happen we need to make sure concurrently run
406
operations don't step on each other toes and break the cluster.
407
We want to make sure that multiple operations can run in parallel on a
408
Ganeti Cluster. In order for this to happen we need to make sure
409
concurrently run operations don't step on each other toes and break the
410
cluster.
407 411

  
408 412
This design addresses how we are going to deal with locking so that:
409 413

  
......
411 415
- we prevent deadlocks
412 416
- we prevent job starvation
413 417

  
414
Reaching the maximum possible parallelism is a Non-Goal. We have identified a
415
set of operations that are currently bottlenecks and need to be parallelised
416
and have worked on those. In the future it will be possible to address other
417
needs, thus making the cluster more and more parallel one step at a time.
418
Reaching the maximum possible parallelism is a Non-Goal. We have
419
identified a set of operations that are currently bottlenecks and need
420
to be parallelised and have worked on those. In the future it will be
421
possible to address other needs, thus making the cluster more and more
422
parallel one step at a time.
418 423

  
419 424
This section only talks about parallelising Ganeti level operations, aka
420
Logical Units, and the locking needed for that. Any other synchronization lock
421
needed internally by the code is outside its scope.
425
Logical Units, and the locking needed for that. Any other
426
synchronization lock needed internally by the code is outside its scope.
422 427

  
423 428
Library details
424 429
+++++++++++++++
425 430

  
426 431
The proposed library has these features:
427 432

  
428
- internally managing all the locks, making the implementation transparent
429
  from their usage
430
- automatically grabbing multiple locks in the right order (avoid deadlock)
433
- internally managing all the locks, making the implementation
434
  transparent from their usage
435
- automatically grabbing multiple locks in the right order (avoid
436
  deadlock)
431 437
- ability to transparently handle conversion to more granularity
432 438
- support asynchronous operation (future goal)
433 439

  
......
446 452
``lockings.SharedLock``), and the individual locks for each object
447 453
will be created at initialisation time, from the config file.
448 454

  
449
The API will have a way to grab one or more than one locks at the same time.
450
Any attempt to grab a lock while already holding one in the wrong order will be
451
checked for, and fail.
455
The API will have a way to grab one or more than one locks at the same
456
time.  Any attempt to grab a lock while already holding one in the wrong
457
order will be checked for, and fail.
452 458

  
453 459

  
454 460
The Locks
......
460 466
- One lock per node in the cluster
461 467
- One lock per instance in the cluster
462 468

  
463
All the instance locks will need to be taken before the node locks, and the
464
node locks before the config lock. Locks will need to be acquired at the same
465
time for multiple instances and nodes, and internal ordering will be dealt
466
within the locking library, which, for simplicity, will just use alphabetical
467
order.
469
All the instance locks will need to be taken before the node locks, and
470
the node locks before the config lock. Locks will need to be acquired at
471
the same time for multiple instances and nodes, and internal ordering
472
will be dealt within the locking library, which, for simplicity, will
473
just use alphabetical order.
468 474

  
469 475
Each lock has the following three possible statuses:
470 476

  
......
475 481
Handling conversion to more granularity
476 482
+++++++++++++++++++++++++++++++++++++++
477 483

  
478
In order to convert to a more granular approach transparently each time we
479
split a lock into more we'll create a "metalock", which will depend on those
480
sub-locks and live for the time necessary for all the code to convert (or
481
forever, in some conditions). When a metalock exists all converted code must
482
acquire it in shared mode, so it can run concurrently, but still be exclusive
483
with old code, which acquires it exclusively.
484
In order to convert to a more granular approach transparently each time
485
we split a lock into more we'll create a "metalock", which will depend
486
on those sub-locks and live for the time necessary for all the code to
487
convert (or forever, in some conditions). When a metalock exists all
488
converted code must acquire it in shared mode, so it can run
489
concurrently, but still be exclusive with old code, which acquires it
490
exclusively.
484 491

  
485
In the beginning the only such lock will be what replaces the current "command"
486
lock, and will acquire all the locks in the system, before proceeding. This
487
lock will be called the "Big Ganeti Lock" because holding that one will avoid
488
any other concurrent Ganeti operations.
492
In the beginning the only such lock will be what replaces the current
493
"command" lock, and will acquire all the locks in the system, before
494
proceeding. This lock will be called the "Big Ganeti Lock" because
495
holding that one will avoid any other concurrent Ganeti operations.
489 496

  
490
We might also want to devise more metalocks (eg. all nodes, all nodes+config)
491
in order to make it easier for some parts of the code to acquire what it needs
492
without specifying it explicitly.
497
We might also want to devise more metalocks (eg. all nodes, all
498
nodes+config) in order to make it easier for some parts of the code to
499
acquire what it needs without specifying it explicitly.
493 500

  
494
In the future things like the node locks could become metalocks, should we
495
decide to split them into an even more fine grained approach, but this will
496
probably be only after the first 2.0 version has been released.
501
In the future things like the node locks could become metalocks, should
502
we decide to split them into an even more fine grained approach, but
503
this will probably be only after the first 2.0 version has been
504
released.
497 505

  
498 506
Adding/Removing locks
499 507
+++++++++++++++++++++
500 508

  
501
When a new instance or a new node is created an associated lock must be added
502
to the list. The relevant code will need to inform the locking library of such
503
a change.
509
When a new instance or a new node is created an associated lock must be
510
added to the list. The relevant code will need to inform the locking
511
library of such a change.
504 512

  
505
This needs to be compatible with every other lock in the system, especially
506
metalocks that guarantee to grab sets of resources without specifying them
507
explicitly. The implementation of this will be handled in the locking library
508
itself.
513
This needs to be compatible with every other lock in the system,
514
especially metalocks that guarantee to grab sets of resources without
515
specifying them explicitly. The implementation of this will be handled
516
in the locking library itself.
509 517

  
510 518
When instances or nodes disappear from the cluster the relevant locks
511 519
must be removed. This is easier than adding new elements, as the code
......
517 525
+++++++++++++++++++++++
518 526

  
519 527
For the first version the locking library will only export synchronous
520
operations, which will block till the needed lock are held, and only fail if
521
the request is impossible or somehow erroneous.
528
operations, which will block till the needed lock are held, and only
529
fail if the request is impossible or somehow erroneous.
522 530

  
523 531
In the future we may want to implement different types of asynchronous
524 532
operations such as:
525 533

  
526 534
- try to acquire this lock set and fail if not possible
527
- try to acquire one of these lock sets and return the first one you were
528
  able to get (or after a timeout) (select/poll like)
535
- try to acquire one of these lock sets and return the first one you
536
  were able to get (or after a timeout) (select/poll like)
529 537

  
530
These operations can be used to prioritize operations based on available locks,
531
rather than making them just blindly queue for acquiring them. The inherent
532
risk, though, is that any code using the first operation, or setting a timeout
533
for the second one, is susceptible to starvation and thus may never be able to
534
get the required locks and complete certain tasks. Considering this
535
providing/using these operations should not be among our first priorities.
538
These operations can be used to prioritize operations based on available
539
locks, rather than making them just blindly queue for acquiring them.
540
The inherent risk, though, is that any code using the first operation,
541
or setting a timeout for the second one, is susceptible to starvation
542
and thus may never be able to get the required locks and complete
543
certain tasks. Considering this providing/using these operations should
544
not be among our first priorities.
536 545

  
537 546
Locking granularity
538 547
+++++++++++++++++++
539 548

  
540 549
For the first version of this code we'll convert each Logical Unit to
541
acquire/release the locks it needs, so locking will be at the Logical Unit
542
level.  In the future we may want to split logical units in independent
543
"tasklets" with their own locking requirements. A different design doc (or mini
544
design doc) will cover the move from Logical Units to tasklets.
550
acquire/release the locks it needs, so locking will be at the Logical
551
Unit level.  In the future we may want to split logical units in
552
independent "tasklets" with their own locking requirements. A different
553
design doc (or mini design doc) will cover the move from Logical Units
554
to tasklets.
545 555

  
546 556
Code examples
547 557
+++++++++++++
548 558

  
549
In general when acquiring locks we should use a code path equivalent to::
559
In general when acquiring locks we should use a code path equivalent
560
to::
550 561

  
551 562
  lock.acquire()
552 563
  try:
......
561 572
syntax will be possible, but we want to keep compatibility with Python
562 573
2.4 so the new constructs should not be used.
563 574

  
564
In order to avoid this extra indentation and code changes everywhere in the
565
Logical Units code, we decided to allow LUs to declare locks, and then execute
566
their code with their locks acquired. In the new world LUs are called like
567
this::
575
In order to avoid this extra indentation and code changes everywhere in
576
the Logical Units code, we decided to allow LUs to declare locks, and
577
then execute their code with their locks acquired. In the new world LUs
578
are called like this::
568 579

  
569 580
  # user passed names are expanded to the internal lock/resource name,
570 581
  # then known needed locks are declared
......
579 590
  lu.Exec()
580 591
  ... locks declared for removal are removed, all acquired locks released ...
581 592

  
582
The Processor and the LogicalUnit class will contain exact documentation on how
583
locks are supposed to be declared.
593
The Processor and the LogicalUnit class will contain exact documentation
594
on how locks are supposed to be declared.
584 595

  
585 596
Caveats
586 597
+++++++
587 598

  
588 599
This library will provide an easy upgrade path to bring all the code to
589 600
granular locking without breaking everything, and it will also guarantee
590
against a lot of common errors. Code switching from the old "lock everything"
591
lock to the new system, though, needs to be carefully scrutinised to be sure it
592
is really acquiring all the necessary locks, and none has been overlooked or
593
forgotten.
601
against a lot of common errors. Code switching from the old "lock
602
everything" lock to the new system, though, needs to be carefully
603
scrutinised to be sure it is really acquiring all the necessary locks,
604
and none has been overlooked or forgotten.
594 605

  
595
The code can contain other locks outside of this library, to synchronise other
596
threaded code (eg for the job queue) but in general these should be leaf locks
597
or carefully structured non-leaf ones, to avoid deadlock race conditions.
606
The code can contain other locks outside of this library, to synchronise
607
other threaded code (eg for the job queue) but in general these should
608
be leaf locks or carefully structured non-leaf ones, to avoid deadlock
609
race conditions.
598 610

  
599 611

  
600 612
Job Queue
......
614 626
Job execution—“Life of a Ganeti job”
615 627
++++++++++++++++++++++++++++++++++++
616 628

  
617
#. Job gets submitted by the client. A new job identifier is generated and
618
   assigned to the job. The job is then automatically replicated [#replic]_
619
   to all nodes in the cluster. The identifier is returned to the client.
620
#. A pool of worker threads waits for new jobs. If all are busy, the job has
621
   to wait and the first worker finishing its work will grab it. Otherwise any
622
   of the waiting threads will pick up the new job.
623
#. Client waits for job status updates by calling a waiting RPC function.
624
   Log message may be shown to the user. Until the job is started, it can also
625
   be canceled.
626
#. As soon as the job is finished, its final result and status can be retrieved
627
   from the server.
629
#. Job gets submitted by the client. A new job identifier is generated
630
   and assigned to the job. The job is then automatically replicated
631
   [#replic]_ to all nodes in the cluster. The identifier is returned to
632
   the client.
633
#. A pool of worker threads waits for new jobs. If all are busy, the job
634
   has to wait and the first worker finishing its work will grab it.
635
   Otherwise any of the waiting threads will pick up the new job.
636
#. Client waits for job status updates by calling a waiting RPC
637
   function. Log message may be shown to the user. Until the job is
638
   started, it can also be canceled.
639
#. As soon as the job is finished, its final result and status can be
640
   retrieved from the server.
628 641
#. If the client archives the job, it gets moved to a history directory.
629 642
   There will be a method to archive all jobs older than a a given age.
630 643

  
631
.. [#replic] We need replication in order to maintain the consistency across
632
   all nodes in the system; the master node only differs in the fact that
633
   now it is running the master daemon, but it if fails and we do a master
634
   failover, the jobs are still visible on the new master (though marked as
635
   failed).
644
.. [#replic] We need replication in order to maintain the consistency
645
   across all nodes in the system; the master node only differs in the
646
   fact that now it is running the master daemon, but it if fails and we
647
   do a master failover, the jobs are still visible on the new master
648
   (though marked as failed).
636 649

  
637 650
Failures to replicate a job to other nodes will be only flagged as
638 651
errors in the master daemon log if more than half of the nodes failed,
......
654 667

  
655 668
- a file can be atomically replaced
656 669
- a file can easily be replicated to other nodes
657
- checking consistency across nodes can be implemented very easily, since
658
  all job files should be (at a given moment in time) identical
670
- checking consistency across nodes can be implemented very easily,
671
  since all job files should be (at a given moment in time) identical
659 672

  
660 673
The other possible choices that were discussed and discounted were:
661 674

  
662
- single big file with all job data: not feasible due to difficult updates
675
- single big file with all job data: not feasible due to difficult
676
  updates
663 677
- in-process databases: hard to replicate the entire database to the
664
  other nodes, and replicating individual operations does not mean wee keep
665
  consistency
678
  other nodes, and replicating individual operations does not mean wee
679
  keep consistency
666 680

  
667 681

  
668 682
Queue structure
669 683
+++++++++++++++
670 684

  
671
All file operations have to be done atomically by writing to a temporary file
672
and subsequent renaming. Except for log messages, every change in a job is
673
stored and replicated to other nodes.
685
All file operations have to be done atomically by writing to a temporary
686
file and subsequent renaming. Except for log messages, every change in a
687
job is stored and replicated to other nodes.
674 688

  
675 689
::
676 690

  
......
688 702
Locking
689 703
+++++++
690 704

  
691
Locking in the job queue is a complicated topic. It is called from more than
692
one thread and must be thread-safe. For simplicity, a single lock is used for
693
the whole job queue.
705
Locking in the job queue is a complicated topic. It is called from more
706
than one thread and must be thread-safe. For simplicity, a single lock
707
is used for the whole job queue.
694 708

  
695 709
A more detailed description can be found in doc/locking.rst.
696 710

  
......
711 725
Client RPC
712 726
++++++++++
713 727

  
714
RPC between Ganeti clients and the Ganeti master daemon supports the following
715
operations:
728
RPC between Ganeti clients and the Ganeti master daemon supports the
729
following operations:
716 730

  
717 731
SubmitJob(ops)
718
  Submits a list of opcodes and returns the job identifier. The identifier is
719
  guaranteed to be unique during the lifetime of a cluster.
732
  Submits a list of opcodes and returns the job identifier. The
733
  identifier is guaranteed to be unique during the lifetime of a
734
  cluster.
720 735
WaitForJobChange(job_id, fields, […], timeout)
721
  This function waits until a job changes or a timeout expires. The condition
722
  for when a job changed is defined by the fields passed and the last log
723
  message received.
736
  This function waits until a job changes or a timeout expires. The
737
  condition for when a job changed is defined by the fields passed and
738
  the last log message received.
724 739
QueryJobs(job_ids, fields)
725 740
  Returns field values for the job identifiers passed.
726 741
CancelJob(job_id)
727
  Cancels the job specified by identifier. This operation may fail if the job
728
  is already running, canceled or finished.
742
  Cancels the job specified by identifier. This operation may fail if
743
  the job is already running, canceled or finished.
729 744
ArchiveJob(job_id)
730
  Moves a job into the …/archive/ directory. This operation will fail if the
731
  job has not been canceled or finished.
745
  Moves a job into the …/archive/ directory. This operation will fail if
746
  the job has not been canceled or finished.
732 747

  
733 748

  
734 749
Job and opcode status
......
749 764
Error
750 765
  The job/opcode was aborted with an error.
751 766

  
752
If the master is aborted while a job is running, the job will be set to the
753
Error status once the master started again.
767
If the master is aborted while a job is running, the job will be set to
768
the Error status once the master started again.
754 769

  
755 770

  
756 771
History
......
810 825

  
811 826
  For example: memory, vcpus, auto_balance
812 827

  
813
  All these parameters will be encoded into constants.py with the prefix "BE\_"
814
  and the whole list of parameters will exist in the set "BES_PARAMETERS"
828
  All these parameters will be encoded into constants.py with the prefix
829
  "BE\_" and the whole list of parameters will exist in the set
830
  "BES_PARAMETERS"
815 831

  
816 832
:proper parameter:
817
  a parameter whose value is unique to the instance (e.g. the name of a LV,
818
  or the MAC of a NIC)
833
  a parameter whose value is unique to the instance (e.g. the name of a
834
  LV, or the MAC of a NIC)
819 835

  
820 836
As a general rule, for all kind of parameters, “None” (or in
821 837
JSON-speak, “nil”) will no longer be a valid value for a parameter. As
......
932 948
- ``Cluster.FillBE(instance, be_type="default")``, which returns the
933 949
  beparams dict, based on the instance and cluster beparams
934 950

  
935
The FillHV/BE transformations will be used, for example, in the RpcRunner
936
when sending an instance for activation/stop, and the sent instance
937
hvparams/beparams will have the final value (noded code doesn't know
938
about defaults).
951
The FillHV/BE transformations will be used, for example, in the
952
RpcRunner when sending an instance for activation/stop, and the sent
953
instance hvparams/beparams will have the final value (noded code doesn't
954
know about defaults).
939 955

  
940 956
LU code will need to self-call the transformation, if needed.
941 957

  
......
945 961
The parameter changes will have impact on the OpCodes, especially on
946 962
the following ones:
947 963

  
948
- ``OpCreateInstance``, where the new hv and be parameters will be sent as
949
  dictionaries; note that all hv and be parameters are now optional, as
950
  the values can be instead taken from the cluster
964
- ``OpCreateInstance``, where the new hv and be parameters will be sent
965
  as dictionaries; note that all hv and be parameters are now optional,
966
  as the values can be instead taken from the cluster
951 967
- ``OpQueryInstances``, where we have to be able to query these new
952 968
  parameters; the syntax for names will be ``hvparam/$NAME`` and
953 969
  ``beparam/$NAME`` for querying an individual parameter out of one
......
1093 1109
Caveats:
1094 1110

  
1095 1111
- some operation semantics are less clear (e.g. what to do on instance
1096
  start with offline secondary?); for now, these will just fail as if the
1097
  flag is not set (but faster)
1112
  start with offline secondary?); for now, these will just fail as if
1113
  the flag is not set (but faster)
1098 1114
- 2-node cluster with one node offline needs manual startup of the
1099 1115
  master with a special flag to skip voting (as the master can't get a
1100 1116
  quorum there)
......
1133 1149
  clean the above instance(s)
1134 1150

  
1135 1151
In order to prevent this situation, and to be able to get nodes into
1136
proper offline status easily, a new *drained* flag was added to the nodes.
1152
proper offline status easily, a new *drained* flag was added to the
1153
nodes.
1137 1154

  
1138 1155
This flag (which actually means "is being, or was drained, and is
1139 1156
expected to go offline"), will prevent allocations on the node, but
......
1173 1190
assumptions made initially are not true and that more flexibility is
1174 1191
needed.
1175 1192

  
1176
One main assumption made was that disk failures should be treated as 'rare'
1177
events, and that each of them needs to be manually handled in order to ensure
1178
data safety; however, both these assumptions are false:
1193
One main assumption made was that disk failures should be treated as
1194
'rare' events, and that each of them needs to be manually handled in
1195
order to ensure data safety; however, both these assumptions are false:
1179 1196

  
1180
- disk failures can be a common occurrence, based on usage patterns or cluster
1181
  size
1182
- our disk setup is robust enough (referring to DRBD8 + LVM) that we could
1183
  automate more of the recovery
1197
- disk failures can be a common occurrence, based on usage patterns or
1198
  cluster size
1199
- our disk setup is robust enough (referring to DRBD8 + LVM) that we
1200
  could automate more of the recovery
1184 1201

  
1185
Note that we still don't have fully-automated disk recovery as a goal, but our
1186
goal is to reduce the manual work needed.
1202
Note that we still don't have fully-automated disk recovery as a goal,
1203
but our goal is to reduce the manual work needed.
1187 1204

  
1188 1205
As such, we plan the following main changes:
1189 1206

  
1190
- DRBD8 is much more flexible and stable than its previous version (0.7),
1191
  such that removing the support for the ``remote_raid1`` template and
1192
  focusing only on DRBD8 is easier
1207
- DRBD8 is much more flexible and stable than its previous version
1208
  (0.7), such that removing the support for the ``remote_raid1``
1209
  template and focusing only on DRBD8 is easier
1193 1210

  
1194
- dynamic discovery of DRBD devices is not actually needed in a cluster that
1195
  where the DRBD namespace is controlled by Ganeti; switching to a static
1196
  assignment (done at either instance creation time or change secondary time)
1197
  will change the disk activation time from O(n) to O(1), which on big
1198
  clusters is a significant gain
1211
- dynamic discovery of DRBD devices is not actually needed in a cluster
1212
  that where the DRBD namespace is controlled by Ganeti; switching to a
1213
  static assignment (done at either instance creation time or change
1214
  secondary time) will change the disk activation time from O(n) to
1215
  O(1), which on big clusters is a significant gain
1199 1216

  
1200
- remove the hard dependency on LVM (currently all available storage types are
1201
  ultimately backed by LVM volumes) by introducing file-based storage
1217
- remove the hard dependency on LVM (currently all available storage
1218
  types are ultimately backed by LVM volumes) by introducing file-based
1219
  storage
1202 1220

  
1203 1221
Additionally, a number of smaller enhancements are also planned:
1204 1222
- support variable number of disks
......
1326 1344
*failover to any* functionality, removing many of the layout
1327 1345
restrictions of a cluster:
1328 1346

  
1329
- the need to reserve memory on the current secondary: this gets reduced to
1330
  a must to reserve memory anywhere on the cluster
1347
- the need to reserve memory on the current secondary: this gets reduced
1348
  to a must to reserve memory anywhere on the cluster
1331 1349

  
1332 1350
- the need to first failover and then replace secondary for an
1333 1351
  instance: with failover-to-any, we can directly failover to
......
1340 1358
made between P1 and S1. This choice can be constrained, depending on
1341 1359
which of P1 and S1 has failed.
1342 1360

  
1343
- if P1 has failed, then S1 must become S2, and live migration is not possible
1361
- if P1 has failed, then S1 must become S2, and live migration is not
1362
  possible
1344 1363
- if S1 has failed, then P1 must become S2, and live migration could be
1345 1364
  possible (in theory, but this is not a design goal for 2.0)
1346 1365

  
......
1349 1368
- verify that S2 (the node the user has chosen to keep as secondary) has
1350 1369
  valid data (is consistent)
1351 1370

  
1352
- tear down the current DRBD association and setup a DRBD pairing between
1353
  P2 (P2 is indicated by the user) and S2; since P2 has no data, it will
1354
  start re-syncing from S2
1371
- tear down the current DRBD association and setup a DRBD pairing
1372
  between P2 (P2 is indicated by the user) and S2; since P2 has no data,
1373
  it will start re-syncing from S2
1355 1374

  
1356
- as soon as P2 is in state SyncTarget (i.e. after the resync has started
1357
  but before it has finished), we can promote it to primary role (r/w)
1358
  and start the instance on P2
1375
- as soon as P2 is in state SyncTarget (i.e. after the resync has
1376
  started but before it has finished), we can promote it to primary role
1377
  (r/w) and start the instance on P2
1359 1378

  
1360 1379
- as soon as the P2?S2 sync has finished, we can remove
1361 1380
  the old data on the old node that has not been chosen for
......
1426 1445
OS interface
1427 1446
~~~~~~~~~~~~
1428 1447

  
1429
The current Ganeti OS interface, version 5, is tailored for Ganeti 1.2. The
1430
interface is composed by a series of scripts which get called with certain
1431
parameters to perform OS-dependent operations on the cluster. The current
1432
scripts are:
1448
The current Ganeti OS interface, version 5, is tailored for Ganeti 1.2.
1449
The interface is composed by a series of scripts which get called with
1450
certain parameters to perform OS-dependent operations on the cluster.
1451
The current scripts are:
1433 1452

  
1434 1453
create
1435 1454
  called when a new instance is added to the cluster
......
1441 1460
  called to perform the os-specific operations necessary for renaming an
1442 1461
  instance
1443 1462

  
1444
Currently these scripts suffer from the limitations of Ganeti 1.2: for example
1445
they accept exactly one block and one swap devices to operate on, rather than
1446
any amount of generic block devices, they blindly assume that an instance will
1447
have just one network interface to operate, they can not be configured to
1448
optimise the instance for a particular hypervisor.
1463
Currently these scripts suffer from the limitations of Ganeti 1.2: for
1464
example they accept exactly one block and one swap devices to operate
1465
on, rather than any amount of generic block devices, they blindly assume
1466
that an instance will have just one network interface to operate, they
1467
can not be configured to optimise the instance for a particular
1468
hypervisor.
1449 1469

  
1450
Since in Ganeti 2.0 we want to support multiple hypervisors, and a non-fixed
1451
number of network and disks the OS interface need to change to transmit the
1452
appropriate amount of information about an instance to its managing operating
1453
system, when operating on it. Moreover since some old assumptions usually used
1454
in OS scripts are no longer valid we need to re-establish a common knowledge on
1455
what can be assumed and what cannot be regarding Ganeti environment.
1470
Since in Ganeti 2.0 we want to support multiple hypervisors, and a
1471
non-fixed number of network and disks the OS interface need to change to
1472
transmit the appropriate amount of information about an instance to its
1473
managing operating system, when operating on it. Moreover since some old
1474
assumptions usually used in OS scripts are no longer valid we need to
1475
re-establish a common knowledge on what can be assumed and what cannot
1476
be regarding Ganeti environment.
1456 1477

  
1457 1478

  
1458 1479
When designing the new OS API our priorities are:
......
1461 1482
- ease of porting from the old API
1462 1483
- modularity
1463 1484

  
1464
As such we want to limit the number of scripts that must be written to support
1465
an OS, and make it easy to share code between them by uniforming their input.
1466
We also will leave the current script structure unchanged, as far as we can,
1467
and make a few of the scripts (import, export and rename) optional. Most
1468
information will be passed to the script through environment variables, for
1469
ease of access and at the same time ease of using only the information a script
1470
needs.
1485
As such we want to limit the number of scripts that must be written to
1486
support an OS, and make it easy to share code between them by uniforming
1487
their input.  We also will leave the current script structure unchanged,
1488
as far as we can, and make a few of the scripts (import, export and
1489
rename) optional. Most information will be passed to the script through
1490
environment variables, for ease of access and at the same time ease of
1491
using only the information a script needs.
1471 1492

  
1472 1493

  
1473 1494
The Scripts
1474 1495
+++++++++++
1475 1496

  
1476
As in Ganeti 1.2, every OS which wants to be installed in Ganeti needs to
1477
support the following functionality, through scripts:
1497
As in Ganeti 1.2, every OS which wants to be installed in Ganeti needs
1498
to support the following functionality, through scripts:
1478 1499

  
1479 1500
create:
1480
  used to create a new instance running that OS. This script should prepare the
1481
  block devices, and install them so that the new OS can boot under the
1482
  specified hypervisor.
1501
  used to create a new instance running that OS. This script should
1502
  prepare the block devices, and install them so that the new OS can
1503
  boot under the specified hypervisor.
1483 1504
export (optional):
1484
  used to export an installed instance using the given OS to a format which can
1485
  be used to import it back into a new instance.
1505
  used to export an installed instance using the given OS to a format
1506
  which can be used to import it back into a new instance.
1486 1507
import (optional):
1487
  used to import an exported instance into a new one. This script is similar to
1488
  create, but the new instance should have the content of the export, rather
1489
  than contain a pristine installation.
1508
  used to import an exported instance into a new one. This script is
1509
  similar to create, but the new instance should have the content of the
1510
  export, rather than contain a pristine installation.
1490 1511
rename (optional):
1491
  used to perform the internal OS-specific operations needed to rename an
1492
  instance.
1512
  used to perform the internal OS-specific operations needed to rename
1513
  an instance.
1493 1514

  
1494
If any optional script is not implemented Ganeti will refuse to perform the
1495
given operation on instances using the non-implementing OS. Of course the
1496
create script is mandatory, and it doesn't make sense to support the either the
1497
export or the import operation but not both.
1515
If any optional script is not implemented Ganeti will refuse to perform
1516
the given operation on instances using the non-implementing OS. Of
1517
course the create script is mandatory, and it doesn't make sense to
1518
support the either the export or the import operation but not both.
1498 1519

  
1499 1520
Incompatibilities with 1.2
1500 1521
__________________________
1501 1522

  
1502
We expect the following incompatibilities between the OS scripts for 1.2 and
1503
the ones for 2.0:
1523
We expect the following incompatibilities between the OS scripts for 1.2
1524
and the ones for 2.0:
1504 1525

  
1505
- Input parameters: in 1.2 those were passed on the command line, in 2.0 we'll
1506
  use environment variables, as there will be a lot more information and not
1507
  all OSes may care about all of it.
1508
- Number of calls: export scripts will be called once for each device the
1509
  instance has, and import scripts once for every exported disk. Imported
1510
  instances will be forced to have a number of disks greater or equal to the
1511
  one of the export.
1512
- Some scripts are not compulsory: if such a script is missing the relevant
1513
  operations will be forbidden for instances of that OS. This makes it easier
1514
  to distinguish between unsupported operations and no-op ones (if any).
1526
- Input parameters: in 1.2 those were passed on the command line, in 2.0
1527
  we'll use environment variables, as there will be a lot more
1528
  information and not all OSes may care about all of it.
1529
- Number of calls: export scripts will be called once for each device
1530
  the instance has, and import scripts once for every exported disk.
1531
  Imported instances will be forced to have a number of disks greater or
1532
  equal to the one of the export.
1533
- Some scripts are not compulsory: if such a script is missing the
1534
  relevant operations will be forbidden for instances of that OS. This
1535
  makes it easier to distinguish between unsupported operations and
1536
  no-op ones (if any).
1515 1537

  
1516 1538

  
1517 1539
Input
1518 1540
_____
1519 1541

  
1520
Rather than using command line flags, as they do now, scripts will accept
1521
inputs from environment variables.  We expect the following input values:
1542
Rather than using command line flags, as they do now, scripts will
1543
accept inputs from environment variables. We expect the following input
1544
values:
1522 1545

  
1523 1546
OS_API_VERSION
1524 1547
  The version of the OS API that the following parameters comply with;
......
1528 1551
INSTANCE_NAME
1529 1552
  Name of the instance acted on
1530 1553
HYPERVISOR
1531
  The hypervisor the instance should run on (e.g. 'xen-pvm', 'xen-hvm', 'kvm')
1554
  The hypervisor the instance should run on (e.g. 'xen-pvm', 'xen-hvm',
1555
  'kvm')
1532 1556
DISK_COUNT
1533 1557
  The number of disks this instance will have
1534 1558
NIC_COUNT
......
1539 1563
  W if read/write, R if read only. OS scripts are not supposed to touch
1540 1564
  read-only disks, but will be passed them to know.
1541 1565
DISK_<N>_FRONTEND_TYPE
1542
  Type of the disk as seen by the instance. Can be 'scsi', 'ide', 'virtio'
1566
  Type of the disk as seen by the instance. Can be 'scsi', 'ide',
1567
  'virtio'
1543 1568
DISK_<N>_BACKEND_TYPE
1544 1569
  Type of the disk as seen from the node. Can be 'block', 'file:loop' or
1545 1570
  'file:blktap'
......
1553 1578
  Type of the Nth NIC as seen by the instance. For example 'virtio',
1554 1579
  'rtl8139', etc.
1555 1580
DEBUG_LEVEL
1556
  Whether more out should be produced, for debugging purposes. Currently the
1557
  only valid values are 0 and 1.
1581
  Whether more out should be produced, for debugging purposes. Currently
1582
  the only valid values are 0 and 1.
1558 1583

  
1559 1584
These are only the basic variables we are thinking of now, but more
1560 1585
may come during the implementation and they will be documented in the
......
1567 1592
OLD_INSTANCE_NAME
1568 1593
  rename: the name the instance should be renamed from.
1569 1594
EXPORT_DEVICE
1570
  export: device to be exported, a snapshot of the actual device. The data must be exported to stdout.
1595
  export: device to be exported, a snapshot of the actual device. The
1596
  data must be exported to stdout.
1571 1597
EXPORT_INDEX
1572 1598
  export: sequential number of the instance device targeted.
1573 1599
IMPORT_DEVICE
1574
  import: device to send the data to, part of the new instance. The data must be imported from stdin.
1600
  import: device to send the data to, part of the new instance. The data
1601
  must be imported from stdin.
1575 1602
IMPORT_INDEX
1576 1603
  import: sequential number of the instance device targeted.
1577 1604

  
1578
(Rationale for INSTANCE_NAME as an environment variable: the instance name is
1579
always needed and we could pass it on the command line. On the other hand,
1580
though, this would force scripts to both access the environment and parse the
1581
command line, so we'll move it for uniformity.)
1605
(Rationale for INSTANCE_NAME as an environment variable: the instance
1606
name is always needed and we could pass it on the command line. On the
1607
other hand, though, this would force scripts to both access the
1608
environment and parse the command line, so we'll move it for
1609
uniformity.)
1582 1610

  
1583 1611

  
1584 1612
Output/Behaviour
1585 1613
________________
1586 1614

  
1587
As discussed scripts should only send user-targeted information to stderr. The
1588
create and import scripts are supposed to format/initialise the given block
1589
devices and install the correct instance data. The export script is supposed to
1590
export instance data to stdout in a format understandable by the the import
1591
script. The data will be compressed by Ganeti, so no compression should be
1592
done. The rename script should only modify the instance's knowledge of what
1593
its name is.
1615
As discussed scripts should only send user-targeted information to
1616
stderr. The create and import scripts are supposed to format/initialise
1617
the given block devices and install the correct instance data. The
1618
export script is supposed to export instance data to stdout in a format
1619
understandable by the the import script. The data will be compressed by
1620
Ganeti, so no compression should be done. The rename script should only
1621
modify the instance's knowledge of what its name is.
1594 1622

  
1595 1623
Other declarative style features
1596 1624
++++++++++++++++++++++++++++++++
......
1604 1632
containing two lines. This is different from Ganeti 1.2, which only
1605 1633
supported one version number.
1606 1634

  
1607
In addition to that an OS will be able to declare that it does support only a
1608
subset of the Ganeti hypervisors, by declaring them in the 'hypervisors' file.
1635
In addition to that an OS will be able to declare that it does support
1636
only a subset of the Ganeti hypervisors, by declaring them in the
1637
'hypervisors' file.
1609 1638

  
1610 1639

  
1611 1640
Caveats/Notes
1612 1641
+++++++++++++
1613 1642

  
1614
We might want to have a "default" import/export behaviour that just dumps all
1615
disks and restores them. This can save work as most systems will just do this,
1616
while allowing flexibility for different systems.
1643
We might want to have a "default" import/export behaviour that just
1644
dumps all disks and restores them. This can save work as most systems
1645
will just do this, while allowing flexibility for different systems.
1617 1646

  
1618
Environment variables are limited in size, but we expect that there will be
1619
enough space to store the information we need. If we discover that this is not
1620
the case we may want to go to a more complex API such as storing those
1621
information on the filesystem and providing the OS script with the path to a
1622
file where they are encoded in some format.
1647
Environment variables are limited in size, but we expect that there will
1648
be enough space to store the information we need. If we discover that
1649
this is not the case we may want to go to a more complex API such as
1650
storing those information on the filesystem and providing the OS script
1651
with the path to a file where they are encoded in some format.
1623 1652

  
1624 1653

  
1625 1654

  
b/doc/design-2.1.rst
5 5
This document describes the major changes in Ganeti 2.1 compared to
6 6
the 2.0 version.
7 7

  
8
The 2.1 version will be a relatively small release. Its main aim is to avoid
9
changing too much of the core code, while addressing issues and adding new
10
features and improvements over 2.0, in a timely fashion.
8
The 2.1 version will be a relatively small release. Its main aim is to
9
avoid changing too much of the core code, while addressing issues and
10
adding new features and improvements over 2.0, in a timely fashion.
11 11

  
12 12
.. contents:: :depth: 4
13 13

  
......
15 15
=========
16 16

  
17 17
Ganeti 2.1 will add features to help further automatization of cluster
18
operations, further improbe scalability to even bigger clusters, and make it
19
easier to debug the Ganeti core.
18
operations, further improbe scalability to even bigger clusters, and
19
make it easier to debug the Ganeti core.
20 20

  
21 21
Background
22 22
==========
......
29 29

  
30 30
As for 2.0 we divide the 2.1 design into three areas:
31 31

  
32
- core changes, which affect the master daemon/job queue/locking or all/most
33
  logical units
32
- core changes, which affect the master daemon/job queue/locking or
33
  all/most logical units
34 34
- logical unit/feature changes
35 35
- external interface changes (eg. command line, os api, hooks, ...)
36 36

  
......
60 60
- list of storage units of this type
61 61
- check status of the storage unit
62 62

  
63
Additionally, there will be specific methods for each method, for example:
63
Additionally, there will be specific methods for each method, for
64
example:
64 65

  
65 66
- enable/disable allocations on a specific PV
66 67
- file storage directory creation/deletion
......
88 89
++++++++++++++++++++++++++++++
89 90

  
90 91
The class ``LockSet`` (see ``lib/locking.py``) is a container for one or
91
many ``SharedLock`` instances. It provides an interface to add/remove locks
92
and to acquire and subsequently release any number of those locks contained
93
in it.
92
many ``SharedLock`` instances. It provides an interface to add/remove
93
locks and to acquire and subsequently release any number of those locks
94
contained in it.
94 95

  
95
Locks in a ``LockSet`` are always acquired in alphabetic order. Due to the
96
way we're using locks for nodes and instances (the single cluster lock isn't
97
affected by this issue) this can lead to long delays when acquiring locks if
98
another operation tries to acquire multiple locks but has to wait for yet
99
another operation.
96
Locks in a ``LockSet`` are always acquired in alphabetic order. Due to
97
the way we're using locks for nodes and instances (the single cluster
98
lock isn't affected by this issue) this can lead to long delays when
99
acquiring locks if another operation tries to acquire multiple locks but
100
has to wait for yet another operation.
100 101

  
101 102
In the following demonstration we assume to have the instance locks
102 103
``inst1``, ``inst2``, ``inst3`` and ``inst4``.
103 104

  
104 105
#. Operation A grabs lock for instance ``inst4``.
105
#. Operation B wants to acquire all instance locks in alphabetic order, but
106
   it has to wait for ``inst4``.
106
#. Operation B wants to acquire all instance locks in alphabetic order,
107
   but it has to wait for ``inst4``.
107 108
#. Operation C tries to lock ``inst1``, but it has to wait until
108 109
   Operation B (which is trying to acquire all locks) releases the lock
109 110
   again.
......
121 122
Non-blocking lock acquiring
122 123
^^^^^^^^^^^^^^^^^^^^^^^^^^^
123 124

  
124
Acquiring locks for OpCode execution is always done in blocking mode. They
125
won't return until the lock has successfully been acquired (or an error
126
occurred, although we won't cover that case here).
125
Acquiring locks for OpCode execution is always done in blocking mode.
126
They won't return until the lock has successfully been acquired (or an
127
error occurred, although we won't cover that case here).
127 128

  
128
``SharedLock`` and ``LockSet`` must be able to be acquired in a non-blocking
129
way. They must support a timeout and abort trying to acquire the lock(s)
130
after the specified amount of time.
129
``SharedLock`` and ``LockSet`` must be able to be acquired in a
130
non-blocking way. They must support a timeout and abort trying to
131
acquire the lock(s) after the specified amount of time.
131 132

  
132 133
Retry acquiring locks
133 134
^^^^^^^^^^^^^^^^^^^^^
134 135

  
135
To prevent other operations from waiting for a long time, such as described
136
in the demonstration before, ``LockSet`` must not keep locks for a prolonged
137
period of time when trying to acquire two or more locks. Instead it should,
138
with an increasing timeout for acquiring all locks, release all locks again
139
and sleep some time if it fails to acquire all requested locks.
136
To prevent other operations from waiting for a long time, such as
137
described in the demonstration before, ``LockSet`` must not keep locks
138
for a prolonged period of time when trying to acquire two or more locks.
139
Instead it should, with an increasing timeout for acquiring all locks,
140
release all locks again and sleep some time if it fails to acquire all
141
requested locks.
140 142

  
141
A good timeout value needs to be determined. In any case should ``LockSet``
142
proceed to acquire locks in blocking mode after a few (unsuccessful)
143
attempts to acquire all requested locks.
143
A good timeout value needs to be determined. In any case should
144
``LockSet`` proceed to acquire locks in blocking mode after a few
145
(unsuccessful) attempts to acquire all requested locks.
144 146

  
145
One proposal for the timeout is to use ``2**tries`` seconds, where ``tries``
146
is the number of unsuccessful tries.
147
One proposal for the timeout is to use ``2**tries`` seconds, where
148
``tries`` is the number of unsuccessful tries.
147 149

  
148
In the demonstration before this would allow Operation C to continue after
149
Operation B unsuccessfully tried to acquire all locks and released all
150
acquired locks (``inst1``, ``inst2`` and ``inst3``) again.
150
In the demonstration before this would allow Operation C to continue
151
after Operation B unsuccessfully tried to acquire all locks and released
152
all acquired locks (``inst1``, ``inst2`` and ``inst3``) again.
151 153

  
152 154
Other solutions discussed
153 155
+++++++++++++++++++++++++
154 156

  
155
There was also some discussion on going one step further and extend the job
156
queue (see ``lib/jqueue.py``) to select the next task for a worker depending
157
on whether it can acquire the necessary locks. While this may reduce the
158
number of necessary worker threads and/or increase throughput on large
159
clusters with many jobs, it also brings many potential problems, such as
160
contention and increased memory usage, with it. As this would be an
161
extension of the changes proposed before it could be implemented at a later
162
point in time, but we decided to stay with the simpler solution for now.
157
There was also some discussion on going one step further and extend the
158
job queue (see ``lib/jqueue.py``) to select the next task for a worker
159
depending on whether it can acquire the necessary locks. While this may
160
reduce the number of necessary worker threads and/or increase throughput
161
on large clusters with many jobs, it also brings many potential
162
problems, such as contention and increased memory usage, with it. As
163
this would be an extension of the changes proposed before it could be
164
implemented at a later point in time, but we decided to stay with the
165
simpler solution for now.
163 166

  
164 167
Implementation details
165 168
++++++++++++++++++++++
......
169 172

  
170 173
The current design of ``SharedLock`` is not good for supporting timeouts
171 174
when acquiring a lock and there are also minor fairness issues in it. We
172
plan to address both with a redesign. A proof of concept implementation was
173
written and resulted in significantly simpler code.
174

  
175
Currently ``SharedLock`` uses two separate queues for shared and exclusive
176
acquires and waiters get to run in turns. This means if an exclusive acquire
177
is released, the lock will allow shared waiters to run and vice versa.
178
Although it's still fair in the end there is a slight bias towards shared
179
waiters in the current implementation. The same implementation with two
180
shared queues can not support timeouts without adding a lot of complexity.
181

  
182
Our proposed redesign changes ``SharedLock`` to have only one single queue.
183
There will be one condition (see Condition_ for a note about performance) in
184
the queue per exclusive acquire and two for all shared acquires (see below for
185
an explanation). The maximum queue length will always be ``2 + (number of
186
exclusive acquires waiting)``. The number of queue entries for shared acquires
187
can vary from 0 to 2.
188

  
189
The two conditions for shared acquires are a bit special. They will be used
190
in turn. When the lock is instantiated, no conditions are in the queue. As
191
soon as the first shared acquire arrives (and there are holder(s) or waiting
192
acquires; see Acquire_), the active condition is added to the queue. Until
193
it becomes the topmost condition in the queue and has been notified, any
194
shared acquire is added to this active condition. When the active condition
195
is notified, the conditions are swapped and further shared acquires are
196
added to the previously inactive condition (which has now become the active
197
condition). After all waiters on the previously active (now inactive) and
198
now notified condition received the notification, it is removed from the
199
queue of pending acquires.
200

  
201
This means shared acquires will skip any exclusive acquire in the queue. We
202
believe it's better to improve parallelization on operations only asking for
203
shared (or read-only) locks. Exclusive operations holding the same lock can
204
not be parallelized.
175
plan to address both with a redesign. A proof of concept implementation
176
was written and resulted in significantly simpler code.
177

  
178
Currently ``SharedLock`` uses two separate queues for shared and
179
exclusive acquires and waiters get to run in turns. This means if an
180
exclusive acquire is released, the lock will allow shared waiters to run
181
and vice versa.  Although it's still fair in the end there is a slight
182
bias towards shared waiters in the current implementation. The same
183
implementation with two shared queues can not support timeouts without
184
adding a lot of complexity.
185

  
186
Our proposed redesign changes ``SharedLock`` to have only one single
187
queue.  There will be one condition (see Condition_ for a note about
188
performance) in the queue per exclusive acquire and two for all shared
189
acquires (see below for an explanation). The maximum queue length will
190
always be ``2 + (number of exclusive acquires waiting)``. The number of
191
queue entries for shared acquires can vary from 0 to 2.
192

  
193
The two conditions for shared acquires are a bit special. They will be
194
used in turn. When the lock is instantiated, no conditions are in the
195
queue. As soon as the first shared acquire arrives (and there are
196
holder(s) or waiting acquires; see Acquire_), the active condition is
197
added to the queue. Until it becomes the topmost condition in the queue
198
and has been notified, any shared acquire is added to this active
199
condition. When the active condition is notified, the conditions are
200
swapped and further shared acquires are added to the previously inactive
201
condition (which has now become the active condition). After all waiters
202
on the previously active (now inactive) and now notified condition
203
received the notification, it is removed from the queue of pending
204
acquires.
205

  
206
This means shared acquires will skip any exclusive acquire in the queue.
207
We believe it's better to improve parallelization on operations only
208
asking for shared (or read-only) locks. Exclusive operations holding the
209
same lock can not be parallelized.
205 210

  
206 211

  
207 212
Acquire
208 213
*******
209 214

  
210
For exclusive acquires a new condition is created and appended to the queue.
211
Shared acquires are added to the active condition for shared acquires and if
212
the condition is not yet on the queue, it's appended.
215
For exclusive acquires a new condition is created and appended to the
216
queue.  Shared acquires are added to the active condition for shared
217
acquires and if the condition is not yet on the queue, it's appended.
213 218

  
214
The next step is to wait for our condition to be on the top of the queue (to
215
guarantee fairness). If the timeout expired, we return to the caller without
216
acquiring the lock. On every notification we check whether the lock has been
217
deleted, in which case an error is returned to the caller.
219
The next step is to wait for our condition to be on the top of the queue
220
(to guarantee fairness). If the timeout expired, we return to the caller
221
without acquiring the lock. On every notification we check whether the
222
lock has been deleted, in which case an error is returned to the caller.
218 223

  
219
The lock can be acquired if we're on top of the queue (there is no one else
220
ahead of us). For an exclusive acquire, there must not be other exclusive or
221
shared holders. For a shared acquire, there must not be an exclusive holder.
222
If these conditions are all true, the lock is acquired and we return to the
223
caller. In any other case we wait again on the condition.
224
The lock can be acquired if we're on top of the queue (there is no one
225
else ahead of us). For an exclusive acquire, there must not be other
226
exclusive or shared holders. For a shared acquire, there must not be an
227
exclusive holder.  If these conditions are all true, the lock is
228
acquired and we return to the caller. In any other case we wait again on
229
the condition.
224 230

  
225
If it was the last waiter on a condition, the condition is removed from the
226
queue.
231
If it was the last waiter on a condition, the condition is removed from
232
the queue.
227 233

  
228 234
Optimization: There's no need to touch the queue if there are no pending
229
acquires and no current holders. The caller can have the lock immediately.
235
acquires and no current holders. The caller can have the lock
236
immediately.
230 237

  
231 238
.. image:: design-2.1-lock-acquire.png
232 239

  
......
234 241
Release
235 242
*******
236 243

  
237
First the lock removes the caller from the internal owner list. If there are
238
pending acquires in the queue, the first (the oldest) condition is notified.
244
First the lock removes the caller from the internal owner list. If there
245
are pending acquires in the queue, the first (the oldest) condition is
246
notified.
239 247

  
240 248
If the first condition was the active condition for shared acquires, the
241
inactive condition will be made active. This ensures fairness with exclusive
242
locks by forcing consecutive shared acquires to wait in the queue.
249
inactive condition will be made active. This ensures fairness with
250
exclusive locks by forcing consecutive shared acquires to wait in the
251
queue.
243 252

  
244 253
.. image:: design-2.1-lock-release.png
245 254

  
......
247 256
Delete
248 257
******
249 258

  
250
The caller must either hold the lock in exclusive mode already or the lock
251
must be acquired in exclusive mode. Trying to delete a lock while it's held
252
in shared mode must fail.
259
The caller must either hold the lock in exclusive mode already or the
260
lock must be acquired in exclusive mode. Trying to delete a lock while
261
it's held in shared mode must fail.
253 262

  
254
After ensuring the lock is held in exclusive mode, the lock will mark itself
255
as deleted and continue to notify all pending acquires. They will wake up,
256
notice the deleted lock and return an error to the caller.
263
After ensuring the lock is held in exclusive mode, the lock will mark
264
itself as deleted and continue to notify all pending acquires. They will
265
wake up, notice the deleted lock and return an error to the caller.
257 266

  
258 267

  
259 268
Condition
260 269
^^^^^^^^^
261 270

  
262
Note: This is not necessary for the locking changes above, but it may be a
263
good optimization (pending performance tests).
271
Note: This is not necessary for the locking changes above, but it may be
272
a good optimization (pending performance tests).
264 273

  
265 274
The existing locking code in Ganeti 2.0 uses Python's built-in
266 275
``threading.Condition`` class. Unfortunately ``Condition`` implements
267
timeouts by sleeping 1ms to 20ms between tries to acquire the condition lock
268
in non-blocking mode. This requires unnecessary context switches and
269
contention on the CPython GIL (Global Interpreter Lock).
276
timeouts by sleeping 1ms to 20ms between tries to acquire the condition
277
lock in non-blocking mode. This requires unnecessary context switches
278
and contention on the CPython GIL (Global Interpreter Lock).
270 279

  
271 280
By using POSIX pipes (see ``pipe(2)``) we can use the operating system's
272 281
support for timeouts on file descriptors (see ``select(2)``). A custom
273 282
condition class will have to be written for this.
274 283

  
275 284
On instantiation the class creates a pipe. After each notification the
276
previous pipe is abandoned and re-created (technically the old pipe needs to
277
stay around until all notifications have been delivered).
285
previous pipe is abandoned and re-created (technically the old pipe
286
needs to stay around until all notifications have been delivered).
278 287

  
279 288
All waiting clients of the condition use ``select(2)`` or ``poll(2)`` to
280
wait for notifications, optionally with a timeout. A notification will be
281
signalled to the waiting clients by closing the pipe. If the pipe wasn't
282
closed during the timeout, the waiting function returns to its caller
283
nonetheless.
289
wait for notifications, optionally with a timeout. A notification will
290
be signalled to the waiting clients by closing the pipe. If the pipe
291
wasn't closed during the timeout, the waiting function returns to its
292
caller nonetheless.
284 293

  
285 294

  
286 295
Feature changes
......
291 300

  
292 301
Current State and shortcomings
293 302
++++++++++++++++++++++++++++++
294
In Ganeti 2.0 all nodes are equal, but some are more equal than others. In
295
particular they are divided between "master", "master candidates" and "normal".
296
(Moreover they can be offline or drained, but this is not important for the
297
current discussion). In general the whole configuration is only replicated to
298
master candidates, and some partial information is spread to all nodes via
299
ssconf.
300

  
301
This change was done so that the most frequent Ganeti operations didn't need to
302
contact all nodes, and so clusters could become bigger. If we want more
303
information to be available on all nodes, we need to add more ssconf values,
304
which is counter-balancing the change, or to talk with the master node, which
305
is not designed to happen now, and requires its availability.
306

  
307
Information such as the instance->primary_node mapping will be needed on all
308
nodes, and we also want to make sure services external to the cluster can query
309
this information as well. This information must be available at all times, so
310
we can't query it through RAPI, which would be a single point of failure, as
311
it's only available on the master.
303

  
304
In Ganeti 2.0 all nodes are equal, but some are more equal than others.
305
In particular they are divided between "master", "master candidates" and
306
"normal".  (Moreover they can be offline or drained, but this is not
307
important for the current discussion). In general the whole
308
configuration is only replicated to master candidates, and some partial
309
information is spread to all nodes via ssconf.
310

  
311
This change was done so that the most frequent Ganeti operations didn't
312
need to contact all nodes, and so clusters could become bigger. If we
313
want more information to be available on all nodes, we need to add more
314
ssconf values, which is counter-balancing the change, or to talk with
315
the master node, which is not designed to happen now, and requires its
316
availability.
317

  
318
Information such as the instance->primary_node mapping will be needed on
319
all nodes, and we also want to make sure services external to the
320
cluster can query this information as well. This information must be
321
available at all times, so we can't query it through RAPI, which would
322
be a single point of failure, as it's only available on the master.
312 323

  
313 324

  
314 325
Proposed changes
315 326
++++++++++++++++
316 327

  
317 328
In order to allow fast and highly available access read-only to some
318
configuration values, we'll create a new ganeti-confd daemon, which will run on
319
master candidates. This daemon will talk via UDP, and authenticate messages
320
using HMAC with a cluster-wide shared key. This key will be generated at
321
cluster init time, and stored on the clusters alongside the ganeti SSL keys,
322
and readable only by root.
323

  
324
An interested client can query a value by making a request to a subset of the
325
cluster master candidates. It will then wait to get a few responses, and use
326
the one with the highest configuration serial number. Since the configuration
327
serial number is increased each time the ganeti config is updated, and the
328
serial number is included in all answers, this can be used to make sure to use
329
the most recent answer, in case some master candidates are stale or in the
330
middle of a configuration update.
329
configuration values, we'll create a new ganeti-confd daemon, which will
330
run on master candidates. This daemon will talk via UDP, and
331
authenticate messages using HMAC with a cluster-wide shared key. This
332
key will be generated at cluster init time, and stored on the clusters
333
alongside the ganeti SSL keys, and readable only by root.
334

  
335
An interested client can query a value by making a request to a subset
336
of the cluster master candidates. It will then wait to get a few
337
responses, and use the one with the highest configuration serial number.
338
Since the configuration serial number is increased each time the ganeti
339
config is updated, and the serial number is included in all answers,
340
this can be used to make sure to use the most recent answer, in case
341
some master candidates are stale or in the middle of a configuration
342
update.
331 343

  
332 344
In order to prevent replay attacks queries will contain the current unix
333 345
timestamp according to the client, and the server will verify that its
334
timestamp is in the same 5 minutes range (this requires synchronized clocks,
335
which is a good idea anyway). Queries will also contain a "salt" which they
336
expect the answers to be sent with, and clients are supposed to accept only
337
answers which contain salt generated by them.
346
timestamp is in the same 5 minutes range (this requires synchronized
347
clocks, which is a good idea anyway). Queries will also contain a "salt"
348
which they expect the answers to be sent with, and clients are supposed
349
to accept only answers which contain salt generated by them.
338 350

  
339 351
The configuration daemon will be able to answer simple queries such as:
340 352

  
......
364 376

  
365 377
  - 'protocol', integer, is the confd protocol version (initially just
366 378
    constants.CONFD_PROTOCOL_VERSION, with a value of 1)
367
  - 'type', integer, is the query type. For example "node role by name" or
368
    "node primary ip by instance ip". Constants will be provided for the actual
369
    available query types.
370
  - 'query', string, is the search key. For example an ip, or a node name.
371
  - 'rsalt', string, is the required response salt. The client must use it to
372
    recognize which answer it's getting.
373

  
374
- 'salt' must be the current unix timestamp, according to the client. Servers
375
  can refuse messages which have a wrong timing, according to their
376
  configuration and clock.
379
  - 'type', integer, is the query type. For example "node role by name"
380
    or "node primary ip by instance ip". Constants will be provided for
381
    the actual available query types.
382
  - 'query', string, is the search key. For example an ip, or a node
383
    name.
384
  - 'rsalt', string, is the required response salt. The client must use
385
    it to recognize which answer it's getting.
386

  
387
- 'salt' must be the current unix timestamp, according to the client.
388
  Servers can refuse messages which have a wrong timing, according to
389
  their configuration and clock.
377 390
- 'hmac' is an hmac signature of salt+msg, with the cluster hmac key
378 391

  
379
If an answer comes back (which is optional, since confd works over UDP) it will
380
be in this format::
392
If an answer comes back (which is optional, since confd works over UDP)
393
it will be in this format::
381 394

  
382 395
  {
383 396
    "msg": "{\"status\": 0,
......
394 407

  
395 408
  - 'protocol', integer, is the confd protocol version (initially just
396 409
    constants.CONFD_PROTOCOL_VERSION, with a value of 1)
397
  - 'status', integer, is the error code. Initially just 0 for 'ok' or '1' for
398
    'error' (in which case answer contains an error detail, rather than an
399
    answer), but in the future it may be expanded to have more meanings (eg: 2,
400
    the answer is compressed)
401
  - 'answer', is the actual answer. Its type and meaning is query specific. For
402
    example for "node primary ip by instance ip" queries it will be a string
403
    containing an IP address, for "node role by name" queries it will be an
404
    integer which encodes the role (master, candidate, drained, offline)
405
    according to constants.
406

  
407
- 'salt' is the requested salt from the query. A client can use it to recognize
408
  what query the answer is answering.
410
  - 'status', integer, is the error code. Initially just 0 for 'ok' or
411
    '1' for 'error' (in which case answer contains an error detail,
412
    rather than an answer), but in the future it may be expanded to have
413
    more meanings (eg: 2, the answer is compressed)
414
  - 'answer', is the actual answer. Its type and meaning is query
415
    specific. For example for "node primary ip by instance ip" queries
416
    it will be a string containing an IP address, for "node role by
417
    name" queries it will be an integer which encodes the role (master,
418
    candidate, drained, offline) according to constants.
419

  
420
- 'salt' is the requested salt from the query. A client can use it to
421
  recognize what query the answer is answering.
409 422
- 'hmac' is an hmac signature of salt+msg, with the cluster hmac key
410 423

  
411 424

  
......
414 427

  
415 428
Current State and shortcomings
416 429
++++++++++++++++++++++++++++++
... This diff was truncated because it exceeds the maximum size that can be displayed.

Also available in: Unified diff