+
+Locking improvements
+~~~~~~~~~~~~~~~~~~~~
+
+Current State and shortcomings
+++++++++++++++++++++++++++++++
+
+The class ``LockSet`` (see ``lib/locking.py``) is a container for one or
+many ``SharedLock`` instances. It provides an interface to add/remove
+locks and to acquire and subsequently release any number of those locks
+contained in it.
+
+Locks in a ``LockSet`` are always acquired in alphabetic order. Due to
+the way we're using locks for nodes and instances (the single cluster
+lock isn't affected by this issue) this can lead to long delays when
+acquiring locks if another operation tries to acquire multiple locks but
+has to wait for yet another operation.
+
+In the following demonstration we assume to have the instance locks
+``inst1``, ``inst2``, ``inst3`` and ``inst4``.
+
+#. Operation A grabs lock for instance ``inst4``.
+#. Operation B wants to acquire all instance locks in alphabetic order,
+ but it has to wait for ``inst4``.
+#. Operation C tries to lock ``inst1``, but it has to wait until
+ Operation B (which is trying to acquire all locks) releases the lock
+ again.
+#. Operation A finishes and releases lock on ``inst4``. Operation B can
+ continue and eventually releases all locks.
+#. Operation C can get ``inst1`` lock and finishes.
+
+Technically there's no need for Operation C to wait for Operation A, and
+subsequently Operation B, to finish. Operation B can't continue until
+Operation A is done (it has to wait for ``inst4``), anyway.
+
+Proposed changes
+++++++++++++++++
+
+Non-blocking lock acquiring
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Acquiring locks for OpCode execution is always done in blocking mode.
+They won't return until the lock has successfully been acquired (or an
+error occurred, although we won't cover that case here).
+
+``SharedLock`` and ``LockSet`` must be able to be acquired in a
+non-blocking way. They must support a timeout and abort trying to
+acquire the lock(s) after the specified amount of time.
+
+Retry acquiring locks
+^^^^^^^^^^^^^^^^^^^^^
+
+To prevent other operations from waiting for a long time, such as
+described in the demonstration before, ``LockSet`` must not keep locks
+for a prolonged period of time when trying to acquire two or more locks.
+Instead it should, with an increasing timeout for acquiring all locks,
+release all locks again and sleep some time if it fails to acquire all
+requested locks.
+
+A good timeout value needs to be determined. In any case should
+``LockSet`` proceed to acquire locks in blocking mode after a few
+(unsuccessful) attempts to acquire all requested locks.
+
+One proposal for the timeout is to use ``2**tries`` seconds, where
+``tries`` is the number of unsuccessful tries.
+
+In the demonstration before this would allow Operation C to continue
+after Operation B unsuccessfully tried to acquire all locks and released
+all acquired locks (``inst1``, ``inst2`` and ``inst3``) again.
+
+Other solutions discussed
++++++++++++++++++++++++++
+
+There was also some discussion on going one step further and extend the
+job queue (see ``lib/jqueue.py``) to select the next task for a worker
+depending on whether it can acquire the necessary locks. While this may
+reduce the number of necessary worker threads and/or increase throughput
+on large clusters with many jobs, it also brings many potential
+problems, such as contention and increased memory usage, with it. As
+this would be an extension of the changes proposed before it could be
+implemented at a later point in time, but we decided to stay with the
+simpler solution for now.
+
+Implementation details
+++++++++++++++++++++++
+
+``SharedLock`` redesign
+^^^^^^^^^^^^^^^^^^^^^^^
+
+The current design of ``SharedLock`` is not good for supporting timeouts
+when acquiring a lock and there are also minor fairness issues in it. We
+plan to address both with a redesign. A proof of concept implementation
+was written and resulted in significantly simpler code.
+
+Currently ``SharedLock`` uses two separate queues for shared and
+exclusive acquires and waiters get to run in turns. This means if an
+exclusive acquire is released, the lock will allow shared waiters to run
+and vice versa. Although it's still fair in the end there is a slight
+bias towards shared waiters in the current implementation. The same
+implementation with two shared queues can not support timeouts without
+adding a lot of complexity.
+
+Our proposed redesign changes ``SharedLock`` to have only one single
+queue. There will be one condition (see Condition_ for a note about
+performance) in the queue per exclusive acquire and two for all shared
+acquires (see below for an explanation). The maximum queue length will
+always be ``2 + (number of exclusive acquires waiting)``. The number of
+queue entries for shared acquires can vary from 0 to 2.
+
+The two conditions for shared acquires are a bit special. They will be
+used in turn. When the lock is instantiated, no conditions are in the
+queue. As soon as the first shared acquire arrives (and there are
+holder(s) or waiting acquires; see Acquire_), the active condition is
+added to the queue. Until it becomes the topmost condition in the queue
+and has been notified, any shared acquire is added to this active
+condition. When the active condition is notified, the conditions are
+swapped and further shared acquires are added to the previously inactive
+condition (which has now become the active condition). After all waiters
+on the previously active (now inactive) and now notified condition
+received the notification, it is removed from the queue of pending
+acquires.
+
+This means shared acquires will skip any exclusive acquire in the queue.
+We believe it's better to improve parallelization on operations only
+asking for shared (or read-only) locks. Exclusive operations holding the
+same lock can not be parallelized.
+
+
+Acquire
+*******
+
+For exclusive acquires a new condition is created and appended to the
+queue. Shared acquires are added to the active condition for shared
+acquires and if the condition is not yet on the queue, it's appended.
+
+The next step is to wait for our condition to be on the top of the queue
+(to guarantee fairness). If the timeout expired, we return to the caller
+without acquiring the lock. On every notification we check whether the
+lock has been deleted, in which case an error is returned to the caller.
+
+The lock can be acquired if we're on top of the queue (there is no one
+else ahead of us). For an exclusive acquire, there must not be other
+exclusive or shared holders. For a shared acquire, there must not be an
+exclusive holder. If these conditions are all true, the lock is
+acquired and we return to the caller. In any other case we wait again on
+the condition.
+
+If it was the last waiter on a condition, the condition is removed from
+the queue.
+
+Optimization: There's no need to touch the queue if there are no pending
+acquires and no current holders. The caller can have the lock
+immediately.
+
+.. digraph:: "design-2.1-lock-acquire"
+
+ graph[fontsize=8, fontname="Helvetica"]
+ node[fontsize=8, fontname="Helvetica", width="0", height="0"]
+ edge[fontsize=8, fontname="Helvetica"]
+
+ /* Actions */
+ abort[label="Abort\n(couldn't acquire)"]
+ acquire[label="Acquire lock"]
+ add_to_queue[label="Add condition to queue"]
+ wait[label="Wait for notification"]
+ remove_from_queue[label="Remove from queue"]
+
+ /* Conditions */
+ alone[label="Empty queue\nand can acquire?", shape=diamond]
+ have_timeout[label="Do I have\ntimeout?", shape=diamond]
+ top_of_queue_and_can_acquire[
+ label="On top of queue and\ncan acquire lock?",
+ shape=diamond,
+ ]
+
+ /* Lines */
+ alone->acquire[label="Yes"]
+ alone->add_to_queue[label="No"]
+
+ have_timeout->abort[label="Yes"]
+ have_timeout->wait[label="No"]
+
+ top_of_queue_and_can_acquire->acquire[label="Yes"]
+ top_of_queue_and_can_acquire->have_timeout[label="No"]
+
+ add_to_queue->wait
+ wait->top_of_queue_and_can_acquire
+ acquire->remove_from_queue
+
+Release
+*******
+
+First the lock removes the caller from the internal owner list. If there
+are pending acquires in the queue, the first (the oldest) condition is
+notified.
+
+If the first condition was the active condition for shared acquires, the
+inactive condition will be made active. This ensures fairness with
+exclusive locks by forcing consecutive shared acquires to wait in the
+queue.
+
+.. digraph:: "design-2.1-lock-release"
+
+ graph[fontsize=8, fontname="Helvetica"]
+ node[fontsize=8, fontname="Helvetica", width="0", height="0"]
+ edge[fontsize=8, fontname="Helvetica"]
+
+ /* Actions */
+ remove_from_owners[label="Remove from owner list"]
+ notify[label="Notify topmost"]
+ swap_shared[label="Swap shared conditions"]
+ success[label="Success"]
+
+ /* Conditions */
+ have_pending[label="Any pending\nacquires?", shape=diamond]
+ was_active_queue[
+ label="Was active condition\nfor shared acquires?",
+ shape=diamond,
+ ]
+
+ /* Lines */
+ remove_from_owners->have_pending
+
+ have_pending->notify[label="Yes"]
+ have_pending->success[label="No"]
+
+ notify->was_active_queue
+
+ was_active_queue->swap_shared[label="Yes"]
+ was_active_queue->success[label="No"]
+
+ swap_shared->success
+
+
+Delete
+******
+
+The caller must either hold the lock in exclusive mode already or the
+lock must be acquired in exclusive mode. Trying to delete a lock while
+it's held in shared mode must fail.
+
+After ensuring the lock is held in exclusive mode, the lock will mark
+itself as deleted and continue to notify all pending acquires. They will
+wake up, notice the deleted lock and return an error to the caller.
+
+
+Condition
+^^^^^^^^^
+
+Note: This is not necessary for the locking changes above, but it may be
+a good optimization (pending performance tests).
+
+The existing locking code in Ganeti 2.0 uses Python's built-in
+``threading.Condition`` class. Unfortunately ``Condition`` implements
+timeouts by sleeping 1ms to 20ms between tries to acquire the condition
+lock in non-blocking mode. This requires unnecessary context switches
+and contention on the CPython GIL (Global Interpreter Lock).
+
+By using POSIX pipes (see ``pipe(2)``) we can use the operating system's
+support for timeouts on file descriptors (see ``select(2)``). A custom
+condition class will have to be written for this.
+
+On instantiation the class creates a pipe. After each notification the
+previous pipe is abandoned and re-created (technically the old pipe
+needs to stay around until all notifications have been delivered).
+
+All waiting clients of the condition use ``select(2)`` or ``poll(2)`` to
+wait for notifications, optionally with a timeout. A notification will
+be signalled to the waiting clients by closing the pipe. If the pipe
+wasn't closed during the timeout, the waiting function returns to its
+caller nonetheless.
+
+
+Node daemon availability
+~~~~~~~~~~~~~~~~~~~~~~~~
+
+Current State and shortcomings
+++++++++++++++++++++++++++++++
+
+Currently, when a Ganeti node suffers serious system disk damage, the
+migration/failover of an instance may not correctly shutdown the virtual
+machine on the broken node causing instances duplication. The ``gnt-node
+powercycle`` command can be used to force a node reboot and thus to
+avoid duplicated instances. This command relies on node daemon
+availability, though, and thus can fail if the node daemon has some
+pages swapped out of ram, for example.
+
+
+Proposed changes
+++++++++++++++++
+
+The proposed solution forces node daemon to run exclusively in RAM. It
+uses python ctypes to to call ``mlockall(MCL_CURRENT | MCL_FUTURE)`` on
+the node daemon process and all its children. In addition another log
+handler has been implemented for node daemon to redirect to
+``/dev/console`` messages that cannot be written on the logfile.
+
+With these changes node daemon can successfully run basic tasks such as
+a powercycle request even when the system disk is heavily damaged and
+reading/writing to disk fails constantly.
+
+
+New Features
+------------
+
+Automated Ganeti Cluster Merger
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Current situation
++++++++++++++++++
+
+Currently there's no easy way to merge two or more clusters together.
+But in order to optimize resources this is a needed missing piece. The
+goal of this design doc is to come up with a easy to use solution which
+allows you to merge two or more cluster together.
+
+Initial contact
++++++++++++++++
+
+As the design of Ganeti is based on an autonomous system, Ganeti by
+itself has no way to reach nodes outside of its cluster. To overcome
+this situation we're required to prepare the cluster before we can go
+ahead with the actual merge: We've to replace at least the ssh keys on
+the affected nodes before we can do any operation within ``gnt-``
+commands.
+
+To make this a automated process we'll ask the user to provide us with
+the root password of every cluster we've to merge. We use the password
+to grab the current ``id_dsa`` key and then rely on that ssh key for any
+further communication to be made until the cluster is fully merged.
+
+Cluster merge
++++++++++++++
+
+After initial contact we do the cluster merge:
+
+1. Grab the list of nodes
+2. On all nodes add our own ``id_dsa.pub`` key to ``authorized_keys``
+3. Stop all instances running on the merging cluster
+4. Disable ``ganeti-watcher`` as it tries to restart Ganeti daemons
+5. Stop all Ganeti daemons on all merging nodes
+6. Grab the ``config.data`` from the master of the merging cluster
+7. Stop local ``ganeti-masterd``
+8. Merge the config:
+
+ 1. Open our own cluster ``config.data``
+ 2. Open cluster ``config.data`` of the merging cluster
+ 3. Grab all nodes of the merging cluster
+ 4. Set ``master_candidate`` to false on all merging nodes
+ 5. Add the nodes to our own cluster ``config.data``
+ 6. Grab all the instances on the merging cluster
+ 7. Adjust the port if the instance has drbd layout:
+
+ 1. In ``logical_id`` (index 2)
+ 2. In ``physical_id`` (index 1 and 3)
+
+ 8. Add the instances to our own cluster ``config.data``
+
+9. Start ``ganeti-masterd`` with ``--no-voting`` ``--yes-do-it``
+10. ``gnt-node add --readd`` on all merging nodes
+11. ``gnt-cluster redist-conf``
+12. Restart ``ganeti-masterd`` normally
+13. Enable ``ganeti-watcher`` again
+14. Start all merging instances again
+
+Rollback
+++++++++
+
+Until we actually (re)add any nodes we can abort and rollback the merge
+at any point. After merging the config, though, we've to get the backup
+copy of ``config.data`` (from another master candidate node). And for
+security reasons it's a good idea to undo ``id_dsa.pub`` distribution by
+going on every affected node and remove the ``id_dsa.pub`` key again.
+Also we've to keep in mind, that we've to start the Ganeti daemons and
+starting up the instances again.
+
+Verification
+++++++++++++
+
+Last but not least we should verify that the merge was successful.
+Therefore we run ``gnt-cluster verify``, which ensures that the cluster
+overall is in a healthy state. Additional it's also possible to compare
+the list of instances/nodes with a list made prior to the upgrade to
+make sure we didn't lose any data/instance/node.
+
+Appendix
+++++++++
+
+cluster-merge.py
+^^^^^^^^^^^^^^^^
+
+Used to merge the cluster config. This is a POC and might differ from
+actual production code.
+
+::
+
+ #!/usr/bin/python
+
+ import sys
+ from ganeti import config
+ from ganeti import constants
+
+ c_mine = config.ConfigWriter(offline=True)
+ c_other = config.ConfigWriter(sys.argv[1])
+
+ fake_id = 0
+ for node in c_other.GetNodeList():
+ node_info = c_other.GetNodeInfo(node)
+ node_info.master_candidate = False
+ c_mine.AddNode(node_info, str(fake_id))
+ fake_id += 1
+
+ for instance in c_other.GetInstanceList():
+ instance_info = c_other.GetInstanceInfo(instance)
+ for dsk in instance_info.disks:
+ if dsk.dev_type in constants.LDS_DRBD:
+ port = c_mine.AllocatePort()
+ logical_id = list(dsk.logical_id)
+ logical_id[2] = port
+ dsk.logical_id = tuple(logical_id)
+ physical_id = list(dsk.physical_id)
+ physical_id[1] = physical_id[3] = port
+ dsk.physical_id = tuple(physical_id)
+ c_mine.AddInstance(instance_info, str(fake_id))
+ fake_id += 1
+
+