X-Git-Url: https://code.grnet.gr/git/ganeti-local/blobdiff_plain/d12689711b3f88a5f4a27452ed70d6301ecceeb7..25231ec5ff3cc74e0e525c35574182e6065a318f:/doc/design-2.1.rst

diff --git a/doc/design-2.1.rst b/doc/design-2.1.rst
index 740d092..67966e5 100644
--- a/doc/design-2.1.rst
+++ b/doc/design-2.1.rst
@@ -5,18 +5,18 @@ Ganeti 2.1 design
 This document describes the major changes in Ganeti 2.1 compared to
 the 2.0 version.
 
-The 2.1 version will be a relatively small release. Its main aim is to avoid
-changing too much of the core code, while addressing issues and adding new
-features and improvements over 2.0, in a timely fashion.
+The 2.1 version will be a relatively small release. Its main aim is to
+avoid changing too much of the core code, while addressing issues and
+adding new features and improvements over 2.0, in a timely fashion.
 
-.. contents:: :depth: 3
+.. contents:: :depth: 4
 
 Objective
 =========
 
 Ganeti 2.1 will add features to help further automatization of cluster
-operations, further improbe scalability to even bigger clusters, and make it
-easier to debug the Ganeti core.
+operations, further improbe scalability to even bigger clusters, and
+make it easier to debug the Ganeti core.
 
 Background
 ==========
@@ -29,58 +29,830 @@ Detailed design
 
 As for 2.0 we divide the 2.1 design into three areas:
 
-- core changes, which affect the master daemon/job queue/locking or all/most
-  logical units
+- core changes, which affect the master daemon/job queue/locking or
+  all/most logical units
 - logical unit/feature changes
 - external interface changes (eg. command line, os api, hooks, ...)
 
 Core changes
 ------------
 
+Storage units modelling
+~~~~~~~~~~~~~~~~~~~~~~~
+
+Currently, Ganeti has a good model of the block devices for instances
+(e.g. LVM logical volumes, files, DRBD devices, etc.) but none of the
+storage pools that are providing the space for these front-end
+devices. For example, there are hardcoded inter-node RPC calls for
+volume group listing, file storage creation/deletion, etc.
+
+The storage units framework will implement a generic handling for all
+kinds of storage backends:
+
+- LVM physical volumes
+- LVM volume groups
+- File-based storage directories
+- any other future storage method
+
+There will be a generic list of methods that each storage unit type
+will provide, like:
+
+- list of storage units of this type
+- check status of the storage unit
+
+Additionally, there will be specific methods for each method, for
+example:
+
+- enable/disable allocations on a specific PV
+- file storage directory creation/deletion
+- VG consistency fixing
+
+This will allow a much better modeling and unification of the various
+RPC calls related to backend storage pool in the future. Ganeti 2.1 is
+intended to add the basics of the framework, and not necessarilly move
+all the curent VG/FileBased operations to it.
+
+Note that while we model both LVM PVs and LVM VGs, the framework will
+**not** model any relationship between the different types. In other
+words, we don't model neither inheritances nor stacking, since this is
+too complex for our needs. While a ``vgreduce`` operation on a LVM VG
+could actually remove a PV from it, this will not be handled at the
+framework level, but at individual operation level. The goal is that
+this is a lightweight framework, for abstracting the different storage
+operation, and not for modelling the storage hierarchy.
+
+
+Locking improvements
+~~~~~~~~~~~~~~~~~~~~
+
+Current State and shortcomings
+++++++++++++++++++++++++++++++
+
+The class ``LockSet`` (see ``lib/locking.py``) is a container for one or
+many ``SharedLock`` instances. It provides an interface to add/remove
+locks and to acquire and subsequently release any number of those locks
+contained in it.
+
+Locks in a ``LockSet`` are always acquired in alphabetic order. Due to
+the way we're using locks for nodes and instances (the single cluster
+lock isn't affected by this issue) this can lead to long delays when
+acquiring locks if another operation tries to acquire multiple locks but
+has to wait for yet another operation.
+
+In the following demonstration we assume to have the instance locks
+``inst1``, ``inst2``, ``inst3`` and ``inst4``.
+
+#. Operation A grabs lock for instance ``inst4``.
+#. Operation B wants to acquire all instance locks in alphabetic order,
+   but it has to wait for ``inst4``.
+#. Operation C tries to lock ``inst1``, but it has to wait until
+   Operation B (which is trying to acquire all locks) releases the lock
+   again.
+#. Operation A finishes and releases lock on ``inst4``. Operation B can
+   continue and eventually releases all locks.
+#. Operation C can get ``inst1`` lock and finishes.
+
+Technically there's no need for Operation C to wait for Operation A, and
+subsequently Operation B, to finish. Operation B can't continue until
+Operation A is done (it has to wait for ``inst4``), anyway.
+
+Proposed changes
+++++++++++++++++
+
+Non-blocking lock acquiring
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Acquiring locks for OpCode execution is always done in blocking mode.
+They won't return until the lock has successfully been acquired (or an
+error occurred, although we won't cover that case here).
+
+``SharedLock`` and ``LockSet`` must be able to be acquired in a
+non-blocking way. They must support a timeout and abort trying to
+acquire the lock(s) after the specified amount of time.
+
+Retry acquiring locks
+^^^^^^^^^^^^^^^^^^^^^
+
+To prevent other operations from waiting for a long time, such as
+described in the demonstration before, ``LockSet`` must not keep locks
+for a prolonged period of time when trying to acquire two or more locks.
+Instead it should, with an increasing timeout for acquiring all locks,
+release all locks again and sleep some time if it fails to acquire all
+requested locks.
+
+A good timeout value needs to be determined. In any case should
+``LockSet`` proceed to acquire locks in blocking mode after a few
+(unsuccessful) attempts to acquire all requested locks.
+
+One proposal for the timeout is to use ``2**tries`` seconds, where
+``tries`` is the number of unsuccessful tries.
+
+In the demonstration before this would allow Operation C to continue
+after Operation B unsuccessfully tried to acquire all locks and released
+all acquired locks (``inst1``, ``inst2`` and ``inst3``) again.
+
+Other solutions discussed
++++++++++++++++++++++++++
+
+There was also some discussion on going one step further and extend the
+job queue (see ``lib/jqueue.py``) to select the next task for a worker
+depending on whether it can acquire the necessary locks. While this may
+reduce the number of necessary worker threads and/or increase throughput
+on large clusters with many jobs, it also brings many potential
+problems, such as contention and increased memory usage, with it. As
+this would be an extension of the changes proposed before it could be
+implemented at a later point in time, but we decided to stay with the
+simpler solution for now.
+
+Implementation details
+++++++++++++++++++++++
+
+``SharedLock`` redesign
+^^^^^^^^^^^^^^^^^^^^^^^
+
+The current design of ``SharedLock`` is not good for supporting timeouts
+when acquiring a lock and there are also minor fairness issues in it. We
+plan to address both with a redesign. A proof of concept implementation
+was written and resulted in significantly simpler code.
+
+Currently ``SharedLock`` uses two separate queues for shared and
+exclusive acquires and waiters get to run in turns. This means if an
+exclusive acquire is released, the lock will allow shared waiters to run
+and vice versa.  Although it's still fair in the end there is a slight
+bias towards shared waiters in the current implementation. The same
+implementation with two shared queues can not support timeouts without
+adding a lot of complexity.
+
+Our proposed redesign changes ``SharedLock`` to have only one single
+queue.  There will be one condition (see Condition_ for a note about
+performance) in the queue per exclusive acquire and two for all shared
+acquires (see below for an explanation). The maximum queue length will
+always be ``2 + (number of exclusive acquires waiting)``. The number of
+queue entries for shared acquires can vary from 0 to 2.
+
+The two conditions for shared acquires are a bit special. They will be
+used in turn. When the lock is instantiated, no conditions are in the
+queue. As soon as the first shared acquire arrives (and there are
+holder(s) or waiting acquires; see Acquire_), the active condition is
+added to the queue. Until it becomes the topmost condition in the queue
+and has been notified, any shared acquire is added to this active
+condition. When the active condition is notified, the conditions are
+swapped and further shared acquires are added to the previously inactive
+condition (which has now become the active condition). After all waiters
+on the previously active (now inactive) and now notified condition
+received the notification, it is removed from the queue of pending
+acquires.
+
+This means shared acquires will skip any exclusive acquire in the queue.
+We believe it's better to improve parallelization on operations only
+asking for shared (or read-only) locks. Exclusive operations holding the
+same lock can not be parallelized.
+
+
+Acquire
+*******
+
+For exclusive acquires a new condition is created and appended to the
+queue.  Shared acquires are added to the active condition for shared
+acquires and if the condition is not yet on the queue, it's appended.
+
+The next step is to wait for our condition to be on the top of the queue
+(to guarantee fairness). If the timeout expired, we return to the caller
+without acquiring the lock. On every notification we check whether the
+lock has been deleted, in which case an error is returned to the caller.
+
+The lock can be acquired if we're on top of the queue (there is no one
+else ahead of us). For an exclusive acquire, there must not be other
+exclusive or shared holders. For a shared acquire, there must not be an
+exclusive holder.  If these conditions are all true, the lock is
+acquired and we return to the caller. In any other case we wait again on
+the condition.
+
+If it was the last waiter on a condition, the condition is removed from
+the queue.
+
+Optimization: There's no need to touch the queue if there are no pending
+acquires and no current holders. The caller can have the lock
+immediately.
+
+.. image:: design-2.1-lock-acquire.png
+
+
+Release
+*******
+
+First the lock removes the caller from the internal owner list. If there
+are pending acquires in the queue, the first (the oldest) condition is
+notified.
+
+If the first condition was the active condition for shared acquires, the
+inactive condition will be made active. This ensures fairness with
+exclusive locks by forcing consecutive shared acquires to wait in the
+queue.
+
+.. image:: design-2.1-lock-release.png
+
+
+Delete
+******
+
+The caller must either hold the lock in exclusive mode already or the
+lock must be acquired in exclusive mode. Trying to delete a lock while
+it's held in shared mode must fail.
+
+After ensuring the lock is held in exclusive mode, the lock will mark
+itself as deleted and continue to notify all pending acquires. They will
+wake up, notice the deleted lock and return an error to the caller.
+
+
+Condition
+^^^^^^^^^
+
+Note: This is not necessary for the locking changes above, but it may be
+a good optimization (pending performance tests).
+
+The existing locking code in Ganeti 2.0 uses Python's built-in
+``threading.Condition`` class. Unfortunately ``Condition`` implements
+timeouts by sleeping 1ms to 20ms between tries to acquire the condition
+lock in non-blocking mode. This requires unnecessary context switches
+and contention on the CPython GIL (Global Interpreter Lock).
+
+By using POSIX pipes (see ``pipe(2)``) we can use the operating system's
+support for timeouts on file descriptors (see ``select(2)``). A custom
+condition class will have to be written for this.
+
+On instantiation the class creates a pipe. After each notification the
+previous pipe is abandoned and re-created (technically the old pipe
+needs to stay around until all notifications have been delivered).
+
+All waiting clients of the condition use ``select(2)`` or ``poll(2)`` to
+wait for notifications, optionally with a timeout. A notification will
+be signalled to the waiting clients by closing the pipe. If the pipe
+wasn't closed during the timeout, the waiting function returns to its
+caller nonetheless.
+
+
 Feature changes
 ---------------
 
+Ganeti Confd
+~~~~~~~~~~~~
+
+Current State and shortcomings
+++++++++++++++++++++++++++++++
+
+In Ganeti 2.0 all nodes are equal, but some are more equal than others.
+In particular they are divided between "master", "master candidates" and
+"normal".  (Moreover they can be offline or drained, but this is not
+important for the current discussion). In general the whole
+configuration is only replicated to master candidates, and some partial
+information is spread to all nodes via ssconf.
+
+This change was done so that the most frequent Ganeti operations didn't
+need to contact all nodes, and so clusters could become bigger. If we
+want more information to be available on all nodes, we need to add more
+ssconf values, which is counter-balancing the change, or to talk with
+the master node, which is not designed to happen now, and requires its
+availability.
+
+Information such as the instance->primary_node mapping will be needed on
+all nodes, and we also want to make sure services external to the
+cluster can query this information as well. This information must be
+available at all times, so we can't query it through RAPI, which would
+be a single point of failure, as it's only available on the master.
+
+
+Proposed changes
+++++++++++++++++
+
+In order to allow fast and highly available access read-only to some
+configuration values, we'll create a new ganeti-confd daemon, which will
+run on master candidates. This daemon will talk via UDP, and
+authenticate messages using HMAC with a cluster-wide shared key. This
+key will be generated at cluster init time, and stored on the clusters
+alongside the ganeti SSL keys, and readable only by root.
+
+An interested client can query a value by making a request to a subset
+of the cluster master candidates. It will then wait to get a few
+responses, and use the one with the highest configuration serial number.
+Since the configuration serial number is increased each time the ganeti
+config is updated, and the serial number is included in all answers,
+this can be used to make sure to use the most recent answer, in case
+some master candidates are stale or in the middle of a configuration
+update.
+
+In order to prevent replay attacks queries will contain the current unix
+timestamp according to the client, and the server will verify that its
+timestamp is in the same 5 minutes range (this requires synchronized
+clocks, which is a good idea anyway). Queries will also contain a "salt"
+which they expect the answers to be sent with, and clients are supposed
+to accept only answers which contain salt generated by them.
+
+The configuration daemon will be able to answer simple queries such as:
+
+- master candidates list
+- master node
+- offline nodes
+- instance list
+- instance primary nodes
+
+Wire protocol
+^^^^^^^^^^^^^
+
+A confd query will look like this, on the wire::
+
+  {
+    "msg": "{\"type\": 1,
+             \"rsalt\": \"9aa6ce92-8336-11de-af38-001d093e835f\",
+             \"protocol\": 1,
+             \"query\": \"node1.example.com\"}\n",
+    "salt": "1249637704",
+    "hmac": "4a4139b2c3c5921f7e439469a0a45ad200aead0f"
+  }
+
+Detailed explanation of the various fields:
+
+- 'msg' contains a JSON-encoded query, its fields are:
+
+  - 'protocol', integer, is the confd protocol version (initially just
+    constants.CONFD_PROTOCOL_VERSION, with a value of 1)
+  - 'type', integer, is the query type. For example "node role by name"
+    or "node primary ip by instance ip". Constants will be provided for
+    the actual available query types.
+  - 'query', string, is the search key. For example an ip, or a node
+    name.
+  - 'rsalt', string, is the required response salt. The client must use
+    it to recognize which answer it's getting.
+
+- 'salt' must be the current unix timestamp, according to the client.
+  Servers can refuse messages which have a wrong timing, according to
+  their configuration and clock.
+- 'hmac' is an hmac signature of salt+msg, with the cluster hmac key
+
+If an answer comes back (which is optional, since confd works over UDP)
+it will be in this format::
+
+  {
+    "msg": "{\"status\": 0,
+             \"answer\": 0,
+             \"serial\": 42,
+             \"protocol\": 1}\n",
+    "salt": "9aa6ce92-8336-11de-af38-001d093e835f",
+    "hmac": "aaeccc0dff9328fdf7967cb600b6a80a6a9332af"
+  }
+
+Where:
+
+- 'msg' contains a JSON-encoded answer, its fields are:
+
+  - 'protocol', integer, is the confd protocol version (initially just
+    constants.CONFD_PROTOCOL_VERSION, with a value of 1)
+  - 'status', integer, is the error code. Initially just 0 for 'ok' or
+    '1' for 'error' (in which case answer contains an error detail,
+    rather than an answer), but in the future it may be expanded to have
+    more meanings (eg: 2, the answer is compressed)
+  - 'answer', is the actual answer. Its type and meaning is query
+    specific. For example for "node primary ip by instance ip" queries
+    it will be a string containing an IP address, for "node role by
+    name" queries it will be an integer which encodes the role (master,
+    candidate, drained, offline) according to constants.
+
+- 'salt' is the requested salt from the query. A client can use it to
+  recognize what query the answer is answering.
+- 'hmac' is an hmac signature of salt+msg, with the cluster hmac key
+
+
 Redistribute Config
 ~~~~~~~~~~~~~~~~~~~
 
 Current State and shortcomings
 ++++++++++++++++++++++++++++++
-Currently LURedistributeConfig triggers a copy of the updated configuration
-file to all master candidates and of the ssconf files to all nodes. There are
-other files which are maintained manually but which are important to keep in
-sync. These are:
+
+Currently LURedistributeConfig triggers a copy of the updated
+configuration file to all master candidates and of the ssconf files to
+all nodes. There are other files which are maintained manually but which
+are important to keep in sync. These are:
 
 - rapi SSL key certificate file (rapi.pem) (on master candidates)
 - rapi user/password file rapi_users (on master candidates)
 
-Furthermore there are some files which are hypervisor specific but we may want
-to keep in sync:
+Furthermore there are some files which are hypervisor specific but we
+may want to keep in sync:
 
-- the xen-hvm hypervisor uses one shared file for all vnc passwords, and copies
-  the file once, during node add. This design is subject to revision to be able
-  to have different passwords for different groups of instances via the use of
-  hypervisor parameters, and to allow xen-hvm and kvm to use an equal system to
-  provide password-protected vnc sessions. In general, though, it would be
-  useful if the vnc password files were copied as well, to avoid unwanted vnc
-  password changes on instance failover/migrate.
+- the xen-hvm hypervisor uses one shared file for all vnc passwords, and
+  copies the file once, during node add. This design is subject to
+  revision to be able to have different passwords for different groups
+  of instances via the use of hypervisor parameters, and to allow
+  xen-hvm and kvm to use an equal system to provide password-protected
+  vnc sessions. In general, though, it would be useful if the vnc
+  password files were copied as well, to avoid unwanted vnc password
+  changes on instance failover/migrate.
 
-Optionally the admin may want to also ship files such as the global xend.conf
-file, and the network scripts to all nodes.
+Optionally the admin may want to also ship files such as the global
+xend.conf file, and the network scripts to all nodes.
 
 Proposed changes
 ++++++++++++++++
 
-RedistributeConfig will be changed to copy also the rapi files, and to call
-every enabled hypervisor asking for a list of additional files to copy. We also
-may want to add a global list of files on the cluster object, which will be
-propagated as well, or a hook to calculate them. If we implement this feature
-there should be a way to specify whether a file must be shipped to all nodes or
-just master candidates.
+RedistributeConfig will be changed to copy also the rapi files, and to
+call every enabled hypervisor asking for a list of additional files to
+copy. Users will have the possibility to populate a file containing a
+list of files to be distributed; this file will be propagated as well.
+Such solution is really simple to implement and it's easily usable by
+scripts.
+
+This code will be also shared (via tasklets or by other means, if
+tasklets are not ready for 2.1) with the AddNode and SetNodeParams LUs
+(so that the relevant files will be automatically shipped to new master
+candidates as they are set).
+
+VNC Console Password
+~~~~~~~~~~~~~~~~~~~~
+
+Current State and shortcomings
+++++++++++++++++++++++++++++++
+
+Currently just the xen-hvm hypervisor supports setting a password to
+connect the the instances' VNC console, and has one common password
+stored in a file.
+
+This doesn't allow different passwords for different instances/groups of
+instances, and makes it necessary to remember to copy the file around
+the cluster when the password changes.
+
+Proposed changes
+++++++++++++++++
+
+We'll change the VNC password file to a vnc_password_file hypervisor
+parameter.  This way it can have a cluster default, but also a different
+value for each instance. The VNC enabled hypervisors (xen and kvm) will
+publish all the password files in use through the cluster so that a
+redistribute-config will ship them to all nodes (see the Redistribute
+Config proposed changes above).
+
+The current VNC_PASSWORD_FILE constant will be removed, but its value
+will be used as the default HV_VNC_PASSWORD_FILE value, thus retaining
+backwards compatibility with 2.0.
+
+The code to export the list of VNC password files from the hypervisors
+to RedistributeConfig will be shared between the KVM and xen-hvm
+hypervisors.
+
+Disk/Net parameters
+~~~~~~~~~~~~~~~~~~~
+
+Current State and shortcomings
+++++++++++++++++++++++++++++++
+
+Currently disks and network interfaces have a few tweakable options and
+all the rest is left to a default we chose. We're finding that we need
+more and more to tweak some of these parameters, for example to disable
+barriers for DRBD devices, or allow striping for the LVM volumes.
+
+Moreover for many of these parameters it will be nice to have
+cluster-wide defaults, and then be able to change them per
+disk/interface.
+
+Proposed changes
+++++++++++++++++
+
+We will add new cluster level diskparams and netparams, which will
+contain all the tweakable parameters. All values which have a sensible
+cluster-wide default will go into this new structure while parameters
+which have unique values will not.
+
+Example of network parameters:
+  - mode: bridge/route
+  - link: for mode "bridge" the bridge to connect to, for mode route it
+    can contain the routing table, or the destination interface
+
+Example of disk parameters:
+  - stripe: lvm stripes
+  - stripe_size: lvm stripe size
+  - meta_flushes: drbd, enable/disable metadata "barriers"
+  - data_flushes: drbd, enable/disable data "barriers"
+
+Some parameters are bound to be disk-type specific (drbd, vs lvm, vs
+files) or hypervisor specific (nic models for example), but for now they
+will all live in the same structure. Each component is supposed to
+validate only the parameters it knows about, and ganeti itself will make
+sure that no "globally unknown" parameters are added, and that no
+parameters have overridden meanings for different components.
+
+The parameters will be kept, as for the BEPARAMS into a "default"
+category, which will allow us to expand on by creating instance
+"classes" in the future.  Instance classes is not a feature we plan
+implementing in 2.1, though.
+
+Non bridged instances support
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Current State and shortcomings
+++++++++++++++++++++++++++++++
+
+Currently each instance NIC must be connected to a bridge, and if the
+bridge is not specified the default cluster one is used. This makes it
+impossible to use the vif-route xen network scripts, or other
+alternative mechanisms that don't need a bridge to work.
+
+Proposed changes
+++++++++++++++++
+
+The new "mode" network parameter will distinguish between bridged
+interfaces and routed ones.
+
+When mode is "bridge" the "link" parameter will contain the bridge the
+instance should be connected to, effectively making things as today. The
+value has been migrated from a nic field to a parameter to allow for an
+easier manipulation of the cluster default.
+
+When mode is "route" the ip field of the interface will become
+mandatory, to allow for a route to be set. In the future we may want
+also to accept multiple IPs or IP/mask values for this purpose. We will
+evaluate possible meanings of the link parameter to signify a routing
+table to be used, which would allow for insulation between instance
+groups (as today happens for different bridges).
+
+For now we won't add a parameter to specify which network script gets
+called for which instance, so in a mixed cluster the network script must
+be able to handle both cases. The default kvm vif script will be changed
+to do so. (Xen doesn't have a ganeti provided script, so nothing will be
+done for that hypervisor)
+
+Introducing persistent UUIDs
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Current state and shortcomings
+++++++++++++++++++++++++++++++
+
+Some objects in the Ganeti configurations are tracked by their name
+while also supporting renames. This creates an extra difficulty,
+because neither Ganeti nor external management tools can then track
+the actual entity, and due to the name change it behaves like a new
+one.
+
+Proposed changes part 1
++++++++++++++++++++++++
+
+We will change Ganeti to use UUIDs for entity tracking, but in a
+staggered way. In 2.1, we will simply add an âuuidâ attribute to each
+of the instances, nodes and cluster itself. This will be reported on
+instance creation for nodes, and on node adds for the nodes. It will
+be of course avaiblable for querying via the OpQueryNodes/Instance and
+cluster information, and via RAPI as well.
+
+Note that Ganeti will not provide any way to change this attribute.
+
+Upgrading from Ganeti 2.0 will automatically add an âuuidâ attribute
+to all entities missing it.
+
+
+Proposed changes part 2
++++++++++++++++++++++++
+
+In the next release (e.g. 2.2), the tracking of objects will change
+from the name to the UUID internally, and externally Ganeti will
+accept both forms of identification; e.g. an RAPI call would be made
+either against ``/2/instances/foo.bar`` or against
+``/2/instances/bb3b2e42â¦``. Since an FQDN must have at least a dot,
+and dots are not valid characters in UUIDs, we will not have namespace
+issues.
+
+Another change here is that node identification (during cluster
+operations/queries like master startup, âam I the master?â and
+similar) could be done via UUIDs which is more stable than the current
+hostname-based scheme.
+
+Internal tracking refers to the way the configuration is stored; a
+DRBD disk of an instance refers to the node name (so that IPs can be
+changed easily), but this is still a problem for name changes; thus
+these will be changed to point to the node UUID to ease renames.
+
+The advantages of this change (after the second round of changes), is
+that node rename becomes trivial, whereas today node rename would
+require a complete lock of all instances.
+
+
+Automated disk repairs infrastructure
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Replacing defective disks in an automated fashion is quite difficult
+with the current version of Ganeti. These changes will introduce
+additional functionality and interfaces to simplify automating disk
+replacements on a Ganeti node.
+
+Fix node volume group
++++++++++++++++++++++
+
+This is the most difficult addition, as it can lead to dataloss if it's
+not properly safeguarded.
+
+The operation must be done only when all the other nodes that have
+instances in common with the target node are fine, i.e. this is the only
+node with problems, and also we have to double-check that all instances
+on this node have at least a good copy of the data.
+
+This might mean that we have to enhance the GetMirrorStatus calls, and
+introduce and a smarter version that can tell us more about the status
+of an instance.
+
+Stop allocation on a given PV
++++++++++++++++++++++++++++++
+
+This is somewhat simple. First we need a "list PVs" opcode (and its
+associated logical unit) and then a set PV status opcode/LU. These in
+combination should allow both checking and changing the disk/PV status.
+
+Instance disk status
+++++++++++++++++++++
+
+This new opcode or opcode change must list the instance-disk-index and
+node combinations of the instance together with their status. This will
+allow determining what part of the instance is broken (if any).
+
+Repair instance
++++++++++++++++
+
+This new opcode/LU/RAPI call will run ``replace-disks -p`` as needed, in
+order to fix the instance status. It only affects primary instances;
+secondaries can just be moved away.
+
+Migrate node
+++++++++++++
+
+This new opcode/LU/RAPI call will take over the current ``gnt-node
+migrate`` code and run migrate for all instances on the node.
+
+Evacuate node
+++++++++++++++
+
+This new opcode/LU/RAPI call will take over the current ``gnt-node
+evacuate`` code and run replace-secondary with an iallocator script for
+all instances on the node.
 
-This code will be also shared (via tasklets or by other means, if tasklets are
-not ready for 2.1) with the AddNode and SetNodeParams LUs (so that the relevant
-files will be automatically shipped to new master candidates as they are set).
 
 External interface changes
 --------------------------
 
+OS API
+~~~~~~
+
+The OS API of Ganeti 2.0 has been built with extensibility in mind.
+Since we pass everything as environment variables it's a lot easier to
+send new information to the OSes without breaking retrocompatibility.
+This section of the design outlines the proposed extensions to the API
+and their implementation.
+
+API Version Compatibility Handling
+++++++++++++++++++++++++++++++++++
+
+In 2.1 there will be a new OS API version (eg. 15), which should be
+mostly compatible with api 10, except for some new added variables.
+Since it's easy not to pass some variables we'll be able to handle
+Ganeti 2.0 OSes by just filtering out the newly added piece of
+information. We will still encourage OSes to declare support for the new
+API after checking that the new variables don't provide any conflict for
+them, and we will drop api 10 support after ganeti 2.1 has released.
+
+New Environment variables
++++++++++++++++++++++++++
+
+Some variables have never been added to the OS api but would definitely
+be useful for the OSes. We plan to add an INSTANCE_HYPERVISOR variable
+to allow the OS to make changes relevant to the virtualization the
+instance is going to use. Since this field is immutable for each
+instance, the os can tight the install without caring of making sure the
+instance can run under any virtualization technology.
+
+We also want the OS to know the particular hypervisor parameters, to be
+able to customize the install even more.  Since the parameters can
+change, though, we will pass them only as an "FYI": if an OS ties some
+instance functionality to the value of a particular hypervisor parameter
+manual changes or a reinstall may be needed to adapt the instance to the
+new environment. This is not a regression as of today, because even if
+the OSes are left blind about this information, sometimes they still
+need to make compromises and cannot satisfy all possible parameter
+values.
+
+OS Variants
++++++++++++
+
+Currently we are assisting to some degree of "os proliferation" just to
+change a simple installation behavior. This means that the same OS gets
+installed on the cluster multiple times, with different names, to
+customize just one installation behavior. Usually such OSes try to share
+as much as possible through symlinks, but this still causes
+complications on the user side, especially when multiple parameters must
+be cross-matched.
+
+For example today if you want to install debian etch, lenny or squeeze
+you probably need to install the debootstrap OS multiple times, changing
+its configuration file, and calling it debootstrap-etch,
+debootstrap-lenny or debootstrap-squeeze. Furthermore if you have for
+example a "server" and a "development" environment which installs
+different packages/configuration files and must be available for all
+installs you'll probably end  up with deboostrap-etch-server,
+debootstrap-etch-dev, debootrap-lenny-server, debootstrap-lenny-dev,
+etc. Crossing more than two parameters quickly becomes not manageable.
+
+In order to avoid this we plan to make OSes more customizable, by
+allowing each OS to declare a list of variants which can be used to
+customize it. The variants list is mandatory and must be written, one
+variant per line, in the new "variants.list" file inside the main os
+dir. At least one supported variant must be supported. When choosing the
+OS exactly one variant will have to be specified, and will be encoded in
+the os name as <OS-name>+<variant>. As for today it will be possible to
+change an instance's OS at creation or install time.
+
+The 2.1 OS list will be the combination of each OS, plus its supported
+variants. This will cause the name name proliferation to remain, but at
+least the internal OS code will be simplified to just parsing the passed
+variant, without the need for symlinks or code duplication.
+
+Also we expect the OSes to declare only "interesting" variants, but to
+accept some non-declared ones which a user will be able to pass in by
+overriding the checks ganeti does. This will be useful for allowing some
+variations to be used without polluting the OS list (per-OS
+documentation should list all supported variants). If a variant which is
+not internally supported is forced through, the OS scripts should abort.
+
+In the future (post 2.1) we may want to move to full fledged parameters
+all orthogonal to each other (for example "architecture" (i386, amd64),
+"suite" (lenny, squeeze, ...), etc). (As opposed to the variant, which
+is a single parameter, and you need a different variant for all the set
+of combinations you want to support).  In this case we envision the
+variants to be moved inside of Ganeti and be associated with lists
+parameter->values associations, which will then be passed to the OS.
+
+
+IAllocator changes
+~~~~~~~~~~~~~~~~~~
+
+Current State and shortcomings
+++++++++++++++++++++++++++++++
+
+The iallocator interface allows creation of instances without manually
+specifying nodes, but instead by specifying plugins which will do the
+required computations and produce a valid node list.
+
+However, the interface is quite akward to use:
+
+- one cannot set a 'default' iallocator script
+- one cannot use it to easily test if allocation would succeed
+- some new functionality, such as rebalancing clusters and calculating
+  capacity estimates is needed
+
+Proposed changes
+++++++++++++++++
+
+There are two area of improvements proposed:
+
+- improving the use of the current interface
+- extending the IAllocator API to cover more automation
+
+
+Default iallocator names
+^^^^^^^^^^^^^^^^^^^^^^^^
+
+The cluster will hold, for each type of iallocator, a (possibly empty)
+list of modules that will be used automatically.
+
+If the list is empty, the behaviour will remain the same.
+
+If the list has one entry, then ganeti will behave as if
+'--iallocator' was specifyed on the command line. I.e. use this
+allocator by default. If the user however passed nodes, those will be
+used in preference.
+
+If the list has multiple entries, they will be tried in order until
+one gives a successful answer.
+
+Dry-run allocation
+^^^^^^^^^^^^^^^^^^
+
+The create instance LU will get a new 'dry-run' option that will just
+simulate the placement, and return the chosen node-lists after running
+all the usual checks.
+
+Cluster balancing
+^^^^^^^^^^^^^^^^^
+
+Instance add/removals/moves can create a situation where load on the
+nodes is not spread equally. For this, a new iallocator mode will be
+implemented called ``balance`` in which the plugin, given the current
+cluster state, and a maximum number of operations, will need to
+compute the instance relocations needed in order to achieve a "better"
+(for whatever the script believes it's better) cluster.
+
+Cluster capacity calculation
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+In this mode, called ``capacity``, given an instance specification and
+the current cluster state (similar to the ``allocate`` mode), the
+plugin needs to return:
+
+- how many instances can be allocated on the cluster with that
+  specification
+- on which nodes these will be allocated (in order)
+
+.. vim: set textwidth=72 :