--- /dev/null
+==========================
+ Virtual clusters support
+==========================
+
+
+Introduction
+============
+
+Currently there are two ways to test the Ganeti (including HTools) code
+base:
+
+- unittests, which run using mocks as normal user and test small bits of
+ the code
+- QA/burnin/live-test, which require actual hardware (either physical or
+ virtual) and will build an actual cluster, with one machine to one
+ node correspondence
+
+The difference in time between these two is significant:
+
+- the unittests run in about 1-2 minutes
+- a so-called ‘quick’ QA (without burnin) runs in about an hour, and a
+ full QA could be double that time
+
+On one hand, the unittests have a clear advantage: quick to run, not
+requiring many machines, but on the other hand QA is actually able to
+run end-to-end tests (including HTools, for example).
+
+Ideally, we would have an intermediate step between these two extremes:
+be able to test most, if not all, of Ganeti's functionality but without
+requiring actual hardware, full machine ownership or root access.
+
+
+Current situation
+=================
+
+Ganeti
+------
+
+It is possible, given a manually built ``config.data`` and
+``_autoconf.py``, to run the masterd under the current user as a
+single-node cluster master. However, the node daemon and related
+functionality (cluster initialisation, master failover, etc.) are not
+directly runnable in this model.
+
+Also, masterd only works as a master of a single node cluster, due to
+our current “hostname” method of identifying nodes, which results in a
+limit of maximum one node daemon per machine, unless we use multiple
+name and IP aliases.
+
+HTools
+------
+
+In HTools the situation is better, since it doesn't have to deal with
+actual machine management: all tools can use a custom LUXI path, and can
+even load RAPI data from the filesystem (so the RAPI backend can be
+tested), and both the ‘text’ backend for hbal/hspace and the input files
+for hail are text-based, loaded from the file-system.
+
+Proposed changes
+================
+
+The end-goal is to have full support for “virtual clusters”, i.e. be
+able to run a “big” (hundreds of virtual nodes and towards thousands of
+virtual instances) on a reasonably powerful, but single machine, under a
+single user account and without any special privileges.
+
+This would have significant advantages:
+
+- being able to test end-to-end certain changes, without requiring a
+ complicated setup
+- better able to estimate Ganeti's behaviour and performance as the
+ cluster size grows; this is something that we haven't been able to
+ test reliably yet, and as such we still have not yet diagnosed
+ scaling problems
+- easier integration with external tools (and even with HTools)
+
+``masterd``
+-----------
+
+As described above, ``masterd`` already works reasonably well in a
+virtual setup, as it won't execute external programs and it shouldn't
+directly read files from the local filesystem (or at least not
+virtualisation-related, as the master node can be a non-vm_capable
+node).
+
+``noded``
+---------
+
+The node daemon executes many privileged operations, but they can be
+split in a few general categories:
+
++---------------+-----------------------+------------------------------------+
+|Category |Description |Solution |
++===============+=======================+====================================+
+|disk operations|Disk creation and |Use only diskless or file-based |
+| |removal |instances |
++---------------+-----------------------+------------------------------------+
+|disk query |Node disk total/free, |Not supported currently, could use |
+| |used in node listing |file-based |
+| |and htools | |
++---------------+-----------------------+------------------------------------+
+|hypervisor |Instance start, stop |Use the *fake* hypervisor |
+|operations |and query | |
++---------------+-----------------------+------------------------------------+
+|instance |Bridge existence query |Unprivileged operation, can be used |
+|networking | |with an existing bridge at system |
+| | |level or use NIC-less instances |
++---------------+-----------------------+------------------------------------+
+|instance OS |OS add, OS rename, |Only used with non diskless |
+|operations |export and import |instances; could work with custom OS|
+| | |scripts (that just ``dd`` without |
+| | |mounting filesystems |
++---------------+-----------------------+------------------------------------+
+|node networking|IP address management |Not supported; Ganeti will need to |
+| |(master ip), IP query, |work without a master IP. For the IP|
+| |etc. |query operations, the test machine |
+| | |would need externally-configured IPs|
++---------------+-----------------------+------------------------------------+
+|node setup |ssh, /etc/hosts, so on |Can already be disabled from the |
+| | |cluster config |
++---------------+-----------------------+------------------------------------+
+|master failover|start/stop the master |Doable (as long as we use a single |
+| |daemon |user), might get tricky w.r.t. paths|
+| | |to executables |
++---------------+-----------------------+------------------------------------+
+|file upload |Uploading of system |The only issue could be with system |
+| |files, job queue files |files, which are not owned by the |
+| |and ganeti config |current user; internal ganeti files |
+| | |should be working fine |
++---------------+-----------------------+------------------------------------+
+|node oob |Out-of-band commands |Since these are user-defined, we can|
+| | |mock them easily |
++---------------+-----------------------+------------------------------------+
+|node OS |List the existing OSes |No special privileges needed, so |
+|discovery |and their properties |works fine as-is |
++---------------+-----------------------+------------------------------------+
+|hooks |Running hooks for given|No special privileges needed |
+| |operations | |
++---------------+-----------------------+------------------------------------+
+|iallocator |Calling an iallocator |No special privileges needed |
+| |script | |
++---------------+-----------------------+------------------------------------+
+|export/import |Exporting and importing|When exporting/importing file-based |
+| |instances |instances, this should work, as the |
+| | |listening ports are dynamically |
+| | |chosen |
++---------------+-----------------------+------------------------------------+
+|hypervisor |The validation of |As long as the hypervisors don't |
+|validation |hypervisor parameters |call to privileged commands, it |
+| | |should work |
++---------------+-----------------------+------------------------------------+
+|node powercycle|The ability to power |Privileged, so not supported, but |
+| |cycle a node remotely |anyway not very interesting for |
+| | |testing |
++---------------+-----------------------+------------------------------------+
+
+It seems that much of the functionality works as is, or could work with
+small adjustments, even in a non-privileged setup. The bigger problem is
+the actual use of multiple node daemons per machine.
+
+Multiple ``noded`` per machine
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Currently Ganeti identifies node simply by their hostname. Since
+changing this method would imply significant changes to tracking the
+nodes, the proposal is to simply have as many IPs per the (single)
+machine that is used for tests as nodes, and have each IP correspond to
+a different name, and thus no changes are needed to the core RPC
+library. Unfortunately this has the downside of requiring root rights
+for setting up the extra IPs and hostnames.
+
+An alternative option is to implement per-node IP/port support in Ganeti
+(especially in the RPC layer), which would eliminate the root rights. We
+expect that this will get implemented as a second step of this design.
+
+The only remaining problem is with sharing the ``localstatedir``
+structure (lib, run, log) amongst the daemons, for which we propose to
+add a command line parameter which can override this path (via injection
+into ``_autoconf.py``). The rationale for this is two-fold:
+
+- having two or more node daemons writing to the same directory might
+ introduce artificial scenarios not existent in real life; currently
+ noded either owns the entire ``/var/lib/ganeti`` directory or shares
+ it with masterd, but never with another noded
+- having separate directories allows cluster verify to check correctly
+ consistency of file upload operations; otherwise, as long as one node
+ daemon wrote a file successfully, the results from all others are
+ “lost”
+
+
+``rapi``
+--------
+
+The RAPI daemon is not privileged and furthermore we only need one per
+cluster, so it presents no issues.
+
+``confd``
+---------
+
+``confd`` has somewhat the same issues as the node daemon regarding
+multiple daemons per machine, but the per-address binding still works.
+
+``ganeti-watcher``
+------------------
+
+Since the startup of daemons will be customised with per-IP binds, the
+watcher either has to be modified to not activate the daemons, or the
+start-stop tool has to take this into account. Due to watcher's use of
+the hostname, it's recommended that the master node is set to the
+machine hostname (also a requirement for the master daemon).
+
+CLI scripts
+-----------
+
+As long as the master node is set to the machine hostname, these should
+work fine.
+
+Cluster initialisation
+----------------------
+
+It could be possible that the cluster initialisation procedure is a bit
+more involved (this was not tried yet). In any case, we can build a
+``config.data`` file manually, without having to actually run
+``gnt-cluster init``.
+
+Needed tools
+============
+
+With the above investigation results in mind, the only thing we need
+are:
+
+- a tool to setup per-virtual node tree structure of ``localstatedir``
+ and setup correctly the extra IP/hostnames
+- changes to the startup daemon tools to launch correctly the daemons
+ per virtual node
+- changes to ``noded`` to override the ``localstatedir`` path
+- documentation for running such a virtual cluster
+- and eventual small fixes to the node daemon backend functionality, to
+ better separate privileged and non-privileged code
+
+.. vim: set textwidth=72 :
+.. Local Variables:
+.. mode: rst
+.. fill-column: 72
+.. End: