X-Git-Url: https://code.grnet.gr/git/ganeti-local/blobdiff_plain/6c2d0b4436b7155a748e26372e612272a65b164e..c8fcde472922e4ee664d904e0bf1a583f1d5040d:/doc/design-2.0.rst?ds=inline

diff --git a/doc/design-2.0.rst b/doc/design-2.0.rst
index 9b40e75..73cfa64 100644
--- a/doc/design-2.0.rst
+++ b/doc/design-2.0.rst
@@ -8,7 +8,7 @@ the 1.2 version.
 The 2.0 version will constitute a rewrite of the 'core' architecture,
 paving the way for additional features in future 2.x versions.
 
-.. contents::
+.. contents:: :depth: 3
 
 Objective
 =========
@@ -332,9 +332,10 @@ and instead have always one logfile per daemon model:
 - rapi-access.log, an additional log file for the RAPI that will be
   in the standard HTTP log format for possible parsing by other tools
 
-Since the `watcher`_ will only submit jobs to the master for startup
-of the instances, its log file will contain less information than
-before, mainly that it will start the instance, but not the results.
+Since the :term:`watcher` will only submit jobs to the master for
+startup of the instances, its log file will contain less information
+than before, mainly that it will start the instance, but not the
+results.
 
 Node daemon changes
 +++++++++++++++++++
@@ -799,7 +800,7 @@ The following definitions for instance parameters will be used below:
   a hypervisor parameter (or hypervisor specific parameter) is defined
   as a parameter that is interpreted by the hypervisor support code in
   Ganeti and usually is specific to a particular hypervisor (like the
-  kernel path for `PVM`_ which makes no sense for `HVM`_).
+  kernel path for :term:`PVM` which makes no sense for :term:`HVM`).
 
 :backend parameter:
   a backend parameter is defined as an instance parameter that can be
@@ -841,6 +842,9 @@ Node parameters
 Node-related parameters are very few, and we will continue using the
 same model for these as previously (attributes on the Node object).
 
+There are three new node flags, described in a separate section "node
+flags" below.
+
 Instance parameters
 +++++++++++++++++++
 
@@ -976,6 +980,182 @@ config data while purging the sensitive value.
 E.g. for the drbd shared secrets, we could export these with the
 values replaced by an empty string.
 
+Node flags
+~~~~~~~~~~
+
+Ganeti 2.0 adds three node flags that change the way nodes are handled
+within Ganeti and the related infrastructure (iallocator interaction,
+RAPI data export).
+
+*master candidate* flag
++++++++++++++++++++++++
+
+Ganeti 2.0 allows more scalability in operation by introducing
+parallelization. However, a new bottleneck is reached that is the
+synchronization and replication of cluster configuration to all nodes
+in the cluster.
+
+This breaks scalability as the speed of the replication decreases
+roughly with the size of the nodes in the cluster. The goal of the
+master candidate flag is to change this O(n) into O(1) with respect to
+job and configuration data propagation.
+
+Only nodes having this flag set (let's call this set of nodes the
+*candidate pool*) will have jobs and configuration data replicated.
+
+The cluster will have a new parameter (runtime changeable) called
+``candidate_pool_size`` which represents the number of candidates the
+cluster tries to maintain (preferably automatically).
+
+This will impact the cluster operations as follows:
+
+- jobs and config data will be replicated only to a fixed set of nodes
+- master fail-over will only be possible to a node in the candidate pool
+- cluster verify needs changing to account for these two roles
+- external scripts will no longer have access to the configuration
+  file (this is not recommended anyway)
+
+
+The caveats of this change are:
+
+- if all candidates are lost (completely), cluster configuration is
+  lost (but it should be backed up external to the cluster anyway)
+
+- failed nodes which are candidate must be dealt with properly, so
+  that we don't lose too many candidates at the same time; this will be
+  reported in cluster verify
+
+- the 'all equal' concept of ganeti is no longer true
+
+- the partial distribution of config data means that all nodes will
+  have to revert to ssconf files for master info (as in 1.2)
+
+Advantages:
+
+- speed on a 100+ nodes simulated cluster is greatly enhanced, even
+  for a simple operation; ``gnt-instance remove`` on a diskless instance
+  remove goes from ~9seconds to ~2 seconds
+
+- node failure of non-candidates will be less impacting on the cluster
+
+The default value for the candidate pool size will be set to 10 but
+this can be changed at cluster creation and modified any time later.
+
+Testing on simulated big clusters with sequential and parallel jobs
+show that this value (10) is a sweet-spot from performance and load
+point of view.
+
+*offline* flag
+++++++++++++++
+
+In order to support better the situation in which nodes are offline
+(e.g. for repair) without altering the cluster configuration, Ganeti
+needs to be told and needs to properly handle this state for nodes.
+
+This will result in simpler procedures, and less mistakes, when the
+amount of node failures is high on an absolute scale (either due to
+high failure rate or simply big clusters).
+
+Nodes having this attribute set will not be contacted for inter-node
+RPC calls, will not be master candidates, and will not be able to host
+instances as primaries.
+
+Setting this attribute on a node:
+
+- will not be allowed if the node is the master
+- will not be allowed if the node has primary instances
+- will cause the node to be demoted from the master candidate role (if
+  it was), possibly causing another node to be promoted to that role
+
+This attribute will impact the cluster operations as follows:
+
+- querying these nodes for anything will fail instantly in the RPC
+  library, with a specific RPC error (RpcResult.offline == True)
+
+- they will be listed in the Other section of cluster verify
+
+The code is changed in the following ways:
+
+- RPC calls were be converted to skip such nodes:
+
+  - RpcRunner-instance-based RPC calls are easy to convert
+
+  - static/classmethod RPC calls are harder to convert, and were left
+    alone
+
+- the RPC results were unified so that this new result state (offline)
+  can be differentiated
+
+- master voting still queries in repair nodes, as we need to ensure
+  consistency in case the (wrong) masters have old data, and nodes have
+  come back from repairs
+
+Caveats:
+
+- some operation semantics are less clear (e.g. what to do on instance
+  start with offline secondary?); for now, these will just fail as if the
+  flag is not set (but faster)
+- 2-node cluster with one node offline needs manual startup of the
+  master with a special flag to skip voting (as the master can't get a
+  quorum there)
+
+One of the advantages of implementing this flag is that it will allow
+in the future automation tools to automatically put the node in
+repairs and recover from this state, and the code (should/will) handle
+this much better than just timing out. So, future possible
+improvements (for later versions):
+
+- watcher will detect nodes which fail RPC calls, will attempt to ssh
+  to them, if failure will put them offline
+- watcher will try to ssh and query the offline nodes, if successful
+  will take them off the repair list
+
+Alternatives considered: The RPC call model in 2.0 is, by default,
+much nicer - errors are logged in the background, and job/opcode
+execution is clearer, so we could simply not introduce this. However,
+having this state will make both the codepaths clearer (offline
+vs. temporary failure) and the operational model (it's not a node with
+errors, but an offline node).
+
+
+*drained* flag
+++++++++++++++
+
+Due to parallel execution of jobs in Ganeti 2.0, we could have the
+following situation:
+
+- gnt-node migrate + failover is run
+- gnt-node evacuate is run, which schedules a long-running 6-opcode
+  job for the node
+- partway through, a new job comes in that runs an iallocator script,
+  which finds the above node as empty and a very good candidate
+- gnt-node evacuate has finished, but now it has to be run again, to
+  clean the above instance(s)
+
+In order to prevent this situation, and to be able to get nodes into
+proper offline status easily, a new *drained* flag was added to the nodes.
+
+This flag (which actually means "is being, or was drained, and is
+expected to go offline"), will prevent allocations on the node, but
+otherwise all other operations (start/stop instance, query, etc.) are
+working without any restrictions.
+
+Interaction between flags
++++++++++++++++++++++++++
+
+While these flags are implemented as separate flags, they are
+mutually-exclusive and are acting together with the master node role
+as a single *node status* value. In other words, a flag is only in one
+of these roles at a given time. The lack of any of these flags denote
+a regular node.
+
+The current node status is visible in the ``gnt-cluster verify``
+output, and the individual flags can be examined via separate flags in
+the ``gnt-node list`` output.
+
+These new flags will be exported in both the iallocator input message
+and via RAPI, see the respective man pages for the exact names.
+
 Feature changes
 ---------------
 
@@ -1378,8 +1558,8 @@ DEBUG_LEVEL
 
 These are only the basic variables we are thinking of now, but more
 may come during the implementation and they will be documented in the
-``ganeti-os-api`` man page. All these variables will be available to
-all scripts.
+:manpage:`ganeti-os-api` man page. All these variables will be
+available to all scripts.
 
 Some scripts will need a few more information to work. These will have
 per-script variables, such as for example:
@@ -1812,41 +1992,3 @@ option is::
   to set, string
 :$OPTION: cluster default option, string,
 :$VALUE: cluster default option value, string.
-
-Glossary
-========
-
-Since this document is only a delta from the Ganeti 1.2, there are
-some unexplained terms. Here is a non-exhaustive list.
-
-.. _HVM:
-
-HVM
-  hardware virtualization mode, where the virtual machine is oblivious
-  to the fact that's being virtualized and all the hardware is emulated
-
-.. _LU:
-
-LogicalUnit
-  the code associated with an OpCode, i.e. the code that implements the
-  startup of an instance
-
-.. _opcode:
-
-OpCode
-  a data structure encapsulating a basic cluster operation; for example,
-  start instance, add instance, etc.;
-
-.. _PVM:
-
-PVM
-  para-virtualization mode, where the virtual machine knows it's being
-  virtualized and as such there is no need for hardware emulation
-
-.. _watcher:
-
-watcher
-  ``ganeti-watcher`` is a tool that should be run regularly from cron
-  and takes care of restarting failed instances, restarting secondary
-  DRBD devices, etc. For more details, see the man page
-  ``ganeti-watcher(8)``.