Cluster: add nicparams, and update them on upgrade

[ganeti-local] / doc / design-2.0.rst
diff --git a/doc/design-2.0.rst b/doc/design-2.0.rst

index 9b40e75..73cfa64 100644 (file)
--- a/doc/design-2.0.rst
+++ b/doc/design-2.0.rst
@@ -8,7 +8,7 @@ the 1.2 version.
  The 2.0 version will constitute a rewrite of the 'core' architecture,
  paving the way for additional features in future 2.x versions.
  
  The 2.0 version will constitute a rewrite of the 'core' architecture,
  paving the way for additional features in future 2.x versions.
  
-.. contents::
+.. contents:: :depth: 3
  
  Objective
  =========
  
  Objective
  =========
@@ -332,9 +332,10 @@ and instead have always one logfile per daemon model:
  - rapi-access.log, an additional log file for the RAPI that will be
    in the standard HTTP log format for possible parsing by other tools
  
  - rapi-access.log, an additional log file for the RAPI that will be
    in the standard HTTP log format for possible parsing by other tools
  
-Since the `watcher`_ will only submit jobs to the master for startup
-of the instances, its log file will contain less information than
-before, mainly that it will start the instance, but not the results.
+Since the :term:`watcher` will only submit jobs to the master for
+startup of the instances, its log file will contain less information
+than before, mainly that it will start the instance, but not the
+results.
  
  Node daemon changes
  +++++++++++++++++++
  
  Node daemon changes
  +++++++++++++++++++
@@ -799,7 +800,7 @@ The following definitions for instance parameters will be used below:
    a hypervisor parameter (or hypervisor specific parameter) is defined
    as a parameter that is interpreted by the hypervisor support code in
    Ganeti and usually is specific to a particular hypervisor (like the
    a hypervisor parameter (or hypervisor specific parameter) is defined
    as a parameter that is interpreted by the hypervisor support code in
    Ganeti and usually is specific to a particular hypervisor (like the
-  kernel path for `PVM`_ which makes no sense for `HVM`_).
+  kernel path for :term:`PVM` which makes no sense for :term:`HVM`).
  
  :backend parameter:
    a backend parameter is defined as an instance parameter that can be
  
  :backend parameter:
    a backend parameter is defined as an instance parameter that can be
@@ -841,6 +842,9 @@ Node parameters
  Node-related parameters are very few, and we will continue using the
  same model for these as previously (attributes on the Node object).
  
  Node-related parameters are very few, and we will continue using the
  same model for these as previously (attributes on the Node object).
  
+There are three new node flags, described in a separate section "node
+flags" below.
+
  Instance parameters
  +++++++++++++++++++
  
  Instance parameters
  +++++++++++++++++++
  
@@ -976,6 +980,182 @@ config data while purging the sensitive value.
  E.g. for the drbd shared secrets, we could export these with the
  values replaced by an empty string.
  
  E.g. for the drbd shared secrets, we could export these with the
  values replaced by an empty string.
  
+Node flags
+~~~~~~~~~~
+
+Ganeti 2.0 adds three node flags that change the way nodes are handled
+within Ganeti and the related infrastructure (iallocator interaction,
+RAPI data export).
+
+*master candidate* flag
++++++++++++++++++++++++
+
+Ganeti 2.0 allows more scalability in operation by introducing
+parallelization. However, a new bottleneck is reached that is the
+synchronization and replication of cluster configuration to all nodes
+in the cluster.
+
+This breaks scalability as the speed of the replication decreases
+roughly with the size of the nodes in the cluster. The goal of the
+master candidate flag is to change this O(n) into O(1) with respect to
+job and configuration data propagation.
+
+Only nodes having this flag set (let's call this set of nodes the
+*candidate pool*) will have jobs and configuration data replicated.
+
+The cluster will have a new parameter (runtime changeable) called
+``candidate_pool_size`` which represents the number of candidates the
+cluster tries to maintain (preferably automatically).
+
+This will impact the cluster operations as follows:
+
+- jobs and config data will be replicated only to a fixed set of nodes
+- master fail-over will only be possible to a node in the candidate pool
+- cluster verify needs changing to account for these two roles
+- external scripts will no longer have access to the configuration
+  file (this is not recommended anyway)
+
+
+The caveats of this change are:
+
+- if all candidates are lost (completely), cluster configuration is
+  lost (but it should be backed up external to the cluster anyway)
+
+- failed nodes which are candidate must be dealt with properly, so
+  that we don't lose too many candidates at the same time; this will be
+  reported in cluster verify
+
+- the 'all equal' concept of ganeti is no longer true
+
+- the partial distribution of config data means that all nodes will
+  have to revert to ssconf files for master info (as in 1.2)
+
+Advantages:
+
+- speed on a 100+ nodes simulated cluster is greatly enhanced, even
+  for a simple operation; ``gnt-instance remove`` on a diskless instance
+  remove goes from ~9seconds to ~2 seconds
+
+- node failure of non-candidates will be less impacting on the cluster
+
+The default value for the candidate pool size will be set to 10 but
+this can be changed at cluster creation and modified any time later.
+
+Testing on simulated big clusters with sequential and parallel jobs
+show that this value (10) is a sweet-spot from performance and load
+point of view.
+
+*offline* flag
+++++++++++++++
+
+In order to support better the situation in which nodes are offline
+(e.g. for repair) without altering the cluster configuration, Ganeti
+needs to be told and needs to properly handle this state for nodes.
+
+This will result in simpler procedures, and less mistakes, when the
+amount of node failures is high on an absolute scale (either due to
+high failure rate or simply big clusters).
+
+Nodes having this attribute set will not be contacted for inter-node
+RPC calls, will not be master candidates, and will not be able to host
+instances as primaries.
+
+Setting this attribute on a node:
+
+- will not be allowed if the node is the master
+- will not be allowed if the node has primary instances
+- will cause the node to be demoted from the master candidate role (if
+  it was), possibly causing another node to be promoted to that role
+
+This attribute will impact the cluster operations as follows:
+
+- querying these nodes for anything will fail instantly in the RPC
+  library, with a specific RPC error (RpcResult.offline == True)
+
+- they will be listed in the Other section of cluster verify
+
+The code is changed in the following ways:
+
+- RPC calls were be converted to skip such nodes:
+
+  - RpcRunner-instance-based RPC calls are easy to convert
+
+  - static/classmethod RPC calls are harder to convert, and were left
+    alone
+
+- the RPC results were unified so that this new result state (offline)
+  can be differentiated
+
+- master voting still queries in repair nodes, as we need to ensure
+  consistency in case the (wrong) masters have old data, and nodes have
+  come back from repairs
+
+Caveats:
+
+- some operation semantics are less clear (e.g. what to do on instance
+  start with offline secondary?); for now, these will just fail as if the
+  flag is not set (but faster)
+- 2-node cluster with one node offline needs manual startup of the
+  master with a special flag to skip voting (as the master can't get a
+  quorum there)
+
+One of the advantages of implementing this flag is that it will allow
+in the future automation tools to automatically put the node in
+repairs and recover from this state, and the code (should/will) handle
+this much better than just timing out. So, future possible
+improvements (for later versions):
+
+- watcher will detect nodes which fail RPC calls, will attempt to ssh
+  to them, if failure will put them offline
+- watcher will try to ssh and query the offline nodes, if successful
+  will take them off the repair list
+
+Alternatives considered: The RPC call model in 2.0 is, by default,
+much nicer - errors are logged in the background, and job/opcode
+execution is clearer, so we could simply not introduce this. However,
+having this state will make both the codepaths clearer (offline
+vs. temporary failure) and the operational model (it's not a node with
+errors, but an offline node).
+
+
+*drained* flag
+++++++++++++++
+
+Due to parallel execution of jobs in Ganeti 2.0, we could have the
+following situation:
+
+- gnt-node migrate + failover is run
+- gnt-node evacuate is run, which schedules a long-running 6-opcode
+  job for the node
+- partway through, a new job comes in that runs an iallocator script,
+  which finds the above node as empty and a very good candidate
+- gnt-node evacuate has finished, but now it has to be run again, to
+  clean the above instance(s)
+
+In order to prevent this situation, and to be able to get nodes into
+proper offline status easily, a new *drained* flag was added to the nodes.
+
+This flag (which actually means "is being, or was drained, and is
+expected to go offline"), will prevent allocations on the node, but
+otherwise all other operations (start/stop instance, query, etc.) are
+working without any restrictions.
+
+Interaction between flags
++++++++++++++++++++++++++
+
+While these flags are implemented as separate flags, they are
+mutually-exclusive and are acting together with the master node role
+as a single *node status* value. In other words, a flag is only in one
+of these roles at a given time. The lack of any of these flags denote
+a regular node.
+
+The current node status is visible in the ``gnt-cluster verify``
+output, and the individual flags can be examined via separate flags in
+the ``gnt-node list`` output.
+
+These new flags will be exported in both the iallocator input message
+and via RAPI, see the respective man pages for the exact names.
+
  Feature changes
  ---------------
  
  Feature changes
  ---------------
  
@@ -1378,8 +1558,8 @@ DEBUG_LEVEL
  
  These are only the basic variables we are thinking of now, but more
  may come during the implementation and they will be documented in the
  
  These are only the basic variables we are thinking of now, but more
  may come during the implementation and they will be documented in the
-``ganeti-os-api`` man page. All these variables will be available to
-all scripts.
+:manpage:`ganeti-os-api` man page. All these variables will be
+available to all scripts.
  
  Some scripts will need a few more information to work. These will have
  per-script variables, such as for example:
  
  Some scripts will need a few more information to work. These will have
  per-script variables, such as for example:
@@ -1812,41 +1992,3 @@ option is::
    to set, string
  :$OPTION: cluster default option, string,
  :$VALUE: cluster default option value, string.
    to set, string
  :$OPTION: cluster default option, string,
  :$VALUE: cluster default option value, string.
-
-Glossary
-========
-
-Since this document is only a delta from the Ganeti 1.2, there are
-some unexplained terms. Here is a non-exhaustive list.
-
-.. _HVM:
-
-HVM
-  hardware virtualization mode, where the virtual machine is oblivious
-  to the fact that's being virtualized and all the hardware is emulated
-
-.. _LU:
-
-LogicalUnit
-  the code associated with an OpCode, i.e. the code that implements the
-  startup of an instance
-
-.. _opcode:
-
-OpCode
-  a data structure encapsulating a basic cluster operation; for example,
-  start instance, add instance, etc.;
-
-.. _PVM:
-
-PVM
-  para-virtualization mode, where the virtual machine knows it's being
-  virtualized and as such there is no need for hardware emulation
-
-.. _watcher:
-
-watcher
-  ``ganeti-watcher`` is a tool that should be run regularly from cron
-  and takes care of restarting failed instances, restarting secondary
-  DRBD devices, etc. For more details, see the man page
-  ``ganeti-watcher(8)``.