From: Marc Schmitt Date: Thu, 28 Oct 2010 11:42:36 +0000 (+0200) Subject: Design Doc: Ganeti Node OOB Management Framework X-Git-Tag: v2.4.0beta1~306 X-Git-Url: https://code.grnet.gr/git/ganeti-local/commitdiff_plain/1e86ee97d0767189df1f5a5d1a1e64e5343e831e Design Doc: Ganeti Node OOB Management Framework Signed-off-by: René Nussbaumer Reviewed-by: Michael Hanselmann --- diff --git a/Makefile.am b/Makefile.am index 56a8500..70444b1 100644 --- a/Makefile.am +++ b/Makefile.am @@ -208,6 +208,7 @@ docrst = \ doc/design-2.1.rst \ doc/design-2.2.rst \ doc/design-2.3.rst \ + doc/design-oob.rst \ doc/cluster-merge.rst \ doc/devnotes.rst \ doc/glossary.rst \ diff --git a/doc/design-oob.rst b/doc/design-oob.rst new file mode 100644 index 0000000..1e78c1a --- /dev/null +++ b/doc/design-oob.rst @@ -0,0 +1,362 @@ +Ganeti Node OOB Management Framework +==================================== + +Objective +--------- + +Extend Ganeti with Out of Band Cluster Node Management Capabilities. + +Background +---------- + +Ganeti currently has no support for Out of Band management of the nodes in a +cluster. It relies on the OS running on the nodes and has therefore limited +possibilities when the OS is not responding. The command ``gnt-node powercycle`` +can be issued to attempt a reboot of a node that crashed but there are no means +to power a node off and power it back on. Supporting this is very handy in the +following situations: + + * **Emergency Power Off**: During emergencies, time is critical and manual + tasks just add latency which can be avoided through automation. If a server + room overheats, halting the OS on the nodes is not enough. The nodes need + to be powered off cleanly to prevent damage to equipment. + * **Repairs**: In most cases, repairing a node means that the node has to be + powered off. + * **Crashes**: Software bugs may crash a node. Having an OS independent way to + power-cycle a node helps to recover the node without human intervention. + +Overview +-------- + +Ganeti will be extended with OOB capabilities through adding a new **Cluster +Parameter** (``--oob-program``), a new **Node Property** (``--oob-program``), a +new **Node State (powered)** and support in ``gnt-node`` for invoking an +**External Helper Command** which executes the actual OOB command (``gnt-node + nodename ...``). The supported commands are: ``power on``, +``power off``, ``power cycle``, ``power status`` and ``health``. + +.. note:: + The new **Node State (powered)** is a **State of Record + (SoR)**, not a **State of World (SoW)**. The maximum execution time of the + **External Helper Command** will be limited to 60s to prevent the cluster from + getting locked for an undefined amount of time. + +Detailed Design +--------------- + +New ``gnt-cluster`` Parameter ++++++++++++++++++++++++++++++ + +| Program: ``gnt-cluster`` +| Command: ``modify|init`` +| Parameters: ``--oob-program`` +| Options: ``--oob-program``: executable OOB program (absolute path) + +New ``gnt-node`` Property ++++++++++++++++++++++++++ + +| Program: ``gnt-node`` +| Command: ``modify|add`` +| Parameters: ``--oob-program`` +| Options: ``--oob-program``: executable OOB program (absolute path) + +.. note:: + If ``--oob-program`` is set to ``!`` then the node has no OOB capabilities. + Otherwise, we will inherit the node group respectively the cluster wide + value. I.e. the nodes have to opt out from OOB capabilities. + +Addition to ``gnt-cluster verify`` +++++++++++++++++++++++++++++++++++ + +| Program: ``gnt-cluster`` +| Command: ``verify`` +| Parameter: None +| Option: None +| Additional Checks: + + 1. existence and execution flag of OOB program on all Master Candidates if + the cluster parameter ``--oob-program`` is set or at least one node has + the property ``--oob-program`` set. The OOB helper is just invoked on the + master + 2. check if node state powered matches actual power state of the machine for + those nodes where ``--oob-program`` is set + +New Node State +++++++++++++++ + +Ganeti supports the following two boolean states related to the nodes: + +**drained** + The cluster still communicates with drained nodes but excludes them from + allocation operations + +**offline** + if offline, the cluster does not communicate with offline nodes; useful for + nodes that are not reachable in order to avoid delays + +And will extend this list with the following boolean state: + +**powered** + if not powered, the cluster does not communicate with not powered nodes if + the node property ``--oob-program`` is not set, the state powered is not + displayed + +Additionally modify the meaning of the offline state as follows: + +**offline** + if offline, the cluster does not communicate with offline nodes (**with the + exception of OOB commands for nodes where** ``--oob-program`` **is set**); + useful for nodes that are not reachable in order to avoid delays + +The corresponding command extensions are: + +| Program: ``gnt-node`` +| Command: ``info`` +| Parameter: [ ``nodename`` ... ] +| Option: None + +Additional Output (SoR, ommited if node property ``--oob-program`` is not set): +powered: ``[True|False]`` + +| Program: ``gnt-node`` +| Command: ``modify`` +| Parameter: nodename +| Option: [ ``--powered=yes|no`` ] +| Reasoning: sometimes you will need to sync the SoR with the SoW manually +| Caveat: ``--powered`` can only be modified if ``--oob-program`` is set for +| the node in question + +New ``gnt-node`` commands: ``power [on|off|cycle|status]`` +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ + +| Program: ``gnt-node`` +| Command: ``power [on|off|cycle|status]`` +| Parameters: [ ``nodename`` ... ] +| Options: None +| Caveats: + + * If no nodenames are passed to ``power [on|off|cycle]``, the user will be + prompted with ``"Do you really want to power [on|off|cycle] the following + nodes: