Revision e3c39cc3
b/doc/design-oob.rst | ||
---|---|---|
4 | 4 |
Objective |
5 | 5 |
--------- |
6 | 6 |
|
7 |
Extend Ganeti with Out of Band Cluster Node Management Capabilities. |
|
7 |
Extend Ganeti with Out of Band (:term:`OOB`) Cluster Node Management |
|
8 |
Capabilities. |
|
8 | 9 |
|
9 | 10 |
Background |
10 | 11 |
---------- |
11 | 12 |
|
12 |
Ganeti currently has no support for Out of Band management of the nodes in a |
|
13 |
cluster. It relies on the OS running on the nodes and has therefore limited |
|
14 |
possibilities when the OS is not responding. The command ``gnt-node powercycle`` |
|
15 |
can be issued to attempt a reboot of a node that crashed but there are no means |
|
16 |
to power a node off and power it back on. Supporting this is very handy in the |
|
17 |
following situations: |
|
18 |
|
|
19 |
* **Emergency Power Off**: During emergencies, time is critical and manual |
|
20 |
tasks just add latency which can be avoided through automation. If a server |
|
21 |
room overheats, halting the OS on the nodes is not enough. The nodes need |
|
22 |
to be powered off cleanly to prevent damage to equipment. |
|
23 |
* **Repairs**: In most cases, repairing a node means that the node has to be |
|
24 |
powered off. |
|
25 |
* **Crashes**: Software bugs may crash a node. Having an OS independent way to |
|
26 |
power-cycle a node helps to recover the node without human intervention. |
|
13 |
Ganeti currently has no support for Out of Band management of the nodes |
|
14 |
in a cluster. It relies on the OS running on the nodes and has therefore |
|
15 |
limited possibilities when the OS is not responding. The command |
|
16 |
``gnt-node powercycle`` can be issued to attempt a reboot of a node that |
|
17 |
crashed but there are no means to power a node off and power it back |
|
18 |
on. Supporting this is very handy in the following situations: |
|
19 |
|
|
20 |
* **Emergency Power Off**: During emergencies, time is critical and |
|
21 |
manual tasks just add latency which can be avoided through |
|
22 |
automation. If a server room overheats, halting the OS on the nodes |
|
23 |
is not enough. The nodes need to be powered off cleanly to prevent |
|
24 |
damage to equipment. |
|
25 |
* **Repairs**: In most cases, repairing a node means that the node has |
|
26 |
to be powered off. |
|
27 |
* **Crashes**: Software bugs may crash a node. Having an OS |
|
28 |
independent way to power-cycle a node helps to recover the node |
|
29 |
without human intervention. |
|
27 | 30 |
|
28 | 31 |
Overview |
29 | 32 |
-------- |
30 | 33 |
|
31 |
Ganeti will be extended with OOB capabilities through adding a new **Cluster |
|
32 |
Parameter** (``--oob-program``), a new **Node Property** (``--oob-program``), a |
|
33 |
new **Node State (powered)** and support in ``gnt-node`` for invoking an |
|
34 |
**External Helper Command** which executes the actual OOB command (``gnt-node |
|
35 |
<command> nodename ...``). The supported commands are: ``power on``, |
|
36 |
``power off``, ``power cycle``, ``power status`` and ``health``. |
|
34 |
Ganeti will be extended with OOB capabilities through adding a new |
|
35 |
**Cluster Parameter** (``--oob-program``), a new **Node Property** |
|
36 |
(``--oob-program``), a new **Node State (powered)** and support in |
|
37 |
``gnt-node`` for invoking an **External Helper Command** which executes |
|
38 |
the actual OOB command (``gnt-node <command> nodename ...``). The |
|
39 |
supported commands are: ``power on``, ``power off``, ``power cycle``, |
|
40 |
``power status`` and ``health``. |
|
37 | 41 |
|
38 | 42 |
.. note:: |
39 |
The new **Node State (powered)** is a **State of Record |
|
40 |
(SoR)**, not a **State of World (SoW)**. The maximum execution time of the |
|
41 |
**External Helper Command** will be limited to 60s to prevent the cluster from |
|
42 |
getting locked for an undefined amount of time. |
|
43 |
The new **Node State (powered)** is a **State of Record** |
|
44 |
(:term:`SoR`), not a **State of World** (:term:`SoW`). The maximum |
|
45 |
execution time of the **External Helper Command** will be limited to |
|
46 |
60s to prevent the cluster from getting locked for an undefined amount |
|
47 |
of time. |
|
43 | 48 |
|
44 | 49 |
Detailed Design |
45 | 50 |
--------------- |
... | ... | |
64 | 69 |
| ``--groups``: To operate on groups instead of nodes |
65 | 70 |
| ``--all``: To operate on the whole cluster |
66 | 71 |
|
67 |
This is a convenience command to allow easy emergency power off of a whole
|
|
68 |
cluster or part of it. It takes care of all steps needed to get the cluster into
|
|
69 |
a sane state to turn off the nodes. |
|
72 |
This is a convenience command to allow easy emergency power off of a |
|
73 |
whole cluster or part of it. It takes care of all steps needed to get
|
|
74 |
the cluster into a sane state to turn off the nodes.
|
|
70 | 75 |
|
71 |
With ``--on`` it does the reverse and tries to bring the rest of the cluster back
|
|
72 |
to life. |
|
76 |
With ``--on`` it does the reverse and tries to bring the rest of the |
|
77 |
cluster back to life.
|
|
73 | 78 |
|
74 | 79 |
.. note:: |
75 |
The master node is not able to shut itself cleanly down. Therefore, this |
|
76 |
command will not do all the work on single node clusters. On multi node |
|
77 |
clusters the command tries to find another master or if that is not possible |
|
78 |
prepares everything to the point where the user has to shutdown the master |
|
79 |
node itself alone this applies also to the single node cluster configuration. |
|
80 |
The master node is not able to shut itself cleanly down. Therefore, |
|
81 |
this command will not do all the work on single node clusters. On |
|
82 |
multi node clusters the command tries to find another master or if |
|
83 |
that is not possible prepares everything to the point where the user |
|
84 |
has to shutdown the master node itself alone this applies also to the |
|
85 |
single node cluster configuration. |
|
80 | 86 |
|
81 | 87 |
New ``gnt-node`` Property |
82 | 88 |
+++++++++++++++++++++++++ |
... | ... | |
87 | 93 |
| Options: ``--oob-program``: executable OOB program (absolute path) |
88 | 94 |
|
89 | 95 |
.. note:: |
90 |
If ``--oob-program`` is set to ``!`` then the node has no OOB capabilities. |
|
91 |
Otherwise, we will inherit the node group respectively the cluster wide |
|
92 |
value. I.e. the nodes have to opt out from OOB capabilities. |
|
96 |
If ``--oob-program`` is set to ``!`` then the node has no OOB |
|
97 |
capabilities. Otherwise, we will inherit the node group respectively |
|
98 |
the cluster wide value. I.e. the nodes have to opt out from OOB |
|
99 |
capabilities. |
|
93 | 100 |
|
94 | 101 |
Addition to ``gnt-cluster verify`` |
95 | 102 |
++++++++++++++++++++++++++++++++++ |
... | ... | |
100 | 107 |
| Option: None |
101 | 108 |
| Additional Checks: |
102 | 109 |
|
103 |
1. existence and execution flag of OOB program on all Master Candidates if
|
|
104 |
the cluster parameter ``--oob-program`` is set or at least one node has
|
|
105 |
the property ``--oob-program`` set. The OOB helper is just invoked on the
|
|
106 |
master |
|
107 |
2. check if node state powered matches actual power state of the machine for
|
|
108 |
those nodes where ``--oob-program`` is set |
|
110 |
1. existence and execution flag of OOB program on all Master |
|
111 |
Candidates if the cluster parameter ``--oob-program`` is set or at
|
|
112 |
least one node has the property ``--oob-program`` set. The OOB
|
|
113 |
helper is just invoked on the master
|
|
114 |
2. check if node state powered matches actual power state of the |
|
115 |
machine for those nodes where ``--oob-program`` is set
|
|
109 | 116 |
|
110 | 117 |
New Node State |
111 | 118 |
++++++++++++++ |
... | ... | |
113 | 120 |
Ganeti supports the following two boolean states related to the nodes: |
114 | 121 |
|
115 | 122 |
**drained** |
116 |
The cluster still communicates with drained nodes but excludes them from
|
|
117 |
allocation operations |
|
123 |
The cluster still communicates with drained nodes but excludes them |
|
124 |
from allocation operations
|
|
118 | 125 |
|
119 | 126 |
**offline** |
120 |
if offline, the cluster does not communicate with offline nodes; useful for
|
|
121 |
nodes that are not reachable in order to avoid delays |
|
127 |
if offline, the cluster does not communicate with offline nodes; |
|
128 |
useful for nodes that are not reachable in order to avoid delays
|
|
122 | 129 |
|
123 | 130 |
And will extend this list with the following boolean state: |
124 | 131 |
|
125 | 132 |
**powered** |
126 |
if not powered, the cluster does not communicate with not powered nodes if
|
|
127 |
the node property ``--oob-program`` is not set, the state powered is not
|
|
128 |
displayed |
|
133 |
if not powered, the cluster does not communicate with not powered |
|
134 |
nodes if the node property ``--oob-program`` is not set, the state
|
|
135 |
powered is not displayed
|
|
129 | 136 |
|
130 | 137 |
Additionally modify the meaning of the offline state as follows: |
131 | 138 |
|
132 | 139 |
**offline** |
133 |
if offline, the cluster does not communicate with offline nodes (**with the |
|
134 |
exception of OOB commands for nodes where** ``--oob-program`` **is set**); |
|
135 |
useful for nodes that are not reachable in order to avoid delays |
|
140 |
if offline, the cluster does not communicate with offline nodes |
|
141 |
(**with the exception of OOB commands for nodes where** |
|
142 |
``--oob-program`` **is set**); useful for nodes that are not reachable |
|
143 |
in order to avoid delays |
|
136 | 144 |
|
137 | 145 |
The corresponding command extensions are: |
138 | 146 |
|
... | ... | |
141 | 149 |
| Parameter: [ ``nodename`` ... ] |
142 | 150 |
| Option: None |
143 | 151 |
|
144 |
Additional Output (SoR, ommited if node property ``--oob-program`` is not set): |
|
152 |
Additional Output (:term:`SoR`, ommited if node property |
|
153 |
``--oob-program`` is not set): |
|
145 | 154 |
powered: ``[True|False]`` |
146 | 155 |
|
147 | 156 |
| Program: ``gnt-node`` |
148 | 157 |
| Command: ``modify`` |
149 | 158 |
| Parameter: nodename |
150 | 159 |
| Option: [ ``--powered=yes|no`` ] |
151 |
| Reasoning: sometimes you will need to sync the SoR with the SoW manually
|
|
160 |
| Reasoning: sometimes you will need to sync the :term:`SoR` with the :term:`SoW` manually
|
|
152 | 161 |
| Caveat: ``--powered`` can only be modified if ``--oob-program`` is set for |
153 | 162 |
| the node in question |
154 | 163 |
|
... | ... | |
161 | 170 |
| Options: None |
162 | 171 |
| Caveats: |
163 | 172 |
|
164 |
* If no nodenames are passed to ``power [on|off|cycle]``, the user will be |
|
165 |
prompted with ``"Do you really want to power [on|off|cycle] the following |
|
166 |
nodes: <display list of OOB capable nodes in the cluster)? (y/n)"`` |
|
173 |
* If no nodenames are passed to ``power [on|off|cycle]``, the user |
|
174 |
will be prompted with ``"Do you really want to power [on|off|cycle] |
|
175 |
the following nodes: <display list of OOB capable nodes in the |
|
176 |
cluster)? (y/n)"`` |
|
167 | 177 |
* For ``power-status``, nodename is optional, if omitted, we list the |
168 |
power-status of all OOB capable nodes in the cluster (SoW)
|
|
178 |
power-status of all OOB capable nodes in the cluster (:term:`SoW`)
|
|
169 | 179 |
* User should be warned and needs to confirm with yes if s/he tries to |
170 | 180 |
``power [off|cycle]`` a node with running instances. |
171 | 181 |
|
172 | 182 |
Error Handling |
173 | 183 |
^^^^^^^^^^^^^^ |
174 | 184 |
|
175 |
+------------------------------+-----------------------------------------------+
|
|
176 |
| Exception | Error Message |
|
|
177 |
+==============================+===============================================+
|
|
178 |
| OOB program return code != 0 | OOB program execution failed ($ERROR_MSG) |
|
|
179 |
+------------------------------+-----------------------------------------------+
|
|
180 |
| OOB program execution time | OOB program execution timeout exceeded, OOB |
|
|
181 |
| exceeds 60s | program execution aborted |
|
|
182 |
+------------------------------+-----------------------------------------------+
|
|
185 |
+-----------------------------+----------------------------------------------+
|
|
186 |
| Exception | Error Message |
|
|
187 |
+=============================+==============================================+
|
|
188 |
| OOB program return code != 0| OOB program execution failed ($ERROR_MSG) |
|
|
189 |
+-----------------------------+----------------------------------------------+
|
|
190 |
| OOB program execution time | OOB program execution timeout exceeded, OOB |
|
|
191 |
| exceeds 60s | program execution aborted |
|
|
192 |
+-----------------------------+----------------------------------------------+
|
|
183 | 193 |
|
184 | 194 |
Node State Changes |
185 | 195 |
^^^^^^^^^^^^^^^^^^ |
186 | 196 |
|
187 |
+----------------+-----------------+----------------+--------------------------+
|
|
188 |
| State before | Command | State after | Comment |
|
|
189 |
| execution | | execution | |
|
|
190 |
+================+=================+================+==========================+
|
|
191 |
| powered: False | ``power off`` | powered: False | FYI: IPMI will complain |
|
|
192 |
| | | | if you try to power off |
|
|
193 |
| | | | a machine that is already|
|
|
194 |
| | | | powered off |
|
|
195 |
+----------------+-----------------+----------------+--------------------------+
|
|
196 |
| powered: False | ``power cycle`` | powered: False | FYI: IPMI will complain |
|
|
197 |
| | | | if you try to cycle a |
|
|
198 |
| | | | machine that is already |
|
|
199 |
| | | | powered off |
|
|
200 |
+----------------+-----------------+----------------+--------------------------+
|
|
201 |
| powered: False | ``power on`` | powered: True | |
|
|
202 |
+----------------+-----------------+----------------+--------------------------+
|
|
203 |
| powered: True | ``power off`` | powered: False | |
|
|
204 |
+----------------+-----------------+----------------+--------------------------+
|
|
205 |
| powered: True | ``power cycle`` | powered: True | |
|
|
206 |
+----------------+-----------------+----------------+--------------------------+
|
|
207 |
| powered: True | ``power on`` | powered: True | FYI: IPMI will complain |
|
|
208 |
| | | | if you try to power on |
|
|
209 |
| | | | a machine that is already|
|
|
210 |
| | | | powered on |
|
|
211 |
+----------------+-----------------+----------------+--------------------------+
|
|
197 |
+----------------+---------------+----------------+--------------------------+ |
|
198 |
| State before |Command | State after | Comment |
|
|
199 |
| execution | | execution | | |
|
200 |
+================+===============+================+==========================+ |
|
201 |
| powered: False |``power off`` | powered: False | FYI: IPMI will complain |
|
|
202 |
| | | | if you try to power off | |
|
203 |
| | | | a machine that is already| |
|
204 |
| | | | powered off | |
|
205 |
+----------------+---------------+----------------+--------------------------+ |
|
206 |
| powered: False |``power cycle``| powered: False | FYI: IPMI will complain |
|
|
207 |
| | | | if you try to cycle a | |
|
208 |
| | | | machine that is already | |
|
209 |
| | | | powered off | |
|
210 |
+----------------+---------------+----------------+--------------------------+ |
|
211 |
| powered: False |``power on`` | powered: True | |
|
|
212 |
+----------------+---------------+----------------+--------------------------+ |
|
213 |
| powered: True |``power off`` | powered: False | |
|
|
214 |
+----------------+---------------+----------------+--------------------------+ |
|
215 |
| powered: True |``power cycle``| powered: True | |
|
|
216 |
+----------------+---------------+----------------+--------------------------+ |
|
217 |
| powered: True |``power on`` | powered: True | FYI: IPMI will complain |
|
|
218 |
| | | | if you try to power on | |
|
219 |
| | | | a machine that is already| |
|
220 |
| | | | powered on | |
|
221 |
+----------------+---------------+----------------+--------------------------+ |
|
212 | 222 |
|
213 | 223 |
.. note:: |
214 | 224 |
|
215 | 225 |
* If the command fails, the Node State remains unchanged. |
216 | 226 |
* We will not prevent the user from trying to power off a node that is |
217 |
already powered off since the powered state represents the **SoR** only and |
|
218 |
not the **SoW**. This can however create problems when the cluster |
|
219 |
administrator wants to bring the **SoR** in sync with the **SoW** without |
|
220 |
actually having to mess with the node(s). For this case, we allow direct |
|
221 |
modification of the powered state through the gnt-node modify |
|
222 |
``--powered=[yes|no]`` command as long as the node has OOB capabilities |
|
223 |
(i.e. ``--oob-program`` is set). |
|
227 |
already powered off since the powered state represents the |
|
228 |
:term:`SoR` only and not the :term:`SoW`. This can however create |
|
229 |
problems when the cluster administrator wants to bring the |
|
230 |
:term:`SoR` in sync with the :term:SoW` without actually having to |
|
231 |
mess with the node(s). For this case, we allow direct modification |
|
232 |
of the powered state through the gnt-node modify |
|
233 |
``--powered=[yes|no]`` command as long as the node has OOB |
|
234 |
capabilities (i.e. ``--oob-program`` is set). |
|
224 | 235 |
* All node power state changes will be logged |
225 | 236 |
|
226 |
Node Power Status Listing (SoW)
|
|
227 |
+++++++++++++++++++++++++++++++ |
|
237 |
Node Power Status Listing (:term:`SoW`)
|
|
238 |
+++++++++++++++++++++++++++++++++++++++
|
|
228 | 239 |
|
229 | 240 |
| Program: ``gnt-node`` |
230 | 241 |
| Command: ``power-status`` |
231 | 242 |
| Parameters: [ ``nodename`` ... ] |
232 | 243 |
|
233 |
Example output (represents **SoW**)::
|
|
244 |
Example output (represents :term:`SoW`)::
|
|
234 | 245 |
|
235 | 246 |
gnt-node oob power-status |
236 | 247 |
Node Power Status |
... | ... | |
241 | 252 |
|
242 | 253 |
.. note:: |
243 | 254 |
|
244 |
* We use ``unknown`` in case the Helper Program could not determine the power |
|
245 |
state. |
|
246 |
* If no nodenames are provided, we will list the power state of all nodes |
|
247 |
which are not opted out from OOB management. |
|
248 |
* Only nodes which are not opted out from OOB management will be listed. |
|
249 |
Invoking the command on a node that does not meet this condition will |
|
250 |
result in an error message "Node X does not support OOB commands". |
|
255 |
* We use ``unknown`` in case the Helper Program could not determine |
|
256 |
the power state. |
|
257 |
* If no nodenames are provided, we will list the power state of all |
|
258 |
nodes which are not opted out from OOB management. |
|
259 |
* Only nodes which are not opted out from OOB management will be |
|
260 |
listed. Invoking the command on a node that does not meet this |
|
261 |
condition will result in an error message "Node X does not support |
|
262 |
OOB commands". |
|
251 | 263 |
|
252 |
Node Power Status Listing (SoR)
|
|
253 |
+++++++++++++++++++++++++++++++ |
|
264 |
Node Power Status Listing (:term:`SoR`)
|
|
265 |
+++++++++++++++++++++++++++++++++++++++
|
|
254 | 266 |
|
255 | 267 |
| Program: ``gnt-node`` |
256 | 268 |
| Command: ``info`` |
257 | 269 |
| Parameter: [ ``nodename`` ... ] |
258 | 270 |
| Option: None |
259 | 271 |
|
260 |
Example output (represents **SoR**)::
|
|
272 |
Example output (represents :term:`SoR`)::
|
|
261 | 273 |
|
262 | 274 |
gnt-node info node1.example.com |
263 | 275 |
Node name: node1.example.com |
... | ... | |
278 | 290 |
- inst7.example.com |
279 | 291 |
|
280 | 292 |
.. note:: |
281 |
Only nodes which are not opted out from OOB management will |
|
282 |
report the powered state.
|
|
293 |
Only nodes which are not opted out from OOB management will report the
|
|
294 |
powered state. |
|
283 | 295 |
|
284 | 296 |
New ``gnt-node`` oob subcommand: ``health`` |
285 | 297 |
+++++++++++++++++++++++++++++++++++++++++++ |
... | ... | |
292 | 304 |
|
293 | 305 |
Caveats: |
294 | 306 |
|
295 |
* If no nodename(s) are provided, we will report the health of all nodes in |
|
296 |
the cluster which have ``--oob-program`` set. |
|
297 |
* Only nodes which are not opted out from OOB management will report their |
|
298 |
health. Invoking the command on a node that does not meet this condition |
|
299 |
will result in an error message "Node does not support OOB commands". |
|
307 |
* If no nodename(s) are provided, we will report the health of all |
|
308 |
nodes in the cluster which have ``--oob-program`` set. |
|
309 |
* Only nodes which are not opted out from OOB management will report |
|
310 |
their health. Invoking the command on a node that does not meet this |
|
311 |
condition will result in an error message "Node does not support OOB |
|
312 |
commands". |
|
300 | 313 |
|
301 | 314 |
For error handling see `Error Handling`_ |
302 | 315 |
|
... | ... | |
313 | 326 |
Return Codes |
314 | 327 |
^^^^^^^^^^^^ |
315 | 328 |
|
316 |
+---------------+--------------------------+ |
|
317 |
| Return code | Meaning | |
|
318 |
+===============+==========================+ |
|
319 |
| 0 | Command succeeded | |
|
320 |
+---------------+--------------------------+ |
|
321 |
| 1 | Command failed | |
|
322 |
+---------------+--------------------------+ |
|
323 |
| others | Unsupported/undefined | |
|
324 |
+---------------+--------------------------+ |
|
325 |
|
|
326 |
Error messages are passed from the helper program to Ganeti through StdErr |
|
327 |
(return code == 1). On StdOut, the helper program will send data back to |
|
328 |
Ganeti (return code == 0). The format of the data is JSON. |
|
329 |
|
|
330 |
+------------------+-------------------------------+ |
|
331 |
| Command | Expected output | |
|
332 |
+==================+===============================+ |
|
333 |
| ``power-on`` | None | |
|
334 |
+------------------+-------------------------------+ |
|
335 |
| ``power-off`` | None | |
|
336 |
+------------------+-------------------------------+ |
|
337 |
| ``power-cycle`` | None | |
|
338 |
+------------------+-------------------------------+ |
|
339 |
| ``power-status`` | ``{ "powered": true|false }`` | |
|
340 |
+------------------+-------------------------------+ |
|
341 |
| ``health`` | :: | |
|
342 |
| | | |
|
343 |
| | [[item, status], | |
|
344 |
| | [item, status], | |
|
345 |
| | ...] | |
|
346 |
+------------------+-------------------------------+ |
|
329 |
+-------------+-------------------------+ |
|
330 |
| Return code | Meaning | |
|
331 |
+=============+=========================+ |
|
332 |
| 0 | Command succeeded | |
|
333 |
+-------------+-------------------------+ |
|
334 |
| 1 | Command failed | |
|
335 |
+-------------+-------------------------+ |
|
336 |
| others | Unsupported/undefined | |
|
337 |
+-------------+-------------------------+ |
|
338 |
|
|
339 |
Error messages are passed from the helper program to Ganeti through |
|
340 |
:manpage:`stderr(3)` (return code == 1). On :manpage:`stdout(3)`, the |
|
341 |
helper program will send data back to Ganeti (return code == 0). The |
|
342 |
format of the data is JSON. |
|
343 |
|
|
344 |
+-----------------+------------------------------+ |
|
345 |
| Command | Expected output | |
|
346 |
+=================+==============================+ |
|
347 |
| ``power-on`` | None | |
|
348 |
+-----------------+------------------------------+ |
|
349 |
| ``power-off`` | None | |
|
350 |
+-----------------+------------------------------+ |
|
351 |
| ``power-cycle`` | None | |
|
352 |
+-----------------+------------------------------+ |
|
353 |
| ``power-status``| ``{ "powered": true|false }``| |
|
354 |
+-----------------+------------------------------+ |
|
355 |
| ``health`` | :: | |
|
356 |
| | | |
|
357 |
| | [[item, status], | |
|
358 |
| | [item, status], | |
|
359 |
| | ...] | |
|
360 |
+-----------------+------------------------------+ |
|
347 | 361 |
|
348 | 362 |
Data Format |
349 | 363 |
^^^^^^^^^^^ |
350 | 364 |
|
351 | 365 |
For the health output, the fields are: |
352 | 366 |
|
353 |
+--------+--------------------------------------------------------------------+
|
|
354 |
| Field | Meaning |
|
|
355 |
+========+====================================================================+
|
|
356 |
| item | String identifier of the item we are querying the health of, |
|
|
357 |
| | examples: |
|
|
358 |
| | |
|
|
359 |
| | * Ambient Temp |
|
|
360 |
| | * PS Redundancy |
|
|
361 |
| | * FAN 1 RPM |
|
|
362 |
+--------+--------------------------------------------------------------------+
|
|
363 |
| status | String; Can take one of the following four values: |
|
|
364 |
| | |
|
|
365 |
| | * OK |
|
|
366 |
| | * WARNING |
|
|
367 |
| | * CRITICAL |
|
|
368 |
| | * UNKNOWN |
|
|
369 |
+--------+--------------------------------------------------------------------+
|
|
367 |
+--------+------------------------------------------------------------------+ |
|
368 |
| Field | Meaning | |
|
369 |
+========+==================================================================+ |
|
370 |
| item | String identifier of the item we are querying the health of, | |
|
371 |
| | examples: | |
|
372 |
| | | |
|
373 |
| | * Ambient Temp | |
|
374 |
| | * PS Redundancy | |
|
375 |
| | * FAN 1 RPM | |
|
376 |
+--------+------------------------------------------------------------------+ |
|
377 |
| status | String; Can take one of the following four values: | |
|
378 |
| | | |
|
379 |
| | * OK | |
|
380 |
| | * WARNING | |
|
381 |
| | * CRITICAL | |
|
382 |
| | * UNKNOWN | |
|
383 |
+--------+------------------------------------------------------------------+ |
|
370 | 384 |
|
371 | 385 |
.. note:: |
372 | 386 |
|
373 |
* The item output list is defined by the Helper Program. It is up to the |
|
374 |
author of the Helper Program to decide which items should be monitored and |
|
375 |
what each corresponding return status is. |
|
376 |
* Ganeti will currently not take any actions based on the item status. It |
|
377 |
will however create log entries for items with status WARNING or CRITICAL |
|
378 |
for each run of the ``gnt-node oob health nodename`` command. Automatic |
|
379 |
actions (regular monitoring of the item status) is considered a new service |
|
380 |
and will be treated in a separate design document. |
|
387 |
* The item output list is defined by the Helper Program. It is up to |
|
388 |
the author of the Helper Program to decide which items should be |
|
389 |
monitored and what each corresponding return status is. |
|
390 |
* Ganeti will currently not take any actions based on the item |
|
391 |
status. It will however create log entries for items with status |
|
392 |
WARNING or CRITICAL for each run of the ``gnt-node oob health |
|
393 |
nodename`` command. Automatic actions (regular monitoring of the |
|
394 |
item status) is considered a new service and will be treated in a |
|
395 |
separate design document. |
|
381 | 396 |
|
382 | 397 |
Logging |
383 | 398 |
------- |
384 | 399 |
|
385 |
The ``gnt-node power-[on|off]`` (power state changes) commands will create log
|
|
386 |
entries following current Ganeti logging practices. In addition, health items
|
|
387 |
with status WARNING or CRITICAL will be logged for each run of ``gnt-node
|
|
388 |
health``. |
|
400 |
The ``gnt-node power-[on|off]`` (power state changes) commands will |
|
401 |
create log entries following current Ganeti logging practices. In
|
|
402 |
addition, health items with status WARNING or CRITICAL will be logged
|
|
403 |
for each run of ``gnt-node health``.
|
|
389 | 404 |
|
390 | 405 |
.. vim: set textwidth=72 : |
391 | 406 |
.. Local Variables: |
Also available in: Unified diff