Revision aa355c79 doc/design-2.1.rst
b/doc/design-2.1.rst | ||
---|---|---|
292 | 292 |
caller nonetheless. |
293 | 293 |
|
294 | 294 |
|
295 |
Node daemon availability |
|
296 |
~~~~~~~~~~~~~~~~~~~~~~~~ |
|
297 |
|
|
298 |
Current State and shortcomings |
|
299 |
++++++++++++++++++++++++++++++ |
|
300 |
|
|
301 |
Currently, when a Ganeti node suffers serious system disk damage, the |
|
302 |
migration/failover of an instance may not correctly shutdown the virtual |
|
303 |
machine on the broken node causing instances duplication. The ``gnt-node |
|
304 |
powercycle`` command can be used to force a node reboot and thus to |
|
305 |
avoid duplicated instances. This command relies on node daemon |
|
306 |
availability, though, and thus can fail if the node daemon has some |
|
307 |
pages swapped out of ram, for example. |
|
308 |
|
|
309 |
|
|
310 |
Proposed changes |
|
311 |
++++++++++++++++ |
|
312 |
|
|
313 |
The proposed solution forces node daemon to run exclusively in RAM. It |
|
314 |
uses python ctypes to to call ``mlockall(MCL_CURRENT | MCL_FUTURE)`` on |
|
315 |
the node daemon process and all its children. In addition another log |
|
316 |
handler has been implemented for node daemon to redirect to |
|
317 |
``/dev/console`` messages that cannot be written on the logfile. |
|
318 |
|
|
319 |
With these changes node daemon can successfully run basic tasks such as |
|
320 |
a powercycle request even when the system disk is heavily damaged and |
|
321 |
reading/writing to disk fails constantly. |
|
322 |
|
|
323 |
|
|
295 | 324 |
Feature changes |
296 | 325 |
--------------- |
297 | 326 |
|
Also available in: Unified diff