« Previous | Next » 

Revision d2cd6944

IDd2cd6944029153fec2888b65740b8989c65dd16b

Added by Iustin Pop about 13 years ago

RPC: mark jobqueue functions as URGENT

Recently, we've seen more and more cases of a specific breakage
pattern in Ganeti: master candidates which are semi-alive (as in, they
respond to ping, they can complete a TCP/SSL handshake, but otherwise
the root filesystem is broken) cause lots of confusion within masterd.

My analysis shows that waiting up to 5 minutes for a reply from such a
broken master candidate is too long, and this long wait breaks other
timeouts (e.g. the Luxi timeout), making standard recovery from this
situation very hard. It's much easier to kill the master daemon, edit
manually the config file and mark the node as regular, then restart
the master daemon.

The proposal is therefore to reduce the timeout for the job queue
functions to TMO_URGENT (1 minute), which should be more balanced
between a working but overloaded node and a broken node.

Signed-off-by: Iustin Pop <>
Reviewed-by: Michael Hanselmann <>

Files

  • added
  • modified
  • copied
  • renamed
  • deleted

View differences