Revision d385a174

« Previous | Next »

ID	d385a1744c144052eaade85c38dd7106d9abf371

Added by Iustin Pop about 13 years ago

Increase the lock timeouts before we block-acquire

This has been observed to cause problems on real clusters via the
following mechanism:

- a long job (e.g. a replace-disks) is keeping an exclusive lock on an
instance
- the watcher starts and submits its query instances opcode which
wants shared locks for all instances
- after about an hour, the watcher job falls back to blocking acquire,
after having acquired all other locks
- any instance opcode that wants an exclusive lock for an instance
cannot start until the watcher has finished, even though there's no
actual operation on that instance

In order to alleviate this problem, we simply increase the max timeout
until lock acquires are sent back to either blocking acquire or
priority increase. The timeout is computed such that we wait ~10 hours
(instead of one) for this to happen, which should be within the
maximum lifetime of a reasonable opcode on a healthy cluster. The
timeout also means that priority increases will happen every half hour.

We also increase the max wait interval to 15 seconds, otherwise we'd
have too many retries with the increased interval.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

Files

added
modified
copied
renamed
deleted

View differences

lib
- constants.py (diff)
- mcpu.py (diff)
test
- ganeti.mcpu_unittest.py (diff)

Synnefo » snf-ganeti » ganeti-local

Revision d385a174

Files