code.grnet.gr Git - ganeti-local/commit

author	Iustin Pop <iustin@google.com>
	Wed, 3 Jun 2009 12:20:03 +0000 (14:20 +0200)
committer	Iustin Pop <iustin@google.com>
	Thu, 4 Jun 2009 10:06:40 +0000 (12:06 +0200)
commit	fbafd7a864cc1c47587f6c4746589d07847b61ae
tree	39a606c9836a99378c527d37fae47b9e554a123b	tree \| snapshot
parent	a97da6b7fa44603bf4649785b1c43bbbaa1ce522	commit \| diff

Wait for a while in failed resyncs

This patch is an attempt at fixing some very rare occurrences of messages like:
- "There are some degraded disks for this instance", or:
- "Cannot resync disks on node node3.example.com: [True, 100]"

What I believe happens is that drbd has finished syncing, but not all
fields are updated in 'Connected' state; maybe it's in WFBitmap[ST], or
in some other transient state we don't handle well.

The patch will change the _WaitForSync method to recheck up to a
hardcoded number of times if we're finished syncing but we're degraded
(using the same condition as the 'break' clause of the loop).

The cons of this changes is that a normal, really-degraded due to
network or disk failure will cause an extra delay before it aborts. For
this, I'm happy to choose other values.

A better, long term fix is to handle more DRBD state correctly (see the
bdev.DRBD8Status class).

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Guido Trotter <ultrotter@google.com>