Statistics
| Branch: | Tag: | Revision:

root / lib @ cd42d0ad

# Date Author Comment
cd42d0ad 01/21/2009 12:03 pm Guido Trotter

Implement the new live migration backend functions

MigrationInfo, AcceptInstance and AbortMigration are implemented as
hypervisor specific functions, and by default they do nothing (as
they're not always necessary).

This patch also converts hv_base.MigrateInstance docstring to epydoc,...

38e250ba 01/21/2009 11:55 am Guido Trotter

KVM: save and remove the KVM runtime

At instance startup time we save the kvm runtime, and at stop time we
delete it. This patch also includes a function to load the kvm runtime,
which is unused yet.

Reviewed-by: iustinp

ee5f20b0 01/21/2009 11:55 am Guido Trotter

KVM: split KVM runtime generation and startup

Before we used to generate the kvm command line and then just run it.
With this patch we split the generation from the time it is run,
allowing us to save it and replay it at reboot.

We must take special care about instance nics:...

6906a9d8 01/21/2009 11:54 am Guido Trotter

Add calls in the intra-node migration protocol

Currently the hypervisor is expected to do all the migration from the
source side. With this patch we also add the option of passing some
information to the target side, and starting some operation there.

As a bonus, a function to cleanup any started operation is included....

89f28b76 01/21/2009 10:33 am Iustin Pop

Update the objects.Disk formatting method

With the addition of minors, this needs to show them too.

Reviewed-by: ultrotter

a1d79fc6 01/20/2009 08:12 pm Guido Trotter

KVM: add a _CONF_DIR

Currently we keep pid files and control files. In the conf dir we'll
also keep the data to start the instance anew, and the network
interface scripts. These will then be copied to a separate area (since
_CONF_DIR could be mounted 'noexec') and used to start the instance....

c4fbefc8 01/20/2009 08:12 pm Guido Trotter

KVM: Remove sockets after shutdown

Abstract the monitor and serial socket naming in two functions, and
reuse them to cleanup the files after shutdown.

Reviewed-by: iustinp

c4469f75 01/20/2009 08:11 pm Guido Trotter

KVM: fix class docstring

Reviewed-by: iustinp

fdf7f055 01/20/2009 08:11 pm Guido Trotter

Xen: use epydoc in MigrateInstance docstring

Reviewed-by: iustinp

920aae98 01/20/2009 07:50 pm Guido Trotter

ShutdownInstance: report hypervisor error

When StopInstance raises an HypervisorError, report it in the logged
message to ease with debugging.

Reviewed-by: iustinp

55224070 01/20/2009 07:50 pm Guido Trotter

ConfigObject docstring, close an open parenthesis

Reviewed-by: iustinp

7577196d 01/20/2009 07:50 pm Guido Trotter

Fix a typo in luxi's docstring

Reviewed-by: iustinp

d21d09d6 01/20/2009 07:19 pm Iustin Pop

Update the logging output of job processing

(this is related to the master daemon log)

Currently it's not possible to follow (in the non-debug runs) the
logical execution thread of jobs. This is due to the fact that we don't
log the thread name (so we lose the association of log messages to jobs)...

ae59efea 01/20/2009 06:47 pm Michael Hanselmann

.gitignore: Don't exclude whole /autotools/ dir, but only files

This way newly added files will be not be excluded by default. Fixes
also a small whitespace error in utils.py.

Reviewed-by: iustinp

96841384 01/20/2009 06:26 pm Iustin Pop

Convert RenameInstance to (status, data)

This allows the rename failures to show the ouput of OS scripts.

Reviewed-by: ultrotter

32388e6d 01/20/2009 04:20 pm Iustin Pop

Fix adding of disks to an instance

The ConfigWriter.AllocateDRBDMinor requires the instance name, not the
instance object. The LUSetInstanceParms is passing wrongly the instance
object, which can cause breakage.

The patch also adds asserts to check for this mismatch in ConfigWriter....

6d2e83d5 01/20/2009 04:20 pm Iustin Pop

Make cluster-verify check the drbd minors space

This patch adds support for verification of drbd minors space in cluster
verify: minors which belong to running instances and should be online
but are not, and minors which do not belong to any instace but are in...

2f907a8c 01/20/2009 04:20 pm Iustin Pop

Fix a couple of epydoc warnings

Reviewed-by: ultrotter

767d52d3 01/20/2009 01:18 pm Iustin Pop

DRBD: check for in-use minor during Create

In order to prevent errors with old, in-use DRBD minors, we check and
abort at create time if our minor is already in use. For this we need to
also modify DRBD8Status to be able to parse cs:Unconfigured devices....

f65f63ef 01/20/2009 01:18 pm Iustin Pop

Add a TailFile function

This patch adds a tail file function, to be used for parsing and returning in
the job log OS installation failures.

Reviewed-by: ultrotter

1492cca7 01/20/2009 12:12 pm Iustin Pop

Some small fixes in cmdlib

Reviewed-by: ultrotter

20e01edd 01/20/2009 12:11 pm Iustin Pop

Convert AddOSToInstance to (status, data)

This allows the install and reinstall instance to return (hopefully)
relevant log files from the OS create scripts.

Reviewed-by: ultrotter

dd279568 01/20/2009 12:11 pm Iustin Pop

Convert the start instance rpc to (status, data)

This will record the failure cause in starting up the instance in the
job log (and thus to the user).

Reviewed-by: ultrotter

7d81697f 01/19/2009 07:22 pm Iustin Pop

Fix handling of failures in create instance disks

Commit 2302 only modified _CreateBlockDevOnPrimary to the new style
result, but _CreateBlockDevOnSecondary was forgotten. After the merger
of the two functions, _CreateBlockDevOnSecondary was taken as template...

c5e489f7 01/19/2009 04:35 pm Iustin Pop

Move the default MAC prefix to the constants file

Instead of having the default live in the gnt-cluster script, we move it
to the constants file. The patch also fixes a typo on constants.py.

Reviewed-by: ultrotter

6b12959c 01/19/2009 04:33 pm Iustin Pop

Use instance.all_nodes instead of hand-building it

This patch replaces a few obvious uses of [instance.primary_node] +
list(instance.secondary_nodes) (or similar usage) with the new
instance.all_nodes.

Reviewed-by: ultrotter

99c7b2a1 01/19/2009 04:32 pm Iustin Pop

Fix non-drbd instance creation

Commit 2294 introduced a new instance.all_nodes property, which
unfortunately is working incorrectly for non-drbd instances.

This patch fixes it by making sure the primary node is always added to
the set, even before recursing over (any potential) children....

7c5abcae 01/19/2009 01:10 pm Iustin Pop

Small simplification in MapLVsByNode

We don't need to pre-create the node entries in lvmap, since they will
be created at recursion time.

Reviewed-by: ultrotter

de12473a 01/19/2009 01:10 pm Iustin Pop

Split the block device creation in two parts

Some callers of _CreateBlockDev need recursive behaviour, but not all.
The replace secondary first creates (manually) new LVs to ensure storage
is there, and then it creates the new DRBD. At this point, we need a...

428958aa 01/19/2009 01:10 pm Iustin Pop

Combine the two _CreateBlockDevOnXXX functions

Since only two boolean parameters differ between these two functions, we
combine them as to have less code duplication. This will be needed in
the future as we will need to split off the recursive part off.
...

dab69e97 01/19/2009 01:10 pm Iustin Pop

Switch call_blockdev_create call to (status, data)

This allows errors to be visible at the user level instead of just node
daemon logs.

Reviewed-by: ultrotter

796cab27 01/19/2009 01:10 pm Iustin Pop

Small change in the instance disk creation path

For future propagation of error messages from backend to cmdlib and to
the job log, just having True/False return from the disk creation
function is not enough.

This patch converts these functions (_CreateDisks, _CreateBlockDevOnXXX)...

6c626518 01/19/2009 01:10 pm Iustin Pop

Block device creation cleanup

Currently when creation LVM-based instances, we always get the
extremely-confusing message "ERROR Can't find LV /dev/xenvg/..." which
is actually expected. This behaviour was introduced before we had
UUID-style LV names, since at that point it was not a unexpected to have...

e6c1ff2f 01/19/2009 12:43 pm Iustin Pop

Use the same root for both _data and _meta LVs

Currently we use a different UUID for the _data and _meta volumes of a
DRBD disk. This is confusing as it's hard to associate the two in the
output of “lvs” or “gnt-node volumes”.

The patch changes so that they use the same prefix....

998c712c 01/16/2009 06:24 pm Iustin Pop

Fix LUExportInstance

Due to deficiencies in our block device implementation, it is a must to
call SetDiskID on disks before passing them to remote nodes. Since in
export/import, we don't touch the disks themselves, this was not needed
before in this function....

cfcc5c6d 01/16/2009 01:02 pm Iustin Pop

Instance: add a new all_nodes property

Since we often need the list of all nodes of an instance, we add a new
"all_nodes" property that returns all nodes of the instance, and we
switch secondary_nodes to a simpler implementation based on this new
function....

aeb83a2b 01/16/2009 12:43 pm Iustin Pop

Fix gnt-backup export with short names

We need to pass the fully-qualified node to _CheckNodeOnline, not the short
one.

Reviewed-by: imsnah

25e7b43f 01/15/2009 12:00 pm Iustin Pop

Some docstring updates

This patch rewraps some comments to shorter lengths, changes
double-quotes to single-quotes inside triple-quoted docstrings for
better editor handling.

It also fixes some epydoc errors, namely invalid crossreferences (after
method rename), documentation for inexistent (removed) parameters, etc....

14d57a8b 01/15/2009 12:00 pm Iustin Pop

ganeti-noded: reduce log noise

The source port/addr is currently logged three times for each
connection, and this is unnecessary. We change two log entries to debug,
since they are useful for precise timing, and we keep only one at INFO
level.

Reviewed-by: imsnah

53c776b5 01/13/2009 05:21 pm Iustin Pop

Forward port the live migration from 1.2 branch

This is forward port via copy (and not individual patches cherry-pick)
of the latest code on the 1.2 branch related to the migration.

The changes compared to 1.2 are the fact that we don't need the
IdentifyDisks step anymore (the drbd rpc calls are independent now), and...

a2d59d8b 01/13/2009 05:20 pm Iustin Pop

Port replace disk/change node to the new DRBD RPCs

In replace disks to new secondary, since Attach (and therefore
call_blockdev_find) is not modifying the devices anymore, we need to
switch this LU to the new call_drbd_disconnect_net and
call_drbd_attach_net functions....

6b93ec9d 01/13/2009 05:20 pm Iustin Pop

Forward-port DrbdNetReconfig

This is a modified forward-port of DrbdNetReconfig and their associated
RPCs. In Ganeti 2.0, these functions will be used for two things:
- live migration (as in 1.2)
- and for other network reconfiguration tasks, since DRBD8.Attach()...

f96e3c4f 01/13/2009 05:20 pm Iustin Pop

backend: rename AttachOrAssemble to Assemble

Since now the Assemble function is different than Attach, we rename this
backend function to show that the intent is to fully assemble the device
(and it's always allowed to modify the device).

Reviewed-by: ultrotter

2d0c8319 01/13/2009 05:20 pm Iustin Pop

drbd: change the semantics of Attach vs. Assemble

Currently, both the Attach and Assemble methods for DRBD8 devices will use and
alter the device state. This is suboptimal, and it has been worked
around in 1.2 via a special cache in the node daemon so that we don't...

f87548b5 01/13/2009 05:20 pm Iustin Pop

bdev: Do not call Assemble() on children

The caller of dev.Assemble() (backend._RecursiveAssembleBD) is doing an
explicit recursion over all the children of the device, with better
error reporting. As such, we don't need this repeated assembly inside
the base BlockDev class....

ea33068f 01/13/2009 04:43 pm Iustin Pop

Fix modification of instance memory

... as found by the QA script - bug was introduced by me in commit 2117.

Reviwed-by: imsnah

24b0d752 01/13/2009 03:16 pm Iustin Pop

Increase resync speed to 60MB/s

This is a forward-port of commit 2219 on the 1.2 branch.

Reviewed-by: ultrotter

4040a784 01/12/2009 06:06 pm Iustin Pop

Skip offline nodes in gnt-cluster commands

This patch makes gnt-cluster copyfile and command skip the offline
nodes.

Reviwed-by: ultrotter, imsnah

4cfb9426 01/12/2009 02:42 pm Iustin Pop

Fix some errors in instance modify --disk remove

The RpcResult introduction still left some bugs (after multiple patches):
- we don't correctly check the result type
- rename a variable to prevent a conflict

Reviewed-by: imsnah

f57c76e4 01/12/2009 12:27 pm Iustin Pop

Fix an error handling case in instance info

The checking for invalid instance names in LUQueryInstanceData is broken
since commit 1642.

Reviewed-by: imsnah

afee0879 01/12/2009 11:14 am Iustin Pop

Introduce a very simple LU to force config updates

This LU can be used to force a push of the config in case it's needed,
for example after an upgrade to update the ssconf_release_version file.

Reviewed-by: imsnah

8a113c7a 01/09/2009 06:24 pm Iustin Pop

Add a new ssconf file with the ganeti version

The patch adds a new ssconf file containing the ganeti version.

Reviewed-by: imsnah

7d585316 01/09/2009 05:34 pm Iustin Pop

Work around a DRBD sync speed race condition

This is modified forward-port of commit 1544 on the 1.2 branch:

When DRBD is doing its dance to establish a connection with its
peer, it also sends the synchronization speed over the wire. In
some cases setting the sync speed only after setting up both...
cfacfd6e 01/09/2009 04:58 pm Iustin Pop

burnin: use the new replace_disks constants

This patch updates burnin to the latest replace disks constant, and
changes the constants' values to be more accurate.

Reviewed-by: imsnah

94a02bb5 01/09/2009 04:26 pm Iustin Pop

Fix gnt-os for offline nodes

We shouldn't query offline nodes in gnt-os. This patch adds an utility
function to ConfigWriter that returns the names of online nodes and uses
it in LUDiagnoseOS to query only the good nodes.

Reviewed-by: imsnah

186ec53c 01/09/2009 02:52 pm Iustin Pop

Silence warning on node list for offline nodes

The warning in node list is meant for nodes that return wrong
information, but for offline nodes this case is normal.

Reviewed-by: imsnah

7d88772a 01/09/2009 02:52 pm Iustin Pop

Rework the daemonization sequence

The current fork+close fds sequence has deficiencies which are hard to
work around:
- logging can start logging before we fork (e.g. if we need to emit
messages related to master checking), and thus use FDs which we...

7e9366f7 01/09/2009 02:22 pm Iustin Pop

Cleanup replace-disks modes and options

In 1.2, due to the md+drbd7 legacy, we had a complex choice of replace
modes, and the new drbd8 modes where forced into this syntax, with some
complicated rules of transition from one mode to another (if REPLACE_ALL...

82e37788 01/08/2009 06:39 pm Iustin Pop

Fix cluster verify/node net test for offline nodes

For offline nodes, we shouldn't add them to the NV_NODELIST and
NV_NODENETTEST tests since they most likely won't succeed.

The patch makes gnt-cluster verify happy again in such cases.

Reviewed-by: imsnah

3247bbac 01/08/2009 06:05 pm Iustin Pop

rpc: Add a method for easy check of remote results

The patch adds a new method to the rpc.RpcResult class called
"RemoteFailMsg" which is useful for the RPC calls which return a
(status, payload) style result.

Reviewed-by: imsnah

56e7640c 01/08/2009 04:16 pm Iustin Pop

Add an instance_migratable rpc call

This is a forward-port of commit 1194 on the 1.2 branch:

This call will check whether an instance is up on its primary, and that
it has been started with symlinks. We currently have no on-secondary
checks, nor any hypervisor specific call....
cf8df3f3 01/08/2009 02:03 pm Iustin Pop

bdev: forward-port ReAttachNet/DisconnectNet

This is plain copy of the 1.2 ReAttachNet and DisconnectNet methods on
the DRBD8 device, with the logger to logging module changes and the
ReAttachNet method renamed to AttachNet.

These methods are not used anywhere right now, but will be used for...

5282084b 01/07/2009 07:02 pm Iustin Pop

backend: Remove symlinks by disk name

This is a modified forward-port of commit 1184 on the 1.2 branch:

backend: Remove symlinks by disk name, not using a wildcard
Reviewed-by: ultrotter

The changes to the original patch are related to the docstring style and...

b2e7666a 01/07/2009 07:02 pm Iustin Pop

Pass instance name to rpc call blockdev_close

This is an extract of commit 1166 on the 1.2 branch (Add a rpc call for
drbd network reconfiguration), but only the blockdev_close part.

The patch changes the blockdev_close call to take the instance so that...

03dfa658 01/07/2009 07:02 pm Iustin Pop

Fix the _RemoveBlockDevLinks() function

This is a forward-port of commit 1163 on the 1.2 branch:
This fixes the removal of the instance symlinks (probably breakage from
the glob changes).

Reviewed-by: imsnah
3c9c571d 01/07/2009 07:01 pm Iustin Pop

Remove instance's symlinks

This is a forward-port of commits 1150 and 1151 on the 1.2 branch:
Add _RemoveBlockDevLinks auxiliary function, called when an instance
fails to start and when it is shut down.

Reviewed-by: iustinp

and:
Fix cut&paste error when removing symlinks...

ec596c24 01/07/2009 07:01 pm Iustin Pop

Catch BlockDeviceError when starting instance

This is a forward-port of commit 1149 on the 1.2 branch:
_GatherAndLinkBlockDevs used to raise the errors.BlockDeviceError
exception when it failed to create a block device, and with this patch
set it does so also when it fails to create a symlink to it....

9332fd8a 01/07/2009 07:01 pm Iustin Pop

Create symlinks to intances' block devices

This is a forward-port of commit 1148 on the 1.2 branch:
Change the _GatherBlockDevs private function, called only one time by
StartInstance, to _GatherAndLinkBlockDevs, and make it transform the
device returned even more by calling the new _SimlinkBlockDev auxiliary...

069cfbf1 01/07/2009 07:01 pm Iustin Pop

Simplify hypervisor block_devices structure

This is a partial forward-port of commit 1136 on the 1.2 branch:

The hypervisor doesn't need to be passed the whole block device
structure, so we'll just give it the block device name on the local
node, and the name as seen by the instance. This will make it easier to...
2b17c3c4 01/07/2009 04:38 pm Iustin Pop

_AssembleInstanceDisks: fix rpcresult handling

Commit 2117 changed _AssembleInstanceDisks to correctly parse the
failure status of the new RpcResult structure, but it didn't fix the
storing of only the result payload. Since RpcResult is not JSON
serializable, LUActivateInstanceDisks is failing....

e09fdcfa 01/06/2009 11:57 am Iustin Pop

Fix some pylint-detected issues

Two bad indentation cases and a missing variable.

Reviewed-by: imsnah

5b099da9 12/19/2008 09:31 pm Michael Hanselmann

ganeti.bootstrap: Set permissions on newly uploaded files

Reviewed-by: amishchenko

699777f2 12/19/2008 09:31 pm Michael Hanselmann

ganeti.cmdlib: Check remote API certificate on "gnt-cluster verify"

Reviewed-by: amishchenko

2438c157 12/19/2008 09:30 pm Michael Hanselmann

ganeti.bootstrap: Upload remote API certificate to new nodes

Reviewed-by: amishchenko

5557b04c 12/19/2008 09:30 pm Michael Hanselmann

ganeti.bootstrap: Prepare for remote API certificate

Reviewed-by: amishchenko

c4415fd5 12/19/2008 09:30 pm Michael Hanselmann

ganeti.bootstrap: Write SSL key to temporary file and set permissions

Previously, we set the permissions only after writing the key. This
gave other users on the system a small window during which they could
read the key.

Reviewed-by: amishchenko

61a08fa3 12/19/2008 09:30 pm Michael Hanselmann

ganeti.bootstrap: Generate SSL certificate for remote API

Reviewed-by: amishchenko

40a97d80 12/19/2008 09:29 pm Michael Hanselmann

ganeti.bootstrap: Move SSL certificate generation into separate function

Reviewed-by: amishchenko

b5b67ef9 12/19/2008 02:58 pm Michael Hanselmann

ganeti-rapi: Implement HTTP authentication

Passwords are stored in "$localstatedir/lib/ganeti/rapi_users". User
options specify the access permissions of a user (see docstring for
ganeti.http.ReadPasswordFile), for which only "write" is supported
to grant write access. Every other user has read-only access....

e6e94655 12/19/2008 02:57 pm Michael Hanselmann

ganeti.http: Function to read password file

Lines in the password file are of the following format:

<username> <password> [options]

Fields are separated by whitespace. Username and password are
mandatory, options are optional and separated by comma (",")....

68fa9caf 12/19/2008 02:57 pm Michael Hanselmann

ganeti.http: Add support for private data in HTTP requests

Reviewed-by: amishchenko

be500c29 12/19/2008 02:57 pm Michael Hanselmann

ganeti.http: Add support for basic HTTP authentication

As per RFC2617.

Reviewed-by: amishchenko

f8bd7df3 12/19/2008 02:57 pm Michael Hanselmann

ganeti.http: Prepare authentication for HTTP server

The authentication class will override PreHandleRequest.

Reviewed-by: amishchenko

dd875d32 12/18/2008 06:39 pm Michael Hanselmann

Job queue: Allow more than one file rename per RPC call

Reviewed-by: ultrotter

d7fd1f28 12/18/2008 06:38 pm Michael Hanselmann

ganeti.jqueue: Group job archivals to reduce number of RPC calls

Reducing the actual number of RPC calls will come in another patch.

Reviewed-by: ultrotter

f8ad5591 12/18/2008 06:38 pm Michael Hanselmann

Prevent RPC timeout on auto-archiving jobs

With a large job queue, auto-archiving jobs can take a very long time,
causing timeouts on the luxi RPC layer. With this change, auto-
archive returns after half of the RPC timeout has passed. The user
will see how many jobs are left unchecked....

78d12585 12/18/2008 06:38 pm Michael Hanselmann

jqueue: When auto-archiving jobs, calculate job status only once

This is done by passing the job object to _ArchiveJobUnlocked instead
of only the job ID. Also return whether job was actually archived.

Reviewed-by: ultrotter

58b22b6e 12/18/2008 06:23 pm Michael Hanselmann

Use subdirectories for job queue archive

As it turned out, having many files in a single directory can be
very painful. With this patch, only 10'000 files are stored in a
directory for the job queue archive. With 10'000 directries, this
allows for up to 100 million jobs be archived without having large...

6e797216 12/18/2008 06:23 pm Michael Hanselmann

Add rename function automatically creating directories if needed

Unfortunately, os.makedirs in Python 2.4 is not safe against multiple
processes creating the same directory tree at the same time. This is
only fixed in Python 2.5 and up. Adding more checks in our code doesn't...

aea0ed67 12/18/2008 06:21 pm Michael Hanselmann

ganeti.http: Don't pass poller object around

They're cheap to instantiate and doing this changes makes the code
a bit simpler.

Reviewed-by: ultrotter

79589f25 12/18/2008 03:45 pm Michael Hanselmann

Rename http.HttpInternalError to HttpInternalServerError

All other exceptions are named after the error name in RFC2616 (HTTP/1.1).

Reviewed-by: amishchenko

b3660886 12/18/2008 03:45 pm Michael Hanselmann

ganeti.http: Add more constants and errors

Reviewed-by: amishchenko

45eac583 12/18/2008 03:45 pm Michael Hanselmann

ganeti.http: Ignore ENOTCONN when shutting down the connection

Reviewed-by: amishchenko

a8e01e9f 12/18/2008 03:44 pm Michael Hanselmann

Implement support for additional headers with HTTP errors

Reviewed-by: amishchenko

f30ca1e6 12/17/2008 04:30 pm Michael Hanselmann

Add simple unittests for ganeti.http

More complex unittests will need some refactoring in the HTTP code.

Reviewed-by: amishchenko

e38220e4 12/17/2008 04:09 pm Michael Hanselmann

ganeti.bootstrap: Whitespace fix

Reviewed-by: iustinp

f87b405e 12/17/2008 03:18 pm Michael Hanselmann

Add job queue size limit

A job queue with too many jobs can increase memory usage and/or make
the master daemon slow. The current limit is just an arbitrary number.
A "soft" limit for automatic job archival is prepared.

Reviewed-by: iustinp

7167159a 12/17/2008 01:24 pm Michael Hanselmann

utils.KillProcess: Use waitpid() to wait for child processes

Sometimes the proc filesystem doesn't reflect the current status of
a process. By calling waitpid(), we make sure to get the current
information, at least for child processes. The timeout is still...

513e896d 12/16/2008 06:24 pm Guido Trotter

LUConnectConsole: fix primary_node online check

The primary node is part of the instance, not of the opcode.

Reviewed-by: iustinp

bf988c29 12/16/2008 06:24 pm Guido Trotter

_RunCmdPipe: handle EINTR in poller.poll()

poll() can be interrupted. rather than failing we retry until it
returns.

Reviewed-by: iustinp