Statistics
| Branch: | Tag: | Revision:

root / lib / jqueue.py @ 35e994e9

History | View | Annotate | Download (37.9 kB)

# Date Author Comment
0f6be82a 02/12/2009 08:13 pm Iustin Pop

job queue: log the opcode error too

Currently we only log "Error in opcode ...", but we don't log the error itself.
This is not good for debugging.

Reviewed-by: ultrotter

df0fb067 01/28/2009 04:46 pm Iustin Pop

Fix some issues related to job cancelling

This patch fixes two issues with the cancel mechanism:
- cancelled jobs show as such, and not in error state (we mark them as
OP_STATUS_CANCELED and not OP_STATUS_ERROR)
- queued jobs which are cancelled don't raise errors in the master (we...

5278185a 01/27/2009 05:41 pm Iustin Pop

Fix single-job archiving (gnt-job archive)

This is a simply typo from the conversion to multi-job archiving.

Reviewed-by: imsnah

d21d09d6 01/20/2009 07:19 pm Iustin Pop

Update the logging output of job processing

(this is related to the master daemon log)

Currently it's not possible to follow (in the non-debug runs) the
logical execution thread of jobs. This is due to the fact that we don't
log the thread name (so we lose the association of log messages to jobs)...

25e7b43f 01/15/2009 12:00 pm Iustin Pop

Some docstring updates

This patch rewraps some comments to shorter lengths, changes
double-quotes to single-quotes inside triple-quoted docstrings for
better editor handling.

It also fixes some epydoc errors, namely invalid crossreferences (after
method rename), documentation for inexistent (removed) parameters, etc....

dd875d32 12/18/2008 06:39 pm Michael Hanselmann

Job queue: Allow more than one file rename per RPC call

Reviewed-by: ultrotter

d7fd1f28 12/18/2008 06:38 pm Michael Hanselmann

ganeti.jqueue: Group job archivals to reduce number of RPC calls

Reducing the actual number of RPC calls will come in another patch.

Reviewed-by: ultrotter

f8ad5591 12/18/2008 06:38 pm Michael Hanselmann

Prevent RPC timeout on auto-archiving jobs

With a large job queue, auto-archiving jobs can take a very long time,
causing timeouts on the luxi RPC layer. With this change, auto-
archive returns after half of the RPC timeout has passed. The user
will see how many jobs are left unchecked....

78d12585 12/18/2008 06:38 pm Michael Hanselmann

jqueue: When auto-archiving jobs, calculate job status only once

This is done by passing the job object to _ArchiveJobUnlocked instead
of only the job ID. Also return whether job was actually archived.

Reviewed-by: ultrotter

58b22b6e 12/18/2008 06:23 pm Michael Hanselmann

Use subdirectories for job queue archive

As it turned out, having many files in a single directory can be
very painful. With this patch, only 10'000 files are stored in a
directory for the job queue archive. With 10'000 directries, this
allows for up to 100 million jobs be archived without having large...

f87b405e 12/17/2008 03:18 pm Michael Hanselmann

Add job queue size limit

A job queue with too many jobs can increase memory usage and/or make
the master daemon slow. The current limit is just an arbitrary number.
A "soft" limit for automatic job archival is prepared.

Reviewed-by: iustinp

9728ae5d 12/14/2008 02:02 pm Iustin Pop

cleanup: exceptions should derive from Exception

Reviewed-by: amishchenko

c41eea6e 12/11/2008 07:13 pm Iustin Pop

Fix epydoc format warnings

This patch should fix all outstanding epydoc parsing errors; as such, we
switch epydoc into verbose mode so that any new errors will be visible.

Reviewed-by: imsnah

59303563 12/02/2008 07:05 am Iustin Pop

Restrict job propagation to master candidates only

This patch restricts the job propagation to master candidates only, by
not registering non-candidates in the job queue node lists.

Note that we do intentionally purge the job queue if a node is toggled
to non-master status....

b7cb9024 11/28/2008 12:29 pm Michael Hanselmann

jqueue: Always print message for 100% when inspecting queue

Reviewed-by: iustinp

fbf0262f 11/28/2008 12:28 pm Michael Hanselmann

jqueue: Allow jobs waiting for locks to be canceled

- Add new "canceling" status
- Notify clients when job is canceled
- Give a return value from CancelJob
- Handle it in the client library

Reviewed-by: iustinp

33987705 11/27/2008 05:13 am Iustin Pop

jqueue: fix a bug in an error path

Dictionaries raise KeyError, and not ValueError when invalid keys are
passes to del.

Reviewed-by: imsnah

711b5124 11/26/2008 06:16 pm Michael Hanselmann

jqueue: Log progress and load jobs one by one

By logging more information, a user can see how far it is in inspecting
the queue. This can be useful with a large number of jobs. Also, instead
of loading all jobs in one go, load only the list of job IDs and then...

16714921 11/26/2008 06:15 pm Michael Hanselmann

jqueue: Shutdown workerpool in case of a problem

Reviewed-by: ultrotter

a3811745 11/12/2008 02:51 pm Michael Hanselmann

jqueue: Always use rpc.RpcRunner

"from ganeti.rpc import RpcRunner" does not conform to the style guide.

Reviewed-by: iustinp

ea03467c 10/28/2008 01:21 am Iustin Pop

Documentation updates for jqueue.py

Reviewed-by: imsnah

63712a09 10/28/2008 01:20 am Iustin Pop

Yet another bug found while reviewing docs

The newer_than variable can be either None or an int, and we normalize
it to an integer previously and save it in the 'serial' variable, which
should be used instead.

Reviewed-by: imsnah

99aabbed 10/20/2008 05:47 pm Iustin Pop

Convert the job queue rpcs to address-based

The two main multi-node job queue RPC calls (jobqueue_update,
jobqueue_rename) are converted to address-based calls, in order to speed
up queue changes. For this, we need to change the _nodes attribute on
the jobqueue to be a dict {name: ip}, instead of a set....

94ed59a5 10/16/2008 03:08 pm Iustin Pop

Fix job queue behaviour when loading jobs

Currently, if loading a job fails, the job queue code raises an
exception and prevents the proper processing of the jobs in the queue.
We change this so that unparseable jobs are instead archived (if not
already)....

3ccafd0e 10/16/2008 11:37 am Iustin Pop

Add an interface for the drain flag changes/query

This adds the set/reset in the jqueue and luxi modules, and a way to
query it in OpQueryConfigValues, and also the comand line interface for
it:
$ gnt-cluster queue info
The drain flag is unset
$ gnt-cluster queue drain...

686d7433 10/15/2008 01:52 pm Iustin Pop

Implement the job queue drain flag

We add a (per-node) queue drain flag that blocks new job submission.
There is not yet an interface to add/remove the flag (will come in next
patches).

Reviewed-by: imsnah

72737a7f 10/10/2008 12:55 pm Iustin Pop

Convert rpc module to RpcRunner

This big patch changes the call model used in internode-rpc from
standalong function calls in the rpc module to via a RpcRunner class,
that holds all the methods. This can be used in the future to enable
smarter processing in the RPC layer itself (some quick examples are not...

e92376d7 10/07/2008 11:03 am Iustin Pop

Implement job 'waiting' status

Background: when we have multiple jobs in the queue (more than just a
few), many of the jobs (up to the number of threads) will be in state
'running', although many of them could be actually blocked, waiting for
some locks. This is not good, as one cannot easily see what is...

07cd723a 10/06/2008 07:42 pm Iustin Pop

Implement job auto-archiving

This patch adds a new luxi call that implements auto-archiving of jobs
older than a certain age (or -1 for all completed jobs), and the gnt-job
command that makes use of this (with 'all' for -1).

Reviewed-by: imsnah

1daae384 10/06/2008 04:29 pm Iustin Pop

Increase the number of threads to 25

Since our locks are not gathered nicely, we can have jobs that are
actually blocking on locks (parallel burnin shows this), so at least we
need to increase the number of threads above the usual number of jobs we
could have in a such a case....

c56ec146 09/30/2008 03:04 pm Iustin Pop

Enhance the job-related timestamps

This patch adds start, stop, and received timestamp for jobs (and allows
querying of them), and allows querying of the opcode timestamps.

Reviewed-by: imsnah

5b23c34c 09/29/2008 06:38 pm Iustin Pop

Add opcode execution log in job info

This patch adds the job execution log in “gnt-job info” and also allows
its selection in “gnt-job list” (however here it's not very useful as
it's not easy to parse). It does this by adding a new field in the query
job call, named ‘oplog’....

60dd1473 09/29/2008 04:09 pm Iustin Pop

Implement job summary in gnt-job list

It is not currently possibly to show a summary of the job in the output
of “gnt-job list”. The closes is listing the whole opcode(s), but that
is too verbose. Also, the default output (id, status) is not very
useful, unless one looks for (and knows about) an exact job ID....

3b87986e 09/29/2008 04:08 pm Iustin Pop

Nicely sort the job list

Unless we decide to change the job identifiers to integer, we should at
least sort the list returned by _GetJobIDsUnlocked.

Reviewed-by: imsnah

e74798c1 09/10/2008 08:46 pm Michael Hanselmann

jqueue: Add common RPC error handling function

We didn't decide yet what exactly it should do with failed nodes.

Reviewed-by: ultrotter

5c735209 08/29/2008 04:42 pm Iustin Pop

Make WaitForJobChanges deal with long jobs

This patch alters the WaitForJobChanges luxi-RPC call to have a
configurable timeout, so that the call behaves nicely with long jobs
that have no update.

We do this by adding a timeout parameter in the RPC call, and returning...

5685c1a5 08/27/2008 05:52 pm Michael Hanselmann

jqueue: Replace normal cache dict with weakref dict

A job should only exist once in memory. After the cache is cleaned,
there can still be references to a job somewhere else. If there
are multiple instances, one can get updated while a function is
waiting for changes on another instance. By using...

70552c46 08/27/2008 05:52 pm Michael Hanselmann

jqueue: Keep timestamp of opcode start and end

Reviewed-by: ultrotter

65548ed5 08/27/2008 05:48 pm Michael Hanselmann

jqueue: Reset run_op_idx after job is done

It can be confusing otherwise.

Reviewed-by: ultrotter

6c5a7090 08/27/2008 11:34 am Michael Hanselmann

Make sure that client programs get all messages

This is a large patch, but I can't figure out how to split it without
breaking stuff. The old way of getting messages by always getting the
last one didn't bring all messages to the client if they were added...

dfe57c22 08/11/2008 07:27 pm Michael Hanselmann

Add RPC call to wait for job changes

This way clients can react faster to status or message changes and
don't have to poll anymore.

Reviewed-by: ultrotter

d5e317ba 08/11/2008 07:27 pm Michael Hanselmann

jqueue: Change log message time format

See the comment in the patch.

Reviewed-by: ultrotter

abc1f2ce 08/08/2008 02:21 pm Michael Hanselmann

jqueue: Move archived jobs on all nodes

Otherwise one might have archived jobs back in the list after a master
failover.

Reviewed-by: iustinp

5d6fb8eb 08/08/2008 01:03 pm Michael Hanselmann

jstore: Change to not always require a lock

This way we can do locking when both noded and masterd are running
on the same machine, the latter holding an exclusive lock on the
queue.

Reviewed-by: iustinp

9f774ee8 08/08/2008 01:01 pm Michael Hanselmann

jqueue: Use new job queue RPC functions

Reviewed-by: iustinp

d2e03a33 08/06/2008 04:36 pm Michael Hanselmann

jqueue: Implement {Add,Remove}Node

These functions will be used to notify the queue about newly added
or removed nodes.

Reviewed-by: iustinp

4c848b18 08/06/2008 04:35 pm Michael Hanselmann

jqueue: Don't pass the list of nodes to SubmitJob anymore

The job queue now maintains its own list and is updated when
nodes are added or removed from the cluster.

Reviewed-by: iustinp

8e00939c 08/06/2008 04:35 pm Michael Hanselmann

Maintain node list in job queue

The code makes sure not to include the master in the list.

Reviewed-by: iustinp

23752136 08/05/2008 01:33 pm Michael Hanselmann

jqueue: Replicate jobs to all nodes

Newly added nodes are not yet taken care of. Queue locking on
non-master nodes is not yet correct.

Reviewed-by: iustinp

04ab05ce 08/04/2008 03:27 pm Michael Hanselmann

jqueue: Use new jstore module

Reviewed-by: iustinp

db37da70 07/31/2008 06:03 pm Michael Hanselmann

jqueue: Move assert into decorator

This reduces code duplication. A later patch will modify the job queue
a bit more and will need a change of this assert. The assertion is
also removed from all class-internal functions.

Reviewed-by: iustinp

5bdce580 07/31/2008 05:33 pm Michael Hanselmann

jqueue: Store context in job queue instead of worker pool

The job queue will need to access to configuration, which is provided
through the context object, to get a list of nodes.

Reviewed-by: iustinp

38206f3c 07/30/2008 05:04 pm Iustin Pop

Fix pylint-detected issues

This is mostly:
- whitespace fix (space at EOL in some files, not all, broken
indentation, etc)
- variable names overriding others (one is a real bug in there)
- too-long-lines
- cleanup of most unused imports (not all)...

85f03e0d 07/30/2008 01:02 pm Michael Hanselmann

Rewrite job queue

We found several issues in the old job queue implementation. It had race
conditions, deadlocks and other deficiencies.

Short summary:
- _QueuedOpCode and _QueuedJob are now more or less data structures with a few
utility functions. __Setup is gone....

8090e19f 07/29/2008 05:07 pm Michael Hanselmann

jqueue: Fix error logging

The passed parameters were not correct.

Reviewed-by: iustinp, ultrotter

188c5e0a 07/28/2008 01:13 pm Michael Hanselmann

Implement job canceling on server side

Locking is not completeley right due to a deadlock when the job calls
UpdateJob after changing its status.

Reviewed-by: ultrotter

4cb1d919 07/28/2008 12:16 pm Michael Hanselmann

Add “canceled” status for opcodes

Reviewed-by: ultrotter

fae737ac 07/25/2008 03:47 pm Michael Hanselmann

Move code extracting job ID into function

It might come in handy at some point and makes the code a bit easier
to read.

Reviewed-by: iustinp

c609f802 07/24/2008 02:32 pm Michael Hanselmann

Implement job archiving on the server side

So far no error reporting to the client is done. Clients don't get
noticed if a job doesn't exist or couldn't be archived because of
its current status.

The internal cache is always cleaned when the preconditions didn't...

0cb94105 07/24/2008 02:32 pm Michael Hanselmann

Add directory for archived jobs

Reviewed-by: iustinp

ce594241 07/23/2008 07:56 pm Michael Hanselmann

Move code formatting job ID into a base class

A later patch will add a memory based job storage class, hence this
code is going into a separate class. It also changes the number format
to always use at least 10 digits, allowing up to 9'999'999'999 jobs to...

21cc1fbd 07/23/2008 04:30 pm Michael Hanselmann

Rename JobStorage to DiskJobStorage

Reviewed-by: iustinp

205d71fd 07/23/2008 03:25 pm Michael Hanselmann

Fix logging with string job IDs

The job ID is now a string, hence logging must use %s instead of %d.

Reviewed-by: iustinp

3be9a705 07/23/2008 02:34 pm Michael Hanselmann

Make job ID a string

The docstring says that _NewSerialUnlocked returns “a string
representing the job identifier”. Until now it returned an
integer and this patch changes it.

Reviewed-by: iustinp

c3f0a12f 07/23/2008 01:06 pm Iustin Pop

Distribute the queue serial file after each update

This patch adds distribution of the queue serial file after each write
to it (but before a new job is created and written with that ID, and
before a response is returned, so we should be safe from crashes in...

c4beba1c 07/23/2008 01:06 pm Iustin Pop

Make the job storage init reuse a serial file

This will be needed for master failover. If we don't have a valid queue
directory, we need to reinitialize it, but we should keep the existing
serial number.

As such, we abstract the reading of the serial and if we find a valid...

57f8615f 07/22/2008 05:05 pm Michael Hanselmann

Make argument to CleanCacheUnlocked mandatory

Not passing the argument means it has the value None. Iterating None
doesn't work:
>>> "123" in None
Traceback (most recent call last):
File "<stdin>", line 1, in ?
TypeError: iterable argument required...

bac5ffc3 07/17/2008 03:51 pm Oleksiy Mishchenko

Implement jobs resource in RAPI

Reviewed-by: imsnah

f0d874fe 07/15/2008 01:49 pm Iustin Pop

Sort the job list in _GetJobIDsUnlocked

Since the IDs are integers, we can simply sort them.

Reviewed-by: imsnah

f1048938 07/14/2008 04:15 pm Iustin Pop

First version of user feedback fixes

This patch contains a raw version for fixing feedback_fn.

The new mechanism works as follows:
- instead of a per-Processor feedback_fn, there's one for each
ExecOpCode, so that feedback for different opcodes go via possibly...

ac0930b9 07/14/2008 02:27 pm Iustin Pop

Cache some jobs in memory

This patch adds a caching mechanisms to the JobStorage. Note that is
does not make the memory cache authoritative.

The algorithm is:
- all jobs loaded from disks are entered in the cache
- all new jobs are entered in the cache...

8a70e415 07/14/2008 02:12 pm Iustin Pop

Fix JobStorage._GetJobIDsUnlocked

The job ID returned must be an integer (and the regex enforces that),
but we didn't convert it manually.

Reviewed-by: imsnah

911a495b 07/14/2008 01:08 pm Iustin Pop

Change JobStorage to work with ids not filenames

Currently some of the functions in JobStorage work with filenames (which
is an implementation detail and should only be used when dealing with
the storage) and not with job IDs. We need to change this in order to...

f1da30e6 07/11/2008 07:17 pm Michael Hanselmann

Add experimental persistency to job queue

It's not perfect and it's not finished, but it's a start.

- Serial number is read only once, but written on each update
- Jobs are kept only on disk (caching will be implemented)

Reviewed-by: iustinp

af30b2fd 07/11/2008 01:25 pm Michael Hanselmann

Make "gnt-job list" work again

"gnt-job list" was broken after my recent changes in the RPC
between clients and the master. This patch makes it work again.

Reviewed-by: iustinp

307149a8 07/10/2008 03:29 pm Iustin Pop

Switch _QueuedOpCode to have their own lock

Right now, the queued opcode doesn't have a lock, and instead relies on
the parent QueuedJob's lock.

This is not good for logging feedback, so it's better to have a lock for
each queuedopcode.

Reviewed-by: ultrotter

7996a135 07/10/2008 03:17 pm Iustin Pop

Add a simple decorator for instance methods

This is just a simple, hardcoded decorator for object methods needing
synchronization on the _lock instance attribute.

Reviewed-by: ultrotter

c8549bfd 07/10/2008 12:22 pm Michael Hanselmann

jqueue: Log more information when running opcodes

Reviewed-by: iustinp

2467e0d3 07/09/2008 01:34 pm Michael Hanselmann

Remove old job queue code

Reviewed-by: iustinp

e2715f69 07/09/2008 01:33 pm Michael Hanselmann

Add very simple job queue

Reviewed-by: iustinp

5d3a153a 06/13/2008 01:14 pm Guido Trotter

Fix a typo in jqueue.py

s/result/op_result/ (this code was never used, so this wasn't caught)

Reviewed-by: iustinp

35049ff2 04/10/2008 06:36 pm Iustin Pop

Add per-opcode results to job processing

This patch changes the definition of a job and introduces per-opcode
results.

First, the result and status fields of a job are condensed into a single
'status' attribute. Then, we introduce an opcode status and one result...

283439c9 04/07/2008 02:18 pm Iustin Pop

Implement selective job query

This patch implements query-ing of only selected jobs instead of all.

Reviewed-by: ultrotter

7a1ecaed 04/04/2008 03:44 pm Iustin Pop

Add a simple gnt-job script

This patch adds a very basic gnt-job script that allows job querying.
This goes on top of the previous master daemon patches.

Currently, because of the not-changed cmd lock, you can't query the jobs
as long as a job is running - you have to rm the cmd lock and then you...

498ae1cc 04/01/2008 04:04 pm Iustin Pop

A dumb queue implementation

This patch adds a very dumb in-memory only queue implementation.

Reviewed-by: imsnah