History | View | Annotate | Download (37.9 kB)
Fix some issues related to job cancelling
This patch fixes two issues with the cancel mechanism: - cancelled jobs show as such, and not in error state (we mark them as OP_STATUS_CANCELED and not OP_STATUS_ERROR) - queued jobs which are cancelled don't raise errors in the master (we...
Fix single-job archiving (gnt-job archive)
This is a simply typo from the conversion to multi-job archiving.
Reviewed-by: imsnah
Update the logging output of job processing
(this is related to the master daemon log)
Currently it's not possible to follow (in the non-debug runs) thelogical execution thread of jobs. This is due to the fact that we don'tlog the thread name (so we lose the association of log messages to jobs)...
Some docstring updates
This patch rewraps some comments to shorter lengths, changesdouble-quotes to single-quotes inside triple-quoted docstrings forbetter editor handling.
It also fixes some epydoc errors, namely invalid crossreferences (aftermethod rename), documentation for inexistent (removed) parameters, etc....
Job queue: Allow more than one file rename per RPC call
Reviewed-by: ultrotter
ganeti.jqueue: Group job archivals to reduce number of RPC calls
Reducing the actual number of RPC calls will come in another patch.
Prevent RPC timeout on auto-archiving jobs
With a large job queue, auto-archiving jobs can take a very long time,causing timeouts on the luxi RPC layer. With this change, auto-archive returns after half of the RPC timeout has passed. The userwill see how many jobs are left unchecked....
jqueue: When auto-archiving jobs, calculate job status only once
This is done by passing the job object to _ArchiveJobUnlocked insteadof only the job ID. Also return whether job was actually archived.
Use subdirectories for job queue archive
As it turned out, having many files in a single directory can bevery painful. With this patch, only 10'000 files are stored in adirectory for the job queue archive. With 10'000 directries, thisallows for up to 100 million jobs be archived without having large...
Add job queue size limit
A job queue with too many jobs can increase memory usage and/or makethe master daemon slow. The current limit is just an arbitrary number.A "soft" limit for automatic job archival is prepared.
Reviewed-by: iustinp
cleanup: exceptions should derive from Exception
Reviewed-by: amishchenko
Fix epydoc format warnings
This patch should fix all outstanding epydoc parsing errors; as such, weswitch epydoc into verbose mode so that any new errors will be visible.
Restrict job propagation to master candidates only
This patch restricts the job propagation to master candidates only, bynot registering non-candidates in the job queue node lists.
Note that we do intentionally purge the job queue if a node is toggledto non-master status....
jqueue: Always print message for 100% when inspecting queue
jqueue: Allow jobs waiting for locks to be canceled
- Add new "canceling" status- Notify clients when job is canceled- Give a return value from CancelJob- Handle it in the client library
jqueue: fix a bug in an error path
Dictionaries raise KeyError, and not ValueError when invalid keys arepasses to del.
jqueue: Log progress and load jobs one by one
By logging more information, a user can see how far it is in inspectingthe queue. This can be useful with a large number of jobs. Also, insteadof loading all jobs in one go, load only the list of job IDs and then...
jqueue: Shutdown workerpool in case of a problem
jqueue: Always use rpc.RpcRunner
"from ganeti.rpc import RpcRunner" does not conform to the style guide.
Documentation updates for jqueue.py
Yet another bug found while reviewing docs
The newer_than variable can be either None or an int, and we normalizeit to an integer previously and save it in the 'serial' variable, whichshould be used instead.
Convert the job queue rpcs to address-based
The two main multi-node job queue RPC calls (jobqueue_update,jobqueue_rename) are converted to address-based calls, in order to speedup queue changes. For this, we need to change the _nodes attribute onthe jobqueue to be a dict {name: ip}, instead of a set....
Fix job queue behaviour when loading jobs
Currently, if loading a job fails, the job queue code raises anexception and prevents the proper processing of the jobs in the queue.We change this so that unparseable jobs are instead archived (if notalready)....
Add an interface for the drain flag changes/query
This adds the set/reset in the jqueue and luxi modules, and a way toquery it in OpQueryConfigValues, and also the comand line interface forit:$ gnt-cluster queue infoThe drain flag is unset$ gnt-cluster queue drain...
Implement the job queue drain flag
We add a (per-node) queue drain flag that blocks new job submission.There is not yet an interface to add/remove the flag (will come in nextpatches).
Convert rpc module to RpcRunner
This big patch changes the call model used in internode-rpc fromstandalong function calls in the rpc module to via a RpcRunner class,that holds all the methods. This can be used in the future to enablesmarter processing in the RPC layer itself (some quick examples are not...
Implement job 'waiting' status
Background: when we have multiple jobs in the queue (more than just afew), many of the jobs (up to the number of threads) will be in state'running', although many of them could be actually blocked, waiting forsome locks. This is not good, as one cannot easily see what is...
Implement job auto-archiving
This patch adds a new luxi call that implements auto-archiving of jobsolder than a certain age (or -1 for all completed jobs), and the gnt-jobcommand that makes use of this (with 'all' for -1).
Increase the number of threads to 25
Since our locks are not gathered nicely, we can have jobs that areactually blocking on locks (parallel burnin shows this), so at least weneed to increase the number of threads above the usual number of jobs wecould have in a such a case....
Enhance the job-related timestamps
This patch adds start, stop, and received timestamp for jobs (and allowsquerying of them), and allows querying of the opcode timestamps.
Add opcode execution log in job info
This patch adds the job execution log in “gnt-job info” and also allowsits selection in “gnt-job list” (however here it's not very useful asit's not easy to parse). It does this by adding a new field in the queryjob call, named ‘oplog’....
Implement job summary in gnt-job list
It is not currently possibly to show a summary of the job in the outputof “gnt-job list”. The closes is listing the whole opcode(s), but thatis too verbose. Also, the default output (id, status) is not veryuseful, unless one looks for (and knows about) an exact job ID....
Nicely sort the job list
Unless we decide to change the job identifiers to integer, we should atleast sort the list returned by _GetJobIDsUnlocked.
jqueue: Add common RPC error handling function
We didn't decide yet what exactly it should do with failed nodes.
Make WaitForJobChanges deal with long jobs
This patch alters the WaitForJobChanges luxi-RPC call to have aconfigurable timeout, so that the call behaves nicely with long jobsthat have no update.
We do this by adding a timeout parameter in the RPC call, and returning...
jqueue: Replace normal cache dict with weakref dict
A job should only exist once in memory. After the cache is cleaned,there can still be references to a job somewhere else. If thereare multiple instances, one can get updated while a function iswaiting for changes on another instance. By using...
jqueue: Keep timestamp of opcode start and end
jqueue: Reset run_op_idx after job is done
It can be confusing otherwise.
Make sure that client programs get all messages
This is a large patch, but I can't figure out how to split it withoutbreaking stuff. The old way of getting messages by always getting thelast one didn't bring all messages to the client if they were added...
Add RPC call to wait for job changes
This way clients can react faster to status or message changes anddon't have to poll anymore.
jqueue: Change log message time format
See the comment in the patch.
jqueue: Move archived jobs on all nodes
Otherwise one might have archived jobs back in the list after a masterfailover.
jstore: Change to not always require a lock
This way we can do locking when both noded and masterd are runningon the same machine, the latter holding an exclusive lock on thequeue.
jqueue: Use new job queue RPC functions
jqueue: Implement {Add,Remove}Node
These functions will be used to notify the queue about newly addedor removed nodes.
jqueue: Don't pass the list of nodes to SubmitJob anymore
The job queue now maintains its own list and is updated whennodes are added or removed from the cluster.
Maintain node list in job queue
The code makes sure not to include the master in the list.
jqueue: Replicate jobs to all nodes
Newly added nodes are not yet taken care of. Queue locking onnon-master nodes is not yet correct.
jqueue: Use new jstore module
jqueue: Move assert into decorator
This reduces code duplication. A later patch will modify the job queuea bit more and will need a change of this assert. The assertion isalso removed from all class-internal functions.
jqueue: Store context in job queue instead of worker pool
The job queue will need to access to configuration, which is providedthrough the context object, to get a list of nodes.
Fix pylint-detected issues
This is mostly: - whitespace fix (space at EOL in some files, not all, broken indentation, etc) - variable names overriding others (one is a real bug in there) - too-long-lines - cleanup of most unused imports (not all)...
Rewrite job queue
We found several issues in the old job queue implementation. It had raceconditions, deadlocks and other deficiencies.
Short summary:- _QueuedOpCode and _QueuedJob are now more or less data structures with a few utility functions. __Setup is gone....
jqueue: Fix error logging
The passed parameters were not correct.
Reviewed-by: iustinp, ultrotter
Implement job canceling on server side
Locking is not completeley right due to a deadlock when the job callsUpdateJob after changing its status.
Add “canceled” status for opcodes
Move code extracting job ID into function
It might come in handy at some point and makes the code a bit easierto read.
Implement job archiving on the server side
So far no error reporting to the client is done. Clients don't getnoticed if a job doesn't exist or couldn't be archived because ofits current status.
The internal cache is always cleaned when the preconditions didn't...
Add directory for archived jobs
Move code formatting job ID into a base class
A later patch will add a memory based job storage class, hence thiscode is going into a separate class. It also changes the number formatto always use at least 10 digits, allowing up to 9'999'999'999 jobs to...
Rename JobStorage to DiskJobStorage
Fix logging with string job IDs
The job ID is now a string, hence logging must use %s instead of %d.
Make job ID a string
The docstring says that _NewSerialUnlocked returns “a stringrepresenting the job identifier”. Until now it returned aninteger and this patch changes it.
Distribute the queue serial file after each update
This patch adds distribution of the queue serial file after each writeto it (but before a new job is created and written with that ID, andbefore a response is returned, so we should be safe from crashes in...
Make the job storage init reuse a serial file
This will be needed for master failover. If we don't have a valid queuedirectory, we need to reinitialize it, but we should keep the existingserial number.
As such, we abstract the reading of the serial and if we find a valid...
Make argument to CleanCacheUnlocked mandatory
Not passing the argument means it has the value None. Iterating Nonedoesn't work: >>> "123" in None Traceback (most recent call last): File "<stdin>", line 1, in ? TypeError: iterable argument required...
Implement jobs resource in RAPI
Sort the job list in _GetJobIDsUnlocked
Since the IDs are integers, we can simply sort them.
First version of user feedback fixes
This patch contains a raw version for fixing feedback_fn.
The new mechanism works as follows: - instead of a per-Processor feedback_fn, there's one for each ExecOpCode, so that feedback for different opcodes go via possibly...
Cache some jobs in memory
This patch adds a caching mechanisms to the JobStorage. Note that isdoes not make the memory cache authoritative.
The algorithm is: - all jobs loaded from disks are entered in the cache - all new jobs are entered in the cache...
Fix JobStorage._GetJobIDsUnlocked
The job ID returned must be an integer (and the regex enforces that),but we didn't convert it manually.
Change JobStorage to work with ids not filenames
Currently some of the functions in JobStorage work with filenames (whichis an implementation detail and should only be used when dealing withthe storage) and not with job IDs. We need to change this in order to...
Add experimental persistency to job queue
It's not perfect and it's not finished, but it's a start.
- Serial number is read only once, but written on each update- Jobs are kept only on disk (caching will be implemented)
Make "gnt-job list" work again
"gnt-job list" was broken after my recent changes in the RPCbetween clients and the master. This patch makes it work again.
Switch _QueuedOpCode to have their own lock
Right now, the queued opcode doesn't have a lock, and instead relies onthe parent QueuedJob's lock.
This is not good for logging feedback, so it's better to have a lock foreach queuedopcode.
Add a simple decorator for instance methods
This is just a simple, hardcoded decorator for object methods needingsynchronization on the _lock instance attribute.
jqueue: Log more information when running opcodes
Remove old job queue code
Add very simple job queue
Fix a typo in jqueue.py
s/result/op_result/ (this code was never used, so this wasn't caught)
Add per-opcode results to job processing
This patch changes the definition of a job and introduces per-opcoderesults.
First, the result and status fields of a job are condensed into a single'status' attribute. Then, we introduce an opcode status and one result...
Implement selective job query
This patch implements query-ing of only selected jobs instead of all.
Add a simple gnt-job script
This patch adds a very basic gnt-job script that allows job querying.This goes on top of the previous master daemon patches.
Currently, because of the not-changed cmd lock, you can't query the jobsas long as a job is running - you have to rm the cmd lock and then you...
A dumb queue implementation
This patch adds a very dumb in-memory only queue implementation.