History | View | Annotate | Download (45.9 kB)
jqueue: remove the _big_jqueue_lock module global
By using ssynchronized in the new way, we can remove the module-global_big_jqueue_lock and revert back to an internal _lock inside the jqueue.
Signed-off-by: Guido Trotter <ultrotter@google.com>Reviewed-by: Iustin Pop <iustin@google.com>
Share the jqueue lock on job-local changes
We can share the jqueue lock when we do per-job updates. These onlyconflict with updates/checks on the same job from another thread (eg.CancelJob, ArchiveJob, which keep the lock unshared, since they are lessfrequent)....
_OpExecCallbacks abstract _AppendFeedback
Move some code to a decorated function rather than explicitelyacquiring/releasing the lock in AppendFeedback.
jqueue: convert to a SharedLock()
Remove the jqueue _lock member and convert to a _big_jqueue_locksharedlock. This allows smooth transition from the old single lock to amore granular approach.
MarkUnfinishedOps: update job file on disk
Every time we call MarkUnfinishedOps we do it in a try/finally blockthat updates the job file. With this patch we move the try/finallyinside. CancelJobUnlocked is removed, because it just becomes a wrapperover MarkUnfinishedOps with two constant values....
Remove spurious empty line
Remove job object condition
We don't need it anymore, since nobody waits on it.
Parallelize WaitForJobChanges
As for QueryJobs we rely on file updates rather than conditionnotification to acquire job changes. In order to do that we use thepyinotify module to watch files. This might make the client a bit slower(pending planned improvements, such as subscription-based...
Update the job file on feedback
This is needed to convert waitforjobchanges to use inotify and theon-disk version and decouple it from the job queue lock. No replicationto remote nodes is done, to keep the operation fast.
Signed-off-by: Guido Trotter <ultrotter@google.com>...
Don't lock on QueryJobs, by using the disk version
We move from querying the in-memory version to loading all jobs from thedisk. Since the jobs are written/deleted on disk in an atomic manner, wedon't need to lock at all. Also, since we're just looking at the...
Add JobQueue.SafeLoadJobFromDisk
This will be used to read a job file without having to deal withexceptions from _LoadJobFromDisk.
jqueue._LoadJobFromDisk: remove safety archival
Currently _LoadJobFromDisk archives job files it finds corrupted. Sincewe want to use it to load files without holding locks, this could causea conflict: we just move the feature to _LoadJobUnlocked which is always...
jqueue.AddManyJobs: use AddManyTasks
Rather than adding the jobs to the worker pool one at a time, we addthem all together, which is slightly faster, and ensures they don't getstarted while we loop.
Signed-off-by: Guido Trotter <ultrotter@google.com>Reviewed-by: Michael Hanselmann <hansmi@google.com>
jqueue: make replication on job update optional
Sometimes it's useful to write to the local filesystem, but immediatereplication to all master candidates is not needed.
The _WriteAndReplicateFileUnlocked function gets renamed to_UpdateJobQueueFile, as calling "write and replicate, but don't...
s/queue._GetJobInfoUnlocked/job.GetInfo/
The job queue currently has a static _GetJobInfoUnlocked method.Changing it to be a normal method of _QueuedJob, which makes more sense.
Abstract loading job file from disk
Move the work from _LoadJobUnlocked to _LoadJobFileFromDisk, which canthen be used in other contexts as well. Also, if we fail to deserializethe job, archive it as well (before we archived it only if we failed tocreate the related object, but kept it there if deserialization failed....
jqueue: simplify removal from _nodes
Somewhere we do try/del/except and somewhere just pop. Using popeverywhere saves lines of code.
ListVisibleFiles: do not sort output
Among all users, turns out just one may need the output to be sorted.All the others can cope without.
Cache a few bits of status in jqueue
Currently each time we submit a job we check the job queue size, and thedrained file. With this change we keep these pieces of information inmemory and don't read them from the filesystem each time.
Significant changes include:...
Fix a TODO in _QueuedJob
Rather than raising Exception use GenericError and explain a bit betterwhat happened.
Remove unused parameter from function
This also removes the relevant pylint disable.No point in keeping unused parameters around: if/when we need them it'seasy to add it back.
Optimize _GetJobIDsUnlocked
Currently we sort the list of job queue files twice (once inutils.ListVisibleFiles with sort and then later with NiceSort). We applythe _RE_JOB_FILE regular expression twice (once in _ListJobFiles andonce in _ExtractJobID). This simplifies the code a little, and a couple...
jqueue: Rename _queue_lock to _queue_filelock
The name clarifies the difference between this and the internal lock.Also explain a bit better what it is.
Remove the job queue drain rpc call
This call was introduced but never used. In two years.Since it's just creating/removing a file it can also be in simpler ways,without a special rpc call, if/when we need it again. In the meantime,let's give it to history....
Add a new opcode timestamp field
Since the current start_timestamp opcode attribute refers to the initalstart time, before locks are acquired, it's not useful to determine theactual execution order of two opcodes/jobs competing for the same lock.
This patch adds a new field, exec_timestamp, that is updated when the...
Switch more code to PathJoin
This should remove most of the remaining constructs which can bereplaced by PathJoin.
Signed-off-by: Iustin Pop <iustin@google.com>Reviewed-by: Michael Hanselmann <hansmi@google.com>
Switch from os.path.join to utils.PathJoin
This passes a full burnin with lots of instances, and should be safe aswe mostly to join a known root (various constants) to a run-timevariable.
jqueue: Don't return negative number for unchecked jobs when archiving
When the queue was empty, the calculation for unchecked jobs whilearchiving would return -1. ``last_touched`` is set to 0, the job ID list(``all_job_ids``) is empty. Calculating ``len(all_job_ids) -...
Improve logging for workerpool tasks by providing repr
Before it would log something like “starting task(<ganeti.http.client._HttpClientPendingRequest object at 0x2aaaad176790>,)”,which isn't really useful for debugging. Now it'll log “[…]<ganeti.http.client._HttpClientPendingRequest...
workerpool: Simplify log messages
Signed-off-by: Michael Hanselmann <hansmi@google.com>Reviewed-by: Iustin Pop <iustin@google.com>
workerpool: Make worker ID alphanumeric
Having a proper name instead of just a number makes debuggingeasier.
Further pylint disables, mostly for Unused args
Many of our functions have to follow a given API, and thus we have tokeep a given signature, but pylint doesn't understand this. Therefore,we silence this warning.
The patch does a few other cleanups.
Signed-off-by: Iustin Pop <iustin@google.com>...
jqueue/_CheckRpcResult: log the whole operation
Currently only the rpc call, but not its description (which also showsthe argument) is logged. We change this to log failmsg too, and thisalso silences a warning.
Convert to static methods (where appropriate)
Many methods are simple pure functions, and not depending on the objectstate. We convert these to staticmethods.
Signed-off-by: Iustin Pop <iustin@google.com>Reviewed-by: Olivier Tharan <olive@google.com>
Add targeted pylint disables
This patch should have only:
- pylint disables- docstring changes- whitespace changes
Remove quotes from CommaJoin and convert to it
This patch removes the quotes from CommaJoin and converts most of thecallers (that I could find) to it. Since CommaJoin does str(i) for i inparam, we can remove these, thus simplifying slightly a few calls....
Processor: support a unique execution id
When the processor is executing a job, it can export the execution id toits callers. This is not supported for Queries, as they're not executedin a job.
Fix pylint 'E' (error) codes
This patch adds some silences and tweaks the code slightly so that“pylint --rcfile pylintrc -e ganeti” doesn't give any errors.
The biggest change is in jqueue.py, the move of _RequireOpenQueue out ofthe JobQueue class. Since that is actually a function and not a method...
jqueue: Convert to utils.Retry
Code and docstring style fixes
Found using pylint and epydoc.
Signed-off-by: Michael Hanselmann <hansmi@google.com>Reviewed-by: Guido Trotter <ultrotter@google.com>
Remove RpcResult.RemoteFailMsg completely
jqueue: Remove unused run_op_index attribute
Export new lock_status field to gnt-job
Keep lock status with every job
This can be useful for debugging locking problems.
Move OpCode processor callbacks into separate class
There are two major arguments for this:- There will be more callbacks (e.g. for lock debugging) and extending the parameter list is a lot of work.- In the jqueue module this allows us to keep per-job or per-opcode variables in...
Merge commit 'origin/next' into branch-2.1
Optimise multi-job submit
Currently, on multi-job submits we simply iterate over thesingle-job-submit function. This means we grab a new serial, write andreplicate (and wait for the remote nodes to ack) the serial file, andonly then create the job file; this is repeated N times, once for each...
Use ReadFile/WriteFile in more places
This survived QA, burnin and unittests.
Signed-off-by: Michael Hanselmann <hansmi@google.com>Reviewed-by: Luca Bigliardi <shammash@google.com>
Encode the actual exception raised by LU execution
Currently, the actual exception raised during an LU execution (one ofOpPrereqError, OpExecError, HooksError, etc.) is lost because thejqueue.py code simply sets that to a str(err), and the code in cli.py...
Merge branch 'next' into branch-2.1
jqueue: Fix error when WaitForJobChange gets invalid ID
When JobQueue.WaitForJobChange gets an invalid or no longer existing job ID ittries to return job_info and log_entries, both of which aren't defined yet.
Signed-off-by: Michael Hanselmann <hansmi@google.com>...
jqueue: Update message for cancelling running job
Conflicts: lib/cli.py: trivial extra empty line
job queue: fix loss of finalized opcode result
Currently, unclean master daemon shutdown overwrites all of a job'sopcode status and result with error/None. This is incorrect, since theany already finished opcode(s) should have their status and resultpreserved, and only not-yet-processed opcodes should be marked as...
Add a luxi call for multi-job submit
As a workaround for the job submit timeouts that we have, this patchadds a new luxi call for multi-job submit; the advantage is that all thejobs are added in the queue and only after the workers can startprocessing them....
job queue: fix interrupted job processing
If a job with more than one opcodes is being processed, and the masterdaemon crashes between two opcodes, we have the first N opcodes markedsuccessful, and the rest marked as queued. This means that the overall...
Fix an error path in job queue worker's RunTask
In case the job fails, we try to set the job's run_op_idx to -1.However, this is a wrong variable, which wasn't detected until theslots addition. The correct variable is run_op_index.
Add slots on objects in jqueue
Adding slots to _QueuedOpCode decreases memory usage (of these objects)by roughly four times. It is a lesser change for _QueuedJobs.
Fix some typos
Convert the jobqueue rpc to new style result
This patch converts the job queue rpc calls to the new style result.It's done in a single patch as there are helper function (in both jqueueand backend) that are used by multiple rpcs and need synchronizedchange....
job queue: log the opcode error too
Currently we only log "Error in opcode ...", but we don't log the error itself.This is not good for debugging.
Reviewed-by: ultrotter
Fix some issues related to job cancelling
This patch fixes two issues with the cancel mechanism: - cancelled jobs show as such, and not in error state (we mark them as OP_STATUS_CANCELED and not OP_STATUS_ERROR) - queued jobs which are cancelled don't raise errors in the master (we...
Fix single-job archiving (gnt-job archive)
This is a simply typo from the conversion to multi-job archiving.
Reviewed-by: imsnah
Update the logging output of job processing
(this is related to the master daemon log)
Currently it's not possible to follow (in the non-debug runs) thelogical execution thread of jobs. This is due to the fact that we don'tlog the thread name (so we lose the association of log messages to jobs)...
Some docstring updates
This patch rewraps some comments to shorter lengths, changesdouble-quotes to single-quotes inside triple-quoted docstrings forbetter editor handling.
It also fixes some epydoc errors, namely invalid crossreferences (aftermethod rename), documentation for inexistent (removed) parameters, etc....
Job queue: Allow more than one file rename per RPC call
ganeti.jqueue: Group job archivals to reduce number of RPC calls
Reducing the actual number of RPC calls will come in another patch.
Prevent RPC timeout on auto-archiving jobs
With a large job queue, auto-archiving jobs can take a very long time,causing timeouts on the luxi RPC layer. With this change, auto-archive returns after half of the RPC timeout has passed. The userwill see how many jobs are left unchecked....
jqueue: When auto-archiving jobs, calculate job status only once
This is done by passing the job object to _ArchiveJobUnlocked insteadof only the job ID. Also return whether job was actually archived.
Use subdirectories for job queue archive
As it turned out, having many files in a single directory can bevery painful. With this patch, only 10'000 files are stored in adirectory for the job queue archive. With 10'000 directries, thisallows for up to 100 million jobs be archived without having large...
Add job queue size limit
A job queue with too many jobs can increase memory usage and/or makethe master daemon slow. The current limit is just an arbitrary number.A "soft" limit for automatic job archival is prepared.
Reviewed-by: iustinp
cleanup: exceptions should derive from Exception
Reviewed-by: amishchenko
Fix epydoc format warnings
This patch should fix all outstanding epydoc parsing errors; as such, weswitch epydoc into verbose mode so that any new errors will be visible.
Restrict job propagation to master candidates only
This patch restricts the job propagation to master candidates only, bynot registering non-candidates in the job queue node lists.
Note that we do intentionally purge the job queue if a node is toggledto non-master status....
jqueue: Always print message for 100% when inspecting queue
jqueue: Allow jobs waiting for locks to be canceled
- Add new "canceling" status- Notify clients when job is canceled- Give a return value from CancelJob- Handle it in the client library
jqueue: fix a bug in an error path
Dictionaries raise KeyError, and not ValueError when invalid keys arepasses to del.
jqueue: Log progress and load jobs one by one
By logging more information, a user can see how far it is in inspectingthe queue. This can be useful with a large number of jobs. Also, insteadof loading all jobs in one go, load only the list of job IDs and then...
jqueue: Shutdown workerpool in case of a problem
jqueue: Always use rpc.RpcRunner
"from ganeti.rpc import RpcRunner" does not conform to the style guide.
Documentation updates for jqueue.py
Yet another bug found while reviewing docs
The newer_than variable can be either None or an int, and we normalizeit to an integer previously and save it in the 'serial' variable, whichshould be used instead.
Convert the job queue rpcs to address-based
The two main multi-node job queue RPC calls (jobqueue_update,jobqueue_rename) are converted to address-based calls, in order to speedup queue changes. For this, we need to change the _nodes attribute onthe jobqueue to be a dict {name: ip}, instead of a set....
Fix job queue behaviour when loading jobs
Currently, if loading a job fails, the job queue code raises anexception and prevents the proper processing of the jobs in the queue.We change this so that unparseable jobs are instead archived (if notalready)....
Add an interface for the drain flag changes/query
This adds the set/reset in the jqueue and luxi modules, and a way toquery it in OpQueryConfigValues, and also the comand line interface forit:$ gnt-cluster queue infoThe drain flag is unset$ gnt-cluster queue drain...
Implement the job queue drain flag
We add a (per-node) queue drain flag that blocks new job submission.There is not yet an interface to add/remove the flag (will come in nextpatches).
Convert rpc module to RpcRunner
This big patch changes the call model used in internode-rpc fromstandalong function calls in the rpc module to via a RpcRunner class,that holds all the methods. This can be used in the future to enablesmarter processing in the RPC layer itself (some quick examples are not...
Implement job 'waiting' status
Background: when we have multiple jobs in the queue (more than just afew), many of the jobs (up to the number of threads) will be in state'running', although many of them could be actually blocked, waiting forsome locks. This is not good, as one cannot easily see what is...
Implement job auto-archiving
This patch adds a new luxi call that implements auto-archiving of jobsolder than a certain age (or -1 for all completed jobs), and the gnt-jobcommand that makes use of this (with 'all' for -1).
Increase the number of threads to 25
Since our locks are not gathered nicely, we can have jobs that areactually blocking on locks (parallel burnin shows this), so at least weneed to increase the number of threads above the usual number of jobs wecould have in a such a case....
Enhance the job-related timestamps
This patch adds start, stop, and received timestamp for jobs (and allowsquerying of them), and allows querying of the opcode timestamps.
Add opcode execution log in job info
This patch adds the job execution log in “gnt-job info” and also allowsits selection in “gnt-job list” (however here it's not very useful asit's not easy to parse). It does this by adding a new field in the queryjob call, named ‘oplog’....
Implement job summary in gnt-job list
It is not currently possibly to show a summary of the job in the outputof “gnt-job list”. The closes is listing the whole opcode(s), but thatis too verbose. Also, the default output (id, status) is not veryuseful, unless one looks for (and knows about) an exact job ID....
Nicely sort the job list
Unless we decide to change the job identifiers to integer, we should atleast sort the list returned by _GetJobIDsUnlocked.
jqueue: Add common RPC error handling function
We didn't decide yet what exactly it should do with failed nodes.
Make WaitForJobChanges deal with long jobs
This patch alters the WaitForJobChanges luxi-RPC call to have aconfigurable timeout, so that the call behaves nicely with long jobsthat have no update.
We do this by adding a timeout parameter in the RPC call, and returning...
jqueue: Replace normal cache dict with weakref dict
A job should only exist once in memory. After the cache is cleaned,there can still be references to a job somewhere else. If thereare multiple instances, one can get updated while a function iswaiting for changes on another instance. By using...
jqueue: Keep timestamp of opcode start and end