noded: Add RPC function to rename job queue files
This will be used to archive jobs.
Reviewed-by: iustinp
backend: Add function to check whether file is in queue dir
Another function will need to check whether its parametersare job queue files.
Two small style fixes
jstore: Change to not always require a lock
This way we can do locking when both noded and masterd are runningon the same machine, the latter holding an exclusive lock on thequeue.
Log only unexpected errors in utils.FileLock
Otherwise users might be confused by errors in log files.
Disallow uploading job queue files through upload_file
The job queue is now updated through its own RPC functions.
jqueue: Use new job queue RPC functions
Add job queue RPC functions
jobqueue_update: Uploads a job queue file's content to a node. Themost common operation is to upload something that we already havein a string. Unlike in the upload_file function, the file is notread again when distributing changes, but content has to be passed...
Move function cleaning directory to module level
JobQueuePurge() will be used by an RPC function.
Fix cli.PollJob
feedback_fn wasn't passed to it.
Implement {Add,Readd,Remove}Node in GanetiContext
By doing this we've a central place which coordinates what needs to bedone when adding or removing nodes. Another patch will add calls intothe job queue.
Two log messages move to config.py.
When removing a node, node_leave_cluster is now called after it has...
jqueue: Implement {Add,Remove}Node
These functions will be used to notify the queue about newly addedor removed nodes.
jqueue: Don't pass the list of nodes to SubmitJob anymore
The job queue now maintains its own list and is updated whennodes are added or removed from the cluster.
Maintain node list in job queue
The code makes sure not to include the master in the list.
Clean job queue directories when leaving cluster
Old job files shouldn't be left on nodes removed from a cluster.
Implement query for nodes
Implement query for instances
Queries don't create jobs and are more efficient. Log messagesare not yet stored anywhere.
jqueue: Replicate jobs to all nodes
Newly added nodes are not yet taken care of. Queue locking onnon-master nodes is not yet correct.
jqueue: Use new jstore module
jstore: Add queue helper functions
This will be used to move common code out of jqueue.
Implement job submission for scripts
This patch adds the infrastructure for executing a job in background,instead of foreground, via a new “--submit” option. The behaviour isthat the job ID is printed and the script will immediately exit.
The patch also converts gnt-node list to this model (yes, this will be a...
jqueue: Move assert into decorator
This reduces code duplication. A later patch will modify the job queuea bit more and will need a change of this assert. The assertion isalso removed from all class-internal functions.
Split cli.SubmitOpCode in two parts
The current SubmitOpCode function is not flexible enough to be used forsubmitters that don't want to wait for the job finish.
The patch splits this in two, a SendJob function and a PollJob one, andthe old SubmitOpCode becomes a wrapper. Note that the new SendJob takes...
Allow job queue files to be uploaded through ganeti-noded
This is needed for job queue replication.
Add FileLock utility class
This class is a wrapper around fcntl.flock and abstracts opening andclosing the lockfile. It'll used for the job queue.
(The patch also removes a duplicate import of tempfile into the unittest)
jqueue: Store context in job queue instead of worker pool
The job queue will need to access to configuration, which is providedthrough the context object, to get a list of nodes.
RAPI Implement DELETE for tags
Reviewed-by: imsnah
First write operation (add tag) for Ganeti RAPI
Add instance tag handling, improved error logging....oh, yes adopt instance listing for RAPI2!
Fix cluster destroy
With the recent startup/shutdown changes (and with the master daemon inplace), the cluster destroy needs some fixing.
This patch moves the finalization of the destroy out from cmdlib intobootstrap, so we can nicely shutdown the rapi and master daemons....
Xen: remove two end-of-line semicolons
It's python, isn't it?
Fix cluster init
With the recent changes, I forgot the extra parameter to this rpc call.Also the rpc call needs to be done after we setup the config data, forthe master daemon to be able to start, so we move it after all otherinit steps.
Reviewed-by: ultrotter
Make gnt-* commands fail nicely on non-masters
This patch adds a check that we are on the master after failing toconnect to the socket, and log nicely the master name.
Parallelize LUFailoverInstance
ChainOpCode is still BGL-only
Prevent mistakes with an assert.
Fix a misuse of exc_info in logging.info
This is my fault, sorry.
Fix pylint-detected issues
This is mostly: - whitespace fix (space at EOL in some files, not all, broken indentation, etc) - variable names overriding others (one is a real bug in there) - too-long-lines - cleanup of most unused imports (not all)...
Fix some errors detected by pylint
Unify SetupDaemon/SetupLogging
The 'old-style' info, error, debug logs do not make much sense. Thispatch unifies the SetupLogging and SetupDaemon functions. As a result,all the commands logs to a 'commands.log' file.
The patch also changes the log setup to keep going if there's an error...
Simplify the log constants and add another one
The patch changes the log constants by moving the slash to the end ofthe log dir instead of at the beginning of each log file name.
It also adds a new LOG_COMMANDS constant (to be used in a next patch)....
Parallelize {Startup,Shutdown,Reboot}Instance
Parallelize LUReinstallInstance
self.recalculate_locks[locking.LEVEL_NODE] could have any value andeverything would work anyway. We'll use the string 'replace' byconvention because in the future we might want an 'append' mode.
LogicalUnit._LockInstancesNodes helper function
This function is used to lock instances' primary and secondary nodesafter locking instances themselves.
Make sharing locks possible
LUs can declare which locks they need by populating theself.needed_locks dictionary, but those locks are always acquired asexclusive. Make it possible to acquire shared locks as well, bydeclaring a particular level as shared in the self.share_locks...
Add LogicalUnit.DeclareLocks
This additional LogicalUnit function is optional to implement, but letsyou change your locking needs for one level just before locking it, butafter the previous levels have been already locked. It is useful forexample to calculate what nodes to lock after locking an instance....
LURenameInstance, add/remove relevant locks
LURenameInstance forgot to remove the old lock name and add the new one,making it impossible for parallel LUs to act on the instance (without amaster daemon restart). This also fixes burning+rename with theparallelization of {Start,Stop}Instance....
Rewrite job queue
We found several issues in the old job queue implementation. It had raceconditions, deadlocks and other deficiencies.
Short summary:- _QueuedOpCode and _QueuedJob are now more or less data structures with a few utility functions. __Setup is gone....
workerpool: Log when waiting for a thread
Rework master startup/shutdown/failover
This (big) patch reworks the master startup/shutdown and the fixes themaster failover.
What does the patch do?
For master start/stop: - remove the old ganeti-master script and its associated man page - moves the ip start/stop directly into the backend.(Start|Stop)Master...
Expose utils.DaemonPidFileName
Since we need to compute this from outside utils.py, we change this to apublic function.
Implement checking for the master role in rapi
This patch moves the CheckMaster function from ganeti-masterd to ssconf(most logical place, it cannot go in utils since we would have recursiveimports between ssconf and utils) and changes ganeti-rapi to also call...
Add a new parameter to backend.(Start|Stop)Master
This patch adds a new, unused for now, parameter to the start and stopmaster operations in backend. The idea behind it is that we need to beable to control whether the IP (de)activation is coupled with daemon...
Log thread name when debug output is enabled
jqueue: Fix error logging
The passed parameters were not correct.
Reviewed-by: iustinp, ultrotter
Fix constants typo
Use constants for the pid file stems
Add a KillProcess function
We cannot depend on all environments to have a start-stop-daemon orsimilar tool. We instead implement a KillProcess function that behavessimilar to “start-stop-daemon --retry”.
Note that the attached unittest can hang in foreground if the child...
Change IsPidFileAlive into ReadPidFile
We already have a function to test if a PID is alive, so it makes moresense to use function composition that force calling (since we need toread PIDs from files in other places too). Now IsProcessAlive returnsFalse for PIDs <= 0, since this is the error return from ReadPidFile....
Move ganeti-rapi core code to daemon
All other daemons have their main code in themselves and not in a module.This patch does the same to ganeti-rapi by moving the code fromlib/rapi/RESTHTTPServer.py to daemons/ganeti-rapi.
Replace httperror module with ganeti.http
The generic HTTP server doesn't know about httperror based exceptionsand would treat them as unknown exceptions, thereby not doing the rightthing with HTTP errors.
Implement job canceling on server side
Locking is not completeley right due to a deadlock when the job callsUpdateJob after changing its status.
Fix exception class name in utils.WritePidFile
Add “canceled” status for opcodes
Move code extracting job ID into function
It might come in handy at some point and makes the code a bit easierto read.
Convert set to a list in LUGetTags
The set triggers exception on a list-tags command and RAPI calls for tagssince it is not serializable by JSON.
Switch RAPI to ganeti.http module
Implement job archiving on the server side
So far no error reporting to the client is done. Clients don't getnoticed if a job doesn't exist or couldn't be archived because ofits current status.
The internal cache is always cleaned when the preconditions didn't...
Add directory for archived jobs
Move code formatting job ID into a base class
A later patch will add a memory based job storage class, hence thiscode is going into a separate class. It also changes the number formatto always use at least 10 digits, allowing up to 9'999'999'999 jobs to...
Add utils.{Write,Remove}PidFile
WritePidFile is a helper function that writes the current pid in apidfile within the ganeti run directory. RemovePidFile tries to deleteit.
Add utils.IsPidFileAlive function
This helper function reads a pid from a file containing it and checkswhether it refers to a live process.
Invert nodes/instances locking order
An implementation mistake from the original design caused nodes to belocked before instances, rather than after. This patch inverts the levelnumbering, changing also the relevant unittests and the recursivelocking function starting point....
Generalization of bulk output mapping
Rename JobStorage to DiskJobStorage
Fix logging with string job IDs
The job ID is now a string, hence logging must use %s instead of %d.
Simplify rapi.baserlib.MapFields()
We can use zip for simplifying this function. Actually, at this pointI'm not sure if it needs to be a separate function at all.
Make job ID a string
The docstring says that _NewSerialUnlocked returns “a stringrepresenting the job identifier”. Until now it returned aninteger and this patch changes it.
Distribute the queue serial file after each update
This patch adds distribution of the queue serial file after each writeto it (but before a new job is created and written with that ID, andbefore a response is returned, so we should be safe from crashes in...
Make the job storage init reuse a serial file
This will be needed for master failover. If we don't have a valid queuedirectory, we need to reinitialize it, but we should keep the existingserial number.
As such, we abstract the reading of the serial and if we find a valid...
Move BDEV_CACHE_DIR to RUN_GANETI_DIR/bdev-cache
This was a TODO for 2.0
Convert SetInstanceParams to concurrency
Grab a lock for the instance we're working on, and update its params.
Use Update in SetInstanceParams
When we set the instance params we're not adding a new instance, butjust updating an existing one, so why using AddInstance?
Convert LUConnectConsole to concurrency
For ConnectConsole we just need to lock the instance we're connectingto. We make a few rpcs to its primary node, but node daemons can nowhandle multiple queries and nodes cannot be removed till they haveinstances on them anyway. Note that since we return the ssh command, and...
Add _ExpandAndLockInstance auxiliary function.
LUs that take an instance name as input and need to expand its name andlock it can use it to simplify their ExpandNames call. Possibly, and_ExpandAndLockNode will come as well.
Convert two (simple) LUs to be concurrent
LUQueryClusterInfo and LUDumpClusterConfig can be made concurrent anddon't need to acquire any locks. In fact they don't interact with thecluster at all, but just with its configuration, which is thread-safe by...
Add missing empty line
Two top level definitions were separated only by one empty line.Fixing this.
Put the poper RAPI baserlib
Make argument to CleanCacheUnlocked mandatory
Not passing the argument means it has the value None. Iterating Nonedoesn't work: >>> "123" in None Traceback (most recent call last): File "<stdin>", line 1, in ? TypeError: iterable argument required...
Split RAPI resources to pieces
Split conditions in worker pool
This patch splits the single threading.Condition object used in theworker pool for synchronization into three.
- worker_to_pool: Notified if a worker wants to notify the pool- pool_to_worker: Notified if the pool wants to notify a single...
Add signal handler class
This signal handler class abstracts some of the code previouslyused in other places. It also uninstalls its handler when Reset()is called or the class is destructed, thereby restoring theprevious behaviour.
Implement jobs resource in RAPI
Breath life in to RAPI for trunk
Documentation updates
Rename BaseJO to BaseOpCode
Since we don't have for now a job definition object anymore, we renamethis class to BaseOpCode. It's still useful (and not merged with OpCode)since it holds all the 'pure' logic (no custom field handling, etc.)whereas OpCode holds opcode specific data (OP_ID handling, etc)....
Sort the job list in _GetJobIDsUnlocked
Since the IDs are integers, we can simply sort them.
Further fixes to enable RAPI startup
Note that since RAPI itself doesn't use luxi.Client yet, nothing works,but at least it can startup now.
Add forgotten RAPI constant
This was forgot on the forward-porting of RAPI.
Improve cli.SubmitOpCode
Currently, the feedback_fn argument to SubmitOpCode is no longer used.We still need it in burnin, so we re-enable it by making the code callthat function with the msg argument in case feedback_fn is callable. Thepatch also modifies burnin to accept the new argument format (msg is not...
First version of user feedback fixes
This patch contains a raw version for fixing feedback_fn.
The new mechanism works as follows: - instead of a per-Processor feedback_fn, there's one for each ExecOpCode, so that feedback for different opcodes go via possibly...
Cache some jobs in memory
This patch adds a caching mechanisms to the JobStorage. Note that isdoes not make the memory cache authoritative.
The algorithm is: - all jobs loaded from disks are entered in the cache - all new jobs are entered in the cache...