workerpool: Log when waiting for a thread
Reviewed-by: iustinp
Rework master startup/shutdown/failover
This (big) patch reworks the master startup/shutdown and the fixes themaster failover.
What does the patch do?
For master start/stop: - remove the old ganeti-master script and its associated man page - moves the ip start/stop directly into the backend.(Start|Stop)Master...
Expose utils.DaemonPidFileName
Since we need to compute this from outside utils.py, we change this to apublic function.
Reviewed-by: ultrotter
Implement checking for the master role in rapi
This patch moves the CheckMaster function from ganeti-masterd to ssconf(most logical place, it cannot go in utils since we would have recursiveimports between ssconf and utils) and changes ganeti-rapi to also call...
Add a new parameter to backend.(Start|Stop)Master
This patch adds a new, unused for now, parameter to the start and stopmaster operations in backend. The idea behind it is that we need to beable to control whether the IP (de)activation is coupled with daemon...
Log thread name when debug output is enabled
jqueue: Fix error logging
The passed parameters were not correct.
Reviewed-by: iustinp, ultrotter
Fix constants typo
Reviewed-by: imsnah
Use constants for the pid file stems
Add a KillProcess function
We cannot depend on all environments to have a start-stop-daemon orsimilar tool. We instead implement a KillProcess function that behavessimilar to “start-stop-daemon --retry”.
Note that the attached unittest can hang in foreground if the child...
Change IsPidFileAlive into ReadPidFile
We already have a function to test if a PID is alive, so it makes moresense to use function composition that force calling (since we need toread PIDs from files in other places too). Now IsProcessAlive returnsFalse for PIDs <= 0, since this is the error return from ReadPidFile....
Move ganeti-rapi core code to daemon
All other daemons have their main code in themselves and not in a module.This patch does the same to ganeti-rapi by moving the code fromlib/rapi/RESTHTTPServer.py to daemons/ganeti-rapi.
Replace httperror module with ganeti.http
The generic HTTP server doesn't know about httperror based exceptionsand would treat them as unknown exceptions, thereby not doing the rightthing with HTTP errors.
Implement job canceling on server side
Locking is not completeley right due to a deadlock when the job callsUpdateJob after changing its status.
Fix exception class name in utils.WritePidFile
Add “canceled” status for opcodes
Move code extracting job ID into function
It might come in handy at some point and makes the code a bit easierto read.
Convert set to a list in LUGetTags
The set triggers exception on a list-tags command and RAPI calls for tagssince it is not serializable by JSON.
Switch RAPI to ganeti.http module
Implement job archiving on the server side
So far no error reporting to the client is done. Clients don't getnoticed if a job doesn't exist or couldn't be archived because ofits current status.
The internal cache is always cleaned when the preconditions didn't...
Add directory for archived jobs
Move code formatting job ID into a base class
A later patch will add a memory based job storage class, hence thiscode is going into a separate class. It also changes the number formatto always use at least 10 digits, allowing up to 9'999'999'999 jobs to...
Add utils.{Write,Remove}PidFile
WritePidFile is a helper function that writes the current pid in apidfile within the ganeti run directory. RemovePidFile tries to deleteit.
Add utils.IsPidFileAlive function
This helper function reads a pid from a file containing it and checkswhether it refers to a live process.
Invert nodes/instances locking order
An implementation mistake from the original design caused nodes to belocked before instances, rather than after. This patch inverts the levelnumbering, changing also the relevant unittests and the recursivelocking function starting point....
Generalization of bulk output mapping
Rename JobStorage to DiskJobStorage
Fix logging with string job IDs
The job ID is now a string, hence logging must use %s instead of %d.
Simplify rapi.baserlib.MapFields()
We can use zip for simplifying this function. Actually, at this pointI'm not sure if it needs to be a separate function at all.
Make job ID a string
The docstring says that _NewSerialUnlocked returns “a stringrepresenting the job identifier”. Until now it returned aninteger and this patch changes it.
Distribute the queue serial file after each update
This patch adds distribution of the queue serial file after each writeto it (but before a new job is created and written with that ID, andbefore a response is returned, so we should be safe from crashes in...
Make the job storage init reuse a serial file
This will be needed for master failover. If we don't have a valid queuedirectory, we need to reinitialize it, but we should keep the existingserial number.
As such, we abstract the reading of the serial and if we find a valid...
Move BDEV_CACHE_DIR to RUN_GANETI_DIR/bdev-cache
This was a TODO for 2.0
Convert SetInstanceParams to concurrency
Grab a lock for the instance we're working on, and update its params.
Use Update in SetInstanceParams
When we set the instance params we're not adding a new instance, butjust updating an existing one, so why using AddInstance?
Convert LUConnectConsole to concurrency
For ConnectConsole we just need to lock the instance we're connectingto. We make a few rpcs to its primary node, but node daemons can nowhandle multiple queries and nodes cannot be removed till they haveinstances on them anyway. Note that since we return the ssh command, and...
Add _ExpandAndLockInstance auxiliary function.
LUs that take an instance name as input and need to expand its name andlock it can use it to simplify their ExpandNames call. Possibly, and_ExpandAndLockNode will come as well.
Convert two (simple) LUs to be concurrent
LUQueryClusterInfo and LUDumpClusterConfig can be made concurrent anddon't need to acquire any locks. In fact they don't interact with thecluster at all, but just with its configuration, which is thread-safe by...
Add missing empty line
Two top level definitions were separated only by one empty line.Fixing this.
Put the poper RAPI baserlib
Make argument to CleanCacheUnlocked mandatory
Not passing the argument means it has the value None. Iterating Nonedoesn't work: >>> "123" in None Traceback (most recent call last): File "<stdin>", line 1, in ? TypeError: iterable argument required...
Split RAPI resources to pieces
Split conditions in worker pool
This patch splits the single threading.Condition object used in theworker pool for synchronization into three.
- worker_to_pool: Notified if a worker wants to notify the pool- pool_to_worker: Notified if the pool wants to notify a single...
Add signal handler class
This signal handler class abstracts some of the code previouslyused in other places. It also uninstalls its handler when Reset()is called or the class is destructed, thereby restoring theprevious behaviour.
Implement jobs resource in RAPI
Breath life in to RAPI for trunk
Documentation updates
Rename BaseJO to BaseOpCode
Since we don't have for now a job definition object anymore, we renamethis class to BaseOpCode. It's still useful (and not merged with OpCode)since it holds all the 'pure' logic (no custom field handling, etc.)whereas OpCode holds opcode specific data (OP_ID handling, etc)....
Sort the job list in _GetJobIDsUnlocked
Since the IDs are integers, we can simply sort them.
Further fixes to enable RAPI startup
Note that since RAPI itself doesn't use luxi.Client yet, nothing works,but at least it can startup now.
Add forgotten RAPI constant
This was forgot on the forward-porting of RAPI.
Improve cli.SubmitOpCode
Currently, the feedback_fn argument to SubmitOpCode is no longer used.We still need it in burnin, so we re-enable it by making the code callthat function with the msg argument in case feedback_fn is callable. Thepatch also modifies burnin to accept the new argument format (msg is not...
First version of user feedback fixes
This patch contains a raw version for fixing feedback_fn.
The new mechanism works as follows: - instead of a per-Processor feedback_fn, there's one for each ExecOpCode, so that feedback for different opcodes go via possibly...
Cache some jobs in memory
This patch adds a caching mechanisms to the JobStorage. Note that isdoes not make the memory cache authoritative.
The algorithm is: - all jobs loaded from disks are entered in the cache - all new jobs are entered in the cache...
Fix JobStorage._GetJobIDsUnlocked
The job ID returned must be an integer (and the regex enforces that),but we didn't convert it manually.
Change JobStorage to work with ids not filenames
Currently some of the functions in JobStorage work with filenames (whichis an implementation detail and should only be used when dealing withthe storage) and not with job IDs. We need to change this in order to...
Add experimental persistency to job queue
It's not perfect and it's not finished, but it's a start.
- Serial number is read only once, but written on each update- Jobs are kept only on disk (caching will be implemented)
Convert backend.py to the logging module
The patch also switches some of the exception logs to uselogging.exception (and therefore the log message will have a diferentformat).
(Note that this might not be a good choice in all cases, though)
Add PID to all logs
This patch (for trunk) adds the PID to all daemon logs.
Fix backend.NodeVolumes handling of LVM output
This is the same fix as for GetVolumeList.
I've checked manually and all other places that call lvm commands arealready checking the output validity in terms of correct number offields.
Fix backend.GetVolumeList handling of LVM output
Sometimes ‘lvs’ can spit error messages on stdout, even when one wantsto parse the output:...Inconsistent metadata copies found - updating to use version 2776...
So we need to validate the output to guard against such cases....
Add generic HTTP server classes
Some of the code is adopted from the 1.2 branch(lib/rapi/RESTHTTPServer.py). This code can be used as a base for thevarious HTTP servers in Ganeti.
Make "gnt-job list" work again
"gnt-job list" was broken after my recent changes in the RPCbetween clients and the master. This patch makes it work again.
Initial copy of RAPI filebase to the trunk
Move watcher's LockFile function to utils
Switch _QueuedOpCode to have their own lock
Right now, the queued opcode doesn't have a lock, and instead relies onthe parent QueuedJob's lock.
This is not good for logging feedback, so it's better to have a lock foreach queuedopcode.
Add a simple decorator for instance methods
This is just a simple, hardcoded decorator for object methods needingsynchronization on the _lock instance attribute.
jqueue: Log more information when running opcodes
Fix double-logging in daemons
Currently, in debug mode, both the logfile handler and the stderrhandler will log debug messages. Since the stderr is redirected to thesame logfile (to catch non-logged errors), it means log entries aredoubled.
The patch adds an extra parameter to the logger.SetupDaemon() function...
Move the master socket in the ganeti run dir
... as it was intended from the beggining, but by mistake left in thetop run dir.
Reduce duplicate Attach() calls in bdev
Currently, the 'public' functions of bdev (FindDevice andAttachOrAssemble) will call the Attach() method right after classinstantiation.
But the constructor itself calls this function, and therefore we haveduplicate Attach() calls (which are not cheap at all)....
Convert bdev.py to the logging module
This does not enhance in any way the messages; it just switches to thenew module.
Convert utils.py to the logging module
The patch also logs all commands executed from RunCmd when we are atdebug level.
Remove the old locking functions
This removes (hopefully) all traces of the old locking functions anduses.
Remove old job queue code
Change masterd/client RPC protocol
- Introduce abstraction class on client side- Use constants for method names- Adopt legacy function SubmitOpCode to use it
Make luxi RPC more flexible
- Use constants for dict entries- Handle exceptions on server side- Rename client function to CallMethod to match server side naming
Add very simple job queue
Convert LUTestDelay to concurrent usage
In order to do so: - We set REQ_BGL to False - We implement ExpandNames
That's it, really.
Processor: Acquire locks before executing an LU
If we're running in a "new style" LU we may need some locks, as requiredby the ExpandNames function, to be able to run. We'll walk up the locklevels present in the needed_locks dictionary and acquire them, then run...
LogicalUnit: add ExpandNames function
New concurrent LUs will need to call ExpandNames so that any namespassed in by the user are canonicalized, and can be used by hooks,locking and other parts of the code. This was done in CheckPrereqbefore, but it's now splitted out, as it's needed for locking, which in...
Processor: Move LU execution to its own method
This makes the try...finally code simplier, and helps adding a morecomplex locking structure before the actual execution. It also fixes aconcurrency bug caused by the fact that write_count was read beforeacquiring the BGL, and thus spurious config update hooks run could have...
constants: Add job and opcode status strings
workerpool: Don't notify if there was no task
Workers have to notify their pool if they finished a task to makethe WorkerPool.Quiesce function work. This is done in the finally:clause to notify even in case of an exception. However, beforewe notified on each run, even if there was no task, thereby creating...
Add a top level RUN_GANETI_DIR constant
This patch creates a base RUN_GANETI_DIR and then moves the other rundir constants to use that (even if just setting BDEV_CACHE_DIR as equalto it, rather than putting it deeper, for now).
Also we create a constant list of all the subdirs we need in RUN_DIR to...
symlinks: Add DISK_LINKS_DIR constant
The DISK_LINKS_DIR points to the RUN_DIR/ganeti/instance-disksdirectory, which will contain symlinks to the instances' disks. Theseprovide a stable name accross all nodes for them, and permitlive-migration to happen....
luxi: Use serializer module instead of simplejson
serializer.DumpJson: Control indentation by parameter
If the simplejson module supports indentation, it's always used. Thereare cases where we might not want to use it or enable it only fordebugging purposes, such as in RPC.
Add a missing import to cmdlib
cmdlib uses some constants from locking (ie. locking levels) but doesn'timport it. This patch fixes the issue.
Fix an error accessing the cfg
Since the context is passed to LogicalUnit, rather than the cfg, we canonly access the cfg as self.cfg, self.context.cfg, or context.cfg (inthe constructor). cfg is not valid anymore.
Add and remove instance/node locks
Whenever we add an instance or node to the cluster (i.e. to the configand whenever we remove them we should add/remove locks as well). In thefuture we may want to optimize this so that the configwriter does it, orit's handled at the context level, but till we're adding/removing...
Pass context to LUs
Rather than passing a ConfigWriter to the LUs we'll pass the wholecontext, from which a ConfigWriter can be extracted, but we can alsoaccess the GanetiLockManager. This also fixes the places where a FakeLUis created.
Fix a typo in LUTestDelay docstring
Locking: remove LEVEL_CONFIG lockset
Since the ConfigWriter now handles its own locking it's not necessary tohave a specific level for the config in the Locking Manager anymore.This patch thus removes it, and all the unittest calls that used it, ordepended on it being present....
ConfigWriter: synchronize access
Since we share the ConfigWriter we need somehow to make sure thataccessing it is properly synchronized. We'll do it using thelocking.ssynchronized decorator and a module-private shared lock.
This patch also renames a few functions, which were called inside the...
Locking: add ssynchronized decorator
This patch creates a new decorator function ssynchronized in the lockinglibrary, which takes as input a SharedLock, and synchronizes access tothe decorated functions using it. The usual SharedLock semantics apply,so it's possible to call more than one synchronized function at the same...
ConfigWriter: remove _ReleaseLock
Remove empty function _ReleaseLock and all its calls. Since we onlyhave one configwriter per cluster the locking needs to cover all thedata in the object, and not just the file contents. Locking inConfigWriter will be handled using the ganeti locking library....
Add generic worker pool implementation
Reuse the luxi client in cli.SubmitOpCode
By a mistake, we don't reuse the luxi client. As such, we open and closethe connection at each poll cycle and spam the server logs.
Add custom logging setup for daemons
It's better for daemons if: - they log only to one log file - the log level is included - for debug runs, the filename/line number is included
This patch moves the custom formatter from the watcher to the logging...