RAPI: read the group UUID from the server
This depends on future support from Ganeti (2.4+).
Signed-off-by: Iustin Pop <iustin@google.com>Reviewed-by: Balazs Lecz <leczb@google.com>
IAlloc: read group uuid from the input message
This makes the code incompatible with JSON files from Ganeti pre-2.4.
Text: read/save the node group UUID
Compatibility with old text files is kept by using the default UUID ifthe file (or even some records) don't have a UUID.
Luxi: read the node uuid from the cluster
This makes the code incompatible with Ganeti pre-2.4.
Node: add the node group's UUID
This is not used anywhere yet, and the backend are all just adding thedefault UUID, not the real one.
The patch also allows displaying the group UUID in the node list.
Utils: add a default UUID
This will be used as a placeholder for the cases when we need a UUID(any UUID), but we don't have one handy.
Merge branch 'devel-0.2' into master
Improve the standard deviation computation
This does just two passes, instead of three, over the list. This reducesthe overall runtime well enough (~25%) in some tests, but it's notreproducible using profiling, so I don't know how much the functionitself is being sped-up....
hbal: change handling of signal
Currently, hbal does a one-two signal handling, where the first signalcauses graceful termination, and the second one an immediate on (eitherSIGINT or SIGTERM can be used, interchangeably). However, this poses atiming problem: if two programs want to send a graceful termination...
Simu loader: move the loading to non-IO code
While we don't actually have IO code in the Simu loader, we do have thesame interface. So we move the code again to a separate parseDatafunction which is exported.
Luxi loader: split parsing from loading
Rapi loader: split parsing from loading
The change is similar to the text loader change.
Text loader: split parsing from loadData
This change, which will be followed by similar changes in the otherloaders, splits the parsing of the data from the actual loading fromdisk. Since the parsing doesn't usually involve IO actions, we will beable to better test the parsing. The loading becomes a smaller part of...
Ignore nodes which are not vm_capable
This break compatibility with Ganeti pre-2.3.
Merge branch 'devel-0.2'
Fix tag exclusion weight
Currently, the tag exclusion metric has a weight of one, which meansthere might be cases where we won't move instances around because itupsets the cluster metrics. However, we do want to make a higher effortfor cleaning up tag collisions, so we increase the weight to an...
Force UTF-8 locale for pandoc invocation
Pandoc 1.5.x uses the locale information to parse its input files (only1.5, pre and post version use always UTF-8). Hence we need to enforce aUTF-8 locale for proper parsing of input files.
Move from hand-written man pages to RST/pandoc
This simplifies the maintenance of the man pages, and unifies the rst-to-*converter to pandoc.
Add design for htools/Ganeti 2.3 sync
This is a work in progress, will be modified along with the progressof Ganeti 2.3.
Update NEWS file for 0.2.7 release
Fix some warnings in unittests
Add a hack for normalized CPU values in hspace
Currently, the key metrics/tiered spec computations show the virtual cpucount. However, since we do have a maximum ration Vcpu/Pcpu, we can alsoshow the “normalized” cpu count, i.e. the equivalent physical cpu count...
Improve the error message for tiered alloc option
hbal: implement user-friendly termination requests
Currently, hbal will abort immediately when requested (^C, or SIGINT,etc.). This is not nice, since then the already started jobs need to betracked manually.
This patch adds a signal handler for SIGINT and SIGTERM, which will, the...
Document the gain options in hbal's manpage
Use the mingain options in the balancing algorithm
Also adds them in hbal.
Add new CLI options for min gain during balancing
Recent hbal seems to run many steps for small improvements (< 1e-3), sowe should stop early in this case.
We add a new option (-g), that will be used for the minimum gain duringbalancing. This check will only become active when the cluster score is...
Makefile: make the rst2html converter more strict
This will make the automated builds flag any problems.
Add some more debugging functions
These are just variations of the standard debug, but are provided forsimpler code, since lazyness is something causing non-computation ofdebug statements.
Fix ReplaceSecondary moves for offline nodes
The addition of a new secondary on a node is doing two memory tests:- in strict mode, reject if we get into N+1 failure- reject if the new instance memory is greater than the free memory (not available memory) on the node...
Update NEWS file
Update man pages for the new -S option
hspace: mark new instances as running
Otherwise the saved cluster state and the in-memory one are wrong.
Implement cluster state saving in hspace
This also uncovered a few issues with the allocation model (instancesnot being marked up, etc.).
Compared to hbal, hspace will generate either one or two files (for boththe standard and the tiered allocation mode), depending on the input...
Change iterateAlloc to return the instance list
The Cluster.iterateAlloc and tieredAlloc functions are changed to alsoreturn the updated instance list, since it is needed to have a “full”cluster view.
Implement cluster state saving in hbal
Also move the LUXI execution (-X) to the end, after all the outputmessages are printed. No good in waiting for the messages for a longwhile, especially as they are not up-to-date stats after the jobexecution, just an estimation of what the state will be.
Abstract the cluster serialization from hscan.hs
This is currently hardcoded in an internal function in hscan.hs, and wemove it to Text.hs for later use.
Add a new option --save-cluster
This option will in the future be used to serialize the cluster state inhbal and hspace after the rebalance/allocation steps.
Add unittest for Node text serialization
This checks that the Node text serialization and deserializationoperations are idempotent when combined other.
Switch unittest to custom hostnames
Currently, the hostnames are almost fully arbitrary chars, which breaksthe assumption that nodes/instances will be normal DNS hostnames.
This patch adds some custom generators for these hostnames, that willallow better testing of text loader serialization/deserialization.
Move text serialization functions to Text.hs
Currently these are in hscan, and cannot be reused easily.
Fix a couple of typos in the manpages
Again, thanks to lintian.
hail: fix error message for failed multi-evac
Currently we show the instance index, but this makes no sense outsidethe current running program. Instead, we show the instance name.
Update NEWS file for the 0.2.6 release
NEWS: Add double blank lines before headers
This looks better for text-only viewing…
hscan: return exit code 2 for RAPI failures
If some clusters failed during RAPI collection, exit with exit code 2 sothat tests can detect this failure.
More enhancements to live-test.sh
Fix another haddock issue
Remove an obsolete function and add Utils tests
Extend the live-test
The (recently-enabled) live test coverage stats found a few low-hangingfruits in the tests we do…
Use --union for hpc sum
… which fixes the issue noted in the previous commit (almost a brownpaper bag change).
Preliminary support for coverage during live-test
While this doesn't work correctly yet (hpc sum seems to only take commonmodules, not the sum of modules?), it prepares for gathering coveragedata during live-test (as an alternative to unittest coverage data).
Add some more imports to QC.hs
This is needed so that in the coverage report we list all modules, eventhe ones we don't test at all, such that we get the complete results.
Change the meaning of the N+1 fail metric
Currently, this metric tracks the nodes failing the N+1 check. Whilethis helps (in some cases) to evacuate such nodes, it's not a goodmetric since rarely it will change during a step (only at the lastinstance moving away). Therefore we replace it with the count of...
Introduce per-metric weights
Currently all metrics have the same weight (we just sum them together).However, for the hard constraints (N+1 failures, offline nodes, etc.)we should handle the metrics differently based on their meaning. Forexample, an instance living on a primary offline node is worse than an...
Allow balancing moves to introduce N+1 errors
This patch switches the applyMove function to the extended versions ofNode.addPri and addSec, and passes the override flag based on the stateof the node that we're moving away from.
Introduce a relaxed add instance mode
In case an instance is living on an offline node, it doesn't make senseto refuse moving it because that would create N+1 failures; failing N+1is still much better than not running at all. Similarly, if thesecondary node of an instance is offline, meaning the instance doesn't...
Remove obsolete Container.maxNameLen
This was only used in one place (hbal), and is obsolete by the change tothe dual name/alias structure.
hbal: print short names in steps list
This was a regression from the name handling changes, as we startedusing the original names for the solution list (which is not designedfor parsing/feeding back into ganeti).
Remove an obsolete function
printSolution is no longer used, as we print the solution iterativelynow.
Allow '+' in node list fields
When the field list is prefixed with a plus sign, this will extend thedefault field list, instead of replacing it entirely.
Update the node list fields
This patch renames the pri/sec to pcnt/scnt, and adds the real primaryand secondary instance lists, the peermap and the index of a node asselectable options.
Cleanup a node's peer map when possible
If the last secondary instance of a peer is deleted (detected by the newpeer memory value being equal to zero), then the pair (pdx, 0) should bedeleted completely. This is not optimization per se, but rather cleanup...
Fix handling of offline options and short names
This needs to be abstracted in a separate function, but in the meantimewe fix the issue in both places.
Signed-off-by: Iustin Pop <iustin@google.com>
Fix another haddock special-char issue
Remove JOB_STATUS_GONE and add unittests
… for the serialization/deserialization of the job and opcode status.
Job status 'gone' was not actually used. It can be reintroduced ifneeded.
Add opcode status constants/type
This mirrors, again, the Ganeti constats, and are added for future use.
Rename the job status constants
The rename is done such that we match Ganeti's own constants.
Optimise the Luxi.recvMsg function
Since the current buffer cannot contain (during network reads) an EOM,we should look for the EOM only in the newly-received string. Whilethis shouldn't make much difference, in some tests it cuts the recvMsgtotal time by around half....
Complete the client Luxi implementation
All current Luxi calls are supported after this patch. A bug inArchiveJob is also fixed (Ganeti's job IDs are strings).
Add support for more LUXI calls
While not are directly useful, having them will open some possibilities(e.g. polling for job changes in hbal's -X mode, and auto-archiving thejobs once they are successful).
Fix some lint errors in the unit tests
Change the Luxi operations structure
Currently, we define the LuxiOp type as a simple enumeration, and leavethe arguments structure to the users of the Ganeti.Luxi module. This issuboptimal for a couple of reasons: first, we decouple the operationtype from operation arguments, and that means we don't use the type...
Fix a warning in Loader tests
Incomplete pattern match…
Add a few Loader tests
These are not comprehensive, but at least we have a start.
Modify the test runner to show test exceptions
QuickCheck's batch driver (at least v1) doesn't show the test aborts,but simply discards the specific exception and increases the abortcount. This makes it hard to debug the tests, so we modify our own test...
Reduce the warnings during the unittests
Since the unittests are not 'clean' from the p.o.v. of typedeclarations, and cannot be made clean in all respects (e.g. orphaninstances), we silence some warnings for the test target, to have acleaner output.
Improve the test driver
The tests are moved to a separate data structure, and we can select asubset of tests to run.
Introduce OpCode unittests
Introduce suport for optional keys in JObjects
Some keys are optional in the Ganeti opcodes (e.g. ‘node’ in theOpReplaceDisks), and as such we need to transform them in a Maybe value,instead of failing.
The patch reworks a bit fromObj and adds maybeFromObj which parses such...
Replace fromJResult with annotateJResult
This patch removes all old uses of fromJResult with the annotatedversion, and removes the non-annotated version. All JSON parsing pointsshould now have annotated errors.
Add annotations to loadJSArray
This allows, for example, the RAPI backend to detail which information(instance or node data) fails to parse.
Change fromObj error messages
Currently fromObj doesn't detail what we're trying to read, which canlead to cryptic messages: "Cannot read Int". The patch changes thisfunction to annotate the error messages with the key/value we're tryingto convert, by using a new version of fromJResult....
A few more small Node unit-tests
Add more unittests
Instance, Node and Text modules have improved coverage.
Add more unit tests for allocation/balance
The patch adds some simple unit-tests for both the allocation function(we can allocate small instances on an empty cluster, we can allocate intiered more starting from any size) and the balancing functions (one...
Move two functions from hspace to Cluster.hs
This is done so we can test a longer pipeline.
Make CStats instance of show
This helps debugging via ghci.
Clarify options related to name passing
After the name patches, we can pass in either the short or the fullname, so update the hbal man page accordingly.
Another haddoc fix…
Accept both full and short names in CLI
This patch introduces some new functionality in the base Element typeand in Container which supports searching for all 'known' names of anelement, such that both short and full names are accept for variousoptions like '-O' and '--excluded-instances'.
Stop modifying names for internal computations
Currently the name used internally is modified and holds the shortenedname of the nodes/instances. This has caused issues before, since wealways have to strip the suffix from input data and reapply it if we...
Add a new node/instance field
This new field ('alias') will hold the shortened/beautified displayname. When resetting the name, the alias is reset too, and there's a newfunction to update only the alias.
Change some test constants
First, we reduce the max size of the disks, since Int on 32bits willoverflow for big simulated clusters. This is a real issue, that willneed fixing in real life, but for now we just "silence" this test.
Second, we increase the amount of time a test is allowed to run,...
Fix some haddock comments
Add more unit tests
This increases the overall coverage by 5%-10% (depending on coveragetype). Some modules are still not unittested at all, as HUnit is abetter choice for them.
Shuffle some constants around
… and export more functions. This will help with unit testing.
Remove the noLimit values and always use limits
This patch moves from allowing no-limits for disk/cpu ratios, and alwaysuse a real limit. For disk, it's simple since we use 0, which means noreservations for disks. For CPU, we set an (arbitrary) limit of 64 v/p,...
hspace: change handling of N+1 bad clusters
Currently we just print a fake result and exit early. This is bad, sinceit doesn't use the same codepaths for all the result printing, and hasalready led to a bug where hspace looks like completely ignoring the...
Fix hspace's KM metrics
We returned the KM_POOL_* metrics as the final state, not as the deltabetween the final and the initial state.