Add some more debugging functions
These are just variations of the standard debug, but are provided forsimpler code, since lazyness is something causing non-computation ofdebug statements.
Fix ReplaceSecondary moves for offline nodes
The addition of a new secondary on a node is doing two memory tests:- in strict mode, reject if we get into N+1 failure- reject if the new instance memory is greater than the free memory (not available memory) on the node...
Change iterateAlloc to return the instance list
The Cluster.iterateAlloc and tieredAlloc functions are changed to alsoreturn the updated instance list, since it is needed to have a “full”cluster view.
Abstract the cluster serialization from hscan.hs
This is currently hardcoded in an internal function in hscan.hs, and wemove it to Text.hs for later use.
Add a new option --save-cluster
This option will in the future be used to serialize the cluster state inhbal and hspace after the rebalance/allocation steps.
Add unittest for Node text serialization
This checks that the Node text serialization and deserializationoperations are idempotent when combined other.
Switch unittest to custom hostnames
Currently, the hostnames are almost fully arbitrary chars, which breaksthe assumption that nodes/instances will be normal DNS hostnames.
This patch adds some custom generators for these hostnames, that willallow better testing of text loader serialization/deserialization.
Move text serialization functions to Text.hs
Currently these are in hscan, and cannot be reused easily.
hail: fix error message for failed multi-evac
Currently we show the instance index, but this makes no sense outsidethe current running program. Instead, we show the instance name.
Fix another haddock issue
Remove an obsolete function and add Utils tests
Add some more imports to QC.hs
This is needed so that in the coverage report we list all modules, eventhe ones we don't test at all, such that we get the complete results.
Change the meaning of the N+1 fail metric
Currently, this metric tracks the nodes failing the N+1 check. Whilethis helps (in some cases) to evacuate such nodes, it's not a goodmetric since rarely it will change during a step (only at the lastinstance moving away). Therefore we replace it with the count of...
Introduce per-metric weights
Currently all metrics have the same weight (we just sum them together).However, for the hard constraints (N+1 failures, offline nodes, etc.)we should handle the metrics differently based on their meaning. Forexample, an instance living on a primary offline node is worse than an...
Allow balancing moves to introduce N+1 errors
This patch switches the applyMove function to the extended versions ofNode.addPri and addSec, and passes the override flag based on the stateof the node that we're moving away from.
Introduce a relaxed add instance mode
In case an instance is living on an offline node, it doesn't make senseto refuse moving it because that would create N+1 failures; failing N+1is still much better than not running at all. Similarly, if thesecondary node of an instance is offline, meaning the instance doesn't...
Remove obsolete Container.maxNameLen
This was only used in one place (hbal), and is obsolete by the change tothe dual name/alias structure.
hbal: print short names in steps list
This was a regression from the name handling changes, as we startedusing the original names for the solution list (which is not designedfor parsing/feeding back into ganeti).
Remove an obsolete function
printSolution is no longer used, as we print the solution iterativelynow.
Allow '+' in node list fields
When the field list is prefixed with a plus sign, this will extend thedefault field list, instead of replacing it entirely.
Update the node list fields
This patch renames the pri/sec to pcnt/scnt, and adds the real primaryand secondary instance lists, the peermap and the index of a node asselectable options.
Cleanup a node's peer map when possible
If the last secondary instance of a peer is deleted (detected by the newpeer memory value being equal to zero), then the pair (pdx, 0) should bedeleted completely. This is not optimization per se, but rather cleanup...
Fix another haddock special-char issue
Remove JOB_STATUS_GONE and add unittests
… for the serialization/deserialization of the job and opcode status.
Job status 'gone' was not actually used. It can be reintroduced ifneeded.
Fix some lint errors in the unit tests
Change the Luxi operations structure
Currently, we define the LuxiOp type as a simple enumeration, and leavethe arguments structure to the users of the Ganeti.Luxi module. This issuboptimal for a couple of reasons: first, we decouple the operationtype from operation arguments, and that means we don't use the type...
Fix a warning in Loader tests
Incomplete pattern match…
Add a few Loader tests
These are not comprehensive, but at least we have a start.
Reduce the warnings during the unittests
Since the unittests are not 'clean' from the p.o.v. of typedeclarations, and cannot be made clean in all respects (e.g. orphaninstances), we silence some warnings for the test target, to have acleaner output.
Introduce OpCode unittests
Introduce suport for optional keys in JObjects
Some keys are optional in the Ganeti opcodes (e.g. ‘node’ in theOpReplaceDisks), and as such we need to transform them in a Maybe value,instead of failing.
The patch reworks a bit fromObj and adds maybeFromObj which parses such...
Replace fromJResult with annotateJResult
This patch removes all old uses of fromJResult with the annotatedversion, and removes the non-annotated version. All JSON parsing pointsshould now have annotated errors.
Add annotations to loadJSArray
This allows, for example, the RAPI backend to detail which information(instance or node data) fails to parse.
Change fromObj error messages
Currently fromObj doesn't detail what we're trying to read, which canlead to cryptic messages: "Cannot read Int". The patch changes thisfunction to annotate the error messages with the key/value we're tryingto convert, by using a new version of fromJResult....
A few more small Node unit-tests
Add more unittests
Instance, Node and Text modules have improved coverage.
Add more unit tests for allocation/balance
The patch adds some simple unit-tests for both the allocation function(we can allocate small instances on an empty cluster, we can allocate intiered more starting from any size) and the balancing functions (one...
Move two functions from hspace to Cluster.hs
This is done so we can test a longer pipeline.
Make CStats instance of show
This helps debugging via ghci.
Another haddoc fix…
Accept both full and short names in CLI
This patch introduces some new functionality in the base Element typeand in Container which supports searching for all 'known' names of anelement, such that both short and full names are accept for variousoptions like '-O' and '--excluded-instances'.
Stop modifying names for internal computations
Currently the name used internally is modified and holds the shortenedname of the nodes/instances. This has caused issues before, since wealways have to strip the suffix from input data and reapply it if we...
Add a new node/instance field
This new field ('alias') will hold the shortened/beautified displayname. When resetting the name, the alias is reset too, and there's a newfunction to update only the alias.
Change some test constants
First, we reduce the max size of the disks, since Int on 32bits willoverflow for big simulated clusters. This is a real issue, that willneed fixing in real life, but for now we just "silence" this test.
Second, we increase the amount of time a test is allowed to run,...
Fix some haddock comments
Add more unit tests
This increases the overall coverage by 5%-10% (depending on coveragetype). Some modules are still not unittested at all, as HUnit is abetter choice for them.
Shuffle some constants around
… and export more functions. This will help with unit testing.
Remove the noLimit values and always use limits
This patch moves from allowing no-limits for disk/cpu ratios, and alwaysuse a real limit. For disk, it's simple since we use 0, which means noreservations for disks. For CPU, we set an (arbitrary) limit of 64 v/p,...
Fix hspace's KM metrics
We returned the KM_POOL_* metrics as the final state, not as the deltabetween the final and the initial state.
Fix Node hiCpu computation
In case we're not enabling limits, let's restrict this to -1, instead of-1 times the number of pcpus.
Add a new function to compute allocation deltas
Given two cluster states, the new function can answer the followingquestions:
- how much resources currently allocated- how much resources finally allocated (delta from above is how much we can actually allocate on the cluster)...
Introduce total vcpu tracking in CStats
We add a new field that tracks the available virtual cpus (expressed asnode cpus times the vcpu ratio).
Merge branch 'master' into next
Fix iallocator crash when no solutions exist
Commit 5436576 added an un-guarded `head' call, which crashes with“Prelude.head: empty list” when no results exists for the per-instanceallocation/relocation calls.
This patch fixes this, and also adds another check for an unguarded...
Fix IAllocator multi-evacuate message
Since Ganeti passes full host names (not common-suffix-stripped), weneed to remove the suffix from the evac_nodes keys too. In case one nodeis not part of the cluster, it will lead to a wrong error message, butfor now it fixes the problem.
Fix a haddock comment issue
For some versions of haddock, this can create problems.
Abstract instance running states into a list
This removes some manual checks from a few places in the code with asingle list defined once.
A number of small fixes from hlint
Fix unused-do-binds for ghc 6.12
GHC 6.12 has some new warnings, which are valid in most cases except(IMHO) printf usage.
Fix unused imports for ghc 6.12
GHC 6.12 has become more picky about unused imports, so we need toremove/tighten some of them.
hscan: implement LUXI backend scanning
This allows hscan to work also with NO_CURL (but only for the localmachine, of course).
Loader: abort for unknown to-be-excluded instances
balance function: use the movable flag directly
Instead of deciding based on secondary node, use the new flag.
Update the loader pipeline to set the movable flag
This updates the movable flag on instances if they have only one node(we don't rely on OpMoveInstance) or if they are set so via the commandline options.
This doesn't yet enable the use of the new flag.
Add a 'movable' flag on instances
This will be used instead of checking for no secondary and forsimplifying 'do not touch' instances.
Add an option for excluding instances from moves
Implement IAllocator node evacuate request
This patch adds the new request loading/execution (trivial), but theactual response formatting becomes more difficult as now the responsetype differs by request.
Signed-off-by: Iustin Pop <iustin@google.com>
Add a tryEvac function
This will be used by the node evacuate IAllocator request type.
Move a type declaration to Node.hs
We'll need AllocElement in both Cluster and IAlloc in the future, so wemove it to Node.hs which is imported by both.
Change an internal type from Maybe to list
In preparation for multiple responses, we change from Maybe to List(both used in the container sense).
This allows us to keep the same workflow for all kind of requests.
IAllocator: move some keys into per-request data
Since not all structures will have these keys in the future, we movethem into per-structure keys.
Implement evacuation mode in hbal
This mode restricts the list of instances to be moved to the instancesliving on the offline (and drained) nodes.
Add an evac mode CLI option
Reorder options in CLI.hs
This should be no code change, just reordering of the options.
Fix secondary node selection for existing N+1
In case a secondary node is already N+1 failed, currently the nodeselection will accept a node that cannot start (at all) the new instanceas valid. This is wrong, so we add a new simple check to prevent the...
Rewrite the node add checks for simpler layout
This will make it clearer than many if…then choices.
Move instance relocation test upper in the chain
Currently we test each instance for relocation in checkMove; however, itis a little more clear if we pass only the relocatable instances tocheckMove. The patch also slightly rewrites (indendation/style) the...
Split the balancing function in two parts
Currently in the balancing function we do two thing:
- take the decision where to do a new balancing round or not- and actually computing the balancing round
This is not nice, as the two parts are conceptually separate, so this...
Fixing a typo in option description
Signed-off-by: René Nussbaumer <rn@google.com>Reviewed-by: Michael Hanselmann <hansmi@google.com>Signed-off-by: Iustin Pop <iustin@google.com>
Switch the text file format to single-file
This patch changes from the two separate files to a single file, withsections separated by a blank line. Currently only the node and instancedata is accepted, later the cluster tags will be read too via thisformat....
Change the signatures of the text loader slightly
This is in preparation for the text format changes.
Convert n1_score metric from % to count
This increases the priority of fixing N+1 failures compared to balancingmetrics.
Metric: count of primary instances/offline nodes
This helps with evacuation/failover of instances on 2-node clusters withone one offline.
Offline instance metric: change from % to count
Currently we use the offline instance percentage (with range [0, 1]),but this is not good, since we want the evacuation of such instances tohave a high priority; therefore we change this to a count of offline...
Use the oper_ram field if available
For the RAPI and LUXI backends, we can get the actual memory usage (ifinstances are running) via the oper_ram, whereas backend/memory onlytell what the instance will use at the next boot.
Not using oper_ram means that the node model is flawed and we consider...
rapi, luxi: treat drained nodes as offline
Commit e97f211 changed the iallocator backend to handle drained nodes asoffline. This commit completes that change by making the rapi and luxibackend do the same (the text backend ignores any '?' values which are...
Fix typo breaking LUXI backend
This really shows the need for actual dist-time full testing (notunittests).
Fix unittests after instance tags addition
Configure exclusion tags via the cluster tags
This patch adds reading of the exclusion tags from the cluster tags: anytags starting with htools:iextags: will convert their suffix into anexclusion tags prefix. In other words, "htools:iextags:service" will...
Read cluster tags in the IAllocator backend
Read cluster tags in the LUXI backend
Read cluster tags in the RAPI backend
This also shows them in hbal in verbose mode.
Introduce support for reading the cluster tags
While these are not actually populated from the backends, and all theprograms ignore them, this patch contains the changes in the functiontypes required.
Collapse the statistical functions into one
This allows us to get rid of two duplicate list length computations,with a minor speedup.
Specialize the math functions
The statistics functions are currently defined as polymorphic with aFloating constraint. Changing this to monomorphic on Double type makesthem stricter and much more performant (~70% speedup). This is a cheapway to recoup some of the loses incurred by the recent proliferation of...
Use conflicting primaries count in cluster score
This small patch adds the number of conflicting primaries in the clusterscore. This is different from the other non-CV metrics where we usuallycompute the percentage of failing instances (for that metric); but for a...
Node: add function for conflicting primary count
Add a new node list field
This patch adds a new node list field (ptags), showing the primaryinstance tags.