History | View | Annotate | Download (33.7 kB)
Change iterateAlloc to return the instance list
The Cluster.iterateAlloc and tieredAlloc functions are changed to alsoreturn the updated instance list, since it is needed to have a “full”cluster view.
hail: fix error message for failed multi-evac
Currently we show the instance index, but this makes no sense outsidethe current running program. Instead, we show the instance name.
Change the meaning of the N+1 fail metric
Currently, this metric tracks the nodes failing the N+1 check. Whilethis helps (in some cases) to evacuate such nodes, it's not a goodmetric since rarely it will change during a step (only at the lastinstance moving away). Therefore we replace it with the count of...
Introduce per-metric weights
Currently all metrics have the same weight (we just sum them together).However, for the hard constraints (N+1 failures, offline nodes, etc.)we should handle the metrics differently based on their meaning. Forexample, an instance living on a primary offline node is worse than an...
Allow balancing moves to introduce N+1 errors
This patch switches the applyMove function to the extended versions ofNode.addPri and addSec, and passes the override flag based on the stateof the node that we're moving away from.
hbal: print short names in steps list
This was a regression from the name handling changes, as we startedusing the original names for the solution list (which is not designedfor parsing/feeding back into ganeti).
Remove an obsolete function
printSolution is no longer used, as we print the solution iterativelynow.
Allow '+' in node list fields
When the field list is prefixed with a plus sign, this will extend thedefault field list, instead of replacing it entirely.
Add more unit tests for allocation/balance
The patch adds some simple unit-tests for both the allocation function(we can allocate small instances on an empty cluster, we can allocate intiered more starting from any size) and the balancing functions (one...
Move two functions from hspace to Cluster.hs
This is done so we can test a longer pipeline.
Make CStats instance of show
This helps debugging via ghci.
Stop modifying names for internal computations
Currently the name used internally is modified and holds the shortenedname of the nodes/instances. This has caused issues before, since wealways have to strip the suffix from input data and reapply it if we...
Remove the noLimit values and always use limits
This patch moves from allowing no-limits for disk/cpu ratios, and alwaysuse a real limit. For disk, it's simple since we use 0, which means noreservations for disks. For CPU, we set an (arbitrary) limit of 64 v/p,...
Fix hspace's KM metrics
We returned the KM_POOL_* metrics as the final state, not as the deltabetween the final and the initial state.
Add a new function to compute allocation deltas
Given two cluster states, the new function can answer the followingquestions:
- how much resources currently allocated- how much resources finally allocated (delta from above is how much we can actually allocate on the cluster)...
Introduce total vcpu tracking in CStats
We add a new field that tracks the available virtual cpus (expressed asnode cpus times the vcpu ratio).
A number of small fixes from hlint
balance function: use the movable flag directly
Instead of deciding based on secondary node, use the new flag.
Add a tryEvac function
This will be used by the node evacuate IAllocator request type.
Signed-off-by: Iustin Pop <iustin@google.com>
Move a type declaration to Node.hs
We'll need AllocElement in both Cluster and IAlloc in the future, so wemove it to Node.hs which is imported by both.
Change an internal type from Maybe to list
In preparation for multiple responses, we change from Maybe to List(both used in the container sense).
This allows us to keep the same workflow for all kind of requests.
Implement evacuation mode in hbal
This mode restricts the list of instances to be moved to the instancesliving on the offline (and drained) nodes.
Move instance relocation test upper in the chain
Currently we test each instance for relocation in checkMove; however, itis a little more clear if we pass only the relocatable instances tocheckMove. The patch also slightly rewrites (indendation/style) the...
Split the balancing function in two parts
Currently in the balancing function we do two thing:
- take the decision where to do a new balancing round or not- and actually computing the balancing round
This is not nice, as the two parts are conceptually separate, so this...
Convert n1_score metric from % to count
This increases the priority of fixing N+1 failures compared to balancingmetrics.
Metric: count of primary instances/offline nodes
This helps with evacuation/failover of instances on 2-node clusters withone one offline.
Offline instance metric: change from % to count
Currently we use the offline instance percentage (with range [0, 1]),but this is not good, since we want the evacuation of such instances tohave a high priority; therefore we change this to a count of offline...
Use conflicting primaries count in cluster score
This small patch adds the number of conflicting primaries in the clusterscore. This is different from the other non-CV metrics where we usuallycompute the percentage of failing instances (for that metric); but for a...
Allow overriding the field list in -p
The print nodes option can now accept an optional field list tocustomise the output. This is ugly, since the field names do not matchthe header names, but it is at least barely customisable (at runtime).
Move more node-listing functionality in Node.hs
This will prepare for the runtime-selectable field list.
Add a few comments in the scoring function
Expand the --print-instances output
This adds run status, resource parameters and load parameters forinstances.
Simplify the cstats initializer
Since all values are initialized to zero, the exact ordering is notimportant and thus we can use the positional mode for simpler code.
The patch also adds docstrings to the cstats functions.
Simplify Cluster.computeMoves
Since we now have an actual type for describing the instance moves(IMove), it's simpler to convert this into the move description/movecommands, rather than re-computing the move based on initial and finalnodes. This makes the shell commands computation and over-Luxi command...
Remove obsolete export
The ‘Placement’ type has been moved to Types.hs but we kept exporting itfrom Cluster, which is not needed.
Generalise the node/instance listing
This patch introduces a generic formatTable function (based on, andsimilar to the Ganeti one, but different and more FP in style) andchanges the node and instance listing to it.
The node list (due to the many variables) is still a little bit hackish...
Fix instance listing for non-redundant case
Start using the utilisation scores in balancing
This enables the per-node load/total available capacity scores to beused in balancing. Note that the total available capacity is currentlyfixed at zero and cannot be changed by the user.
Show the load on nodes in node lists
The strange printf usage is due to some limitation (it seems) in ghc forvery long argument lists. The whole printout should be rewritten later.
Allow displaying the instance map in hbal
This is similar to --print-nodes, but with much fewer fields.
Style change: cluster CStats camel-casing
This is again the cs_x to csX name change.
Style change: node and instance attributes
This changes from a_b to aB in all node and instance attributes, tomatch the standard Haskell style. Also attributes that should have beencamel-cased but weren't were changed (e.g. plist → pList, pnode →pNode).
Modify the internals of the detailed CV scores
Before we used a tuple; since we'll need more metrics in the future,it's simpler to transform this into a list of doubles, whose elementsare handled homogeneously by all the code that needs them.
Change iMoveToJob to properly create migrates
The current Cluster.iMoveToJob always creates failovers, which is notwhat we want. This simply used the original instances status to selectbetween these two (this is not optimal by the way, since the status...
Extend the MoveJob type to hold the instance index
This will be needed in order to generate the proper instance move commands.
Store the instance move in the MoveJobs
This will automatically sort our Ganeti jobs into the independent jobsets, and then we can submit them separately.
Move some more type definitions to Types.hs
Add a function converting Placements into Jobs
This converts from htools-specific Placements into Ganeti standardOpCodes, which will later allow execution via Luxi.
Record the move being performed in a Placement
This will allow a more descriptive output later in the solution list, asopposed to trying to reconstruct the move from the node indices.
The patch also documents the Placement members.
hbal: Implement grouping of moves into jobsets
Since moving two instances between different node-quadruples (inst X: A,B → C, D and inst Y: E, F → G, H) can be parallelised by Ganeti, itmakes sense to split the operation list into jobsets whose execution...
Turn on, and fix, more warnings
The Makefile was intented to be -Wall and not simply -W, but I missedthat. This enables more warnings and also enables -Werror (except forthe tests).
Split the balancing algorithm in two parts
Currently the computation, recursing part and the IO part (progressupdates) of the balancing main function (iterateDepth) are all in thesame function, which makes it hard to test. This patch moves thedecision/computation part (whether to proceed one more round, whether we...
Implement support for 'cheap' moves only
This patch adds support for cheap (failover/migrate) operations only inthe balancing algorithm and in the hbal command line options.
This allows a very quick balancing (compared to allowing replace-disks)which can be useful as a scheduled operation.
Use migrate or failover based on instance state
While we can't guarantee that the instance will be in the same state bythe time the migrate/failover command will be run, we can at least tryto do the right thing assuming no other changes to the cluster state....
Fix a few hlint errors
Fix a haddoc issue
hspace: fix failure handling of tryAlloc results
Currently hspace doesn't handle failures from tryAlloc correctly; thispatch changes the iterateDepth function in hspace to return a Result (…)so that errors can be propagated correctly.
The patch also changes one output key to be more clear and a typo in...
Change the tryAlloc/tryReloc workflow
Currently, the tryAlloc and tryReloc function return a list with all theresults, both failures and successes. This is fine for hail, which doesone round of allocations, but is not so good for hspace, which doesiterative rounds; since at each (successful) step we only take the best...
Simplify the Cluster.tryAlloc structures
Currently the tryAlloc function calls theallocateOnSingle/allocateOnPair and the builds a new tuple with thosefunctions's result plus the new node list. This is however suboptimalin two respects: - the new nodes added are the 'old' versions of the respective nodes,...
Slight change to the internal allocation results
Currently the Cluster.AllocSolution type is defined as a list of‘(OpResult Node.list, …)’ and the results for applyMove are defined as‘(OpResult Node.List, …)’. Both these means that the failure/successindication is hidden in the first elements of this tuple, which makes is...
hspace: move instance count and score into CStats
Currently the instance count and cluster score are separated from theother initial/final phase stats, even though they are very similar. Thispatch moves computation of these two into totalResources/CStats and...
Export more stats in hspace
This patch changes Cluster.totalResources to compute more resources andprints them in hspace.
Fix score calculation to work with empty clusters
Currently the cluster score calculation includes an offline instancepercentage, expressed as “offline inst / (offline + online inst)”, whichresults in NaN for empty clusters. This patch changes the calculation...
This patch changes the function Cluster.computeMoves to use guards and acouple of subexpressions in order to greatly simplify it.
Fix hlint-generated warnings
This big patch cleans up the code per hlint indications. Many removalsof extra parentheses, replacements of concat . map with concabtMap,extra dollar signs, eta reductions, etc. were performed.
The code still compiles and passes a couple of manual tests on sample...
Introduce a new type for allocation results
Currently the allocation/move operations workflow return ‘Maybe a’,which is very convenient but loses all details about the failure mode.
This patch introduces a new data type which encodes the specific failure...
Remove hn1 and related code
hn1 was deprecated for a while and this patch removes it altogether. Thesupport code in Cluster.hs is also removed.
Fix totalResources avail disk computation
This uses the newly-added Node.availDisk to compute the actual availabledisk correctl, and display the total allocatable disk in hspace.
Add a new type for cluster statistics
Currently totalResources returns a 5-tuple of integers. This is not easyto handle, as each change on the return type means that each caller mustbe updated.
This patch adds a new type for cluster stats and uses that instead as...
Add display of more stats in hspace
This patch changes Cluster.totalResources to compute more details aboutthe cluster status, and enhances hspace to display more of these.
Fix a haddock/docstring issue
Fix the various monomorphism warning
In a few places (e.g. tryRead or any printf call) it's a little bit hardto add the correct type signatures, but in the it is possible to fixthese warnings (which can bite one in subtle cases).
Small changes to the node list output
This is just some cleanup of the node list output, adding pcpu/vcpucounters, and making the display slightly nicer.
Add cpu ratio to cluster calculation
Add cpu-count-related attributes to nodes
This patch adds cpu-count related attributes to nodes: - total cpus - cpus in use - ratio of virtual:physical cpus
We also set correctly the cpu values at load time, but we don't doanything yet while moving instances around. The cpu ratio is shown in...
Fix the ReplacePrimary instance move
During a replace-primary instance move, on the real cluster the instanceis temporarily started on the secondary, and as such we must check thatthe secondary node can hold it for this duration. Currently the codedoes not, and depending on cluster scoring it will put instances on such...
Rework the tryAlloc/tryReloc functions
Currently tryAlloc/tryReloc do not return the new instance, as this isnot needed for IAllocator alloc/reloc requests. However, for computingthe space, the new instance is useful, so we modify these functions toreturn this information too....
Add copyright/license information
This doc-patch adds copyright and license information to (hopefully) allneeded files.
Small whitespace change
Move some alloc functions from hail into Cluster
These are generic enough to be used from multiple places, they belongbetter in Cluster.hs than in the hail source.
Cleanup an old function
Also replace a type with its synonim.
Lots of documentation updates
This patch does only doc build changes, doc changes and function movearound (for more logical documentation). It should have no impact at allon the code.
Remove an unused type synonim
Add type synonyms for the node/instance indices
This is a first step towards full datatype renaming. That requires morechanges, so at first we only want to document clearly what is a nodeindex, what is an instance index, and what is a plain Int.
Change the module import hierarchy
This patch makes the Types module a base module, and Node/Instance onesimport it, from the previous (opposite) situation. This will allow inthe future to use newtypes for the index and name types.
hail: Implement non-mirrored instance allocation
This patch implements non-mirrored instance allocation, by allocating assecondary node “noSecondary”.
Implement hail allocate (for 2-node requests)
This patch implements allocate for two node requests. One node requestscan be done as soon as we have a valid allocateOn function for singlenodes.
Working implementation if relocate
This patch completes the implementation of hail relocate. It maps allvalid destination nodes through a ReplaceSecondary IMove, filters outthe failed relocations, computes the resulting scores and picks thelowest one.
Remove most uses of ktn/kti
This patch removes all uses of ktn/kti from the past-loader stages.
Remove some extraneous uses of ktn/kti
Since we have Node/Instance.name, we can now simplify a few constructs.
Move checkData from Cluster to Loader
This moves the remaining loading function to Loader (together with itsassociated support functions).
More code reorganizations
This new big patch does a couple of more cleanups in the loading of datachapter: - introduce a Types module that holds most types (except the base Node/Instance/etc.) so that multiple other modules can use these (instead of only Cluster and its users)...
Rework the loader model
This big patch changes the loader model from “string data as commonformat” to actual object structures as common format.
The text loading function move from Cluster.hs to a new Text.hs module,some common functions are moved to a new Loader.hs module, and the...
Experimental support for non-redundant instances
This patch adds experimental support to hbal for non-redundant instances(i.e. instances with only one node). They are currently handled asnon-moveable, and as such the algorithm simply ignores them.
Supports needs to be added when reading from RAPI via hscan, and...
Small doc addition
Introduce nice errors on invalid input fields
This patch switches from plain read to a wrapper over readsPrec thatreturns better error messages than the buildin 'Prelude: no parse'.
Split node/instance parsing into functions
This allows easy checking for valid format of the input data (row-wise).
Add initial validation checks in Cluster.loadData
This patch converts loadTabular and loadData to a monadic form, thusallowing meaningful error messages from the node/instance load routines.
Convert Cluster.loadData to Result return
This patch changes Cluster.loadData to return a Result, instead ofdirectly the values; this will allow us to return meaningful errorvalues (e.g. when an instances lives on unknown node) rather than simplyabort. Currently the result is always an Ok, the actual signalling of...
Don't consider offline nodes as N+1 failed
This is just a cosmetic (I hope) change; the nodes shouldn't be usedanyway, and we only correct the display message.