X-Git-Url: https://code.grnet.gr/git/ganeti-local/blobdiff_plain/bbd1d27390f9e3a664de09aa2e951ecf110e54a9..db4d9a9bdcf4e2500660ede126c955e2eee73b6c:/NEWS diff --git a/NEWS b/NEWS index e4acdb3..e014cf3 100644 --- a/NEWS +++ b/NEWS @@ -1,22 +1,406 @@ -Version 0.0.5: - - a few small improvements for hbal (possibly undone by later - changes), hbal is now quite faster - - fix documentation building - - allow hbal to work on non N+1 compliant clusters, but without - guarantees that the end cluster will be compliant; in any case, this - should give a smaller number of nodes that are not compliant if the - cluster state permits it - - strip common domain suffix from nodes and instances, so that output - is shorter and hopefully clearer - -Version 0.0.4: - - better balancing algorithm in hbal - - implemented an RAPI collector, now the cluster data can be gathered - automatically via RAPI and doesn't need manual export of node and - instance list - -Version 0.0.3: - - initial release of the hbal, a cluster rebalancing tool - - input data format changed due to hbal requirements - -Previous version was initial announcement. +Ganeti-htools release notes +=========================== + + +Version 0.2.7 (Thu, 07 Oct 2010) +-------------------------------- + +Bug fixes: + +- fixed the error message for hail multi-evacuation mode +- improve evacuation mode for offline secondary nodes (ignore available + memory) + +New features: + +- add a new option ``-S`` to hbal and hspace that saves the cluster + state at the end of the processing in the text format used by the + ``-t`` option, for later re-processing +- a two new options to hbal, -g and --min-gain-limit, that should help + in limiting the number of balances steps with a low gain in the final + stages +- hbal, when executing jobs, will now wait for the current jobs to + finish at the first stop (e.g. ^C); if the user wants immediate exit, + another signal should be sent +- added “normalized” physical CPU units in hspace output (NPU), which + represents units of physical CPUs free/used, based on the max-cpu + ratio + + +Version 0.2.6 (Mon, 26 Jul 2010) +-------------------------------- + +Exactly three months since the last release. Many internal changes, plus +a couple of important changes in the balancing algorithm. + +First, the balancing may now introduce N+1 errors, if this solves other, +more critical problems. For the moment, this means that moving instances +away from offline nodes is allowed even if it creates N+1 errors, and +that means evacuation can be done in more cases. + +Second, the scoring for N+1 has changed. In previous versions, it simply +counted the number of failing N+1 nodes, which means moving an instance +away from a N+1 failed node (but without the node 'clearing' the N+1 +status) was not reflected in the cluster score. As such, the balancing +algorithm managed to clear N+1 errors only sometimes, since usually it +takes more than one move for this, and the first prerequisite move was +not 'rewarded' appropriately and thus it was not selected. Now, it is +possible to fix many more error cases than before: on a simulated 40 +node cluster full with instances (symmetrically allocated on all nodes), +around five nodes can be evacuated before N+1 errors can be solved, +whereas 0.2.5 could evacuate at best one node. + +There were some other internal changes to the scoring algorithm, such +that now the metrics have associated weights, and they are not all of +the same importance anymore. As of now, the only change is that offline +instances have a higher weight, which should favour proper node +evacuations. + +Among the other changes: + +- fixed the hspace KM_POOL_* metrics, which were returned as the final + state and not as the delta between the initial and final states +- fixed hspace handling of N+1 failing clusters: before, it used to + generate a 'fake' response, and the structure of this response was not + always in sync with the real responses, leading to missing items; + currently it proceeds correctly through the code (skipping the + computation), and uses the same display mechanisms as the normal case +- fixed hscan exit code for RAPI failures: previously it finished with + success even if all the clusters failed, which was creating issues + with the live-test script; now it exits with exit code 2 for RAPI + failures (unfortunately this is still not optimal as LUXI failures + will use exit code 1, the same as the command line) +- changed the limit values for CPU/disk, which previously were used + optionally, whereas now they are always used; the default cpu ratio + limit is now 64 VCPUs per PCPU +- changed the internal handling of the short name vs. original + (Ganeti-provided) name; now internally we always use the full name, + and only in display routines we show the shortened (called 'alias') + name; as a result, the -O and --excluded-instances options now accept + both the full name and the shortened name +- changed internal handling of JSON conversions and errors, such that + now we show a better context for failure messages, which should help + with diagnosing the malformed message +- changed the names for a few node fields, and added some more nodes; + this is most likely to help with debugging, and not with regular + operation though +- changed the node fields option to allow the '+' prefix to mean 'extend + the default fields list' rather than start from fresh (similar to + Ganeti's implementation) +- a few internal changes related to the LUXI protocol implementation, + which should make it more safe against potential bugs, one + optiomization that should help with large messages, and some patches + in preparation for potential expansion of the LUXI backend functionality + +And finally, many improvements on unittests and the live-test +script. Test coverage is much enhanced, and the test infrastructure has +better error reporting; this should lead down-the-road to better code +and fewer bugs… + + +Version 0.2.5 (Mon, 26 Apr 2010) +-------------------------------- + +Some internal cleanup plus a few user-visible changes: + +- new option for marking instances as 'do-not-move' during rebalancing +- allow ``hscan`` to scan the local cluster via Luxi +- add more metrics to ``hspace`` which show the delta between original + state and final state better (only valid for tiered allocation) + + +Version 0.2.4 (Mon, 22 Feb 2010) +-------------------------------- + +Two improvements for node evacuation: + +- hbal takes a new parameter ``--evac-mode`` that restricts the + instances to be moved to the ones on offline/drained nodes, which + should reduce the work done +- hail supports the new ``multi-evacuate`` mode of the IAllocator + protocol, that will be released in a minor release on the Ganeti 2.1 + branch + + +Version 0.2.3 (Thu, 4 Feb 2010) +-------------------------------- + +A small release: + +- Fixes selection of secondary node: previously, if the cluster had + many N+1 failures, a N+1 failed node could be selected as secondary + even if it did not have enough memory to allow the instance to be + migrated/failed over to it; this is bad for automated tools, since + we can get the cluster in an unhealthy state +- Switch the text backend to a single input file, that is generated + now by hscan and shouldn't be generated manually via + gnt-node/instance list anymore; this allows richer information to be + kept in the file, and simplifies a little the internals of the text + backend + + +Version 0.2.2 (Tue, 29 Dec 2009) +-------------------------------- + +Small release, 0.2.1 was broken and thus this was released earlier: + +- Release 0.2.1 broke the LUXI backend due to a typo, fixed +- Added a live-test script that should catch errors like the above one + in the future (needs a working, non-empty cluster) +- Changed RAPI and LUXI backends to treat drained nodes as offline, + similar to the IAllocator backend change in 0.2.0 (which was wrongly + marked as affecting all backends) +- Changed the metrics for offline instances and N1 score from percent to + count, in order to increase the priority of evacuations +- Added a new metric (offline primary instances) which should fix the + evacuation of a offline node in a 2-node cluster + + +Version 0.2.1 (Wed, 2 Dec 2009) +-------------------------------- + +- Added instance exclusion defined via instance tags +- Fixed the output of hspace to be again parseable from the shell + + +Version 0.2.0 (Tue, 10 Nov 2009) +-------------------------------- + +A significant release, with a few new major features: + +- Added direct execution of the hbal solution when using the Luxi + backend; the steps for each instance moves are submitted as a single + jobs, and the different jobs are submitted as groups in order to + parallelise the execution of moves +- Added support for balancing based on dynamic utilisation data for + instances, fed in via a text file; by default, all instances are + considered equal and this change also improves the equalisation of + secondary instances per node +- Added support for tiered capacity calculation in hspace, where we + start from a maximum instance spec and decrease the spec when we run + out of resources; this should give a better measure of available + capacity on 'fragmented' clusters; this is done separately from the + current fixed-mode computation + +Also there have been many minor improvements: + +- Added option for showing instances (“--print-instances”), similar to + the print nodes option +- Added support for customising the node list via an argument to the + print nodes option in the form of a comma-separated list of field + names; currently the field names are not documented, expecting further + changes in a next release +- Enhanced the error reporting in the Luxi and Rapi backends +- Changed the handling of drained nodes, now being treated the same as + offline nodes, for Ganeti 2.0.4+ compatibility +- A number of internal changes, simplifying code and merging some + disparate functions +- Simplify the build system in relation to creation of archives + + +Version 0.1.8 (Tue, 29 Sep 2009) +-------------------------------- + +- Brown-paper-bag release fixing haddock issues + + +Version 0.1.7 (Mon, 28 Sep 2009) +-------------------------------- + +- Fixed a bug in the Luxi backend for big responses +- Fixed test suite exit code in presence of test failures +- Changed the migrate operation to run instead failover for instances + which were marked as not running in the input data (this could have + been changed since then, but it's better than today's always migrate) +- Added support for 'cheap' moves only (only migrate/failover) in + balancing +- Added support for building without curl (thus no RAPI backend) + + +Version 0.1.6 (Wed, 19 Aug 2009) +-------------------------------- + +- Added support for Luxi (the native Ganeti protocol) +- Added support for simulated clusters (for hspace only) +- Added timeouts for the RAPI backend +- Fixed a few inconsistencies in the command line handling +- Fixed handling of errors while loading data +- The 'network' is a new dependency due to the Luxi addition + + +Version 0.1.5 (Thu, 09 Jul 2009) +-------------------------------- + +- Removed obsolete hn1 program; this allowed removal of a lot of + supporting code +- Lots of changes in hspace: the output now is a shell fragment in order + for script to source it or parse it easier; added failure reasons; + optimised to use less memory for large clusters +- Optimized the scoring algorithm (used by all tools) so that now + computations should be faster + + +Version 0.1.4 (Tue, 16 Jun 2009) +-------------------------------- + +- Added CPU count/ratio of virtual-to-physical CPUs to the cluster + scoring methods; this means that now the balancer, the iallocator + plugin and so on will try to keep the VCPU-to-PCPU ratio equal across + the cluster +- Fixed some hscan bugs +- Fixed the way iallocator reads the total disk size (was broken and it + was always falling back to summing the disk sizes) +- Internals: fixed most compile-time warnings + + +Version 0.1.3 (Fri, 05 Jun 2009) +-------------------------------- + +- Fix a bug in the ReplacePrimary instance moves, affecting most of the + tools + + +Version 0.1.2 (Tue, 02 Jun 2009) +-------------------------------- + +- Add a new program, “hspace”, which computes the free space on a + cluster (based on a given instance spec) +- Improvements in API docs and partially in the user docs +- Started adding unittests + + +Version 0.1.1 (Tue, 26 May 2009) +-------------------------------- + +- Add a new program, “hail”, which is an iallocator plugin and can + allocate/relocate instances +- Experimental support for non-mirrored instances (hail supports them, + hbal should no longer abort when it finds such instances and simply + ignore them) +- The RAPI port and/or scheme can be overriden now, and even “file://” + schemes can be used if the message body has been saved under the + appropriate name +- Lots of code reorganization, esp. rewritten loading pipeline +- Better data checking and better error messages in case validation + fails; tools now consider nodes with error in input data (‘?’ returned + by ganeti) as offline +- Small enhancement to the makefile for simpler packaging + + +Version 0.1.0 (Tue, 19 May 2009) +-------------------------------- + +- Drop compatibility with Ganeti 1.2 +- Add a new minimum score option (with a very low default), should help + with very good clusters (but is still not optimal) +- Add a --quiet option to hbal +- Add support for reading offline nodes directly from the cluster + + +Version 0.0.8 (Tue, 21 Apr 2009) +-------------------------------- + +- hbal: prevent mismatches in wrong node names being passed to -O, by + aborting in this case +- add the ability to write the commands (-C) to a script via (-C), + so that it can be later executed directly; this has also changed the + commands to include the ncessary -f flags to skip confirmations +- add checks for extra argument in hbal and hn1, so that unintended + errors are catched +- raise the accepted “missing” memory limit to 512MB, to cover usual Xen + reservations + + +Version 0.0.7 (Mon, 23 Mar 2009) +-------------------------------- + +- added support for offline nodes, which are not used as targets for + instance relocation and if they hold instances the hbal algorithm will + attempt to relocate these away +- added support for offline instances, which now will no longer skew the + free memory estimation of nodes; the algorithm will no longer create + conditions for N+1 failures when such instances are later started +- implemented a complete model of node resources, in order to prevent an + unintended re-occurrence of cases like the offline instance were we + miscalculate some node resource; this gives warning now in case the + node reported free disk or free memory deviates by more than a set + amount from the expected value +- a new tool *hscan* that can generate the input text-file for the other + tools by collection via RAPI +- some small changes to the build system to make it more friendly; also + included the generated documentation in the source archive + + +Version 0.0.6 (Mon, 16 Mar 2009) +-------------------------------- + +- re-factored the hbal algorithm to make it stable in the sense that it + gives the same solution when restarted from the middle; barring + rounding of disk/memory and incomplete reporting from Ganeti (for + 1.2), it should be now feasible to rely on its output without + generating moves ad infinitum +- the hbal algorithm now uses two more variables: the node N+1 failures + and the amount of reserved memory; the first of which tries to ‘fix’ + the N+1 status, the latter tries to distribute secondaries more + equally +- the hbal algorithm now uses two more moves at each step: + replace+failover and failover+replace (besides the original failover, + replace, and failover+replace+failover) +- slightly changed the build system to embed GIT version/tags into the + binaries so that we know for a binary from which tree it was done, + either via ‘--version’ or via “strings hbal|grep version” +- changed the solution list and in general the hbal output to be more + clear by default, and changed “gnt-instance failover” to “gnt-instance + migrate” +- added man pages for the two binaries + + +Version 0.0.5 (Mon, 09 Mar 2009) +-------------------------------- + +- a few small improvements for hbal (possibly undone by later changes), + hbal is now quite faster +- fix documentation building +- allow hbal to work on non N+1 compliant clusters, but without + guarantees that the end cluster will be compliant; in any case, this + should give a smaller number of nodes that are not compliant if the + cluster state permits it +- strip common domain suffix from nodes and instances, so that output is + shorter and hopefully clearer + + +Version 0.0.4 (Sun, 15 Feb 2009) +-------------------------------- + +- better balancing algorithm in hbal +- implemented an RAPI collector, now the cluster data can be gathered + automatically via RAPI and doesn't need manual export of node and + instance list + + +Version 0.0.3 (Wed, 28 Jan 2009) +-------------------------------- + +- initial release of the hbal, a cluster rebalancing tool +- input data format changed due to hbal requirements + + +Version 0.0.2 (Tue, 06 Jan 2009) +-------------------------------- + +- fix handling of some common cases (cluster N+1 compliant from the + start, too big depth given, failure to compute solution) +- add option to print the needed command list for reaching the proposed + solution + + +Version 0.0.1 (Tue, 06 Jan 2009) +-------------------------------- + +- initial release of hn1 tool + +.. vim: set textwidth=72 : +.. Local Variables: +.. mode: rst +.. fill-column: 72 +.. End: