code.grnet.gr Git - ganeti-local/blob - NEWS

   1 Ganeti-htools release notes
   2 ===========================
   3
   4
   5 Version 0.3.0 (Fri, 04 Feb 2011)
   6 --------------------------------
   7
   8 A significant release that breaks compatibility with Ganeti versions
   9 below 2.4 due to the node group changes. Only the RAPI backend can talk
  10 to older clusters, but it is recommended to use this version only with
  11 Ganeti 2.4.
  12
  13 All commands are now multi-group aware (but to various degrees), so
  14 allocation, balancing and capacity calculation respects the group layout
  15 and will not create “broken” instances by using nodes from different
  16 groups.
  17
  18 For a regular, single-group cluster, no changes should be directly
  19 visible to the users. A multi-group cluster however will change some
  20 things slightly:
  21
  22 - hbal will require a target group to operate on (no cluster-wide
  23   balancing yet)
  24 - evacuation of (DRBD) instances from a node will be restricted to nodes
  25   in the same group, as inter-group moves are not implemented yet
  26 - capacity, while showing correct data, will not give per-group details
  27   yet
  28
  29 There are other changes in this release:
  30
  31 - fixed a long-standing bug in hscan related to node memory data
  32 - changed the text backend format, which unfortunately invalidates old
  33   files
  34 - error handling improvements, so that invalid input data reports better
  35   where the error is
  36 - the simulation backend changes its syntax, now it takes the allocation
  37   policy too, and can generate multiple groups
  38 - (internal) man page generation moved to pandoc from hand-written,
  39   which is helpful as it can also generate HTML versions
  40 - the balancing algorithm has been changed to work in parallel, if the
  41   code is linked against the multi-threaded runtime; this gives a very
  42   good speedup (~80% on 4 cores, ~60-70% of 12 cores)
  43
  44 Version 0.2.8 (Thu, 23 Dec 2010)
  45 --------------------------------
  46
  47 A bug fix release:
  48
  49 - fixed balancing function for big clusters, which will improve corner
  50   cases where hbal didn't see any solution even though the cluster was
  51   obviously not well balanced
  52 - fixed exit code of hbal in case of (Luxi) job errors
  53 - changed the signal handling in hbal in order to make hbal control
  54   easier: instead of synchronising on the count of signals, make SIGINT
  55   cause graceful termination, and SIGTERM an immediate one
  56 - increased the tag exclusion weight so that it has greater importance
  57   during the balancing
  58 - slight improvement to the speed of balancing via algorithm tweaks
  59
  60
  61 Version 0.2.7 (Thu, 07 Oct 2010)
  62 --------------------------------
  63
  64 Bug fixes:
  65
  66 - fixed the error message for hail multi-evacuation mode
  67 - improve evacuation mode for offline secondary nodes (ignore available
  68   memory)
  69
  70 New features:
  71
  72 - add a new option ``-S`` to hbal and hspace that saves the cluster
  73   state at the end of the processing in the text format used by the
  74   ``-t`` option, for later re-processing
  75 - a two new options to hbal, -g and --min-gain-limit, that should help
  76   in limiting the number of balances steps with a low gain in the final
  77   stages
  78 - hbal, when executing jobs, will now wait for the current jobs to
  79   finish at the first stop (e.g. ^C); if the user wants immediate exit,
  80   another signal should be sent
  81 - added “normalized” physical CPU units in hspace output (NPU), which
  82   represents units of physical CPUs free/used, based on the max-cpu
  83   ratio
  84
  85
  86 Version 0.2.6 (Mon, 26 Jul 2010)
  87 --------------------------------
  88
  89 Exactly three months since the last release. Many internal changes, plus
  90 a couple of important changes in the balancing algorithm.
  91
  92 First, the balancing may now introduce N+1 errors, if this solves other,
  93 more critical problems. For the moment, this means that moving instances
  94 away from offline nodes is allowed even if it creates N+1 errors, and
  95 that means evacuation can be done in more cases.
  96
  97 Second, the scoring for N+1 has changed. In previous versions, it simply
  98 counted the number of failing N+1 nodes, which means moving an instance
  99 away from a N+1 failed node (but without the node 'clearing' the N+1
 100 status) was not reflected in the cluster score. As such, the balancing
 101 algorithm managed to clear N+1 errors only sometimes, since usually it
 102 takes more than one move for this, and the first prerequisite move was
 103 not 'rewarded' appropriately and thus it was not selected. Now, it is
 104 possible to fix many more error cases than before: on a simulated 40
 105 node cluster full with instances (symmetrically allocated on all nodes),
 106 around five nodes can be evacuated before N+1 errors can be solved,
 107 whereas 0.2.5 could evacuate at best one node.
 108
 109 There were some other internal changes to the scoring algorithm, such
 110 that now the metrics have associated weights, and they are not all of
 111 the same importance anymore. As of now, the only change is that offline
 112 instances have a higher weight, which should favour proper node
 113 evacuations.
 114
 115 Among the other changes:
 116
 117 - fixed the hspace KM_POOL_* metrics, which were returned as the final
 118   state and not as the delta between the initial and final states
 119 - fixed hspace handling of N+1 failing clusters: before, it used to
 120   generate a 'fake' response, and the structure of this response was not
 121   always in sync with the real responses, leading to missing items;
 122   currently it proceeds correctly through the code (skipping the
 123   computation), and uses the same display mechanisms as the normal case
 124 - fixed hscan exit code for RAPI failures: previously it finished with
 125   success even if all the clusters failed, which was creating issues
 126   with the live-test script; now it exits with exit code 2 for RAPI
 127   failures (unfortunately this is still not optimal as LUXI failures
 128   will use exit code 1, the same as the command line)
 129 - changed the limit values for CPU/disk, which previously were used
 130   optionally, whereas now they are always used; the default cpu ratio
 131   limit is now 64 VCPUs per PCPU
 132 - changed the internal handling of the short name vs. original
 133   (Ganeti-provided) name; now internally we always use the full name,
 134   and only in display routines we show the shortened (called 'alias')
 135   name; as a result, the -O and --excluded-instances options now accept
 136   both the full name and the shortened name
 137 - changed internal handling of JSON conversions and errors, such that
 138   now we show a better context for failure messages, which should help
 139   with diagnosing the malformed message
 140 - changed the names for a few node fields, and added some more nodes;
 141   this is most likely to help with debugging, and not with regular
 142   operation though
 143 - changed the node fields option to allow the '+' prefix to mean 'extend
 144   the default fields list' rather than start from fresh (similar to
 145   Ganeti's implementation)
 146 - a few internal changes related to the LUXI protocol implementation,
 147   which should make it more safe against potential bugs, one
 148   optiomization that should help with large messages, and some patches
 149   in preparation for potential expansion of the LUXI backend functionality
 150
 151 And finally, many improvements on unittests and the live-test
 152 script. Test coverage is much enhanced, and the test infrastructure has
 153 better error reporting; this should lead down-the-road to better code
 154 and fewer bugs…
 155
 156
 157 Version 0.2.5 (Mon, 26 Apr 2010)
 158 --------------------------------
 159
 160 Some internal cleanup plus a few user-visible changes:
 161
 162 - new option for marking instances as 'do-not-move' during rebalancing
 163 - allow ``hscan`` to scan the local cluster via Luxi
 164 - add more metrics to ``hspace`` which show the delta between original
 165   state and final state better (only valid for tiered allocation)
 166
 167
 168 Version 0.2.4 (Mon, 22 Feb 2010)
 169 --------------------------------
 170
 171 Two improvements for node evacuation:
 172
 173 - hbal takes a new parameter ``--evac-mode`` that restricts the
 174   instances to be moved to the ones on offline/drained nodes, which
 175   should reduce the work done
 176 - hail supports the new ``multi-evacuate`` mode of the IAllocator
 177   protocol, that will be released in a minor release on the Ganeti 2.1
 178   branch
 179
 180
 181 Version 0.2.3 (Thu,  4 Feb 2010)
 182 --------------------------------
 183
 184 A small release:
 185
 186 - Fixes selection of secondary node: previously, if the cluster had
 187   many N+1 failures, a N+1 failed node could be selected as secondary
 188   even if it did not have enough memory to allow the instance to be
 189   migrated/failed over to it; this is bad for automated tools, since
 190   we can get the cluster in an unhealthy state
 191 - Switch the text backend to a single input file, that is generated
 192   now by hscan and shouldn't be generated manually via
 193   gnt-node/instance list anymore; this allows richer information to be
 194   kept in the file, and simplifies a little the internals of the text
 195   backend
 196
 197
 198 Version 0.2.2 (Tue, 29 Dec 2009)
 199 --------------------------------
 200
 201 Small release, 0.2.1 was broken and thus this was released earlier:
 202
 203 - Release 0.2.1 broke the LUXI backend due to a typo, fixed
 204 - Added a live-test script that should catch errors like the above one
 205   in the future (needs a working, non-empty cluster)
 206 - Changed RAPI and LUXI backends to treat drained nodes as offline,
 207   similar to the IAllocator backend change in 0.2.0 (which was wrongly
 208   marked as affecting all backends)
 209 - Changed the metrics for offline instances and N1 score from percent to
 210   count, in order to increase the priority of evacuations
 211 - Added a new metric (offline primary instances) which should fix the
 212   evacuation of a offline node in a 2-node cluster
 213
 214
 215 Version 0.2.1 (Wed,  2 Dec 2009)
 216 --------------------------------
 217
 218 - Added instance exclusion defined via instance tags
 219 - Fixed the output of hspace to be again parseable from the shell
 220
 221
 222 Version 0.2.0 (Tue, 10 Nov 2009)
 223 --------------------------------
 224
 225 A significant release, with a few new major features:
 226
 227 - Added direct execution of the hbal solution when using the Luxi
 228   backend; the steps for each instance moves are submitted as a single
 229   jobs, and the different jobs are submitted as groups in order to
 230   parallelise the execution of moves
 231 - Added support for balancing based on dynamic utilisation data for
 232   instances, fed in via a text file; by default, all instances are
 233   considered equal and this change also improves the equalisation of
 234   secondary instances per node
 235 - Added support for tiered capacity calculation in hspace, where we
 236   start from a maximum instance spec and decrease the spec when we run
 237   out of resources; this should give a better measure of available
 238   capacity on 'fragmented' clusters; this is done separately from the
 239   current fixed-mode computation
 240
 241 Also there have been many minor improvements:
 242
 243 - Added option for showing instances (“--print-instances”), similar to
 244   the print nodes option
 245 - Added support for customising the node list via an argument to the
 246   print nodes option in the form of a comma-separated list of field
 247   names; currently the field names are not documented, expecting further
 248   changes in a next release
 249 - Enhanced the error reporting in the Luxi and Rapi backends
 250 - Changed the handling of drained nodes, now being treated the same as
 251   offline nodes, for Ganeti 2.0.4+ compatibility
 252 - A number of internal changes, simplifying code and merging some
 253   disparate functions
 254 - Simplify the build system in relation to creation of archives
 255
 256
 257 Version 0.1.8 (Tue, 29 Sep 2009)
 258 --------------------------------
 259
 260 - Brown-paper-bag release fixing haddock issues
 261
 262
 263 Version 0.1.7 (Mon, 28 Sep 2009)
 264 --------------------------------
 265
 266 - Fixed a bug in the Luxi backend for big responses
 267 - Fixed test suite exit code in presence of test failures
 268 - Changed the migrate operation to run instead failover for instances
 269   which were marked as not running in the input data (this could have
 270   been changed since then, but it's better than today's always migrate)
 271 - Added support for 'cheap' moves only (only migrate/failover) in
 272   balancing
 273 - Added support for building without curl (thus no RAPI backend)
 274
 275
 276 Version 0.1.6 (Wed, 19 Aug 2009)
 277 --------------------------------
 278
 279 - Added support for Luxi (the native Ganeti protocol)
 280 - Added support for simulated clusters (for hspace only)
 281 - Added timeouts for the RAPI backend
 282 - Fixed a few inconsistencies in the command line handling
 283 - Fixed handling of errors while loading data
 284 - The 'network' is a new dependency due to the Luxi addition
 285
 286
 287 Version 0.1.5 (Thu, 09 Jul 2009)
 288 --------------------------------
 289
 290 - Removed obsolete hn1 program; this allowed removal of a lot of
 291   supporting code
 292 - Lots of changes in hspace: the output now is a shell fragment in order
 293   for script to source it or parse it easier; added failure reasons;
 294   optimised to use less memory for large clusters
 295 - Optimized the scoring algorithm (used by all tools) so that now
 296   computations should be faster
 297
 298
 299 Version 0.1.4 (Tue, 16 Jun 2009)
 300 --------------------------------
 301
 302 - Added CPU count/ratio of virtual-to-physical CPUs to the cluster
 303   scoring methods; this means that now the balancer, the iallocator
 304   plugin and so on will try to keep the VCPU-to-PCPU ratio equal across
 305   the cluster
 306 - Fixed some hscan bugs
 307 - Fixed the way iallocator reads the total disk size (was broken and it
 308   was always falling back to summing the disk sizes)
 309 - Internals: fixed most compile-time warnings
 310
 311
 312 Version 0.1.3 (Fri, 05 Jun 2009)
 313 --------------------------------
 314
 315 - Fix a bug in the ReplacePrimary instance moves, affecting most of the
 316   tools
 317
 318
 319 Version 0.1.2 (Tue, 02 Jun 2009)
 320 --------------------------------
 321
 322 - Add a new program, “hspace”, which computes the free space on a
 323   cluster (based on a given instance spec)
 324 - Improvements in API docs and partially in the user docs
 325 - Started adding unittests
 326
 327
 328 Version 0.1.1 (Tue, 26 May 2009)
 329 --------------------------------
 330
 331 - Add a new program, “hail”, which is an iallocator plugin and can
 332   allocate/relocate instances
 333 - Experimental support for non-mirrored instances (hail supports them,
 334   hbal should no longer abort when it finds such instances and simply
 335   ignore them)
 336 - The RAPI port and/or scheme can be overriden now, and even “file://”
 337   schemes can be used if the message body has been saved under the
 338   appropriate name
 339 - Lots of code reorganization, esp. rewritten loading pipeline
 340 - Better data checking and better error messages in case validation
 341   fails; tools now consider nodes with error in input data (‘?’ returned
 342   by ganeti) as offline
 343 - Small enhancement to the makefile for simpler packaging
 344
 345
 346 Version 0.1.0 (Tue, 19 May 2009)
 347 --------------------------------
 348
 349 - Drop compatibility with Ganeti 1.2
 350 - Add a new minimum score option (with a very low default), should help
 351   with very good clusters (but is still not optimal)
 352 - Add a --quiet option to hbal
 353 - Add support for reading offline nodes directly from the cluster
 354
 355
 356 Version 0.0.8 (Tue, 21 Apr 2009)
 357 --------------------------------
 358
 359 - hbal: prevent mismatches in wrong node names being passed to -O, by
 360   aborting in this case
 361 - add the ability to write the commands (-C) to a script via (-C<file>),
 362   so that it can be later executed directly; this has also changed the
 363   commands to include the ncessary -f flags to skip confirmations
 364 - add checks for extra argument in hbal and hn1, so that unintended
 365   errors are catched
 366 - raise the accepted “missing” memory limit to 512MB, to cover usual Xen
 367   reservations
 368
 369
 370 Version 0.0.7 (Mon, 23 Mar 2009)
 371 --------------------------------
 372
 373 - added support for offline nodes, which are not used as targets for
 374   instance relocation and if they hold instances the hbal algorithm will
 375   attempt to relocate these away
 376 - added support for offline instances, which now will no longer skew the
 377   free memory estimation of nodes; the algorithm will no longer create
 378   conditions for N+1 failures when such instances are later started
 379 - implemented a complete model of node resources, in order to prevent an
 380   unintended re-occurrence of cases like the offline instance were we
 381   miscalculate some node resource; this gives warning now in case the
 382   node reported free disk or free memory deviates by more than a set
 383   amount from the expected value
 384 - a new tool *hscan* that can generate the input text-file for the other
 385   tools by collection via RAPI
 386 - some small changes to the build system to make it more friendly; also
 387   included the generated documentation in the source archive
 388
 389
 390 Version 0.0.6 (Mon, 16 Mar 2009)
 391 --------------------------------
 392
 393 - re-factored the hbal algorithm to make it stable in the sense that it
 394   gives the same solution when restarted from the middle; barring
 395   rounding of disk/memory and incomplete reporting from Ganeti (for
 396   1.2), it should be now feasible to rely on its output without
 397   generating moves ad infinitum
 398 - the hbal algorithm now uses two more variables: the node N+1 failures
 399   and the amount of reserved memory; the first of which tries to ‘fix’
 400   the N+1 status, the latter tries to distribute secondaries more
 401   equally
 402 - the hbal algorithm now uses two more moves at each step:
 403   replace+failover and failover+replace (besides the original failover,
 404   replace, and failover+replace+failover)
 405 - slightly changed the build system to embed GIT version/tags into the
 406   binaries so that we know for a binary from which tree it was done,
 407   either via ‘--version’ or via “strings hbal|grep version”
 408 - changed the solution list and in general the hbal output to be more
 409   clear by default, and changed “gnt-instance failover” to “gnt-instance
 410   migrate”
 411 - added man pages for the two binaries
 412
 413
 414 Version 0.0.5 (Mon, 09 Mar 2009)
 415 --------------------------------
 416
 417 - a few small improvements for hbal (possibly undone by later changes),
 418   hbal is now quite faster
 419 - fix documentation building
 420 - allow hbal to work on non N+1 compliant clusters, but without
 421   guarantees that the end cluster will be compliant; in any case, this
 422   should give a smaller number of nodes that are not compliant if the
 423   cluster state permits it
 424 - strip common domain suffix from nodes and instances, so that output is
 425   shorter and hopefully clearer
 426
 427
 428 Version 0.0.4 (Sun, 15 Feb 2009)
 429 --------------------------------
 430
 431 - better balancing algorithm in hbal
 432 - implemented an RAPI collector, now the cluster data can be gathered
 433   automatically via RAPI and doesn't need manual export of node and
 434   instance list
 435
 436
 437 Version 0.0.3 (Wed, 28 Jan 2009)
 438 --------------------------------
 439
 440 - initial release of the hbal, a cluster rebalancing tool
 441 - input data format changed due to hbal requirements
 442
 443
 444 Version 0.0.2 (Tue, 06 Jan 2009)
 445 --------------------------------
 446
 447 - fix handling of some common cases (cluster N+1 compliant from the
 448   start, too big depth given, failure to compute solution)
 449 - add option to print the needed command list for reaching the proposed
 450   solution
 451
 452
 453 Version 0.0.1 (Tue, 06 Jan 2009)
 454 --------------------------------
 455
 456 - initial release of hn1 tool
 457
 458 .. vim: set textwidth=72 :
 459 .. Local Variables:
 460 .. mode: rst
 461 .. fill-column: 72
 462 .. End: