code.grnet.gr Git - ganeti-local/blob - src/OLD-NEWS

   1 Ganeti-htools release notes
   2 ===========================
   3
   4
   5 **Note**: After version 0.3.1, the htools sources have been integrated
   6 into the ganeti core repository, and released together with the ganeti
   7 releases. Thus this NEWS file is obsolete.
   8
   9 Version 0.3.1 (Fri, 11 Mar 2011)
  10 --------------------------------
  11
  12 Minor bugfix release:
  13
  14 - Fixed source archive generation: the hscolour.css file was an invalid
  15   symlink, and the man pages were not correctly timestamped (leading to
  16   unneeded build-time rebuilds)
  17 - Improved the Luxi backend to show which attribute fails parsing
  18 - Small improvements to the man pages, and also ship the HTML version of
  19   man pages in the source archive
  20
  21
  22 Version 0.3.0 (Fri, 04 Feb 2011)
  23 --------------------------------
  24
  25 A significant release that breaks compatibility with Ganeti versions
  26 below 2.4 due to the node group changes. Only the RAPI backend can talk
  27 to older clusters, but it is recommended to use this version only with
  28 Ganeti 2.4.
  29
  30 All commands are now multi-group aware (but to various degrees), so
  31 allocation, balancing and capacity calculation respects the group layout
  32 and will not create “broken” instances by using nodes from different
  33 groups.
  34
  35 For a regular, single-group cluster, no changes should be directly
  36 visible to the users. A multi-group cluster however will change some
  37 things slightly:
  38
  39 - hbal will require a target group to operate on (no cluster-wide
  40   balancing yet)
  41 - evacuation of (DRBD) instances from a node will be restricted to nodes
  42   in the same group, as inter-group moves are not implemented yet
  43 - capacity, while showing correct data, will not give per-group details
  44   yet
  45
  46 There are other changes in this release:
  47
  48 - fixed a long-standing bug in hscan related to node memory data
  49 - changed the text backend format, which unfortunately invalidates old
  50   files
  51 - error handling improvements, so that invalid input data reports better
  52   where the error is
  53 - the simulation backend changes its syntax, now it takes the allocation
  54   policy too, and can generate multiple groups
  55 - (internal) man page generation moved to pandoc from hand-written,
  56   which is helpful as it can also generate HTML versions
  57 - the balancing algorithm has been changed to work in parallel, if the
  58   code is linked against the multi-threaded runtime; this gives a very
  59   good speedup (~80% on 4 cores, ~60-70% of 12 cores)
  60
  61 Version 0.2.8 (Thu, 23 Dec 2010)
  62 --------------------------------
  63
  64 A bug fix release:
  65
  66 - fixed balancing function for big clusters, which will improve corner
  67   cases where hbal didn't see any solution even though the cluster was
  68   obviously not well balanced
  69 - fixed exit code of hbal in case of (Luxi) job errors
  70 - changed the signal handling in hbal in order to make hbal control
  71   easier: instead of synchronising on the count of signals, make SIGINT
  72   cause graceful termination, and SIGTERM an immediate one
  73 - increased the tag exclusion weight so that it has greater importance
  74   during the balancing
  75 - slight improvement to the speed of balancing via algorithm tweaks
  76
  77
  78 Version 0.2.7 (Thu, 07 Oct 2010)
  79 --------------------------------
  80
  81 Bug fixes:
  82
  83 - fixed the error message for hail multi-evacuation mode
  84 - improve evacuation mode for offline secondary nodes (ignore available
  85   memory)
  86
  87 New features:
  88
  89 - add a new option ``-S`` to hbal and hspace that saves the cluster
  90   state at the end of the processing in the text format used by the
  91   ``-t`` option, for later re-processing
  92 - a two new options to hbal, -g and --min-gain-limit, that should help
  93   in limiting the number of balances steps with a low gain in the final
  94   stages
  95 - hbal, when executing jobs, will now wait for the current jobs to
  96   finish at the first stop (e.g. ^C); if the user wants immediate exit,
  97   another signal should be sent
  98 - added “normalized” physical CPU units in hspace output (NPU), which
  99   represents units of physical CPUs free/used, based on the max-cpu
 100   ratio
 101
 102
 103 Version 0.2.6 (Mon, 26 Jul 2010)
 104 --------------------------------
 105
 106 Exactly three months since the last release. Many internal changes, plus
 107 a couple of important changes in the balancing algorithm.
 108
 109 First, the balancing may now introduce N+1 errors, if this solves other,
 110 more critical problems. For the moment, this means that moving instances
 111 away from offline nodes is allowed even if it creates N+1 errors, and
 112 that means evacuation can be done in more cases.
 113
 114 Second, the scoring for N+1 has changed. In previous versions, it simply
 115 counted the number of failing N+1 nodes, which means moving an instance
 116 away from a N+1 failed node (but without the node 'clearing' the N+1
 117 status) was not reflected in the cluster score. As such, the balancing
 118 algorithm managed to clear N+1 errors only sometimes, since usually it
 119 takes more than one move for this, and the first prerequisite move was
 120 not 'rewarded' appropriately and thus it was not selected. Now, it is
 121 possible to fix many more error cases than before: on a simulated 40
 122 node cluster full with instances (symmetrically allocated on all nodes),
 123 around five nodes can be evacuated before N+1 errors can be solved,
 124 whereas 0.2.5 could evacuate at best one node.
 125
 126 There were some other internal changes to the scoring algorithm, such
 127 that now the metrics have associated weights, and they are not all of
 128 the same importance anymore. As of now, the only change is that offline
 129 instances have a higher weight, which should favour proper node
 130 evacuations.
 131
 132 Among the other changes:
 133
 134 - fixed the hspace KM_POOL_* metrics, which were returned as the final
 135   state and not as the delta between the initial and final states
 136 - fixed hspace handling of N+1 failing clusters: before, it used to
 137   generate a 'fake' response, and the structure of this response was not
 138   always in sync with the real responses, leading to missing items;
 139   currently it proceeds correctly through the code (skipping the
 140   computation), and uses the same display mechanisms as the normal case
 141 - fixed hscan exit code for RAPI failures: previously it finished with
 142   success even if all the clusters failed, which was creating issues
 143   with the live-test script; now it exits with exit code 2 for RAPI
 144   failures (unfortunately this is still not optimal as LUXI failures
 145   will use exit code 1, the same as the command line)
 146 - changed the limit values for CPU/disk, which previously were used
 147   optionally, whereas now they are always used; the default cpu ratio
 148   limit is now 64 VCPUs per PCPU
 149 - changed the internal handling of the short name vs. original
 150   (Ganeti-provided) name; now internally we always use the full name,
 151   and only in display routines we show the shortened (called 'alias')
 152   name; as a result, the -O and --excluded-instances options now accept
 153   both the full name and the shortened name
 154 - changed internal handling of JSON conversions and errors, such that
 155   now we show a better context for failure messages, which should help
 156   with diagnosing the malformed message
 157 - changed the names for a few node fields, and added some more nodes;
 158   this is most likely to help with debugging, and not with regular
 159   operation though
 160 - changed the node fields option to allow the '+' prefix to mean 'extend
 161   the default fields list' rather than start from fresh (similar to
 162   Ganeti's implementation)
 163 - a few internal changes related to the LUXI protocol implementation,
 164   which should make it more safe against potential bugs, one
 165   optiomization that should help with large messages, and some patches
 166   in preparation for potential expansion of the LUXI backend functionality
 167
 168 And finally, many improvements on unittests and the live-test
 169 script. Test coverage is much enhanced, and the test infrastructure has
 170 better error reporting; this should lead down-the-road to better code
 171 and fewer bugs…
 172
 173
 174 Version 0.2.5 (Mon, 26 Apr 2010)
 175 --------------------------------
 176
 177 Some internal cleanup plus a few user-visible changes:
 178
 179 - new option for marking instances as 'do-not-move' during rebalancing
 180 - allow ``hscan`` to scan the local cluster via Luxi
 181 - add more metrics to ``hspace`` which show the delta between original
 182   state and final state better (only valid for tiered allocation)
 183
 184
 185 Version 0.2.4 (Mon, 22 Feb 2010)
 186 --------------------------------
 187
 188 Two improvements for node evacuation:
 189
 190 - hbal takes a new parameter ``--evac-mode`` that restricts the
 191   instances to be moved to the ones on offline/drained nodes, which
 192   should reduce the work done
 193 - hail supports the new ``multi-evacuate`` mode of the IAllocator
 194   protocol, that will be released in a minor release on the Ganeti 2.1
 195   branch
 196
 197
 198 Version 0.2.3 (Thu,  4 Feb 2010)
 199 --------------------------------
 200
 201 A small release:
 202
 203 - Fixes selection of secondary node: previously, if the cluster had
 204   many N+1 failures, a N+1 failed node could be selected as secondary
 205   even if it did not have enough memory to allow the instance to be
 206   migrated/failed over to it; this is bad for automated tools, since
 207   we can get the cluster in an unhealthy state
 208 - Switch the text backend to a single input file, that is generated
 209   now by hscan and shouldn't be generated manually via
 210   gnt-node/instance list anymore; this allows richer information to be
 211   kept in the file, and simplifies a little the internals of the text
 212   backend
 213
 214
 215 Version 0.2.2 (Tue, 29 Dec 2009)
 216 --------------------------------
 217
 218 Small release, 0.2.1 was broken and thus this was released earlier:
 219
 220 - Release 0.2.1 broke the LUXI backend due to a typo, fixed
 221 - Added a live-test script that should catch errors like the above one
 222   in the future (needs a working, non-empty cluster)
 223 - Changed RAPI and LUXI backends to treat drained nodes as offline,
 224   similar to the IAllocator backend change in 0.2.0 (which was wrongly
 225   marked as affecting all backends)
 226 - Changed the metrics for offline instances and N1 score from percent to
 227   count, in order to increase the priority of evacuations
 228 - Added a new metric (offline primary instances) which should fix the
 229   evacuation of a offline node in a 2-node cluster
 230
 231
 232 Version 0.2.1 (Wed,  2 Dec 2009)
 233 --------------------------------
 234
 235 - Added instance exclusion defined via instance tags
 236 - Fixed the output of hspace to be again parseable from the shell
 237
 238
 239 Version 0.2.0 (Tue, 10 Nov 2009)
 240 --------------------------------
 241
 242 A significant release, with a few new major features:
 243
 244 - Added direct execution of the hbal solution when using the Luxi
 245   backend; the steps for each instance moves are submitted as a single
 246   jobs, and the different jobs are submitted as groups in order to
 247   parallelise the execution of moves
 248 - Added support for balancing based on dynamic utilisation data for
 249   instances, fed in via a text file; by default, all instances are
 250   considered equal and this change also improves the equalisation of
 251   secondary instances per node
 252 - Added support for tiered capacity calculation in hspace, where we
 253   start from a maximum instance spec and decrease the spec when we run
 254   out of resources; this should give a better measure of available
 255   capacity on 'fragmented' clusters; this is done separately from the
 256   current fixed-mode computation
 257
 258 Also there have been many minor improvements:
 259
 260 - Added option for showing instances (“--print-instances”), similar to
 261   the print nodes option
 262 - Added support for customising the node list via an argument to the
 263   print nodes option in the form of a comma-separated list of field
 264   names; currently the field names are not documented, expecting further
 265   changes in a next release
 266 - Enhanced the error reporting in the Luxi and Rapi backends
 267 - Changed the handling of drained nodes, now being treated the same as
 268   offline nodes, for Ganeti 2.0.4+ compatibility
 269 - A number of internal changes, simplifying code and merging some
 270   disparate functions
 271 - Simplify the build system in relation to creation of archives
 272
 273
 274 Version 0.1.8 (Tue, 29 Sep 2009)
 275 --------------------------------
 276
 277 - Brown-paper-bag release fixing haddock issues
 278
 279
 280 Version 0.1.7 (Mon, 28 Sep 2009)
 281 --------------------------------
 282
 283 - Fixed a bug in the Luxi backend for big responses
 284 - Fixed test suite exit code in presence of test failures
 285 - Changed the migrate operation to run instead failover for instances
 286   which were marked as not running in the input data (this could have
 287   been changed since then, but it's better than today's always migrate)
 288 - Added support for 'cheap' moves only (only migrate/failover) in
 289   balancing
 290 - Added support for building without curl (thus no RAPI backend)
 291
 292
 293 Version 0.1.6 (Wed, 19 Aug 2009)
 294 --------------------------------
 295
 296 - Added support for Luxi (the native Ganeti protocol)
 297 - Added support for simulated clusters (for hspace only)
 298 - Added timeouts for the RAPI backend
 299 - Fixed a few inconsistencies in the command line handling
 300 - Fixed handling of errors while loading data
 301 - The 'network' is a new dependency due to the Luxi addition
 302
 303
 304 Version 0.1.5 (Thu, 09 Jul 2009)
 305 --------------------------------
 306
 307 - Removed obsolete hn1 program; this allowed removal of a lot of
 308   supporting code
 309 - Lots of changes in hspace: the output now is a shell fragment in order
 310   for script to source it or parse it easier; added failure reasons;
 311   optimised to use less memory for large clusters
 312 - Optimized the scoring algorithm (used by all tools) so that now
 313   computations should be faster
 314
 315
 316 Version 0.1.4 (Tue, 16 Jun 2009)
 317 --------------------------------
 318
 319 - Added CPU count/ratio of virtual-to-physical CPUs to the cluster
 320   scoring methods; this means that now the balancer, the iallocator
 321   plugin and so on will try to keep the VCPU-to-PCPU ratio equal across
 322   the cluster
 323 - Fixed some hscan bugs
 324 - Fixed the way iallocator reads the total disk size (was broken and it
 325   was always falling back to summing the disk sizes)
 326 - Internals: fixed most compile-time warnings
 327
 328
 329 Version 0.1.3 (Fri, 05 Jun 2009)
 330 --------------------------------
 331
 332 - Fix a bug in the ReplacePrimary instance moves, affecting most of the
 333   tools
 334
 335
 336 Version 0.1.2 (Tue, 02 Jun 2009)
 337 --------------------------------
 338
 339 - Add a new program, “hspace”, which computes the free space on a
 340   cluster (based on a given instance spec)
 341 - Improvements in API docs and partially in the user docs
 342 - Started adding unittests
 343
 344
 345 Version 0.1.1 (Tue, 26 May 2009)
 346 --------------------------------
 347
 348 - Add a new program, “hail”, which is an iallocator plugin and can
 349   allocate/relocate instances
 350 - Experimental support for non-mirrored instances (hail supports them,
 351   hbal should no longer abort when it finds such instances and simply
 352   ignore them)
 353 - The RAPI port and/or scheme can be overriden now, and even “file://”
 354   schemes can be used if the message body has been saved under the
 355   appropriate name
 356 - Lots of code reorganization, esp. rewritten loading pipeline
 357 - Better data checking and better error messages in case validation
 358   fails; tools now consider nodes with error in input data (‘?’ returned
 359   by ganeti) as offline
 360 - Small enhancement to the makefile for simpler packaging
 361
 362
 363 Version 0.1.0 (Tue, 19 May 2009)
 364 --------------------------------
 365
 366 - Drop compatibility with Ganeti 1.2
 367 - Add a new minimum score option (with a very low default), should help
 368   with very good clusters (but is still not optimal)
 369 - Add a --quiet option to hbal
 370 - Add support for reading offline nodes directly from the cluster
 371
 372
 373 Version 0.0.8 (Tue, 21 Apr 2009)
 374 --------------------------------
 375
 376 - hbal: prevent mismatches in wrong node names being passed to -O, by
 377   aborting in this case
 378 - add the ability to write the commands (-C) to a script via (-C<file>),
 379   so that it can be later executed directly; this has also changed the
 380   commands to include the ncessary -f flags to skip confirmations
 381 - add checks for extra argument in hbal and hn1, so that unintended
 382   errors are catched
 383 - raise the accepted “missing” memory limit to 512MB, to cover usual Xen
 384   reservations
 385
 386
 387 Version 0.0.7 (Mon, 23 Mar 2009)
 388 --------------------------------
 389
 390 - added support for offline nodes, which are not used as targets for
 391   instance relocation and if they hold instances the hbal algorithm will
 392   attempt to relocate these away
 393 - added support for offline instances, which now will no longer skew the
 394   free memory estimation of nodes; the algorithm will no longer create
 395   conditions for N+1 failures when such instances are later started
 396 - implemented a complete model of node resources, in order to prevent an
 397   unintended re-occurrence of cases like the offline instance were we
 398   miscalculate some node resource; this gives warning now in case the
 399   node reported free disk or free memory deviates by more than a set
 400   amount from the expected value
 401 - a new tool *hscan* that can generate the input text-file for the other
 402   tools by collection via RAPI
 403 - some small changes to the build system to make it more friendly; also
 404   included the generated documentation in the source archive
 405
 406
 407 Version 0.0.6 (Mon, 16 Mar 2009)
 408 --------------------------------
 409
 410 - re-factored the hbal algorithm to make it stable in the sense that it
 411   gives the same solution when restarted from the middle; barring
 412   rounding of disk/memory and incomplete reporting from Ganeti (for
 413   1.2), it should be now feasible to rely on its output without
 414   generating moves ad infinitum
 415 - the hbal algorithm now uses two more variables: the node N+1 failures
 416   and the amount of reserved memory; the first of which tries to ‘fix’
 417   the N+1 status, the latter tries to distribute secondaries more
 418   equally
 419 - the hbal algorithm now uses two more moves at each step:
 420   replace+failover and failover+replace (besides the original failover,
 421   replace, and failover+replace+failover)
 422 - slightly changed the build system to embed GIT version/tags into the
 423   binaries so that we know for a binary from which tree it was done,
 424   either via ‘--version’ or via “strings hbal|grep version”
 425 - changed the solution list and in general the hbal output to be more
 426   clear by default, and changed “gnt-instance failover” to “gnt-instance
 427   migrate”
 428 - added man pages for the two binaries
 429
 430
 431 Version 0.0.5 (Mon, 09 Mar 2009)
 432 --------------------------------
 433
 434 - a few small improvements for hbal (possibly undone by later changes),
 435   hbal is now quite faster
 436 - fix documentation building
 437 - allow hbal to work on non N+1 compliant clusters, but without
 438   guarantees that the end cluster will be compliant; in any case, this
 439   should give a smaller number of nodes that are not compliant if the
 440   cluster state permits it
 441 - strip common domain suffix from nodes and instances, so that output is
 442   shorter and hopefully clearer
 443
 444
 445 Version 0.0.4 (Sun, 15 Feb 2009)
 446 --------------------------------
 447
 448 - better balancing algorithm in hbal
 449 - implemented an RAPI collector, now the cluster data can be gathered
 450   automatically via RAPI and doesn't need manual export of node and
 451   instance list
 452
 453
 454 Version 0.0.3 (Wed, 28 Jan 2009)
 455 --------------------------------
 456
 457 - initial release of the hbal, a cluster rebalancing tool
 458 - input data format changed due to hbal requirements
 459
 460
 461 Version 0.0.2 (Tue, 06 Jan 2009)
 462 --------------------------------
 463
 464 - fix handling of some common cases (cluster N+1 compliant from the
 465   start, too big depth given, failure to compute solution)
 466 - add option to print the needed command list for reaching the proposed
 467   solution
 468
 469
 470 Version 0.0.1 (Tue, 06 Jan 2009)
 471 --------------------------------
 472
 473 - initial release of hn1 tool
 474
 475 .. vim: set textwidth=72 :
 476 .. Local Variables:
 477 .. mode: rst
 478 .. fill-column: 72
 479 .. End: