code.grnet.gr Git - ganeti-local/blob - NEWS

   1 Ganeti-htools release notes
   2 ===========================
   3
   4
   5 Version 0.2.7 (Thu, 07 Oct 2010)
   6 --------------------------------
   7
   8 Bug fixes:
   9
  10 - fixed the error message for hail multi-evacuation mode
  11 - improve evacuation mode for offline secondary nodes (ignore available
  12   memory)
  13
  14 New features:
  15
  16 - add a new option ``-S`` to hbal and hspace that saves the cluster
  17   state at the end of the processing in the text format used by the
  18   ``-t`` option, for later re-processing
  19 - a two new options to hbal, -g and --min-gain-limit, that should help
  20   in limiting the number of balances steps with a low gain in the final
  21   stages
  22 - hbal, when executing jobs, will now wait for the current jobs to
  23   finish at the first stop (e.g. ^C); if the user wants immediate exit,
  24   another signal should be sent
  25 - added “normalized” physical CPU units in hspace output (NPU), which
  26   represents units of physical CPUs free/used, based on the max-cpu
  27   ratio
  28
  29
  30 Version 0.2.6 (Mon, 26 Jul 2010)
  31 --------------------------------
  32
  33 Exactly three months since the last release. Many internal changes, plus
  34 a couple of important changes in the balancing algorithm.
  35
  36 First, the balancing may now introduce N+1 errors, if this solves other,
  37 more critical problems. For the moment, this means that moving instances
  38 away from offline nodes is allowed even if it creates N+1 errors, and
  39 that means evacuation can be done in more cases.
  40
  41 Second, the scoring for N+1 has changed. In previous versions, it simply
  42 counted the number of failing N+1 nodes, which means moving an instance
  43 away from a N+1 failed node (but without the node 'clearing' the N+1
  44 status) was not reflected in the cluster score. As such, the balancing
  45 algorithm managed to clear N+1 errors only sometimes, since usually it
  46 takes more than one move for this, and the first prerequisite move was
  47 not 'rewarded' appropriately and thus it was not selected. Now, it is
  48 possible to fix many more error cases than before: on a simulated 40
  49 node cluster full with instances (symmetrically allocated on all nodes),
  50 around five nodes can be evacuated before N+1 errors can be solved,
  51 whereas 0.2.5 could evacuate at best one node.
  52
  53 There were some other internal changes to the scoring algorithm, such
  54 that now the metrics have associated weights, and they are not all of
  55 the same importance anymore. As of now, the only change is that offline
  56 instances have a higher weight, which should favour proper node
  57 evacuations.
  58
  59 Among the other changes:
  60
  61 - fixed the hspace KM_POOL_* metrics, which were returned as the final
  62   state and not as the delta between the initial and final states
  63 - fixed hspace handling of N+1 failing clusters: before, it used to
  64   generate a 'fake' response, and the structure of this response was not
  65   always in sync with the real responses, leading to missing items;
  66   currently it proceeds correctly through the code (skipping the
  67   computation), and uses the same display mechanisms as the normal case
  68 - fixed hscan exit code for RAPI failures: previously it finished with
  69   success even if all the clusters failed, which was creating issues
  70   with the live-test script; now it exits with exit code 2 for RAPI
  71   failures (unfortunately this is still not optimal as LUXI failures
  72   will use exit code 1, the same as the command line)
  73 - changed the limit values for CPU/disk, which previously were used
  74   optionally, whereas now they are always used; the default cpu ratio
  75   limit is now 64 VCPUs per PCPU
  76 - changed the internal handling of the short name vs. original
  77   (Ganeti-provided) name; now internally we always use the full name,
  78   and only in display routines we show the shortened (called 'alias')
  79   name; as a result, the -O and --excluded-instances options now accept
  80   both the full name and the shortened name
  81 - changed internal handling of JSON conversions and errors, such that
  82   now we show a better context for failure messages, which should help
  83   with diagnosing the malformed message
  84 - changed the names for a few node fields, and added some more nodes;
  85   this is most likely to help with debugging, and not with regular
  86   operation though
  87 - changed the node fields option to allow the '+' prefix to mean 'extend
  88   the default fields list' rather than start from fresh (similar to
  89   Ganeti's implementation)
  90 - a few internal changes related to the LUXI protocol implementation,
  91   which should make it more safe against potential bugs, one
  92   optiomization that should help with large messages, and some patches
  93   in preparation for potential expansion of the LUXI backend functionality
  94
  95 And finally, many improvements on unittests and the live-test
  96 script. Test coverage is much enhanced, and the test infrastructure has
  97 better error reporting; this should lead down-the-road to better code
  98 and fewer bugs…
  99
 100
 101 Version 0.2.5 (Mon, 26 Apr 2010)
 102 --------------------------------
 103
 104 Some internal cleanup plus a few user-visible changes:
 105
 106 - new option for marking instances as 'do-not-move' during rebalancing
 107 - allow ``hscan`` to scan the local cluster via Luxi
 108 - add more metrics to ``hspace`` which show the delta between original
 109   state and final state better (only valid for tiered allocation)
 110
 111
 112 Version 0.2.4 (Mon, 22 Feb 2010)
 113 --------------------------------
 114
 115 Two improvements for node evacuation:
 116
 117 - hbal takes a new parameter ``--evac-mode`` that restricts the
 118   instances to be moved to the ones on offline/drained nodes, which
 119   should reduce the work done
 120 - hail supports the new ``multi-evacuate`` mode of the IAllocator
 121   protocol, that will be released in a minor release on the Ganeti 2.1
 122   branch
 123
 124
 125 Version 0.2.3 (Thu,  4 Feb 2010)
 126 --------------------------------
 127
 128 A small release:
 129
 130 - Fixes selection of secondary node: previously, if the cluster had
 131   many N+1 failures, a N+1 failed node could be selected as secondary
 132   even if it did not have enough memory to allow the instance to be
 133   migrated/failed over to it; this is bad for automated tools, since
 134   we can get the cluster in an unhealthy state
 135 - Switch the text backend to a single input file, that is generated
 136   now by hscan and shouldn't be generated manually via
 137   gnt-node/instance list anymore; this allows richer information to be
 138   kept in the file, and simplifies a little the internals of the text
 139   backend
 140
 141
 142 Version 0.2.2 (Tue, 29 Dec 2009)
 143 --------------------------------
 144
 145 Small release, 0.2.1 was broken and thus this was released earlier:
 146
 147 - Release 0.2.1 broke the LUXI backend due to a typo, fixed
 148 - Added a live-test script that should catch errors like the above one
 149   in the future (needs a working, non-empty cluster)
 150 - Changed RAPI and LUXI backends to treat drained nodes as offline,
 151   similar to the IAllocator backend change in 0.2.0 (which was wrongly
 152   marked as affecting all backends)
 153 - Changed the metrics for offline instances and N1 score from percent to
 154   count, in order to increase the priority of evacuations
 155 - Added a new metric (offline primary instances) which should fix the
 156   evacuation of a offline node in a 2-node cluster
 157
 158
 159 Version 0.2.1 (Wed,  2 Dec 2009)
 160 --------------------------------
 161
 162 - Added instance exclusion defined via instance tags
 163 - Fixed the output of hspace to be again parseable from the shell
 164
 165
 166 Version 0.2.0 (Tue, 10 Nov 2009)
 167 --------------------------------
 168
 169 A significant release, with a few new major features:
 170
 171 - Added direct execution of the hbal solution when using the Luxi
 172   backend; the steps for each instance moves are submitted as a single
 173   jobs, and the different jobs are submitted as groups in order to
 174   parallelise the execution of moves
 175 - Added support for balancing based on dynamic utilisation data for
 176   instances, fed in via a text file; by default, all instances are
 177   considered equal and this change also improves the equalisation of
 178   secondary instances per node
 179 - Added support for tiered capacity calculation in hspace, where we
 180   start from a maximum instance spec and decrease the spec when we run
 181   out of resources; this should give a better measure of available
 182   capacity on 'fragmented' clusters; this is done separately from the
 183   current fixed-mode computation
 184
 185 Also there have been many minor improvements:
 186
 187 - Added option for showing instances (“--print-instances”), similar to
 188   the print nodes option
 189 - Added support for customising the node list via an argument to the
 190   print nodes option in the form of a comma-separated list of field
 191   names; currently the field names are not documented, expecting further
 192   changes in a next release
 193 - Enhanced the error reporting in the Luxi and Rapi backends
 194 - Changed the handling of drained nodes, now being treated the same as
 195   offline nodes, for Ganeti 2.0.4+ compatibility
 196 - A number of internal changes, simplifying code and merging some
 197   disparate functions
 198 - Simplify the build system in relation to creation of archives
 199
 200
 201 Version 0.1.8 (Tue, 29 Sep 2009)
 202 --------------------------------
 203
 204 - Brown-paper-bag release fixing haddock issues
 205
 206
 207 Version 0.1.7 (Mon, 28 Sep 2009)
 208 --------------------------------
 209
 210 - Fixed a bug in the Luxi backend for big responses
 211 - Fixed test suite exit code in presence of test failures
 212 - Changed the migrate operation to run instead failover for instances
 213   which were marked as not running in the input data (this could have
 214   been changed since then, but it's better than today's always migrate)
 215 - Added support for 'cheap' moves only (only migrate/failover) in
 216   balancing
 217 - Added support for building without curl (thus no RAPI backend)
 218
 219
 220 Version 0.1.6 (Wed, 19 Aug 2009)
 221 --------------------------------
 222
 223 - Added support for Luxi (the native Ganeti protocol)
 224 - Added support for simulated clusters (for hspace only)
 225 - Added timeouts for the RAPI backend
 226 - Fixed a few inconsistencies in the command line handling
 227 - Fixed handling of errors while loading data
 228 - The 'network' is a new dependency due to the Luxi addition
 229
 230
 231 Version 0.1.5 (Thu, 09 Jul 2009)
 232 --------------------------------
 233
 234 - Removed obsolete hn1 program; this allowed removal of a lot of
 235   supporting code
 236 - Lots of changes in hspace: the output now is a shell fragment in order
 237   for script to source it or parse it easier; added failure reasons;
 238   optimised to use less memory for large clusters
 239 - Optimized the scoring algorithm (used by all tools) so that now
 240   computations should be faster
 241
 242
 243 Version 0.1.4 (Tue, 16 Jun 2009)
 244 --------------------------------
 245
 246 - Added CPU count/ratio of virtual-to-physical CPUs to the cluster
 247   scoring methods; this means that now the balancer, the iallocator
 248   plugin and so on will try to keep the VCPU-to-PCPU ratio equal across
 249   the cluster
 250 - Fixed some hscan bugs
 251 - Fixed the way iallocator reads the total disk size (was broken and it
 252   was always falling back to summing the disk sizes)
 253 - Internals: fixed most compile-time warnings
 254
 255
 256 Version 0.1.3 (Fri, 05 Jun 2009)
 257 --------------------------------
 258
 259 - Fix a bug in the ReplacePrimary instance moves, affecting most of the
 260   tools
 261
 262
 263 Version 0.1.2 (Tue, 02 Jun 2009)
 264 --------------------------------
 265
 266 - Add a new program, “hspace”, which computes the free space on a
 267   cluster (based on a given instance spec)
 268 - Improvements in API docs and partially in the user docs
 269 - Started adding unittests
 270
 271
 272 Version 0.1.1 (Tue, 26 May 2009)
 273 --------------------------------
 274
 275 - Add a new program, “hail”, which is an iallocator plugin and can
 276   allocate/relocate instances
 277 - Experimental support for non-mirrored instances (hail supports them,
 278   hbal should no longer abort when it finds such instances and simply
 279   ignore them)
 280 - The RAPI port and/or scheme can be overriden now, and even “file://”
 281   schemes can be used if the message body has been saved under the
 282   appropriate name
 283 - Lots of code reorganization, esp. rewritten loading pipeline
 284 - Better data checking and better error messages in case validation
 285   fails; tools now consider nodes with error in input data (‘?’ returned
 286   by ganeti) as offline
 287 - Small enhancement to the makefile for simpler packaging
 288
 289
 290 Version 0.1.0 (Tue, 19 May 2009)
 291 --------------------------------
 292
 293 - Drop compatibility with Ganeti 1.2
 294 - Add a new minimum score option (with a very low default), should help
 295   with very good clusters (but is still not optimal)
 296 - Add a --quiet option to hbal
 297 - Add support for reading offline nodes directly from the cluster
 298
 299
 300 Version 0.0.8 (Tue, 21 Apr 2009)
 301 --------------------------------
 302
 303 - hbal: prevent mismatches in wrong node names being passed to -O, by
 304   aborting in this case
 305 - add the ability to write the commands (-C) to a script via (-C<file>),
 306   so that it can be later executed directly; this has also changed the
 307   commands to include the ncessary -f flags to skip confirmations
 308 - add checks for extra argument in hbal and hn1, so that unintended
 309   errors are catched
 310 - raise the accepted “missing” memory limit to 512MB, to cover usual Xen
 311   reservations
 312
 313
 314 Version 0.0.7 (Mon, 23 Mar 2009)
 315 --------------------------------
 316
 317 - added support for offline nodes, which are not used as targets for
 318   instance relocation and if they hold instances the hbal algorithm will
 319   attempt to relocate these away
 320 - added support for offline instances, which now will no longer skew the
 321   free memory estimation of nodes; the algorithm will no longer create
 322   conditions for N+1 failures when such instances are later started
 323 - implemented a complete model of node resources, in order to prevent an
 324   unintended re-occurrence of cases like the offline instance were we
 325   miscalculate some node resource; this gives warning now in case the
 326   node reported free disk or free memory deviates by more than a set
 327   amount from the expected value
 328 - a new tool *hscan* that can generate the input text-file for the other
 329   tools by collection via RAPI
 330 - some small changes to the build system to make it more friendly; also
 331   included the generated documentation in the source archive
 332
 333
 334 Version 0.0.6 (Mon, 16 Mar 2009)
 335 --------------------------------
 336
 337 - re-factored the hbal algorithm to make it stable in the sense that it
 338   gives the same solution when restarted from the middle; barring
 339   rounding of disk/memory and incomplete reporting from Ganeti (for
 340   1.2), it should be now feasible to rely on its output without
 341   generating moves ad infinitum
 342 - the hbal algorithm now uses two more variables: the node N+1 failures
 343   and the amount of reserved memory; the first of which tries to ‘fix’
 344   the N+1 status, the latter tries to distribute secondaries more
 345   equally
 346 - the hbal algorithm now uses two more moves at each step:
 347   replace+failover and failover+replace (besides the original failover,
 348   replace, and failover+replace+failover)
 349 - slightly changed the build system to embed GIT version/tags into the
 350   binaries so that we know for a binary from which tree it was done,
 351   either via ‘--version’ or via “strings hbal|grep version”
 352 - changed the solution list and in general the hbal output to be more
 353   clear by default, and changed “gnt-instance failover” to “gnt-instance
 354   migrate”
 355 - added man pages for the two binaries
 356
 357
 358 Version 0.0.5 (Mon, 09 Mar 2009)
 359 --------------------------------
 360
 361 - a few small improvements for hbal (possibly undone by later changes),
 362   hbal is now quite faster
 363 - fix documentation building
 364 - allow hbal to work on non N+1 compliant clusters, but without
 365   guarantees that the end cluster will be compliant; in any case, this
 366   should give a smaller number of nodes that are not compliant if the
 367   cluster state permits it
 368 - strip common domain suffix from nodes and instances, so that output is
 369   shorter and hopefully clearer
 370
 371
 372 Version 0.0.4 (Sun, 15 Feb 2009)
 373 --------------------------------
 374
 375 - better balancing algorithm in hbal
 376 - implemented an RAPI collector, now the cluster data can be gathered
 377   automatically via RAPI and doesn't need manual export of node and
 378   instance list
 379
 380
 381 Version 0.0.3 (Wed, 28 Jan 2009)
 382 --------------------------------
 383
 384 - initial release of the hbal, a cluster rebalancing tool
 385 - input data format changed due to hbal requirements
 386
 387
 388 Version 0.0.2 (Tue, 06 Jan 2009)
 389 --------------------------------
 390
 391 - fix handling of some common cases (cluster N+1 compliant from the
 392   start, too big depth given, failure to compute solution)
 393 - add option to print the needed command list for reaching the proposed
 394   solution
 395
 396
 397 Version 0.0.1 (Tue, 06 Jan 2009)
 398 --------------------------------
 399
 400 - initial release of hn1 tool
 401
 402 .. vim: set textwidth=72 :
 403 .. Local Variables:
 404 .. mode: rst
 405 .. fill-column: 72
 406 .. End: