1 Ganeti-htools release notes
2 ===========================
5 **Note**: After version 0.3.1, the htools sources have been integrated
6 into the ganeti core repository, and released together with the ganeti
7 releases. Thus this NEWS file is obsolete.
9 Version 0.3.1 (Fri, 11 Mar 2011)
10 --------------------------------
14 - Fixed source archive generation: the hscolour.css file was an invalid
15 symlink, and the man pages were not correctly timestamped (leading to
16 unneeded build-time rebuilds)
17 - Improved the Luxi backend to show which attribute fails parsing
18 - Small improvements to the man pages, and also ship the HTML version of
19 man pages in the source archive
22 Version 0.3.0 (Fri, 04 Feb 2011)
23 --------------------------------
25 A significant release that breaks compatibility with Ganeti versions
26 below 2.4 due to the node group changes. Only the RAPI backend can talk
27 to older clusters, but it is recommended to use this version only with
30 All commands are now multi-group aware (but to various degrees), so
31 allocation, balancing and capacity calculation respects the group layout
32 and will not create “broken” instances by using nodes from different
35 For a regular, single-group cluster, no changes should be directly
36 visible to the users. A multi-group cluster however will change some
39 - hbal will require a target group to operate on (no cluster-wide
41 - evacuation of (DRBD) instances from a node will be restricted to nodes
42 in the same group, as inter-group moves are not implemented yet
43 - capacity, while showing correct data, will not give per-group details
46 There are other changes in this release:
48 - fixed a long-standing bug in hscan related to node memory data
49 - changed the text backend format, which unfortunately invalidates old
51 - error handling improvements, so that invalid input data reports better
53 - the simulation backend changes its syntax, now it takes the allocation
54 policy too, and can generate multiple groups
55 - (internal) man page generation moved to pandoc from hand-written,
56 which is helpful as it can also generate HTML versions
57 - the balancing algorithm has been changed to work in parallel, if the
58 code is linked against the multi-threaded runtime; this gives a very
59 good speedup (~80% on 4 cores, ~60-70% of 12 cores)
61 Version 0.2.8 (Thu, 23 Dec 2010)
62 --------------------------------
66 - fixed balancing function for big clusters, which will improve corner
67 cases where hbal didn't see any solution even though the cluster was
68 obviously not well balanced
69 - fixed exit code of hbal in case of (Luxi) job errors
70 - changed the signal handling in hbal in order to make hbal control
71 easier: instead of synchronising on the count of signals, make SIGINT
72 cause graceful termination, and SIGTERM an immediate one
73 - increased the tag exclusion weight so that it has greater importance
75 - slight improvement to the speed of balancing via algorithm tweaks
78 Version 0.2.7 (Thu, 07 Oct 2010)
79 --------------------------------
83 - fixed the error message for hail multi-evacuation mode
84 - improve evacuation mode for offline secondary nodes (ignore available
89 - add a new option ``-S`` to hbal and hspace that saves the cluster
90 state at the end of the processing in the text format used by the
91 ``-t`` option, for later re-processing
92 - a two new options to hbal, -g and --min-gain-limit, that should help
93 in limiting the number of balances steps with a low gain in the final
95 - hbal, when executing jobs, will now wait for the current jobs to
96 finish at the first stop (e.g. ^C); if the user wants immediate exit,
97 another signal should be sent
98 - added “normalized” physical CPU units in hspace output (NPU), which
99 represents units of physical CPUs free/used, based on the max-cpu
103 Version 0.2.6 (Mon, 26 Jul 2010)
104 --------------------------------
106 Exactly three months since the last release. Many internal changes, plus
107 a couple of important changes in the balancing algorithm.
109 First, the balancing may now introduce N+1 errors, if this solves other,
110 more critical problems. For the moment, this means that moving instances
111 away from offline nodes is allowed even if it creates N+1 errors, and
112 that means evacuation can be done in more cases.
114 Second, the scoring for N+1 has changed. In previous versions, it simply
115 counted the number of failing N+1 nodes, which means moving an instance
116 away from a N+1 failed node (but without the node 'clearing' the N+1
117 status) was not reflected in the cluster score. As such, the balancing
118 algorithm managed to clear N+1 errors only sometimes, since usually it
119 takes more than one move for this, and the first prerequisite move was
120 not 'rewarded' appropriately and thus it was not selected. Now, it is
121 possible to fix many more error cases than before: on a simulated 40
122 node cluster full with instances (symmetrically allocated on all nodes),
123 around five nodes can be evacuated before N+1 errors can be solved,
124 whereas 0.2.5 could evacuate at best one node.
126 There were some other internal changes to the scoring algorithm, such
127 that now the metrics have associated weights, and they are not all of
128 the same importance anymore. As of now, the only change is that offline
129 instances have a higher weight, which should favour proper node
132 Among the other changes:
134 - fixed the hspace KM_POOL_* metrics, which were returned as the final
135 state and not as the delta between the initial and final states
136 - fixed hspace handling of N+1 failing clusters: before, it used to
137 generate a 'fake' response, and the structure of this response was not
138 always in sync with the real responses, leading to missing items;
139 currently it proceeds correctly through the code (skipping the
140 computation), and uses the same display mechanisms as the normal case
141 - fixed hscan exit code for RAPI failures: previously it finished with
142 success even if all the clusters failed, which was creating issues
143 with the live-test script; now it exits with exit code 2 for RAPI
144 failures (unfortunately this is still not optimal as LUXI failures
145 will use exit code 1, the same as the command line)
146 - changed the limit values for CPU/disk, which previously were used
147 optionally, whereas now they are always used; the default cpu ratio
148 limit is now 64 VCPUs per PCPU
149 - changed the internal handling of the short name vs. original
150 (Ganeti-provided) name; now internally we always use the full name,
151 and only in display routines we show the shortened (called 'alias')
152 name; as a result, the -O and --excluded-instances options now accept
153 both the full name and the shortened name
154 - changed internal handling of JSON conversions and errors, such that
155 now we show a better context for failure messages, which should help
156 with diagnosing the malformed message
157 - changed the names for a few node fields, and added some more nodes;
158 this is most likely to help with debugging, and not with regular
160 - changed the node fields option to allow the '+' prefix to mean 'extend
161 the default fields list' rather than start from fresh (similar to
162 Ganeti's implementation)
163 - a few internal changes related to the LUXI protocol implementation,
164 which should make it more safe against potential bugs, one
165 optiomization that should help with large messages, and some patches
166 in preparation for potential expansion of the LUXI backend functionality
168 And finally, many improvements on unittests and the live-test
169 script. Test coverage is much enhanced, and the test infrastructure has
170 better error reporting; this should lead down-the-road to better code
174 Version 0.2.5 (Mon, 26 Apr 2010)
175 --------------------------------
177 Some internal cleanup plus a few user-visible changes:
179 - new option for marking instances as 'do-not-move' during rebalancing
180 - allow ``hscan`` to scan the local cluster via Luxi
181 - add more metrics to ``hspace`` which show the delta between original
182 state and final state better (only valid for tiered allocation)
185 Version 0.2.4 (Mon, 22 Feb 2010)
186 --------------------------------
188 Two improvements for node evacuation:
190 - hbal takes a new parameter ``--evac-mode`` that restricts the
191 instances to be moved to the ones on offline/drained nodes, which
192 should reduce the work done
193 - hail supports the new ``multi-evacuate`` mode of the IAllocator
194 protocol, that will be released in a minor release on the Ganeti 2.1
198 Version 0.2.3 (Thu, 4 Feb 2010)
199 --------------------------------
203 - Fixes selection of secondary node: previously, if the cluster had
204 many N+1 failures, a N+1 failed node could be selected as secondary
205 even if it did not have enough memory to allow the instance to be
206 migrated/failed over to it; this is bad for automated tools, since
207 we can get the cluster in an unhealthy state
208 - Switch the text backend to a single input file, that is generated
209 now by hscan and shouldn't be generated manually via
210 gnt-node/instance list anymore; this allows richer information to be
211 kept in the file, and simplifies a little the internals of the text
215 Version 0.2.2 (Tue, 29 Dec 2009)
216 --------------------------------
218 Small release, 0.2.1 was broken and thus this was released earlier:
220 - Release 0.2.1 broke the LUXI backend due to a typo, fixed
221 - Added a live-test script that should catch errors like the above one
222 in the future (needs a working, non-empty cluster)
223 - Changed RAPI and LUXI backends to treat drained nodes as offline,
224 similar to the IAllocator backend change in 0.2.0 (which was wrongly
225 marked as affecting all backends)
226 - Changed the metrics for offline instances and N1 score from percent to
227 count, in order to increase the priority of evacuations
228 - Added a new metric (offline primary instances) which should fix the
229 evacuation of a offline node in a 2-node cluster
232 Version 0.2.1 (Wed, 2 Dec 2009)
233 --------------------------------
235 - Added instance exclusion defined via instance tags
236 - Fixed the output of hspace to be again parseable from the shell
239 Version 0.2.0 (Tue, 10 Nov 2009)
240 --------------------------------
242 A significant release, with a few new major features:
244 - Added direct execution of the hbal solution when using the Luxi
245 backend; the steps for each instance moves are submitted as a single
246 jobs, and the different jobs are submitted as groups in order to
247 parallelise the execution of moves
248 - Added support for balancing based on dynamic utilisation data for
249 instances, fed in via a text file; by default, all instances are
250 considered equal and this change also improves the equalisation of
251 secondary instances per node
252 - Added support for tiered capacity calculation in hspace, where we
253 start from a maximum instance spec and decrease the spec when we run
254 out of resources; this should give a better measure of available
255 capacity on 'fragmented' clusters; this is done separately from the
256 current fixed-mode computation
258 Also there have been many minor improvements:
260 - Added option for showing instances (“--print-instances”), similar to
261 the print nodes option
262 - Added support for customising the node list via an argument to the
263 print nodes option in the form of a comma-separated list of field
264 names; currently the field names are not documented, expecting further
265 changes in a next release
266 - Enhanced the error reporting in the Luxi and Rapi backends
267 - Changed the handling of drained nodes, now being treated the same as
268 offline nodes, for Ganeti 2.0.4+ compatibility
269 - A number of internal changes, simplifying code and merging some
271 - Simplify the build system in relation to creation of archives
274 Version 0.1.8 (Tue, 29 Sep 2009)
275 --------------------------------
277 - Brown-paper-bag release fixing haddock issues
280 Version 0.1.7 (Mon, 28 Sep 2009)
281 --------------------------------
283 - Fixed a bug in the Luxi backend for big responses
284 - Fixed test suite exit code in presence of test failures
285 - Changed the migrate operation to run instead failover for instances
286 which were marked as not running in the input data (this could have
287 been changed since then, but it's better than today's always migrate)
288 - Added support for 'cheap' moves only (only migrate/failover) in
290 - Added support for building without curl (thus no RAPI backend)
293 Version 0.1.6 (Wed, 19 Aug 2009)
294 --------------------------------
296 - Added support for Luxi (the native Ganeti protocol)
297 - Added support for simulated clusters (for hspace only)
298 - Added timeouts for the RAPI backend
299 - Fixed a few inconsistencies in the command line handling
300 - Fixed handling of errors while loading data
301 - The 'network' is a new dependency due to the Luxi addition
304 Version 0.1.5 (Thu, 09 Jul 2009)
305 --------------------------------
307 - Removed obsolete hn1 program; this allowed removal of a lot of
309 - Lots of changes in hspace: the output now is a shell fragment in order
310 for script to source it or parse it easier; added failure reasons;
311 optimised to use less memory for large clusters
312 - Optimized the scoring algorithm (used by all tools) so that now
313 computations should be faster
316 Version 0.1.4 (Tue, 16 Jun 2009)
317 --------------------------------
319 - Added CPU count/ratio of virtual-to-physical CPUs to the cluster
320 scoring methods; this means that now the balancer, the iallocator
321 plugin and so on will try to keep the VCPU-to-PCPU ratio equal across
323 - Fixed some hscan bugs
324 - Fixed the way iallocator reads the total disk size (was broken and it
325 was always falling back to summing the disk sizes)
326 - Internals: fixed most compile-time warnings
329 Version 0.1.3 (Fri, 05 Jun 2009)
330 --------------------------------
332 - Fix a bug in the ReplacePrimary instance moves, affecting most of the
336 Version 0.1.2 (Tue, 02 Jun 2009)
337 --------------------------------
339 - Add a new program, “hspace”, which computes the free space on a
340 cluster (based on a given instance spec)
341 - Improvements in API docs and partially in the user docs
342 - Started adding unittests
345 Version 0.1.1 (Tue, 26 May 2009)
346 --------------------------------
348 - Add a new program, “hail”, which is an iallocator plugin and can
349 allocate/relocate instances
350 - Experimental support for non-mirrored instances (hail supports them,
351 hbal should no longer abort when it finds such instances and simply
353 - The RAPI port and/or scheme can be overriden now, and even “file://”
354 schemes can be used if the message body has been saved under the
356 - Lots of code reorganization, esp. rewritten loading pipeline
357 - Better data checking and better error messages in case validation
358 fails; tools now consider nodes with error in input data (‘?’ returned
359 by ganeti) as offline
360 - Small enhancement to the makefile for simpler packaging
363 Version 0.1.0 (Tue, 19 May 2009)
364 --------------------------------
366 - Drop compatibility with Ganeti 1.2
367 - Add a new minimum score option (with a very low default), should help
368 with very good clusters (but is still not optimal)
369 - Add a --quiet option to hbal
370 - Add support for reading offline nodes directly from the cluster
373 Version 0.0.8 (Tue, 21 Apr 2009)
374 --------------------------------
376 - hbal: prevent mismatches in wrong node names being passed to -O, by
377 aborting in this case
378 - add the ability to write the commands (-C) to a script via (-C<file>),
379 so that it can be later executed directly; this has also changed the
380 commands to include the ncessary -f flags to skip confirmations
381 - add checks for extra argument in hbal and hn1, so that unintended
383 - raise the accepted “missing” memory limit to 512MB, to cover usual Xen
387 Version 0.0.7 (Mon, 23 Mar 2009)
388 --------------------------------
390 - added support for offline nodes, which are not used as targets for
391 instance relocation and if they hold instances the hbal algorithm will
392 attempt to relocate these away
393 - added support for offline instances, which now will no longer skew the
394 free memory estimation of nodes; the algorithm will no longer create
395 conditions for N+1 failures when such instances are later started
396 - implemented a complete model of node resources, in order to prevent an
397 unintended re-occurrence of cases like the offline instance were we
398 miscalculate some node resource; this gives warning now in case the
399 node reported free disk or free memory deviates by more than a set
400 amount from the expected value
401 - a new tool *hscan* that can generate the input text-file for the other
402 tools by collection via RAPI
403 - some small changes to the build system to make it more friendly; also
404 included the generated documentation in the source archive
407 Version 0.0.6 (Mon, 16 Mar 2009)
408 --------------------------------
410 - re-factored the hbal algorithm to make it stable in the sense that it
411 gives the same solution when restarted from the middle; barring
412 rounding of disk/memory and incomplete reporting from Ganeti (for
413 1.2), it should be now feasible to rely on its output without
414 generating moves ad infinitum
415 - the hbal algorithm now uses two more variables: the node N+1 failures
416 and the amount of reserved memory; the first of which tries to ‘fix’
417 the N+1 status, the latter tries to distribute secondaries more
419 - the hbal algorithm now uses two more moves at each step:
420 replace+failover and failover+replace (besides the original failover,
421 replace, and failover+replace+failover)
422 - slightly changed the build system to embed GIT version/tags into the
423 binaries so that we know for a binary from which tree it was done,
424 either via ‘--version’ or via “strings hbal|grep version”
425 - changed the solution list and in general the hbal output to be more
426 clear by default, and changed “gnt-instance failover” to “gnt-instance
428 - added man pages for the two binaries
431 Version 0.0.5 (Mon, 09 Mar 2009)
432 --------------------------------
434 - a few small improvements for hbal (possibly undone by later changes),
435 hbal is now quite faster
436 - fix documentation building
437 - allow hbal to work on non N+1 compliant clusters, but without
438 guarantees that the end cluster will be compliant; in any case, this
439 should give a smaller number of nodes that are not compliant if the
440 cluster state permits it
441 - strip common domain suffix from nodes and instances, so that output is
442 shorter and hopefully clearer
445 Version 0.0.4 (Sun, 15 Feb 2009)
446 --------------------------------
448 - better balancing algorithm in hbal
449 - implemented an RAPI collector, now the cluster data can be gathered
450 automatically via RAPI and doesn't need manual export of node and
454 Version 0.0.3 (Wed, 28 Jan 2009)
455 --------------------------------
457 - initial release of the hbal, a cluster rebalancing tool
458 - input data format changed due to hbal requirements
461 Version 0.0.2 (Tue, 06 Jan 2009)
462 --------------------------------
464 - fix handling of some common cases (cluster N+1 compliant from the
465 start, too big depth given, failure to compute solution)
466 - add option to print the needed command list for reaching the proposed
470 Version 0.0.1 (Tue, 06 Jan 2009)
471 --------------------------------
473 - initial release of hn1 tool
475 .. vim: set textwidth=72 :