Statistics
| Branch: | Tag: | Revision:

root / NEWS @ 49d977db

History | View | Annotate | Download (15.6 kB)

1
Ganeti-htools release notes
2
===========================
3

    
4

    
5
Version 0.2.7 (Unreleased)
6
--------------------------
7

    
8
Bug fixes:
9

    
10
- fixed the error message for hail multi-evacuation mode
11

    
12
New features:
13

    
14
- add a new option ``-S`` to hbal and hspace that saves the cluster
15
  state at the end of the processing in the text format used by the
16
  ``-t`` option, for later re-processing
17

    
18
Version 0.2.6 (Mon, 26 Jul 2010)
19
--------------------------------
20

    
21
Exactly three months since the last release. Many internal changes, plus
22
a couple of important changes in the balancing algorithm.
23

    
24
First, the balancing may now introduce N+1 errors, if this solves other,
25
more critical problems. For the moment, this means that moving instances
26
away from offline nodes is allowed even if it creates N+1 errors, and
27
that means evacuation can be done in more cases.
28

    
29
Second, the scoring for N+1 has changed. In previous versions, it simply
30
counted the number of failing N+1 nodes, which means moving an instance
31
away from a N+1 failed node (but without the node 'clearing' the N+1
32
status) was not reflected in the cluster score. As such, the balancing
33
algorithm managed to clear N+1 errors only sometimes, since usually it
34
takes more than one move for this, and the first prerequisite move was
35
not 'rewarded' appropriately and thus it was not selected. Now, it is
36
possible to fix many more error cases than before: on a simulated 40
37
node cluster full with instances (symmetrically allocated on all nodes),
38
around five nodes can be evacuated before N+1 errors can be solved,
39
whereas 0.2.5 could evacuate at best one node.
40

    
41
There were some other internal changes to the scoring algorithm, such
42
that now the metrics have associated weights, and they are not all of
43
the same importance anymore. As of now, the only change is that offline
44
instances have a higher weight, which should favour proper node
45
evacuations.
46

    
47
Among the other changes:
48

    
49
- fixed the hspace KM_POOL_* metrics, which were returned as the final
50
  state and not as the delta between the initial and final states
51
- fixed hspace handling of N+1 failing clusters: before, it used to
52
  generate a 'fake' response, and the structure of this response was not
53
  always in sync with the real responses, leading to missing items;
54
  currently it proceeds correctly through the code (skipping the
55
  computation), and uses the same display mechanisms as the normal case
56
- fixed hscan exit code for RAPI failures: previously it finished with
57
  success even if all the clusters failed, which was creating issues
58
  with the live-test script; now it exits with exit code 2 for RAPI
59
  failures (unfortunately this is still not optimal as LUXI failures
60
  will use exit code 1, the same as the command line)
61
- changed the limit values for CPU/disk, which previously were used
62
  optionally, whereas now they are always used; the default cpu ratio
63
  limit is now 64 VCPUs per PCPU
64
- changed the internal handling of the short name vs. original
65
  (Ganeti-provided) name; now internally we always use the full name,
66
  and only in display routines we show the shortened (called 'alias')
67
  name; as a result, the -O and --excluded-instances options now accept
68
  both the full name and the shortened name
69
- changed internal handling of JSON conversions and errors, such that
70
  now we show a better context for failure messages, which should help
71
  with diagnosing the malformed message
72
- changed the names for a few node fields, and added some more nodes;
73
  this is most likely to help with debugging, and not with regular
74
  operation though
75
- changed the node fields option to allow the '+' prefix to mean 'extend
76
  the default fields list' rather than start from fresh (similar to
77
  Ganeti's implementation)
78
- a few internal changes related to the LUXI protocol implementation,
79
  which should make it more safe against potential bugs, one
80
  optiomization that should help with large messages, and some patches
81
  in preparation for potential expansion of the LUXI backend functionality
82

    
83
And finally, many improvements on unittests and the live-test
84
script. Test coverage is much enhanced, and the test infrastructure has
85
better error reporting; this should lead down-the-road to better code
86
and fewer bugs…
87

    
88

    
89
Version 0.2.5 (Mon, 26 Apr 2010)
90
--------------------------------
91

    
92
Some internal cleanup plus a few user-visible changes:
93

    
94
- new option for marking instances as 'do-not-move' during rebalancing
95
- allow ``hscan`` to scan the local cluster via Luxi
96
- add more metrics to ``hspace`` which show the delta between original
97
  state and final state better (only valid for tiered allocation)
98

    
99

    
100
Version 0.2.4 (Mon, 22 Feb 2010)
101
--------------------------------
102

    
103
Two improvements for node evacuation:
104

    
105
- hbal takes a new parameter ``--evac-mode`` that restricts the
106
  instances to be moved to the ones on offline/drained nodes, which
107
  should reduce the work done
108
- hail supports the new ``multi-evacuate`` mode of the IAllocator
109
  protocol, that will be released in a minor release on the Ganeti 2.1
110
  branch
111

    
112

    
113
Version 0.2.3 (Thu,  4 Feb 2010)
114
--------------------------------
115

    
116
A small release:
117

    
118
- Fixes selection of secondary node: previously, if the cluster had
119
  many N+1 failures, a N+1 failed node could be selected as secondary
120
  even if it did not have enough memory to allow the instance to be
121
  migrated/failed over to it; this is bad for automated tools, since
122
  we can get the cluster in an unhealthy state
123
- Switch the text backend to a single input file, that is generated
124
  now by hscan and shouldn't be generated manually via
125
  gnt-node/instance list anymore; this allows richer information to be
126
  kept in the file, and simplifies a little the internals of the text
127
  backend
128

    
129

    
130
Version 0.2.2 (Tue, 29 Dec 2009)
131
--------------------------------
132

    
133
Small release, 0.2.1 was broken and thus this was released earlier:
134

    
135
- Release 0.2.1 broke the LUXI backend due to a typo, fixed
136
- Added a live-test script that should catch errors like the above one
137
  in the future (needs a working, non-empty cluster)
138
- Changed RAPI and LUXI backends to treat drained nodes as offline,
139
  similar to the IAllocator backend change in 0.2.0 (which was wrongly
140
  marked as affecting all backends)
141
- Changed the metrics for offline instances and N1 score from percent to
142
  count, in order to increase the priority of evacuations
143
- Added a new metric (offline primary instances) which should fix the
144
  evacuation of a offline node in a 2-node cluster
145

    
146

    
147
Version 0.2.1 (Wed,  2 Dec 2009)
148
--------------------------------
149

    
150
- Added instance exclusion defined via instance tags
151
- Fixed the output of hspace to be again parseable from the shell
152

    
153

    
154
Version 0.2.0 (Tue, 10 Nov 2009)
155
--------------------------------
156

    
157
A significant release, with a few new major features:
158

    
159
- Added direct execution of the hbal solution when using the Luxi
160
  backend; the steps for each instance moves are submitted as a single
161
  jobs, and the different jobs are submitted as groups in order to
162
  parallelise the execution of moves
163
- Added support for balancing based on dynamic utilisation data for
164
  instances, fed in via a text file; by default, all instances are
165
  considered equal and this change also improves the equalisation of
166
  secondary instances per node
167
- Added support for tiered capacity calculation in hspace, where we
168
  start from a maximum instance spec and decrease the spec when we run
169
  out of resources; this should give a better measure of available
170
  capacity on 'fragmented' clusters; this is done separately from the
171
  current fixed-mode computation
172

    
173
Also there have been many minor improvements:
174

    
175
- Added option for showing instances (“--print-instances”), similar to
176
  the print nodes option
177
- Added support for customising the node list via an argument to the
178
  print nodes option in the form of a comma-separated list of field
179
  names; currently the field names are not documented, expecting further
180
  changes in a next release
181
- Enhanced the error reporting in the Luxi and Rapi backends
182
- Changed the handling of drained nodes, now being treated the same as
183
  offline nodes, for Ganeti 2.0.4+ compatibility
184
- A number of internal changes, simplifying code and merging some
185
  disparate functions
186
- Simplify the build system in relation to creation of archives
187

    
188

    
189
Version 0.1.8 (Tue, 29 Sep 2009)
190
--------------------------------
191

    
192
- Brown-paper-bag release fixing haddock issues
193

    
194

    
195
Version 0.1.7 (Mon, 28 Sep 2009)
196
--------------------------------
197

    
198
- Fixed a bug in the Luxi backend for big responses
199
- Fixed test suite exit code in presence of test failures
200
- Changed the migrate operation to run instead failover for instances
201
  which were marked as not running in the input data (this could have
202
  been changed since then, but it's better than today's always migrate)
203
- Added support for 'cheap' moves only (only migrate/failover) in
204
  balancing
205
- Added support for building without curl (thus no RAPI backend)
206

    
207

    
208
Version 0.1.6 (Wed, 19 Aug 2009)
209
--------------------------------
210

    
211
- Added support for Luxi (the native Ganeti protocol)
212
- Added support for simulated clusters (for hspace only)
213
- Added timeouts for the RAPI backend
214
- Fixed a few inconsistencies in the command line handling
215
- Fixed handling of errors while loading data
216
- The 'network' is a new dependency due to the Luxi addition
217

    
218

    
219
Version 0.1.5 (Thu, 09 Jul 2009)
220
--------------------------------
221

    
222
- Removed obsolete hn1 program; this allowed removal of a lot of
223
  supporting code
224
- Lots of changes in hspace: the output now is a shell fragment in order
225
  for script to source it or parse it easier; added failure reasons;
226
  optimised to use less memory for large clusters
227
- Optimized the scoring algorithm (used by all tools) so that now
228
  computations should be faster
229

    
230

    
231
Version 0.1.4 (Tue, 16 Jun 2009)
232
--------------------------------
233

    
234
- Added CPU count/ratio of virtual-to-physical CPUs to the cluster
235
  scoring methods; this means that now the balancer, the iallocator
236
  plugin and so on will try to keep the VCPU-to-PCPU ratio equal across
237
  the cluster
238
- Fixed some hscan bugs
239
- Fixed the way iallocator reads the total disk size (was broken and it
240
  was always falling back to summing the disk sizes)
241
- Internals: fixed most compile-time warnings
242

    
243

    
244
Version 0.1.3 (Fri, 05 Jun 2009)
245
--------------------------------
246

    
247
- Fix a bug in the ReplacePrimary instance moves, affecting most of the
248
  tools
249

    
250

    
251
Version 0.1.2 (Tue, 02 Jun 2009)
252
--------------------------------
253

    
254
- Add a new program, “hspace”, which computes the free space on a
255
  cluster (based on a given instance spec)
256
- Improvements in API docs and partially in the user docs
257
- Started adding unittests
258

    
259

    
260
Version 0.1.1 (Tue, 26 May 2009)
261
--------------------------------
262

    
263
- Add a new program, “hail”, which is an iallocator plugin and can
264
  allocate/relocate instances
265
- Experimental support for non-mirrored instances (hail supports them,
266
  hbal should no longer abort when it finds such instances and simply
267
  ignore them)
268
- The RAPI port and/or scheme can be overriden now, and even “file://”
269
  schemes can be used if the message body has been saved under the
270
  appropriate name
271
- Lots of code reorganization, esp. rewritten loading pipeline
272
- Better data checking and better error messages in case validation
273
  fails; tools now consider nodes with error in input data (‘?’ returned
274
  by ganeti) as offline
275
- Small enhancement to the makefile for simpler packaging
276

    
277

    
278
Version 0.1.0 (Tue, 19 May 2009)
279
--------------------------------
280

    
281
- Drop compatibility with Ganeti 1.2
282
- Add a new minimum score option (with a very low default), should help
283
  with very good clusters (but is still not optimal)
284
- Add a --quiet option to hbal
285
- Add support for reading offline nodes directly from the cluster
286

    
287

    
288
Version 0.0.8 (Tue, 21 Apr 2009)
289
--------------------------------
290

    
291
- hbal: prevent mismatches in wrong node names being passed to -O, by
292
  aborting in this case
293
- add the ability to write the commands (-C) to a script via (-C<file>),
294
  so that it can be later executed directly; this has also changed the
295
  commands to include the ncessary -f flags to skip confirmations
296
- add checks for extra argument in hbal and hn1, so that unintended
297
  errors are catched
298
- raise the accepted “missing” memory limit to 512MB, to cover usual Xen
299
  reservations
300

    
301

    
302
Version 0.0.7 (Mon, 23 Mar 2009)
303
--------------------------------
304

    
305
- added support for offline nodes, which are not used as targets for
306
  instance relocation and if they hold instances the hbal algorithm will
307
  attempt to relocate these away
308
- added support for offline instances, which now will no longer skew the
309
  free memory estimation of nodes; the algorithm will no longer create
310
  conditions for N+1 failures when such instances are later started
311
- implemented a complete model of node resources, in order to prevent an
312
  unintended re-occurrence of cases like the offline instance were we
313
  miscalculate some node resource; this gives warning now in case the
314
  node reported free disk or free memory deviates by more than a set
315
  amount from the expected value
316
- a new tool *hscan* that can generate the input text-file for the other
317
  tools by collection via RAPI
318
- some small changes to the build system to make it more friendly; also
319
  included the generated documentation in the source archive
320

    
321

    
322
Version 0.0.6 (Mon, 16 Mar 2009)
323
--------------------------------
324

    
325
- re-factored the hbal algorithm to make it stable in the sense that it
326
  gives the same solution when restarted from the middle; barring
327
  rounding of disk/memory and incomplete reporting from Ganeti (for
328
  1.2), it should be now feasible to rely on its output without
329
  generating moves ad infinitum
330
- the hbal algorithm now uses two more variables: the node N+1 failures
331
  and the amount of reserved memory; the first of which tries to ‘fix’
332
  the N+1 status, the latter tries to distribute secondaries more
333
  equally
334
- the hbal algorithm now uses two more moves at each step:
335
  replace+failover and failover+replace (besides the original failover,
336
  replace, and failover+replace+failover)
337
- slightly changed the build system to embed GIT version/tags into the
338
  binaries so that we know for a binary from which tree it was done,
339
  either via ‘--version’ or via “strings hbal|grep version”
340
- changed the solution list and in general the hbal output to be more
341
  clear by default, and changed “gnt-instance failover” to “gnt-instance
342
  migrate”
343
- added man pages for the two binaries
344

    
345

    
346
Version 0.0.5 (Mon, 09 Mar 2009)
347
--------------------------------
348

    
349
- a few small improvements for hbal (possibly undone by later changes),
350
  hbal is now quite faster
351
- fix documentation building
352
- allow hbal to work on non N+1 compliant clusters, but without
353
  guarantees that the end cluster will be compliant; in any case, this
354
  should give a smaller number of nodes that are not compliant if the
355
  cluster state permits it
356
- strip common domain suffix from nodes and instances, so that output is
357
  shorter and hopefully clearer
358

    
359

    
360
Version 0.0.4 (Sun, 15 Feb 2009)
361
--------------------------------
362

    
363
- better balancing algorithm in hbal
364
- implemented an RAPI collector, now the cluster data can be gathered
365
  automatically via RAPI and doesn't need manual export of node and
366
  instance list
367

    
368

    
369
Version 0.0.3 (Wed, 28 Jan 2009)
370
--------------------------------
371

    
372
- initial release of the hbal, a cluster rebalancing tool
373
- input data format changed due to hbal requirements
374

    
375

    
376
Version 0.0.2 (Tue, 06 Jan 2009)
377
--------------------------------
378

    
379
- fix handling of some common cases (cluster N+1 compliant from the
380
  start, too big depth given, failure to compute solution)
381
- add option to print the needed command list for reaching the proposed
382
  solution
383

    
384

    
385
Version 0.0.1 (Tue, 06 Jan 2009)
386
--------------------------------
387

    
388
- initial release of hn1 tool
389

    
390
.. vim: set textwidth=72 :
391
.. Local Variables:
392
.. mode: rst
393
.. fill-column: 72
394
.. End: