Statistics
| Branch: | Tag: | Revision:

root / NEWS @ b7b29191

History | View | Annotate | Download (16.1 kB)

1
Ganeti-htools release notes
2
===========================
3

    
4

    
5
Version 0.2.7 (Thu, 07 Oct 2010)
6
--------------------------------
7

    
8
Bug fixes:
9

    
10
- fixed the error message for hail multi-evacuation mode
11
- improve evacuation mode for offline secondary nodes (ignore available
12
  memory)
13

    
14
New features:
15

    
16
- add a new option ``-S`` to hbal and hspace that saves the cluster
17
  state at the end of the processing in the text format used by the
18
  ``-t`` option, for later re-processing
19
- a two new options to hbal, -g and --min-gain-limit, that should help
20
  in limiting the number of balances steps with a low gain in the final
21
  stages
22
- hbal, when executing jobs, will now wait for the current jobs to
23
  finish at the first stop (e.g. ^C); if the user wants immediate exit,
24
  another signal should be sent
25
- added “normalized” physical CPU units in hspace output (NPU), which
26
  represents units of physical CPUs free/used, based on the max-cpu
27
  ratio
28

    
29

    
30
Version 0.2.6 (Mon, 26 Jul 2010)
31
--------------------------------
32

    
33
Exactly three months since the last release. Many internal changes, plus
34
a couple of important changes in the balancing algorithm.
35

    
36
First, the balancing may now introduce N+1 errors, if this solves other,
37
more critical problems. For the moment, this means that moving instances
38
away from offline nodes is allowed even if it creates N+1 errors, and
39
that means evacuation can be done in more cases.
40

    
41
Second, the scoring for N+1 has changed. In previous versions, it simply
42
counted the number of failing N+1 nodes, which means moving an instance
43
away from a N+1 failed node (but without the node 'clearing' the N+1
44
status) was not reflected in the cluster score. As such, the balancing
45
algorithm managed to clear N+1 errors only sometimes, since usually it
46
takes more than one move for this, and the first prerequisite move was
47
not 'rewarded' appropriately and thus it was not selected. Now, it is
48
possible to fix many more error cases than before: on a simulated 40
49
node cluster full with instances (symmetrically allocated on all nodes),
50
around five nodes can be evacuated before N+1 errors can be solved,
51
whereas 0.2.5 could evacuate at best one node.
52

    
53
There were some other internal changes to the scoring algorithm, such
54
that now the metrics have associated weights, and they are not all of
55
the same importance anymore. As of now, the only change is that offline
56
instances have a higher weight, which should favour proper node
57
evacuations.
58

    
59
Among the other changes:
60

    
61
- fixed the hspace KM_POOL_* metrics, which were returned as the final
62
  state and not as the delta between the initial and final states
63
- fixed hspace handling of N+1 failing clusters: before, it used to
64
  generate a 'fake' response, and the structure of this response was not
65
  always in sync with the real responses, leading to missing items;
66
  currently it proceeds correctly through the code (skipping the
67
  computation), and uses the same display mechanisms as the normal case
68
- fixed hscan exit code for RAPI failures: previously it finished with
69
  success even if all the clusters failed, which was creating issues
70
  with the live-test script; now it exits with exit code 2 for RAPI
71
  failures (unfortunately this is still not optimal as LUXI failures
72
  will use exit code 1, the same as the command line)
73
- changed the limit values for CPU/disk, which previously were used
74
  optionally, whereas now they are always used; the default cpu ratio
75
  limit is now 64 VCPUs per PCPU
76
- changed the internal handling of the short name vs. original
77
  (Ganeti-provided) name; now internally we always use the full name,
78
  and only in display routines we show the shortened (called 'alias')
79
  name; as a result, the -O and --excluded-instances options now accept
80
  both the full name and the shortened name
81
- changed internal handling of JSON conversions and errors, such that
82
  now we show a better context for failure messages, which should help
83
  with diagnosing the malformed message
84
- changed the names for a few node fields, and added some more nodes;
85
  this is most likely to help with debugging, and not with regular
86
  operation though
87
- changed the node fields option to allow the '+' prefix to mean 'extend
88
  the default fields list' rather than start from fresh (similar to
89
  Ganeti's implementation)
90
- a few internal changes related to the LUXI protocol implementation,
91
  which should make it more safe against potential bugs, one
92
  optiomization that should help with large messages, and some patches
93
  in preparation for potential expansion of the LUXI backend functionality
94

    
95
And finally, many improvements on unittests and the live-test
96
script. Test coverage is much enhanced, and the test infrastructure has
97
better error reporting; this should lead down-the-road to better code
98
and fewer bugs…
99

    
100

    
101
Version 0.2.5 (Mon, 26 Apr 2010)
102
--------------------------------
103

    
104
Some internal cleanup plus a few user-visible changes:
105

    
106
- new option for marking instances as 'do-not-move' during rebalancing
107
- allow ``hscan`` to scan the local cluster via Luxi
108
- add more metrics to ``hspace`` which show the delta between original
109
  state and final state better (only valid for tiered allocation)
110

    
111

    
112
Version 0.2.4 (Mon, 22 Feb 2010)
113
--------------------------------
114

    
115
Two improvements for node evacuation:
116

    
117
- hbal takes a new parameter ``--evac-mode`` that restricts the
118
  instances to be moved to the ones on offline/drained nodes, which
119
  should reduce the work done
120
- hail supports the new ``multi-evacuate`` mode of the IAllocator
121
  protocol, that will be released in a minor release on the Ganeti 2.1
122
  branch
123

    
124

    
125
Version 0.2.3 (Thu,  4 Feb 2010)
126
--------------------------------
127

    
128
A small release:
129

    
130
- Fixes selection of secondary node: previously, if the cluster had
131
  many N+1 failures, a N+1 failed node could be selected as secondary
132
  even if it did not have enough memory to allow the instance to be
133
  migrated/failed over to it; this is bad for automated tools, since
134
  we can get the cluster in an unhealthy state
135
- Switch the text backend to a single input file, that is generated
136
  now by hscan and shouldn't be generated manually via
137
  gnt-node/instance list anymore; this allows richer information to be
138
  kept in the file, and simplifies a little the internals of the text
139
  backend
140

    
141

    
142
Version 0.2.2 (Tue, 29 Dec 2009)
143
--------------------------------
144

    
145
Small release, 0.2.1 was broken and thus this was released earlier:
146

    
147
- Release 0.2.1 broke the LUXI backend due to a typo, fixed
148
- Added a live-test script that should catch errors like the above one
149
  in the future (needs a working, non-empty cluster)
150
- Changed RAPI and LUXI backends to treat drained nodes as offline,
151
  similar to the IAllocator backend change in 0.2.0 (which was wrongly
152
  marked as affecting all backends)
153
- Changed the metrics for offline instances and N1 score from percent to
154
  count, in order to increase the priority of evacuations
155
- Added a new metric (offline primary instances) which should fix the
156
  evacuation of a offline node in a 2-node cluster
157

    
158

    
159
Version 0.2.1 (Wed,  2 Dec 2009)
160
--------------------------------
161

    
162
- Added instance exclusion defined via instance tags
163
- Fixed the output of hspace to be again parseable from the shell
164

    
165

    
166
Version 0.2.0 (Tue, 10 Nov 2009)
167
--------------------------------
168

    
169
A significant release, with a few new major features:
170

    
171
- Added direct execution of the hbal solution when using the Luxi
172
  backend; the steps for each instance moves are submitted as a single
173
  jobs, and the different jobs are submitted as groups in order to
174
  parallelise the execution of moves
175
- Added support for balancing based on dynamic utilisation data for
176
  instances, fed in via a text file; by default, all instances are
177
  considered equal and this change also improves the equalisation of
178
  secondary instances per node
179
- Added support for tiered capacity calculation in hspace, where we
180
  start from a maximum instance spec and decrease the spec when we run
181
  out of resources; this should give a better measure of available
182
  capacity on 'fragmented' clusters; this is done separately from the
183
  current fixed-mode computation
184

    
185
Also there have been many minor improvements:
186

    
187
- Added option for showing instances (“--print-instances”), similar to
188
  the print nodes option
189
- Added support for customising the node list via an argument to the
190
  print nodes option in the form of a comma-separated list of field
191
  names; currently the field names are not documented, expecting further
192
  changes in a next release
193
- Enhanced the error reporting in the Luxi and Rapi backends
194
- Changed the handling of drained nodes, now being treated the same as
195
  offline nodes, for Ganeti 2.0.4+ compatibility
196
- A number of internal changes, simplifying code and merging some
197
  disparate functions
198
- Simplify the build system in relation to creation of archives
199

    
200

    
201
Version 0.1.8 (Tue, 29 Sep 2009)
202
--------------------------------
203

    
204
- Brown-paper-bag release fixing haddock issues
205

    
206

    
207
Version 0.1.7 (Mon, 28 Sep 2009)
208
--------------------------------
209

    
210
- Fixed a bug in the Luxi backend for big responses
211
- Fixed test suite exit code in presence of test failures
212
- Changed the migrate operation to run instead failover for instances
213
  which were marked as not running in the input data (this could have
214
  been changed since then, but it's better than today's always migrate)
215
- Added support for 'cheap' moves only (only migrate/failover) in
216
  balancing
217
- Added support for building without curl (thus no RAPI backend)
218

    
219

    
220
Version 0.1.6 (Wed, 19 Aug 2009)
221
--------------------------------
222

    
223
- Added support for Luxi (the native Ganeti protocol)
224
- Added support for simulated clusters (for hspace only)
225
- Added timeouts for the RAPI backend
226
- Fixed a few inconsistencies in the command line handling
227
- Fixed handling of errors while loading data
228
- The 'network' is a new dependency due to the Luxi addition
229

    
230

    
231
Version 0.1.5 (Thu, 09 Jul 2009)
232
--------------------------------
233

    
234
- Removed obsolete hn1 program; this allowed removal of a lot of
235
  supporting code
236
- Lots of changes in hspace: the output now is a shell fragment in order
237
  for script to source it or parse it easier; added failure reasons;
238
  optimised to use less memory for large clusters
239
- Optimized the scoring algorithm (used by all tools) so that now
240
  computations should be faster
241

    
242

    
243
Version 0.1.4 (Tue, 16 Jun 2009)
244
--------------------------------
245

    
246
- Added CPU count/ratio of virtual-to-physical CPUs to the cluster
247
  scoring methods; this means that now the balancer, the iallocator
248
  plugin and so on will try to keep the VCPU-to-PCPU ratio equal across
249
  the cluster
250
- Fixed some hscan bugs
251
- Fixed the way iallocator reads the total disk size (was broken and it
252
  was always falling back to summing the disk sizes)
253
- Internals: fixed most compile-time warnings
254

    
255

    
256
Version 0.1.3 (Fri, 05 Jun 2009)
257
--------------------------------
258

    
259
- Fix a bug in the ReplacePrimary instance moves, affecting most of the
260
  tools
261

    
262

    
263
Version 0.1.2 (Tue, 02 Jun 2009)
264
--------------------------------
265

    
266
- Add a new program, “hspace”, which computes the free space on a
267
  cluster (based on a given instance spec)
268
- Improvements in API docs and partially in the user docs
269
- Started adding unittests
270

    
271

    
272
Version 0.1.1 (Tue, 26 May 2009)
273
--------------------------------
274

    
275
- Add a new program, “hail”, which is an iallocator plugin and can
276
  allocate/relocate instances
277
- Experimental support for non-mirrored instances (hail supports them,
278
  hbal should no longer abort when it finds such instances and simply
279
  ignore them)
280
- The RAPI port and/or scheme can be overriden now, and even “file://”
281
  schemes can be used if the message body has been saved under the
282
  appropriate name
283
- Lots of code reorganization, esp. rewritten loading pipeline
284
- Better data checking and better error messages in case validation
285
  fails; tools now consider nodes with error in input data (‘?’ returned
286
  by ganeti) as offline
287
- Small enhancement to the makefile for simpler packaging
288

    
289

    
290
Version 0.1.0 (Tue, 19 May 2009)
291
--------------------------------
292

    
293
- Drop compatibility with Ganeti 1.2
294
- Add a new minimum score option (with a very low default), should help
295
  with very good clusters (but is still not optimal)
296
- Add a --quiet option to hbal
297
- Add support for reading offline nodes directly from the cluster
298

    
299

    
300
Version 0.0.8 (Tue, 21 Apr 2009)
301
--------------------------------
302

    
303
- hbal: prevent mismatches in wrong node names being passed to -O, by
304
  aborting in this case
305
- add the ability to write the commands (-C) to a script via (-C<file>),
306
  so that it can be later executed directly; this has also changed the
307
  commands to include the ncessary -f flags to skip confirmations
308
- add checks for extra argument in hbal and hn1, so that unintended
309
  errors are catched
310
- raise the accepted “missing” memory limit to 512MB, to cover usual Xen
311
  reservations
312

    
313

    
314
Version 0.0.7 (Mon, 23 Mar 2009)
315
--------------------------------
316

    
317
- added support for offline nodes, which are not used as targets for
318
  instance relocation and if they hold instances the hbal algorithm will
319
  attempt to relocate these away
320
- added support for offline instances, which now will no longer skew the
321
  free memory estimation of nodes; the algorithm will no longer create
322
  conditions for N+1 failures when such instances are later started
323
- implemented a complete model of node resources, in order to prevent an
324
  unintended re-occurrence of cases like the offline instance were we
325
  miscalculate some node resource; this gives warning now in case the
326
  node reported free disk or free memory deviates by more than a set
327
  amount from the expected value
328
- a new tool *hscan* that can generate the input text-file for the other
329
  tools by collection via RAPI
330
- some small changes to the build system to make it more friendly; also
331
  included the generated documentation in the source archive
332

    
333

    
334
Version 0.0.6 (Mon, 16 Mar 2009)
335
--------------------------------
336

    
337
- re-factored the hbal algorithm to make it stable in the sense that it
338
  gives the same solution when restarted from the middle; barring
339
  rounding of disk/memory and incomplete reporting from Ganeti (for
340
  1.2), it should be now feasible to rely on its output without
341
  generating moves ad infinitum
342
- the hbal algorithm now uses two more variables: the node N+1 failures
343
  and the amount of reserved memory; the first of which tries to ‘fix’
344
  the N+1 status, the latter tries to distribute secondaries more
345
  equally
346
- the hbal algorithm now uses two more moves at each step:
347
  replace+failover and failover+replace (besides the original failover,
348
  replace, and failover+replace+failover)
349
- slightly changed the build system to embed GIT version/tags into the
350
  binaries so that we know for a binary from which tree it was done,
351
  either via ‘--version’ or via “strings hbal|grep version”
352
- changed the solution list and in general the hbal output to be more
353
  clear by default, and changed “gnt-instance failover” to “gnt-instance
354
  migrate”
355
- added man pages for the two binaries
356

    
357

    
358
Version 0.0.5 (Mon, 09 Mar 2009)
359
--------------------------------
360

    
361
- a few small improvements for hbal (possibly undone by later changes),
362
  hbal is now quite faster
363
- fix documentation building
364
- allow hbal to work on non N+1 compliant clusters, but without
365
  guarantees that the end cluster will be compliant; in any case, this
366
  should give a smaller number of nodes that are not compliant if the
367
  cluster state permits it
368
- strip common domain suffix from nodes and instances, so that output is
369
  shorter and hopefully clearer
370

    
371

    
372
Version 0.0.4 (Sun, 15 Feb 2009)
373
--------------------------------
374

    
375
- better balancing algorithm in hbal
376
- implemented an RAPI collector, now the cluster data can be gathered
377
  automatically via RAPI and doesn't need manual export of node and
378
  instance list
379

    
380

    
381
Version 0.0.3 (Wed, 28 Jan 2009)
382
--------------------------------
383

    
384
- initial release of the hbal, a cluster rebalancing tool
385
- input data format changed due to hbal requirements
386

    
387

    
388
Version 0.0.2 (Tue, 06 Jan 2009)
389
--------------------------------
390

    
391
- fix handling of some common cases (cluster N+1 compliant from the
392
  start, too big depth given, failure to compute solution)
393
- add option to print the needed command list for reaching the proposed
394
  solution
395

    
396

    
397
Version 0.0.1 (Tue, 06 Jan 2009)
398
--------------------------------
399

    
400
- initial release of hn1 tool
401

    
402
.. vim: set textwidth=72 :
403
.. Local Variables:
404
.. mode: rst
405
.. fill-column: 72
406
.. End: