Revision d2ac5526 hbal.1

b/hbal.1
1
.TH HBAL 1 2009-03-14 htools "Ganeti H-tools"
1
.TH HBAL 1 2009-03-22 htools "Ganeti H-tools"
2 2
.SH NAME
3 3
hbal \- Cluster balancer for Ganeti
4 4

  
......
7 7
.B "[-C]"
8 8
.B "[-p]"
9 9
.B "[-o]"
10
.B "-l"
11
.BI "[ -m " cluster "]"
10
.BI "[-l" limit "]"
11
.BI "[-O" name... "]"
12
.BI "[-m " cluster "]"
12 13
.BI "[-n " nodes-file " ]"
13
.BI "[ -i " instances-file "]"
14
.BI "[-i " instances-file "]"
14 15

  
15 16
.B hbal
16 17
.B --version
......
61 62
  - coefficient of variance of the percent of reserved memory
62 63
  - coefficient of variance of the percent of free disk
63 64
  - percentage of nodes failing N+1 check
65
  - percentage of instances living (either as primary or secondary) on
66
    offline nodes
64 67

  
65 68
The free memory and free disk values help ensure that all nodes are
66 69
somewhat balanced in their resource usage. The reserved memory helps
......
69 72
N+1. And finally, the N+1 percentage helps guide the algorithm towards
70 73
eliminating N+1 failures, if possible.
71 74

  
72
Except for the N+1 failures, we use the coefficient of variance since
73
this brings the values into the same unit so to speak, and with a
74
restrict domain of values (between zero and one). The percentage of
75
N+1 failures, while also in this numeric range, doesn't actually has
76
the same meaning, but it has shown to work well.
75
Except for the N+1 failures and offline instances percentage, we use
76
the coefficient of variance since this brings the values into the same
77
unit so to speak, and with a restrict domain of values (between zero
78
and one). The percentage of N+1 failures, while also in this numeric
79
range, doesn't actually has the same meaning, but it has shown to work
80
well.
77 81

  
78 82
The other alternative, using for N+1 checks the coefficient of
79 83
variance of (N+1 fail=1, N+1 pass=0) across nodes could hint the
......
82 86
rules of the algorithm, so the N+1 checks would simply not work
83 87
anymore in this case.
84 88

  
89
The offline instances percentage (meaning the percentage of instances
90
living on offline nodes) will cause the algorithm to actively move
91
instances away from offline nodes. This, coupled with the restriction
92
on placement given by offline nodes, will cause evacuation of such
93
nodes.
94

  
85 95
On a perfectly balanced cluster (all nodes the same size, all
86 96
instances the same size and spread across the nodes equally), all
87 97
values would be zero. This doesn't happen too often in practice :)
......
106 116
the user to understand the node's most important parameters.
107 117

  
108 118
The node list will contain these informations:
109
  - a character denoting the status of the node, with '-' meaning an
110
    offline node, '*' meaning N+1 failure and blank meaning a good
111
    node
112
  - the node name
113
  - the total node memory
114
  - the memory used by the node itself
115
  - the free node memory
116
  - the reserved node memory, which is the amount of free memory
117
    needed for N+1 compliance
118
  - total disk
119
  - free disk
120
  - number of primary instances
121
  - number of secondary instances
122
  - percent of free memory
123
  - percent of free disk
119
.RS
120
.TP
121
.B F
122
a character denoting the status of the node, with '-' meaning an
123
offline node, '*' meaning N+1 failure and blank meaning a good node
124
.TP
125
.B Name
126
the node name
127
.TP
128
.B t_mem
129
the total node memory
130
.TP
131
.B n_mem
132
the memory used by the node itself
133
.TP
134
.B i_mem
135
the memory used by instances
136
.TP
137
.B x_mem
138
amount memory which seems to be in use but cannot be determined why or
139
by which instance; usually this means that the hypervisor has some
140
overhead or that there are other reporting errors
141
.TP
142
.B f_mem
143
the free node memory
144
.TP
145
.B r_mem
146
the reserved node memory, which is the amount of free memory needed
147
for N+1 compliance
148
.TP
149
.B t_dsk
150
total disk
151
.TP
152
.B f_dsk
153
free disk
154
.TP
155
.B pri
156
number of primary instances
157
.TP
158
.B sec
159
number of secondary instances
160
.TP
161
.B p_fmem
162
percent of free memory
163
.TP
164
.B p_fdsk
165
percent of free disk
166
.RE
124 167

  
125 168
.TP
126 169
.B -o, --oneline
......
135 178
  - improvement in the cluster score
136 179

  
137 180
.TP
181
.BI "-O " name
182
This option (which can be given multiple times) will mark nodes as
183
being \fIoffline\fR. This means a couple of things:
184
.RS
185
.TP
186
-
187
instances won't be placed on these nodes, not even temporarily;
188
e.g. the \fIreplace primary\fR move is not available if the secondary
189
node is offline, since this move requires a failover.
190
.TP
191
-
192
these nodes will not be included in the score calculation (except for
193
the percentage of instances on offline nodes)
194
.RE
195

  
196
.TP
138 197
.BI "-n" nodefile ", --nodes=" nodefile
139 198
The name of the file holding node information (if not collecting via
140 199
RAPI), instead of the default
......
188 247

  
189 248
.SH EXAMPLE
190 249

  
250
Note that this example are not for the latest version (they don't have
251
full node data).
252

  
191 253
.SS Default output
192 254

  
193 255
With the default options, the program shows each individual step and
......
362 424
list (but hopefully will end in the same state).
363 425

  
364 426
.SH SEE ALSO
365
hn1(1), ganeti(7), gnt-instance(8), gnt-node(8)
427
.BR hn1 "(1), " hscan "(1), " ganeti "(7), " gnt-instance "(8), "
428
.BR gnt-node "(8)"

Also available in: Unified diff