Revision 53c24840 hbal.1
b/hbal.1  

120  120 
.RS 4 
121  121 
.TP 3 
122  122 
\(em 
123 
coefficient of variance of the percent of free memory


123 
standard deviation of the percent of free memory


124  124 
.TP 
125  125 
\(em 
126 
coefficient of variance of the percent of reserved memory


126 
standard deviation of the percent of reserved memory


127  127 
.TP 
128  128 
\(em 
129 
coefficient of variance of the percent of free disk


129 
standard deviation of the percent of free disk


130  130 
.TP 
131  131 
\(em 
132 
percentage of nodes failing N+1 check


132 
count of nodes failing N+1 check


133  133 
.TP 
134  134 
\(em 
135 
percentage of instances living (either as primary or secondary) on


135 
count of instances living (either as primary or secondary) on


136  136 
offline nodes 
137  137 
.TP 
138  138 
\(em 
139 
coefficent of variance of the ratio of virtualtophysical cpus (for 

140 
primary instaces of the node) 

139 
count of instances living (as primary) on offline nodes; this differs 

140 
from the above metric by helping failover of such instances in 2node 

141 
clusters 

141  142 
.TP 
142  143 
\(em 
143 
coefficients of variance of the dynamic load on the nodes, for cpus, 

144 
standard deviation of the ratio of virtualtophysical cpus (for 

145 
primary instances of the node) 

146 
.TP 

147 
\(em 

148 
standard deviation of the dynamic load on the nodes, for cpus, 

144  149 
memory, disk and network 
145  150 
.RE 
146  151  
...  ...  
151  156 
N+1. And finally, the N+1 percentage helps guide the algorithm towards 
152  157 
eliminating N+1 failures, if possible. 
153  158  
154 
Except for the N+1 failures and offline instances percentage, we use 

155 
the coefficient of variance since this brings the values into the same 

156 
unit so to speak, and with a restrict domain of values (between zero 

157 
and one). The percentage of N+1 failures, while also in this numeric 

158 
range, doesn't actually has the same meaning, but it has shown to work 

159 
well. 

160  
161 
The other alternative, using for N+1 checks the coefficient of 

162 
variance of (N+1 fail=1, N+1 pass=0) across nodes could hint the 

163 
algorithm to make more N+1 failures if most nodes are N+1 fail 

164 
already. Since this (making N+1 failures) is not allowed by other 

165 
rules of the algorithm, so the N+1 checks would simply not work 

166 
anymore in this case. 

167  
168 
The offline instances percentage (meaning the percentage of instances 

169 
living on offline nodes) will cause the algorithm to actively move 

170 
instances away from offline nodes. This, coupled with the restriction 

171 
on placement given by offline nodes, will cause evacuation of such 

172 
nodes. 

159 
Except for the N+1 failures and offline instances counts, we use the 

160 
standard deviation since when used with values within a fixed range 

161 
(we use percents expressed as values between zero and one) it gives 

162 
consistent results across all metrics (there are some small issues 

163 
related to different means, but it works generally well). The 'count' 

164 
type values will have higher score and thus will matter more for 

165 
balancing; thus these are better for hard constraints (like evacuating 

166 
nodes and fixing N+1 failures). For example, the offline instances 

167 
count (i.e. the number of instances living on offline nodes) will 

168 
cause the algorithm to actively move instances away from offline 

169 
nodes. This, coupled with the restriction on placement given by 

170 
offline nodes, will cause evacuation of such nodes. 

173  171  
174  172 
The dynamic load values need to be read from an external file (Ganeti 
175  173 
doesn't supply them), and are computed for each node as: sum of 
...  ...  
182  180 
values, and feed that via the \fIU\fR option for all instances (and 
183  181 
keep the other metrics as one). For the algorithm to work, all that is 
184  182 
needed is that the values are consistent for a metric across all 
185 
instances (e.g. all instances use cpu% to report cpu usage, but they 

186 
could represent network bandwith in Gbps). Note that it's recommended 

187 
to not have zero as the load value for any instance metric since then 

188 
secondary instances are not well balanced. 

183 
instances (e.g. all instances use cpu% to report cpu usage, and not 

184 
something related to number of CPU seconds used if the CPUs are 

185 
different), and that they are normalised to between zero and one. Note 

186 
that it's recommended to not have zero as the load value for any 

187 
instance metric since then secondary instances are not well balanced. 

189  188  
190  189 
On a perfectly balanced cluster (all nodes the same size, all 
191  190 
instances the same size and spread across the nodes equally), the 
...  ...  
202  201  
203  202 
.SS EXCLUSION TAGS 
204  203  
205 
The exclusion tags mecanism is designed to prevent instances which run


206 
the same workload (e.g. two DNS servers) to land on the same node, 

204 
The exclusion tags mechanism is designed to prevent instances which


205 
run the same workload (e.g. two DNS servers) to land on the same node,


207  206 
which would make the respective node a SPOF for the given service. 
208  207  
209  208 
It works by tagging instances with certain tags and then building 
Also available in: Unified diff