root / doc / design-htools-2.3.rst @ 33c730a2
History | View | Annotate | Download (12.8 kB)
1 |
==================================== |
---|---|
2 |
Synchronising htools to Ganeti 2.3 |
3 |
==================================== |
4 |
|
5 |
Ganeti 2.3 introduces a number of new features that change the cluster |
6 |
internals significantly enough that the htools suite needs to be |
7 |
updated accordingly in order to function correctly. |
8 |
|
9 |
Shared storage support |
10 |
====================== |
11 |
|
12 |
Currently, the htools algorithms presume a model where all of an |
13 |
instance's resources is served from within the cluster, more |
14 |
specifically from the nodes comprising the cluster. While is this |
15 |
usual for memory and CPU, deployments which use shared storage will |
16 |
invalidate this assumption for storage. |
17 |
|
18 |
To account for this, we need to move some assumptions from being |
19 |
implicit (and hardcoded) to being explicitly exported from Ganeti. |
20 |
|
21 |
|
22 |
New instance parameters |
23 |
----------------------- |
24 |
|
25 |
It is presumed that Ganeti will export for all instances a new |
26 |
``storage_type`` parameter, that will denote either internal storage |
27 |
(e.g. *plain* or *drbd*), or external storage. |
28 |
|
29 |
Furthermore, a new ``storage_pool`` parameter will classify, for both |
30 |
internal and external storage, the pool out of which the storage is |
31 |
allocated. For internal storage, this will be either ``lvm`` (the pool |
32 |
that provides space to both ``plain`` and ``drbd`` instances) or |
33 |
``file`` (for file-storage-based instances). For external storage, |
34 |
this will be the respective NAS/SAN/cloud storage that backs up the |
35 |
instance. Note that for htools, external storage pools are opaque; we |
36 |
only care that they have an identifier, so that we can distinguish |
37 |
between two different pools. |
38 |
|
39 |
If these two parameters are not present, the instances will be |
40 |
presumed to be ``internal/lvm``. |
41 |
|
42 |
New node parameters |
43 |
------------------- |
44 |
|
45 |
For each node, it is expected that Ganeti will export what storage |
46 |
types it supports and pools it has access to. So a classic 2.2 cluster |
47 |
will have all nodes supporting ``internal/lvm`` and/or |
48 |
``internal/file``, whereas a new shared storage only 2.3 cluster could |
49 |
have ``external/my-nas`` storage. |
50 |
|
51 |
Whatever the mechanism that Ganeti will use internally to configure |
52 |
the associations between nodes and storage pools, we consider that |
53 |
we'll have available two node attributes inside htools: the list of internal |
54 |
and external storage pools. |
55 |
|
56 |
External storage and instances |
57 |
------------------------------ |
58 |
|
59 |
Currently, for an instance we allow one cheap move type: failover to |
60 |
the current secondary, if it is a healthy node, and four other |
61 |
“expensive” (as in, including data copies) moves that involve changing |
62 |
either the secondary or the primary node or both. |
63 |
|
64 |
In presence of an external storage type, the following things will |
65 |
change: |
66 |
|
67 |
- the disk-based moves will be disallowed; this is already a feature |
68 |
in the algorithm, controlled by a boolean switch, so adapting |
69 |
external storage here will be trivial |
70 |
- instead of the current one secondary node, the secondaries will |
71 |
become a list of potential secondaries, based on access to the |
72 |
instance's storage pool |
73 |
|
74 |
Except for this, the basic move algorithm remains unchanged. |
75 |
|
76 |
External storage and nodes |
77 |
-------------------------- |
78 |
|
79 |
Two separate areas will have to change for nodes and external storage. |
80 |
|
81 |
First, then allocating instances (either as part of a move or a new |
82 |
allocation), if the instance is using external storage, then the |
83 |
internal disk metrics should be ignored (for both the primary and |
84 |
secondary cases). |
85 |
|
86 |
Second, the per-node metrics used in the cluster scoring must take |
87 |
into account that nodes might not have internal storage at all, and |
88 |
handle this as a well-balanced case (score 0). |
89 |
|
90 |
N+1 status |
91 |
---------- |
92 |
|
93 |
Currently, computing the N+1 status of a node is simple: |
94 |
|
95 |
- group the current secondary instances by their primary node, and |
96 |
compute the sum of each instance group memory |
97 |
- choose the maximum sum, and check if it's smaller than the current |
98 |
available memory on this node |
99 |
|
100 |
In effect, computing the N+1 status is a per-node matter. However, |
101 |
with shared storage, we don't have secondary nodes, just potential |
102 |
secondaries. Thus computing the N+1 status will be a cluster-level |
103 |
matter, and much more expensive. |
104 |
|
105 |
A simple version of the N+1 checks would be that for each instance |
106 |
having said node as primary, we have enough memory in the cluster for |
107 |
relocation. This means we would actually need to run allocation |
108 |
checks, and update the cluster status from within allocation on one |
109 |
node, while being careful that we don't recursively check N+1 status |
110 |
during this relocation, which is too expensive. |
111 |
|
112 |
However, the shared storage model has some properties that changes the |
113 |
rules of the computation. Speaking broadly (and ignoring hard |
114 |
restrictions like tag based exclusion and CPU limits), the exact |
115 |
location of an instance in the cluster doesn't matter as long as |
116 |
memory is available. This results in two changes: |
117 |
|
118 |
- simply tracking the amount of free memory buckets is enough, |
119 |
cluster-wide |
120 |
- moving an instance from one node to another would not change the N+1 |
121 |
status of any node, and only allocation needs to deal with N+1 |
122 |
checks |
123 |
|
124 |
Unfortunately, this very cheap solution fails in case of any other |
125 |
exclusion or prevention factors. |
126 |
|
127 |
TODO: find a solution for N+1 checks. |
128 |
|
129 |
|
130 |
Node groups support |
131 |
=================== |
132 |
|
133 |
The addition of node groups has a small impact on the actual |
134 |
algorithms, which will simply operate at node group level instead of |
135 |
cluster level, but it requires the addition of new algorithms for |
136 |
inter-node group operations. |
137 |
|
138 |
The following two definitions will be used in the following |
139 |
paragraphs: |
140 |
|
141 |
local group |
142 |
The local group refers to a node's own node group, or when speaking |
143 |
about an instance, the node group of its primary node |
144 |
|
145 |
regular cluster |
146 |
A cluster composed of a single node group, or pre-2.3 cluster |
147 |
|
148 |
super cluster |
149 |
This term refers to a cluster which comprises multiple node groups, |
150 |
as opposed to a 2.2 and earlier cluster with a single node group |
151 |
|
152 |
In all the below operations, it's assumed that Ganeti can gather the |
153 |
entire super cluster state cheaply. |
154 |
|
155 |
|
156 |
Balancing changes |
157 |
----------------- |
158 |
|
159 |
Balancing will move from cluster-level balancing to group |
160 |
balancing. In order to achieve a reasonable improvement in a super |
161 |
cluster, without needing to keep state of what groups have been |
162 |
already balanced previously, the balancing algorithm will run as |
163 |
follows: |
164 |
|
165 |
#. the cluster data is gathered |
166 |
#. if this is a regular cluster, as opposed to a super cluster, |
167 |
balancing will proceed normally as previously |
168 |
#. otherwise, compute the cluster scores for all groups |
169 |
#. choose the group with the worst score and see if we can improve it; |
170 |
if not choose the next-worst group, so on |
171 |
#. once a group has been identified, run the balancing for it |
172 |
|
173 |
Of course, explicit selection of a group will be allowed. |
174 |
|
175 |
Super cluster operations |
176 |
++++++++++++++++++++++++ |
177 |
|
178 |
Beside the regular group balancing, in a super cluster we have more |
179 |
operations. |
180 |
|
181 |
|
182 |
Redistribution |
183 |
^^^^^^^^^^^^^^ |
184 |
|
185 |
In a regular cluster, once we run out of resources (offline nodes |
186 |
which can't be fully evacuated, N+1 failures, etc.) there is nothing |
187 |
we can do unless nodes are added or instances are removed. |
188 |
|
189 |
In a super cluster however, there might be resources available in |
190 |
another group, so there is the possibility of relocating instances |
191 |
between groups to re-establish N+1 success within each group. |
192 |
|
193 |
One difficulty in the presence of both super clusters and shared |
194 |
storage is that the move paths of instances are quite complicated; |
195 |
basically an instance can move inside its local group, and to any |
196 |
other groups which have access to the same storage type and storage |
197 |
pool pair. In effect, the super cluster is composed of multiple |
198 |
‘partitions’, each containing one or more groups, but a node is |
199 |
simultaneously present in multiple partitions, one for each storage |
200 |
type and storage pool it supports. As such, the interactions between |
201 |
the individual partitions are too complex for non-trivial clusters to |
202 |
assume we can compute a perfect solution: we might need to move some |
203 |
instances using shared storage pool ‘A’ in order to clear some more |
204 |
memory to accept an instance using local storage, which will further |
205 |
clear more VCPUs in a third partition, etc. As such, we'll limit |
206 |
ourselves at simple relocation steps within a single partition. |
207 |
|
208 |
Algorithm: |
209 |
|
210 |
#. read super cluster data, and exit if cluster doesn't allow |
211 |
inter-group moves |
212 |
#. filter out any groups that are “alone” in their partition |
213 |
(i.e. no other group sharing at least one storage method) |
214 |
#. determine list of healthy versus unhealthy groups: |
215 |
|
216 |
#. a group which contains offline nodes still hosting instances is |
217 |
definitely not healthy |
218 |
#. a group which has nodes failing N+1 is ‘weakly’ unhealthy |
219 |
|
220 |
#. if either list is empty, exit (no work to do, or no way to fix problems) |
221 |
#. for each unhealthy group: |
222 |
|
223 |
#. compute the instances that are causing the problems: all |
224 |
instances living on offline nodes, all instances living as |
225 |
secondary on N+1 failing nodes, all instances living as primaries |
226 |
on N+1 failing nodes (in this order) |
227 |
#. remove instances, one by one, until the source group is healthy |
228 |
again |
229 |
#. try to run a standard allocation procedure for each instance on |
230 |
all potential groups in its partition |
231 |
#. if all instances were relocated successfully, it means we have a |
232 |
solution for repairing the original group |
233 |
|
234 |
Compression |
235 |
^^^^^^^^^^^ |
236 |
|
237 |
In a super cluster which has had many instance reclamations, it is |
238 |
possible that while none of the groups is empty, overall there is |
239 |
enough empty capacity that an entire group could be removed. |
240 |
|
241 |
The algorithm for “compressing” the super cluster is as follows: |
242 |
|
243 |
#. read super cluster data |
244 |
#. compute total *(memory, disk, cpu)*, and free *(memory, disk, cpu)* |
245 |
for the super-cluster |
246 |
#. computer per-group used and free *(memory, disk, cpu)* |
247 |
#. select candidate groups for evacuation: |
248 |
|
249 |
#. they must be connected to other groups via a common storage type |
250 |
and pool |
251 |
#. they must have fewer used resources than the global free |
252 |
resources (minus their own free resources) |
253 |
|
254 |
#. for each of these groups, try to relocate all its instances to |
255 |
connected peer groups |
256 |
#. report the list of groups that could be evacuated, or if instructed |
257 |
so, perform the evacuation of the group with the largest free |
258 |
resources (i.e. in order to reclaim the most capacity) |
259 |
|
260 |
Load balancing |
261 |
^^^^^^^^^^^^^^ |
262 |
|
263 |
Assuming a super cluster using shared storage, where instance failover |
264 |
is cheap, it should be possible to do a load-based balancing across |
265 |
groups. |
266 |
|
267 |
As opposed to the normal balancing, where we want to balance on all |
268 |
node attributes, here we should look only at the load attributes; in |
269 |
other words, compare the available (total) node capacity with the |
270 |
(total) load generated by instances in a given group, and computing |
271 |
such scores for all groups, trying to see if we have any outliers. |
272 |
|
273 |
Once a reliable load-weighting method for groups exists, we can apply |
274 |
a modified version of the cluster scoring method to score not |
275 |
imbalances across nodes, but imbalances across groups which result in |
276 |
a super cluster load-related score. |
277 |
|
278 |
Allocation changes |
279 |
------------------ |
280 |
|
281 |
It is important to keep the allocation method across groups internal |
282 |
(in the Ganeti/Iallocator combination), instead of delegating it to an |
283 |
external party (e.g. a RAPI client). For this, the IAllocator protocol |
284 |
should be extended to provide proper group support. |
285 |
|
286 |
For htools, the new algorithm will work as follows: |
287 |
|
288 |
#. read/receive cluster data from Ganeti |
289 |
#. filter out any groups that do not supports the requested storage |
290 |
method |
291 |
#. for remaining groups, try allocation and compute scores after |
292 |
allocation |
293 |
#. sort valid allocation solutions accordingly and return the entire |
294 |
list to Ganeti |
295 |
|
296 |
The rationale for returning the entire group list, and not only the |
297 |
best choice, is that we anyway have the list, and Ganeti might have |
298 |
other criteria (e.g. the best group might be busy/locked down, etc.) |
299 |
so even if from the point of view of resources it is the best choice, |
300 |
it might not be the overall best one. |
301 |
|
302 |
Node evacuation changes |
303 |
----------------------- |
304 |
|
305 |
While the basic concept in the ``multi-evac`` iallocator |
306 |
mode remains unchanged (it's a simple local group issue), when failing |
307 |
to evacuate and running in a super cluster, we could have resources |
308 |
available elsewhere in the cluster for evacuation. |
309 |
|
310 |
The algorithm for computing this will be the same as the one for super |
311 |
cluster compression and redistribution, except that the list of |
312 |
instances is fixed to the ones living on the nodes to-be-evacuated. |
313 |
|
314 |
If the inter-group relocation is successful, the result to Ganeti will |
315 |
not be a local group evacuation target, but instead (for each |
316 |
instance) a pair *(remote group, nodes)*. Ganeti itself will have to |
317 |
decide (based on user input) whether to continue with inter-group |
318 |
evacuation or not. |
319 |
|
320 |
In case that Ganeti doesn't provide complete cluster data, just the |
321 |
local group, the inter-group relocation won't be attempted. |
322 |
|
323 |
.. vim: set textwidth=72 : |
324 |
.. Local Variables: |
325 |
.. mode: rst |
326 |
.. fill-column: 72 |
327 |
.. End: |