Revision 92902e91
b/Makefile.am | ||
---|---|---|
240 | 240 |
doc/design-2.1.rst \ |
241 | 241 |
doc/design-2.2.rst \ |
242 | 242 |
doc/design-2.3.rst \ |
243 |
doc/design-htools-2.3.rst \ |
|
243 | 244 |
doc/design-2.4.rst \ |
244 | 245 |
doc/design-draft.rst \ |
245 | 246 |
doc/design-oob.rst \ |
b/doc/design-htools-2.3.rst | ||
---|---|---|
1 |
==================================== |
|
2 |
Synchronising htools to Ganeti 2.3 |
|
3 |
==================================== |
|
4 |
|
|
5 |
Ganeti 2.3 introduces a number of new features that change the cluster |
|
6 |
internals significantly enough that the htools suite needs to be |
|
7 |
updated accordingly in order to function correctly. |
|
8 |
|
|
9 |
Shared storage support |
|
10 |
====================== |
|
11 |
|
|
12 |
Currently, the htools algorithms presume a model where all of an |
|
13 |
instance's resources is served from within the cluster, more |
|
14 |
specifically from the nodes comprising the cluster. While is this |
|
15 |
usual for memory and CPU, deployments which use shared storage will |
|
16 |
invalidate this assumption for storage. |
|
17 |
|
|
18 |
To account for this, we need to move some assumptions from being |
|
19 |
implicit (and hardcoded) to being explicitly exported from Ganeti. |
|
20 |
|
|
21 |
|
|
22 |
New instance parameters |
|
23 |
----------------------- |
|
24 |
|
|
25 |
It is presumed that Ganeti will export for all instances a new |
|
26 |
``storage_type`` parameter, that will denote either internal storage |
|
27 |
(e.g. *plain* or *drbd*), or external storage. |
|
28 |
|
|
29 |
Furthermore, a new ``storage_pool`` parameter will classify, for both |
|
30 |
internal and external storage, the pool out of which the storage is |
|
31 |
allocated. For internal storage, this will be either ``lvm`` (the pool |
|
32 |
that provides space to both ``plain`` and ``drbd`` instances) or |
|
33 |
``file`` (for file-storage-based instances). For external storage, |
|
34 |
this will be the respective NAS/SAN/cloud storage that backs up the |
|
35 |
instance. Note that for htools, external storage pools are opaque; we |
|
36 |
only care that they have an identifier, so that we can distinguish |
|
37 |
between two different pools. |
|
38 |
|
|
39 |
If these two parameters are not present, the instances will be |
|
40 |
presumed to be ``internal/lvm``. |
|
41 |
|
|
42 |
New node parameters |
|
43 |
------------------- |
|
44 |
|
|
45 |
For each node, it is expected that Ganeti will export what storage |
|
46 |
types it supports and pools it has access to. So a classic 2.2 cluster |
|
47 |
will have all nodes supporting ``internal/lvm`` and/or |
|
48 |
``internal/file``, whereas a new shared storage only 2.3 cluster could |
|
49 |
have ``external/my-nas`` storage. |
|
50 |
|
|
51 |
Whatever the mechanism that Ganeti will use internally to configure |
|
52 |
the associations between nodes and storage pools, we consider that |
|
53 |
we'll have available two node attributes inside htools: the list of internal |
|
54 |
and external storage pools. |
|
55 |
|
|
56 |
External storage and instances |
|
57 |
------------------------------ |
|
58 |
|
|
59 |
Currently, for an instance we allow one cheap move type: failover to |
|
60 |
the current secondary, if it is a healthy node, and four other |
|
61 |
“expensive” (as in, including data copies) moves that involve changing |
|
62 |
either the secondary or the primary node or both. |
|
63 |
|
|
64 |
In presence of an external storage type, the following things will |
|
65 |
change: |
|
66 |
|
|
67 |
- the disk-based moves will be disallowed; this is already a feature |
|
68 |
in the algorithm, controlled by a boolean switch, so adapting |
|
69 |
external storage here will be trivial |
|
70 |
- instead of the current one secondary node, the secondaries will |
|
71 |
become a list of potential secondaries, based on access to the |
|
72 |
instance's storage pool |
|
73 |
|
|
74 |
Except for this, the basic move algorithm remains unchanged. |
|
75 |
|
|
76 |
External storage and nodes |
|
77 |
-------------------------- |
|
78 |
|
|
79 |
Two separate areas will have to change for nodes and external storage. |
|
80 |
|
|
81 |
First, then allocating instances (either as part of a move or a new |
|
82 |
allocation), if the instance is using external storage, then the |
|
83 |
internal disk metrics should be ignored (for both the primary and |
|
84 |
secondary cases). |
|
85 |
|
|
86 |
Second, the per-node metrics used in the cluster scoring must take |
|
87 |
into account that nodes might not have internal storage at all, and |
|
88 |
handle this as a well-balanced case (score 0). |
|
89 |
|
|
90 |
N+1 status |
|
91 |
---------- |
|
92 |
|
|
93 |
Currently, computing the N+1 status of a node is simple: |
|
94 |
|
|
95 |
- group the current secondary instances by their primary node, and |
|
96 |
compute the sum of each instance group memory |
|
97 |
- choose the maximum sum, and check if it's smaller than the current |
|
98 |
available memory on this node |
|
99 |
|
|
100 |
In effect, computing the N+1 status is a per-node matter. However, |
|
101 |
with shared storage, we don't have secondary nodes, just potential |
|
102 |
secondaries. Thus computing the N+1 status will be a cluster-level |
|
103 |
matter, and much more expensive. |
|
104 |
|
|
105 |
A simple version of the N+1 checks would be that for each instance |
|
106 |
having said node as primary, we have enough memory in the cluster for |
|
107 |
relocation. This means we would actually need to run allocation |
|
108 |
checks, and update the cluster status from within allocation on one |
|
109 |
node, while being careful that we don't recursively check N+1 status |
|
110 |
during this relocation, which is too expensive. |
|
111 |
|
|
112 |
However, the shared storage model has some properties that changes the |
|
113 |
rules of the computation. Speaking broadly (and ignoring hard |
|
114 |
restrictions like tag based exclusion and CPU limits), the exact |
|
115 |
location of an instance in the cluster doesn't matter as long as |
|
116 |
memory is available. This results in two changes: |
|
117 |
|
|
118 |
- simply tracking the amount of free memory buckets is enough, |
|
119 |
cluster-wide |
|
120 |
- moving an instance from one node to another would not change the N+1 |
|
121 |
status of any node, and only allocation needs to deal with N+1 |
|
122 |
checks |
|
123 |
|
|
124 |
Unfortunately, this very cheap solution fails in case of any other |
|
125 |
exclusion or prevention factors. |
|
126 |
|
|
127 |
TODO: find a solution for N+1 checks. |
|
128 |
|
|
129 |
|
|
130 |
Node groups support |
|
131 |
=================== |
|
132 |
|
|
133 |
The addition of node groups has a small impact on the actual |
|
134 |
algorithms, which will simply operate at node group level instead of |
|
135 |
cluster level, but it requires the addition of new algorithms for |
|
136 |
inter-node group operations. |
|
137 |
|
|
138 |
The following two definitions will be used in the following |
|
139 |
paragraphs: |
|
140 |
|
|
141 |
local group |
|
142 |
The local group refers to a node's own node group, or when speaking |
|
143 |
about an instance, the node group of its primary node |
|
144 |
|
|
145 |
regular cluster |
|
146 |
A cluster composed of a single node group, or pre-2.3 cluster |
|
147 |
|
|
148 |
super cluster |
|
149 |
This term refers to a cluster which comprises multiple node groups, |
|
150 |
as opposed to a 2.2 and earlier cluster with a single node group |
|
151 |
|
|
152 |
In all the below operations, it's assumed that Ganeti can gather the |
|
153 |
entire super cluster state cheaply. |
|
154 |
|
|
155 |
|
|
156 |
Balancing changes |
|
157 |
----------------- |
|
158 |
|
|
159 |
Balancing will move from cluster-level balancing to group |
|
160 |
balancing. In order to achieve a reasonable improvement in a super |
|
161 |
cluster, without needing to keep state of what groups have been |
|
162 |
already balanced previously, the balancing algorithm will run as |
|
163 |
follows: |
|
164 |
|
|
165 |
#. the cluster data is gathered |
|
166 |
#. if this is a regular cluster, as opposed to a super cluster, |
|
167 |
balancing will proceed normally as previously |
|
168 |
#. otherwise, compute the cluster scores for all groups |
|
169 |
#. choose the group with the worst score and see if we can improve it; |
|
170 |
if not choose the next-worst group, so on |
|
171 |
#. once a group has been identified, run the balancing for it |
|
172 |
|
|
173 |
Of course, explicit selection of a group will be allowed. |
|
174 |
|
|
175 |
Super cluster operations |
|
176 |
++++++++++++++++++++++++ |
|
177 |
|
|
178 |
Beside the regular group balancing, in a super cluster we have more |
|
179 |
operations. |
|
180 |
|
|
181 |
|
|
182 |
Redistribution |
|
183 |
^^^^^^^^^^^^^^ |
|
184 |
|
|
185 |
In a regular cluster, once we run out of resources (offline nodes |
|
186 |
which can't be fully evacuated, N+1 failures, etc.) there is nothing |
|
187 |
we can do unless nodes are added or instances are removed. |
|
188 |
|
|
189 |
In a super cluster however, there might be resources available in |
|
190 |
another group, so there is the possibility of relocating instances |
|
191 |
between groups to re-establish N+1 success within each group. |
|
192 |
|
|
193 |
One difficulty in the presence of both super clusters and shared |
|
194 |
storage is that the move paths of instances are quite complicated; |
|
195 |
basically an instance can move inside its local group, and to any |
|
196 |
other groups which have access to the same storage type and storage |
|
197 |
pool pair. In effect, the super cluster is composed of multiple |
|
198 |
‘partitions’, each containing one or more groups, but a node is |
|
199 |
simultaneously present in multiple partitions, one for each storage |
|
200 |
type and storage pool it supports. As such, the interactions between |
|
201 |
the individual partitions are too complex for non-trivial clusters to |
|
202 |
assume we can compute a perfect solution: we might need to move some |
|
203 |
instances using shared storage pool ‘A’ in order to clear some more |
|
204 |
memory to accept an instance using local storage, which will further |
|
205 |
clear more VCPUs in a third partition, etc. As such, we'll limit |
|
206 |
ourselves at simple relocation steps within a single partition. |
|
207 |
|
|
208 |
Algorithm: |
|
209 |
|
|
210 |
#. read super cluster data, and exit if cluster doesn't allow |
|
211 |
inter-group moves |
|
212 |
#. filter out any groups that are “alone” in their partition |
|
213 |
(i.e. no other group sharing at least one storage method) |
|
214 |
#. determine list of healthy versus unhealthy groups: |
|
215 |
|
|
216 |
#. a group which contains offline nodes still hosting instances is |
|
217 |
definitely not healthy |
|
218 |
#. a group which has nodes failing N+1 is ‘weakly’ unhealthy |
|
219 |
|
|
220 |
#. if either list is empty, exit (no work to do, or no way to fix problems) |
|
221 |
#. for each unhealthy group: |
|
222 |
|
|
223 |
#. compute the instances that are causing the problems: all |
|
224 |
instances living on offline nodes, all instances living as |
|
225 |
secondary on N+1 failing nodes, all instances living as primaries |
|
226 |
on N+1 failing nodes (in this order) |
|
227 |
#. remove instances, one by one, until the source group is healthy |
|
228 |
again |
|
229 |
#. try to run a standard allocation procedure for each instance on |
|
230 |
all potential groups in its partition |
|
231 |
#. if all instances were relocated successfully, it means we have a |
|
232 |
solution for repairing the original group |
|
233 |
|
|
234 |
Compression |
|
235 |
^^^^^^^^^^^ |
|
236 |
|
|
237 |
In a super cluster which has had many instance reclamations, it is |
|
238 |
possible that while none of the groups is empty, overall there is |
|
239 |
enough empty capacity that an entire group could be removed. |
|
240 |
|
|
241 |
The algorithm for “compressing” the super cluster is as follows: |
|
242 |
|
|
243 |
#. read super cluster data |
|
244 |
#. compute total *(memory, disk, cpu)*, and free *(memory, disk, cpu)* |
|
245 |
for the super-cluster |
|
246 |
#. computer per-group used and free *(memory, disk, cpu)* |
|
247 |
#. select candidate groups for evacuation: |
|
248 |
|
|
249 |
#. they must be connected to other groups via a common storage type |
|
250 |
and pool |
|
251 |
#. they must have fewer used resources than the global free |
|
252 |
resources (minus their own free resources) |
|
253 |
|
|
254 |
#. for each of these groups, try to relocate all its instances to |
|
255 |
connected peer groups |
|
256 |
#. report the list of groups that could be evacuated, or if instructed |
|
257 |
so, perform the evacuation of the group with the largest free |
|
258 |
resources (i.e. in order to reclaim the most capacity) |
|
259 |
|
|
260 |
Load balancing |
|
261 |
^^^^^^^^^^^^^^ |
|
262 |
|
|
263 |
Assuming a super cluster using shared storage, where instance failover |
|
264 |
is cheap, it should be possible to do a load-based balancing across |
|
265 |
groups. |
|
266 |
|
|
267 |
As opposed to the normal balancing, where we want to balance on all |
|
268 |
node attributes, here we should look only at the load attributes; in |
|
269 |
other words, compare the available (total) node capacity with the |
|
270 |
(total) load generated by instances in a given group, and computing |
|
271 |
such scores for all groups, trying to see if we have any outliers. |
|
272 |
|
|
273 |
Once a reliable load-weighting method for groups exists, we can apply |
|
274 |
a modified version of the cluster scoring method to score not |
|
275 |
imbalances across nodes, but imbalances across groups which result in |
|
276 |
a super cluster load-related score. |
|
277 |
|
|
278 |
Allocation changes |
|
279 |
------------------ |
|
280 |
|
|
281 |
It is important to keep the allocation method across groups internal |
|
282 |
(in the Ganeti/Iallocator combination), instead of delegating it to an |
|
283 |
external party (e.g. a RAPI client). For this, the IAllocator protocol |
|
284 |
should be extended to provide proper group support. |
|
285 |
|
|
286 |
For htools, the new algorithm will work as follows: |
|
287 |
|
|
288 |
#. read/receive cluster data from Ganeti |
|
289 |
#. filter out any groups that do not supports the requested storage |
|
290 |
method |
|
291 |
#. for remaining groups, try allocation and compute scores after |
|
292 |
allocation |
|
293 |
#. sort valid allocation solutions accordingly and return the entire |
|
294 |
list to Ganeti |
|
295 |
|
|
296 |
The rationale for returning the entire group list, and not only the |
|
297 |
best choice, is that we anyway have the list, and Ganeti might have |
|
298 |
other criteria (e.g. the best group might be busy/locked down, etc.) |
|
299 |
so even if from the point of view of resources it is the best choice, |
|
300 |
it might not be the overall best one. |
|
301 |
|
|
302 |
Node evacuation changes |
|
303 |
----------------------- |
|
304 |
|
|
305 |
While the basic concept in the ``multi-evac`` iallocator |
|
306 |
mode remains unchanged (it's a simple local group issue), when failing |
|
307 |
to evacuate and running in a super cluster, we could have resources |
|
308 |
available elsewhere in the cluster for evacuation. |
|
309 |
|
|
310 |
The algorithm for computing this will be the same as the one for super |
|
311 |
cluster compression and redistribution, except that the list of |
|
312 |
instances is fixed to the ones living on the nodes to-be-evacuated. |
|
313 |
|
|
314 |
If the inter-group relocation is successful, the result to Ganeti will |
|
315 |
not be a local group evacuation target, but instead (for each |
|
316 |
instance) a pair *(remote group, nodes)*. Ganeti itself will have to |
|
317 |
decide (based on user input) whether to continue with inter-group |
|
318 |
evacuation or not. |
|
319 |
|
|
320 |
In case that Ganeti doesn't provide complete cluster data, just the |
|
321 |
local group, the inter-group relocation won't be attempted. |
b/doc/index.rst | ||
---|---|---|
19 | 19 |
design-2.1.rst |
20 | 20 |
design-2.2.rst |
21 | 21 |
design-2.3.rst |
22 |
design-htools-2.3.rst |
|
22 | 23 |
design-2.4.rst |
23 | 24 |
design-draft.rst |
24 | 25 |
cluster-merge.rst |
/dev/null | ||
---|---|---|
1 |
==================================== |
|
2 |
Synchronising htools to Ganeti 2.3 |
|
3 |
==================================== |
|
4 |
|
|
5 |
Ganeti 2.3 introduces a number of new features that change the cluster |
|
6 |
internals significantly enough that the htools suite needs to be |
|
7 |
updated accordingly in order to function correctly. |
|
8 |
|
|
9 |
Shared storage support |
|
10 |
====================== |
|
11 |
|
|
12 |
Currently, the htools algorithms presume a model where all of an |
|
13 |
instance's resources is served from within the cluster, more |
|
14 |
specifically from the nodes comprising the cluster. While is this |
|
15 |
usual for memory and CPU, deployments which use shared storage will |
|
16 |
invalidate this assumption for storage. |
|
17 |
|
|
18 |
To account for this, we need to move some assumptions from being |
|
19 |
implicit (and hardcoded) to being explicitly exported from Ganeti. |
|
20 |
|
|
21 |
|
|
22 |
New instance parameters |
|
23 |
----------------------- |
|
24 |
|
|
25 |
It is presumed that Ganeti will export for all instances a new |
|
26 |
``storage_type`` parameter, that will denote either internal storage |
|
27 |
(e.g. *plain* or *drbd*), or external storage. |
|
28 |
|
|
29 |
Furthermore, a new ``storage_pool`` parameter will classify, for both |
|
30 |
internal and external storage, the pool out of which the storage is |
|
31 |
allocated. For internal storage, this will be either ``lvm`` (the pool |
|
32 |
that provides space to both ``plain`` and ``drbd`` instances) or |
|
33 |
``file`` (for file-storage-based instances). For external storage, |
|
34 |
this will be the respective NAS/SAN/cloud storage that backs up the |
|
35 |
instance. Note that for htools, external storage pools are opaque; we |
|
36 |
only care that they have an identifier, so that we can distinguish |
|
37 |
between two different pools. |
|
38 |
|
|
39 |
If these two parameters are not present, the instances will be |
|
40 |
presumed to be ``internal/lvm``. |
|
41 |
|
|
42 |
New node parameters |
|
43 |
------------------- |
|
44 |
|
|
45 |
For each node, it is expected that Ganeti will export what storage |
|
46 |
types it supports and pools it has access to. So a classic 2.2 cluster |
|
47 |
will have all nodes supporting ``internal/lvm`` and/or |
|
48 |
``internal/file``, whereas a new shared storage only 2.3 cluster could |
|
49 |
have ``external/my-nas`` storage. |
|
50 |
|
|
51 |
Whatever the mechanism that Ganeti will use internally to configure |
|
52 |
the associations between nodes and storage pools, we consider that |
|
53 |
we'll have available two node attributes inside htools: the list of internal |
|
54 |
and external storage pools. |
|
55 |
|
|
56 |
External storage and instances |
|
57 |
------------------------------ |
|
58 |
|
|
59 |
Currently, for an instance we allow one cheap move type: failover to |
|
60 |
the current secondary, if it is a healthy node, and four other |
|
61 |
“expensive” (as in, including data copies) moves that involve changing |
|
62 |
either the secondary or the primary node or both. |
|
63 |
|
|
64 |
In presence of an external storage type, the following things will |
|
65 |
change: |
|
66 |
|
|
67 |
- the disk-based moves will be disallowed; this is already a feature |
|
68 |
in the algorithm, controlled by a boolean switch, so adapting |
|
69 |
external storage here will be trivial |
|
70 |
- instead of the current one secondary node, the secondaries will |
|
71 |
become a list of potential secondaries, based on access to the |
|
72 |
instance's storage pool |
|
73 |
|
|
74 |
Except for this, the basic move algorithm remains unchanged. |
|
75 |
|
|
76 |
External storage and nodes |
|
77 |
-------------------------- |
|
78 |
|
|
79 |
Two separate areas will have to change for nodes and external storage. |
|
80 |
|
|
81 |
First, then allocating instances (either as part of a move or a new |
|
82 |
allocation), if the instance is using external storage, then the |
|
83 |
internal disk metrics should be ignored (for both the primary and |
|
84 |
secondary cases). |
|
85 |
|
|
86 |
Second, the per-node metrics used in the cluster scoring must take |
|
87 |
into account that nodes might not have internal storage at all, and |
|
88 |
handle this as a well-balanced case (score 0). |
|
89 |
|
|
90 |
N+1 status |
|
91 |
---------- |
|
92 |
|
|
93 |
Currently, computing the N+1 status of a node is simple: |
|
94 |
|
|
95 |
- group the current secondary instances by their primary node, and |
|
96 |
compute the sum of each instance group memory |
|
97 |
- choose the maximum sum, and check if it's smaller than the current |
|
98 |
available memory on this node |
|
99 |
|
|
100 |
In effect, computing the N+1 status is a per-node matter. However, |
|
101 |
with shared storage, we don't have secondary nodes, just potential |
|
102 |
secondaries. Thus computing the N+1 status will be a cluster-level |
|
103 |
matter, and much more expensive. |
|
104 |
|
|
105 |
A simple version of the N+1 checks would be that for each instance |
|
106 |
having said node as primary, we have enough memory in the cluster for |
|
107 |
relocation. This means we would actually need to run allocation |
|
108 |
checks, and update the cluster status from within allocation on one |
|
109 |
node, while being careful that we don't recursively check N+1 status |
|
110 |
during this relocation, which is too expensive. |
|
111 |
|
|
112 |
However, the shared storage model has some properties that changes the |
|
113 |
rules of the computation. Speaking broadly (and ignoring hard |
|
114 |
restrictions like tag based exclusion and CPU limits), the exact |
|
115 |
location of an instance in the cluster doesn't matter as long as |
|
116 |
memory is available. This results in two changes: |
|
117 |
|
|
118 |
- simply tracking the amount of free memory buckets is enough, |
|
119 |
cluster-wide |
|
120 |
- moving an instance from one node to another would not change the N+1 |
|
121 |
status of any node, and only allocation needs to deal with N+1 |
|
122 |
checks |
|
123 |
|
|
124 |
Unfortunately, this very cheap solution fails in case of any other |
|
125 |
exclusion or prevention factors. |
|
126 |
|
|
127 |
TODO: find a solution for N+1 checks. |
|
128 |
|
|
129 |
|
|
130 |
Node groups support |
|
131 |
=================== |
|
132 |
|
|
133 |
The addition of node groups has a small impact on the actual |
|
134 |
algorithms, which will simply operate at node group level instead of |
|
135 |
cluster level, but it requires the addition of new algorithms for |
|
136 |
inter-node group operations. |
|
137 |
|
|
138 |
The following two definitions will be used in the following |
|
139 |
paragraphs: |
|
140 |
|
|
141 |
local group |
|
142 |
The local group refers to a node's own node group, or when speaking |
|
143 |
about an instance, the node group of its primary node |
|
144 |
|
|
145 |
regular cluster |
|
146 |
A cluster composed of a single node group, or pre-2.3 cluster |
|
147 |
|
|
148 |
super cluster |
|
149 |
This term refers to a cluster which comprises multiple node groups, |
|
150 |
as opposed to a 2.2 and earlier cluster with a single node group |
|
151 |
|
|
152 |
In all the below operations, it's assumed that Ganeti can gather the |
|
153 |
entire super cluster state cheaply. |
|
154 |
|
|
155 |
|
|
156 |
Balancing changes |
|
157 |
----------------- |
|
158 |
|
|
159 |
Balancing will move from cluster-level balancing to group |
|
160 |
balancing. In order to achieve a reasonable improvement in a super |
|
161 |
cluster, without needing to keep state of what groups have been |
|
162 |
already balanced previously, the balancing algorithm will run as |
|
163 |
follows: |
|
164 |
|
|
165 |
#. the cluster data is gathered |
|
166 |
#. if this is a regular cluster, as opposed to a super cluster, |
|
167 |
balancing will proceed normally as previously |
|
168 |
#. otherwise, compute the cluster scores for all groups |
|
169 |
#. choose the group with the worst score and see if we can improve it; |
|
170 |
if not choose the next-worst group, so on |
|
171 |
#. once a group has been identified, run the balancing for it |
|
172 |
|
|
173 |
Of course, explicit selection of a group will be allowed. |
|
174 |
|
|
175 |
Super cluster operations |
|
176 |
++++++++++++++++++++++++ |
|
177 |
|
|
178 |
Beside the regular group balancing, in a super cluster we have more |
|
179 |
operations. |
|
180 |
|
|
181 |
|
|
182 |
Redistribution |
|
183 |
^^^^^^^^^^^^^^ |
|
184 |
|
|
185 |
In a regular cluster, once we run out of resources (offline nodes |
|
186 |
which can't be fully evacuated, N+1 failures, etc.) there is nothing |
|
187 |
we can do unless nodes are added or instances are removed. |
|
188 |
|
|
189 |
In a super cluster however, there might be resources available in |
|
190 |
another group, so there is the possibility of relocating instances |
|
191 |
between groups to re-establish N+1 success within each group. |
|
192 |
|
|
193 |
One difficulty in the presence of both super clusters and shared |
|
194 |
storage is that the move paths of instances are quite complicated; |
|
195 |
basically an instance can move inside its local group, and to any |
|
196 |
other groups which have access to the same storage type and storage |
|
197 |
pool pair. In effect, the super cluster is composed of multiple |
|
198 |
‘partitions’, each containing one or more groups, but a node is |
|
199 |
simultaneously present in multiple partitions, one for each storage |
|
200 |
type and storage pool it supports. As such, the interactions between |
|
201 |
the individual partitions are too complex for non-trivial clusters to |
|
202 |
assume we can compute a perfect solution: we might need to move some |
|
203 |
instances using shared storage pool ‘A’ in order to clear some more |
|
204 |
memory to accept an instance using local storage, which will further |
|
205 |
clear more VCPUs in a third partition, etc. As such, we'll limit |
|
206 |
ourselves at simple relocation steps within a single partition. |
|
207 |
|
|
208 |
Algorithm: |
|
209 |
|
|
210 |
#. read super cluster data, and exit if cluster doesn't allow |
|
211 |
inter-group moves |
|
212 |
#. filter out any groups that are “alone” in their partition |
|
213 |
(i.e. no other group sharing at least one storage method) |
|
214 |
#. determine list of healthy versus unhealthy groups: |
|
215 |
|
|
216 |
#. a group which contains offline nodes still hosting instances is |
|
217 |
definitely not healthy |
|
218 |
#. a group which has nodes failing N+1 is ‘weakly’ unhealthy |
|
219 |
|
|
220 |
#. if either list is empty, exit (no work to do, or no way to fix problems) |
|
221 |
#. for each unhealthy group: |
|
222 |
|
|
223 |
#. compute the instances that are causing the problems: all |
|
224 |
instances living on offline nodes, all instances living as |
|
225 |
secondary on N+1 failing nodes, all instances living as primaries |
|
226 |
on N+1 failing nodes (in this order) |
|
227 |
#. remove instances, one by one, until the source group is healthy |
|
228 |
again |
|
229 |
#. try to run a standard allocation procedure for each instance on |
|
230 |
all potential groups in its partition |
|
231 |
#. if all instances were relocated successfully, it means we have a |
|
232 |
solution for repairing the original group |
|
233 |
|
|
234 |
Compression |
|
235 |
^^^^^^^^^^^ |
|
236 |
|
|
237 |
In a super cluster which has had many instance reclamations, it is |
|
238 |
possible that while none of the groups is empty, overall there is |
|
239 |
enough empty capacity that an entire group could be removed. |
|
240 |
|
|
241 |
The algorithm for “compressing” the super cluster is as follows: |
|
242 |
|
|
243 |
#. read super cluster data |
|
244 |
#. compute total *(memory, disk, cpu)*, and free *(memory, disk, cpu)* |
|
245 |
for the super-cluster |
|
246 |
#. computer per-group used and free *(memory, disk, cpu)* |
|
247 |
#. select candidate groups for evacuation: |
|
248 |
|
|
249 |
#. they must be connected to other groups via a common storage type |
|
250 |
and pool |
|
251 |
#. they must have fewer used resources than the global free |
|
252 |
resources (minus their own free resources) |
|
253 |
#. for each of these groups, try to relocate all its instances to |
|
254 |
connected peer groups |
|
255 |
#. report the list of groups that could be evacuated, or if instructed |
|
256 |
so, perform the evacuation of the group with the largest free |
|
257 |
resources (i.e. in order to reclaim the most capacity) |
|
258 |
|
|
259 |
Load balancing |
|
260 |
^^^^^^^^^^^^^^ |
|
261 |
|
|
262 |
Assuming a super cluster using shared storage, where instance failover |
|
263 |
is cheap, it should be possible to do a load-based balancing across |
|
264 |
groups. |
|
265 |
|
|
266 |
As opposed to the normal balancing, where we want to balance on all |
|
267 |
node attributes, here we should look only at the load attributes; in |
|
268 |
other words, compare the available (total) node capacity with the |
|
269 |
(total) load generated by instances in a given group, and computing |
|
270 |
such scores for all groups, trying to see if we have any outliers. |
|
271 |
|
|
272 |
Once a reliable load-weighting method for groups exists, we can apply |
|
273 |
a modified version of the cluster scoring method to score not |
|
274 |
imbalances across nodes, but imbalances across groups which result in |
|
275 |
a super cluster load-related score. |
|
276 |
|
|
277 |
Allocation changes |
|
278 |
------------------ |
|
279 |
|
|
280 |
It is important to keep the allocation method across groups internal |
|
281 |
(in the Ganeti/Iallocator combination), instead of delegating it to an |
|
282 |
external party (e.g. a RAPI client). For this, the IAllocator protocol |
|
283 |
should be extended to provide proper group support. |
|
284 |
|
|
285 |
For htools, the new algorithm will work as follows: |
|
286 |
|
|
287 |
#. read/receive cluster data from Ganeti |
|
288 |
#. filter out any groups that do not supports the requested storage |
|
289 |
method |
|
290 |
#. for remaining groups, try allocation and compute scores after |
|
291 |
allocation |
|
292 |
#. sort valid allocation solutions accordingly and return the entire |
|
293 |
list to Ganeti |
|
294 |
|
|
295 |
The rationale for returning the entire group list, and not only the |
|
296 |
best choice, is that we anyway have the list, and Ganeti might have |
|
297 |
other criteria (e.g. the best group might be busy/locked down, etc.) |
|
298 |
so even if from the point of view of resources it is the best choice, |
|
299 |
it might not be the overall best one. |
|
300 |
|
|
301 |
Node evacuation changes |
|
302 |
----------------------- |
|
303 |
|
|
304 |
While the basic concept in the ``multi-evac`` iallocator |
|
305 |
mode remains unchanged (it's a simple local group issue), when failing |
|
306 |
to evacuate and running in a super cluster, we could have resources |
|
307 |
available elsewhere in the cluster for evacuation. |
|
308 |
|
|
309 |
The algorithm for computing this will be the same as the one for super |
|
310 |
cluster compression and redistribution, except that the list of |
|
311 |
instances is fixed to the ones living on the nodes to-be-evacuated. |
|
312 |
|
|
313 |
If the inter-group relocation is successful, the result to Ganeti will |
|
314 |
not be a local group evacuation target, but instead (for each |
|
315 |
instance) a pair *(remote group, nodes)*. Ganeti itself will have to |
|
316 |
decide (based on user input) whether to continue with inter-group |
|
317 |
evacuation or not. |
|
318 |
|
|
319 |
In case that Ganeti doesn't provide complete cluster data, just the |
|
320 |
local group, the inter-group relocation won't be attempted. |
Also available in: Unified diff