Revision 92902e91

b/Makefile.am
240 240
	doc/design-2.1.rst \
241 241
	doc/design-2.2.rst \
242 242
	doc/design-2.3.rst \
243
	doc/design-htools-2.3.rst \
243 244
	doc/design-2.4.rst \
244 245
	doc/design-draft.rst \
245 246
	doc/design-oob.rst \
b/doc/design-htools-2.3.rst
1
====================================
2
 Synchronising htools to Ganeti 2.3
3
====================================
4

  
5
Ganeti 2.3 introduces a number of new features that change the cluster
6
internals significantly enough that the htools suite needs to be
7
updated accordingly in order to function correctly.
8

  
9
Shared storage support
10
======================
11

  
12
Currently, the htools algorithms presume a model where all of an
13
instance's resources is served from within the cluster, more
14
specifically from the nodes comprising the cluster. While is this
15
usual for memory and CPU, deployments which use shared storage will
16
invalidate this assumption for storage.
17

  
18
To account for this, we need to move some assumptions from being
19
implicit (and hardcoded) to being explicitly exported from Ganeti.
20

  
21

  
22
New instance parameters
23
-----------------------
24

  
25
It is presumed that Ganeti will export for all instances a new
26
``storage_type`` parameter, that will denote either internal storage
27
(e.g. *plain* or *drbd*), or external storage.
28

  
29
Furthermore, a new ``storage_pool`` parameter will classify, for both
30
internal and external storage, the pool out of which the storage is
31
allocated. For internal storage, this will be either ``lvm`` (the pool
32
that provides space to both ``plain`` and ``drbd`` instances) or
33
``file`` (for file-storage-based instances). For external storage,
34
this will be the respective NAS/SAN/cloud storage that backs up the
35
instance. Note that for htools, external storage pools are opaque; we
36
only care that they have an identifier, so that we can distinguish
37
between two different pools.
38

  
39
If these two parameters are not present, the instances will be
40
presumed to be ``internal/lvm``.
41

  
42
New node parameters
43
-------------------
44

  
45
For each node, it is expected that Ganeti will export what storage
46
types it supports and pools it has access to. So a classic 2.2 cluster
47
will have all nodes supporting ``internal/lvm`` and/or
48
``internal/file``, whereas a new shared storage only 2.3 cluster could
49
have ``external/my-nas`` storage.
50

  
51
Whatever the mechanism that Ganeti will use internally to configure
52
the associations between nodes and storage pools, we consider that
53
we'll have available two node attributes inside htools: the list of internal
54
and external storage pools.
55

  
56
External storage and instances
57
------------------------------
58

  
59
Currently, for an instance we allow one cheap move type: failover to
60
the current secondary, if it is a healthy node, and four other
61
“expensive” (as in, including data copies) moves that involve changing
62
either the secondary or the primary node or both.
63

  
64
In presence of an external storage type, the following things will
65
change:
66

  
67
- the disk-based moves will be disallowed; this is already a feature
68
  in the algorithm, controlled by a boolean switch, so adapting
69
  external storage here will be trivial
70
- instead of the current one secondary node, the secondaries will
71
  become a list of potential secondaries, based on access to the
72
  instance's storage pool
73

  
74
Except for this, the basic move algorithm remains unchanged.
75

  
76
External storage and nodes
77
--------------------------
78

  
79
Two separate areas will have to change for nodes and external storage.
80

  
81
First, then allocating instances (either as part of a move or a new
82
allocation), if the instance is using external storage, then the
83
internal disk metrics should be ignored (for both the primary and
84
secondary cases).
85

  
86
Second, the per-node metrics used in the cluster scoring must take
87
into account that nodes might not have internal storage at all, and
88
handle this as a well-balanced case (score 0).
89

  
90
N+1 status
91
----------
92

  
93
Currently, computing the N+1 status of a node is simple:
94

  
95
- group the current secondary instances by their primary node, and
96
  compute the sum of each instance group memory
97
- choose the maximum sum, and check if it's smaller than the current
98
  available memory on this node
99

  
100
In effect, computing the N+1 status is a per-node matter. However,
101
with shared storage, we don't have secondary nodes, just potential
102
secondaries. Thus computing the N+1 status will be a cluster-level
103
matter, and much more expensive.
104

  
105
A simple version of the N+1 checks would be that for each instance
106
having said node as primary, we have enough memory in the cluster for
107
relocation. This means we would actually need to run allocation
108
checks, and update the cluster status from within allocation on one
109
node, while being careful that we don't recursively check N+1 status
110
during this relocation, which is too expensive.
111

  
112
However, the shared storage model has some properties that changes the
113
rules of the computation. Speaking broadly (and ignoring hard
114
restrictions like tag based exclusion and CPU limits), the exact
115
location of an instance in the cluster doesn't matter as long as
116
memory is available. This results in two changes:
117

  
118
- simply tracking the amount of free memory buckets is enough,
119
  cluster-wide
120
- moving an instance from one node to another would not change the N+1
121
  status of any node, and only allocation needs to deal with N+1
122
  checks
123

  
124
Unfortunately, this very cheap solution fails in case of any other
125
exclusion or prevention factors.
126

  
127
TODO: find a solution for N+1 checks.
128

  
129

  
130
Node groups support
131
===================
132

  
133
The addition of node groups has a small impact on the actual
134
algorithms, which will simply operate at node group level instead of
135
cluster level, but it requires the addition of new algorithms for
136
inter-node group operations.
137

  
138
The following two definitions will be used in the following
139
paragraphs:
140

  
141
local group
142
  The local group refers to a node's own node group, or when speaking
143
  about an instance, the node group of its primary node
144

  
145
regular cluster
146
  A cluster composed of a single node group, or pre-2.3 cluster
147

  
148
super cluster
149
  This term refers to a cluster which comprises multiple node groups,
150
  as opposed to a 2.2 and earlier cluster with a single node group
151

  
152
In all the below operations, it's assumed that Ganeti can gather the
153
entire super cluster state cheaply.
154

  
155

  
156
Balancing changes
157
-----------------
158

  
159
Balancing will move from cluster-level balancing to group
160
balancing. In order to achieve a reasonable improvement in a super
161
cluster, without needing to keep state of what groups have been
162
already balanced previously, the balancing algorithm will run as
163
follows:
164

  
165
#. the cluster data is gathered
166
#. if this is a regular cluster, as opposed to a super cluster,
167
   balancing will proceed normally as previously
168
#. otherwise, compute the cluster scores for all groups
169
#. choose the group with the worst score and see if we can improve it;
170
   if not choose the next-worst group, so on
171
#. once a group has been identified, run the balancing for it
172

  
173
Of course, explicit selection of a group will be allowed.
174

  
175
Super cluster operations
176
++++++++++++++++++++++++
177

  
178
Beside the regular group balancing, in a super cluster we have more
179
operations.
180

  
181

  
182
Redistribution
183
^^^^^^^^^^^^^^
184

  
185
In a regular cluster, once we run out of resources (offline nodes
186
which can't be fully evacuated, N+1 failures, etc.) there is nothing
187
we can do unless nodes are added or instances are removed.
188

  
189
In a super cluster however, there might be resources available in
190
another group, so there is the possibility of relocating instances
191
between groups to re-establish N+1 success within each group.
192

  
193
One difficulty in the presence of both super clusters and shared
194
storage is that the move paths of instances are quite complicated;
195
basically an instance can move inside its local group, and to any
196
other groups which have access to the same storage type and storage
197
pool pair. In effect, the super cluster is composed of multiple
198
‘partitions’, each containing one or more groups, but a node is
199
simultaneously present in multiple partitions, one for each storage
200
type and storage pool it supports. As such, the interactions between
201
the individual partitions are too complex for non-trivial clusters to
202
assume we can compute a perfect solution: we might need to move some
203
instances using shared storage pool ‘A’ in order to clear some more
204
memory to accept an instance using local storage, which will further
205
clear more VCPUs in a third partition, etc. As such, we'll limit
206
ourselves at simple relocation steps within a single partition.
207

  
208
Algorithm:
209

  
210
#. read super cluster data, and exit if cluster doesn't allow
211
   inter-group moves
212
#. filter out any groups that are “alone” in their partition
213
   (i.e. no other group sharing at least one storage method)
214
#. determine list of healthy versus unhealthy groups:
215

  
216
    #. a group which contains offline nodes still hosting instances is
217
       definitely not healthy
218
    #. a group which has nodes failing N+1 is ‘weakly’ unhealthy
219

  
220
#. if either list is empty, exit (no work to do, or no way to fix problems)
221
#. for each unhealthy group:
222

  
223
    #. compute the instances that are causing the problems: all
224
       instances living on offline nodes, all instances living as
225
       secondary on N+1 failing nodes, all instances living as primaries
226
       on N+1 failing nodes (in this order)
227
    #. remove instances, one by one, until the source group is healthy
228
       again
229
    #. try to run a standard allocation procedure for each instance on
230
       all potential groups in its partition
231
    #. if all instances were relocated successfully, it means we have a
232
       solution for repairing the original group
233

  
234
Compression
235
^^^^^^^^^^^
236

  
237
In a super cluster which has had many instance reclamations, it is
238
possible that while none of the groups is empty, overall there is
239
enough empty capacity that an entire group could be removed.
240

  
241
The algorithm for “compressing” the super cluster is as follows:
242

  
243
#. read super cluster data
244
#. compute total *(memory, disk, cpu)*, and free *(memory, disk, cpu)*
245
   for the super-cluster
246
#. computer per-group used and free *(memory, disk, cpu)*
247
#. select candidate groups for evacuation:
248

  
249
    #. they must be connected to other groups via a common storage type
250
       and pool
251
    #. they must have fewer used resources than the global free
252
       resources (minus their own free resources)
253

  
254
#. for each of these groups, try to relocate all its instances to
255
   connected peer groups
256
#. report the list of groups that could be evacuated, or if instructed
257
   so, perform the evacuation of the group with the largest free
258
   resources (i.e. in order to reclaim the most capacity)
259

  
260
Load balancing
261
^^^^^^^^^^^^^^
262

  
263
Assuming a super cluster using shared storage, where instance failover
264
is cheap, it should be possible to do a load-based balancing across
265
groups.
266

  
267
As opposed to the normal balancing, where we want to balance on all
268
node attributes, here we should look only at the load attributes; in
269
other words, compare the available (total) node capacity with the
270
(total) load generated by instances in a given group, and computing
271
such scores for all groups, trying to see if we have any outliers.
272

  
273
Once a reliable load-weighting method for groups exists, we can apply
274
a modified version of the cluster scoring method to score not
275
imbalances across nodes, but imbalances across groups which result in
276
a super cluster load-related score.
277

  
278
Allocation changes
279
------------------
280

  
281
It is important to keep the allocation method across groups internal
282
(in the Ganeti/Iallocator combination), instead of delegating it to an
283
external party (e.g. a RAPI client). For this, the IAllocator protocol
284
should be extended to provide proper group support.
285

  
286
For htools, the new algorithm will work as follows:
287

  
288
#. read/receive cluster data from Ganeti
289
#. filter out any groups that do not supports the requested storage
290
   method
291
#. for remaining groups, try allocation and compute scores after
292
   allocation
293
#. sort valid allocation solutions accordingly and return the entire
294
   list to Ganeti
295

  
296
The rationale for returning the entire group list, and not only the
297
best choice, is that we anyway have the list, and Ganeti might have
298
other criteria (e.g. the best group might be busy/locked down, etc.)
299
so even if from the point of view of resources it is the best choice,
300
it might not be the overall best one.
301

  
302
Node evacuation changes
303
-----------------------
304

  
305
While the basic concept in the ``multi-evac`` iallocator
306
mode remains unchanged (it's a simple local group issue), when failing
307
to evacuate and running in a super cluster, we could have resources
308
available elsewhere in the cluster for evacuation.
309

  
310
The algorithm for computing this will be the same as the one for super
311
cluster compression and redistribution, except that the list of
312
instances is fixed to the ones living on the nodes to-be-evacuated.
313

  
314
If the inter-group relocation is successful, the result to Ganeti will
315
not be a local group evacuation target, but instead (for each
316
instance) a pair *(remote group, nodes)*. Ganeti itself will have to
317
decide (based on user input) whether to continue with inter-group
318
evacuation or not.
319

  
320
In case that Ganeti doesn't provide complete cluster data, just the
321
local group, the inter-group relocation won't be attempted.
b/doc/index.rst
19 19
   design-2.1.rst
20 20
   design-2.2.rst
21 21
   design-2.3.rst
22
   design-htools-2.3.rst
22 23
   design-2.4.rst
23 24
   design-draft.rst
24 25
   cluster-merge.rst
/dev/null
1
====================================
2
 Synchronising htools to Ganeti 2.3
3
====================================
4

  
5
Ganeti 2.3 introduces a number of new features that change the cluster
6
internals significantly enough that the htools suite needs to be
7
updated accordingly in order to function correctly.
8

  
9
Shared storage support
10
======================
11

  
12
Currently, the htools algorithms presume a model where all of an
13
instance's resources is served from within the cluster, more
14
specifically from the nodes comprising the cluster. While is this
15
usual for memory and CPU, deployments which use shared storage will
16
invalidate this assumption for storage.
17

  
18
To account for this, we need to move some assumptions from being
19
implicit (and hardcoded) to being explicitly exported from Ganeti.
20

  
21

  
22
New instance parameters
23
-----------------------
24

  
25
It is presumed that Ganeti will export for all instances a new
26
``storage_type`` parameter, that will denote either internal storage
27
(e.g. *plain* or *drbd*), or external storage.
28

  
29
Furthermore, a new ``storage_pool`` parameter will classify, for both
30
internal and external storage, the pool out of which the storage is
31
allocated. For internal storage, this will be either ``lvm`` (the pool
32
that provides space to both ``plain`` and ``drbd`` instances) or
33
``file`` (for file-storage-based instances). For external storage,
34
this will be the respective NAS/SAN/cloud storage that backs up the
35
instance. Note that for htools, external storage pools are opaque; we
36
only care that they have an identifier, so that we can distinguish
37
between two different pools.
38

  
39
If these two parameters are not present, the instances will be
40
presumed to be ``internal/lvm``.
41

  
42
New node parameters
43
-------------------
44

  
45
For each node, it is expected that Ganeti will export what storage
46
types it supports and pools it has access to. So a classic 2.2 cluster
47
will have all nodes supporting ``internal/lvm`` and/or
48
``internal/file``, whereas a new shared storage only 2.3 cluster could
49
have ``external/my-nas`` storage.
50

  
51
Whatever the mechanism that Ganeti will use internally to configure
52
the associations between nodes and storage pools, we consider that
53
we'll have available two node attributes inside htools: the list of internal
54
and external storage pools.
55

  
56
External storage and instances
57
------------------------------
58

  
59
Currently, for an instance we allow one cheap move type: failover to
60
the current secondary, if it is a healthy node, and four other
61
“expensive” (as in, including data copies) moves that involve changing
62
either the secondary or the primary node or both.
63

  
64
In presence of an external storage type, the following things will
65
change:
66

  
67
- the disk-based moves will be disallowed; this is already a feature
68
  in the algorithm, controlled by a boolean switch, so adapting
69
  external storage here will be trivial
70
- instead of the current one secondary node, the secondaries will
71
  become a list of potential secondaries, based on access to the
72
  instance's storage pool
73

  
74
Except for this, the basic move algorithm remains unchanged.
75

  
76
External storage and nodes
77
--------------------------
78

  
79
Two separate areas will have to change for nodes and external storage.
80

  
81
First, then allocating instances (either as part of a move or a new
82
allocation), if the instance is using external storage, then the
83
internal disk metrics should be ignored (for both the primary and
84
secondary cases).
85

  
86
Second, the per-node metrics used in the cluster scoring must take
87
into account that nodes might not have internal storage at all, and
88
handle this as a well-balanced case (score 0).
89

  
90
N+1 status
91
----------
92

  
93
Currently, computing the N+1 status of a node is simple:
94

  
95
- group the current secondary instances by their primary node, and
96
  compute the sum of each instance group memory
97
- choose the maximum sum, and check if it's smaller than the current
98
  available memory on this node
99

  
100
In effect, computing the N+1 status is a per-node matter. However,
101
with shared storage, we don't have secondary nodes, just potential
102
secondaries. Thus computing the N+1 status will be a cluster-level
103
matter, and much more expensive.
104

  
105
A simple version of the N+1 checks would be that for each instance
106
having said node as primary, we have enough memory in the cluster for
107
relocation. This means we would actually need to run allocation
108
checks, and update the cluster status from within allocation on one
109
node, while being careful that we don't recursively check N+1 status
110
during this relocation, which is too expensive.
111

  
112
However, the shared storage model has some properties that changes the
113
rules of the computation. Speaking broadly (and ignoring hard
114
restrictions like tag based exclusion and CPU limits), the exact
115
location of an instance in the cluster doesn't matter as long as
116
memory is available. This results in two changes:
117

  
118
- simply tracking the amount of free memory buckets is enough,
119
  cluster-wide
120
- moving an instance from one node to another would not change the N+1
121
  status of any node, and only allocation needs to deal with N+1
122
  checks
123

  
124
Unfortunately, this very cheap solution fails in case of any other
125
exclusion or prevention factors.
126

  
127
TODO: find a solution for N+1 checks.
128

  
129

  
130
Node groups support
131
===================
132

  
133
The addition of node groups has a small impact on the actual
134
algorithms, which will simply operate at node group level instead of
135
cluster level, but it requires the addition of new algorithms for
136
inter-node group operations.
137

  
138
The following two definitions will be used in the following
139
paragraphs:
140

  
141
local group
142
  The local group refers to a node's own node group, or when speaking
143
  about an instance, the node group of its primary node
144

  
145
regular cluster
146
  A cluster composed of a single node group, or pre-2.3 cluster
147

  
148
super cluster
149
  This term refers to a cluster which comprises multiple node groups,
150
  as opposed to a 2.2 and earlier cluster with a single node group
151

  
152
In all the below operations, it's assumed that Ganeti can gather the
153
entire super cluster state cheaply.
154

  
155

  
156
Balancing changes
157
-----------------
158

  
159
Balancing will move from cluster-level balancing to group
160
balancing. In order to achieve a reasonable improvement in a super
161
cluster, without needing to keep state of what groups have been
162
already balanced previously, the balancing algorithm will run as
163
follows:
164

  
165
#. the cluster data is gathered
166
#. if this is a regular cluster, as opposed to a super cluster,
167
   balancing will proceed normally as previously
168
#. otherwise, compute the cluster scores for all groups
169
#. choose the group with the worst score and see if we can improve it;
170
   if not choose the next-worst group, so on
171
#. once a group has been identified, run the balancing for it
172

  
173
Of course, explicit selection of a group will be allowed.
174

  
175
Super cluster operations
176
++++++++++++++++++++++++
177

  
178
Beside the regular group balancing, in a super cluster we have more
179
operations.
180

  
181

  
182
Redistribution
183
^^^^^^^^^^^^^^
184

  
185
In a regular cluster, once we run out of resources (offline nodes
186
which can't be fully evacuated, N+1 failures, etc.) there is nothing
187
we can do unless nodes are added or instances are removed.
188

  
189
In a super cluster however, there might be resources available in
190
another group, so there is the possibility of relocating instances
191
between groups to re-establish N+1 success within each group.
192

  
193
One difficulty in the presence of both super clusters and shared
194
storage is that the move paths of instances are quite complicated;
195
basically an instance can move inside its local group, and to any
196
other groups which have access to the same storage type and storage
197
pool pair. In effect, the super cluster is composed of multiple
198
‘partitions’, each containing one or more groups, but a node is
199
simultaneously present in multiple partitions, one for each storage
200
type and storage pool it supports. As such, the interactions between
201
the individual partitions are too complex for non-trivial clusters to
202
assume we can compute a perfect solution: we might need to move some
203
instances using shared storage pool ‘A’ in order to clear some more
204
memory to accept an instance using local storage, which will further
205
clear more VCPUs in a third partition, etc. As such, we'll limit
206
ourselves at simple relocation steps within a single partition.
207

  
208
Algorithm:
209

  
210
#. read super cluster data, and exit if cluster doesn't allow
211
   inter-group moves
212
#. filter out any groups that are “alone” in their partition
213
   (i.e. no other group sharing at least one storage method)
214
#. determine list of healthy versus unhealthy groups:
215

  
216
  #. a group which contains offline nodes still hosting instances is
217
     definitely not healthy
218
  #. a group which has nodes failing N+1 is ‘weakly’ unhealthy
219

  
220
#. if either list is empty, exit (no work to do, or no way to fix problems)
221
#. for each unhealthy group:
222

  
223
  #. compute the instances that are causing the problems: all
224
     instances living on offline nodes, all instances living as
225
     secondary on N+1 failing nodes, all instances living as primaries
226
     on N+1 failing nodes (in this order)
227
  #. remove instances, one by one, until the source group is healthy
228
    again
229
  #. try to run a standard allocation procedure for each instance on
230
     all potential groups in its partition
231
  #. if all instances were relocated successfully, it means we have a
232
     solution for repairing the original group
233

  
234
Compression
235
^^^^^^^^^^^
236

  
237
In a super cluster which has had many instance reclamations, it is
238
possible that while none of the groups is empty, overall there is
239
enough empty capacity that an entire group could be removed.
240

  
241
The algorithm for “compressing” the super cluster is as follows:
242

  
243
#. read super cluster data
244
#. compute total *(memory, disk, cpu)*, and free *(memory, disk, cpu)*
245
   for the super-cluster
246
#. computer per-group used and free *(memory, disk, cpu)*
247
#. select candidate groups for evacuation:
248

  
249
  #. they must be connected to other groups via a common storage type
250
     and pool
251
  #. they must have fewer used resources than the global free
252
     resources (minus their own free resources)
253
#. for each of these groups, try to relocate all its instances to
254
   connected peer groups
255
#. report the list of groups that could be evacuated, or if instructed
256
   so, perform the evacuation of the group with the largest free
257
   resources (i.e. in order to reclaim the most capacity)
258

  
259
Load balancing
260
^^^^^^^^^^^^^^
261

  
262
Assuming a super cluster using shared storage, where instance failover
263
is cheap, it should be possible to do a load-based balancing across
264
groups.
265

  
266
As opposed to the normal balancing, where we want to balance on all
267
node attributes, here we should look only at the load attributes; in
268
other words, compare the available (total) node capacity with the
269
(total) load generated by instances in a given group, and computing
270
such scores for all groups, trying to see if we have any outliers.
271

  
272
Once a reliable load-weighting method for groups exists, we can apply
273
a modified version of the cluster scoring method to score not
274
imbalances across nodes, but imbalances across groups which result in
275
a super cluster load-related score.
276

  
277
Allocation changes
278
------------------
279

  
280
It is important to keep the allocation method across groups internal
281
(in the Ganeti/Iallocator combination), instead of delegating it to an
282
external party (e.g. a RAPI client). For this, the IAllocator protocol
283
should be extended to provide proper group support.
284

  
285
For htools, the new algorithm will work as follows:
286

  
287
#. read/receive cluster data from Ganeti
288
#. filter out any groups that do not supports the requested storage
289
   method
290
#. for remaining groups, try allocation and compute scores after
291
   allocation
292
#. sort valid allocation solutions accordingly and return the entire
293
   list to Ganeti
294

  
295
The rationale for returning the entire group list, and not only the
296
best choice, is that we anyway have the list, and Ganeti might have
297
other criteria (e.g. the best group might be busy/locked down, etc.)
298
so even if from the point of view of resources it is the best choice,
299
it might not be the overall best one.
300

  
301
Node evacuation changes
302
-----------------------
303

  
304
While the basic concept in the ``multi-evac`` iallocator
305
mode remains unchanged (it's a simple local group issue), when failing
306
to evacuate and running in a super cluster, we could have resources
307
available elsewhere in the cluster for evacuation.
308

  
309
The algorithm for computing this will be the same as the one for super
310
cluster compression and redistribution, except that the list of
311
instances is fixed to the ones living on the nodes to-be-evacuated.
312

  
313
If the inter-group relocation is successful, the result to Ganeti will
314
not be a local group evacuation target, but instead (for each
315
instance) a pair *(remote group, nodes)*. Ganeti itself will have to
316
decide (based on user input) whether to continue with inter-group
317
evacuation or not.
318

  
319
In case that Ganeti doesn't provide complete cluster data, just the
320
local group, the inter-group relocation won't be attempted.

Also available in: Unified diff