Statistics
| Branch: | Tag: | Revision:

root / doc / design-htools-2.3.rst @ 7142485a

History | View | Annotate | Download (12.8 kB)

1 92921ea4 Iustin Pop
====================================
2 92921ea4 Iustin Pop
 Synchronising htools to Ganeti 2.3
3 92921ea4 Iustin Pop
====================================
4 92921ea4 Iustin Pop
5 92921ea4 Iustin Pop
Ganeti 2.3 introduces a number of new features that change the cluster
6 92921ea4 Iustin Pop
internals significantly enough that the htools suite needs to be
7 92921ea4 Iustin Pop
updated accordingly in order to function correctly.
8 92921ea4 Iustin Pop
9 92921ea4 Iustin Pop
Shared storage support
10 92921ea4 Iustin Pop
======================
11 92921ea4 Iustin Pop
12 92921ea4 Iustin Pop
Currently, the htools algorithms presume a model where all of an
13 92921ea4 Iustin Pop
instance's resources is served from within the cluster, more
14 92921ea4 Iustin Pop
specifically from the nodes comprising the cluster. While is this
15 92921ea4 Iustin Pop
usual for memory and CPU, deployments which use shared storage will
16 92921ea4 Iustin Pop
invalidate this assumption for storage.
17 92921ea4 Iustin Pop
18 92921ea4 Iustin Pop
To account for this, we need to move some assumptions from being
19 92921ea4 Iustin Pop
implicit (and hardcoded) to being explicitly exported from Ganeti.
20 92921ea4 Iustin Pop
21 92921ea4 Iustin Pop
22 92921ea4 Iustin Pop
New instance parameters
23 92921ea4 Iustin Pop
-----------------------
24 92921ea4 Iustin Pop
25 92921ea4 Iustin Pop
It is presumed that Ganeti will export for all instances a new
26 92921ea4 Iustin Pop
``storage_type`` parameter, that will denote either internal storage
27 92921ea4 Iustin Pop
(e.g. *plain* or *drbd*), or external storage.
28 92921ea4 Iustin Pop
29 92921ea4 Iustin Pop
Furthermore, a new ``storage_pool`` parameter will classify, for both
30 92921ea4 Iustin Pop
internal and external storage, the pool out of which the storage is
31 92921ea4 Iustin Pop
allocated. For internal storage, this will be either ``lvm`` (the pool
32 92921ea4 Iustin Pop
that provides space to both ``plain`` and ``drbd`` instances) or
33 92921ea4 Iustin Pop
``file`` (for file-storage-based instances). For external storage,
34 92921ea4 Iustin Pop
this will be the respective NAS/SAN/cloud storage that backs up the
35 92921ea4 Iustin Pop
instance. Note that for htools, external storage pools are opaque; we
36 92921ea4 Iustin Pop
only care that they have an identifier, so that we can distinguish
37 92921ea4 Iustin Pop
between two different pools.
38 92921ea4 Iustin Pop
39 92921ea4 Iustin Pop
If these two parameters are not present, the instances will be
40 92921ea4 Iustin Pop
presumed to be ``internal/lvm``.
41 92921ea4 Iustin Pop
42 92921ea4 Iustin Pop
New node parameters
43 92921ea4 Iustin Pop
-------------------
44 92921ea4 Iustin Pop
45 92921ea4 Iustin Pop
For each node, it is expected that Ganeti will export what storage
46 92921ea4 Iustin Pop
types it supports and pools it has access to. So a classic 2.2 cluster
47 92921ea4 Iustin Pop
will have all nodes supporting ``internal/lvm`` and/or
48 92921ea4 Iustin Pop
``internal/file``, whereas a new shared storage only 2.3 cluster could
49 92921ea4 Iustin Pop
have ``external/my-nas`` storage.
50 92921ea4 Iustin Pop
51 92921ea4 Iustin Pop
Whatever the mechanism that Ganeti will use internally to configure
52 92921ea4 Iustin Pop
the associations between nodes and storage pools, we consider that
53 92921ea4 Iustin Pop
we'll have available two node attributes inside htools: the list of internal
54 92921ea4 Iustin Pop
and external storage pools.
55 92921ea4 Iustin Pop
56 92921ea4 Iustin Pop
External storage and instances
57 92921ea4 Iustin Pop
------------------------------
58 92921ea4 Iustin Pop
59 92921ea4 Iustin Pop
Currently, for an instance we allow one cheap move type: failover to
60 92921ea4 Iustin Pop
the current secondary, if it is a healthy node, and four other
61 92921ea4 Iustin Pop
“expensive” (as in, including data copies) moves that involve changing
62 92921ea4 Iustin Pop
either the secondary or the primary node or both.
63 92921ea4 Iustin Pop
64 92921ea4 Iustin Pop
In presence of an external storage type, the following things will
65 92921ea4 Iustin Pop
change:
66 92921ea4 Iustin Pop
67 92921ea4 Iustin Pop
- the disk-based moves will be disallowed; this is already a feature
68 92921ea4 Iustin Pop
  in the algorithm, controlled by a boolean switch, so adapting
69 92921ea4 Iustin Pop
  external storage here will be trivial
70 92921ea4 Iustin Pop
- instead of the current one secondary node, the secondaries will
71 92921ea4 Iustin Pop
  become a list of potential secondaries, based on access to the
72 92921ea4 Iustin Pop
  instance's storage pool
73 92921ea4 Iustin Pop
74 92921ea4 Iustin Pop
Except for this, the basic move algorithm remains unchanged.
75 92921ea4 Iustin Pop
76 92921ea4 Iustin Pop
External storage and nodes
77 92921ea4 Iustin Pop
--------------------------
78 92921ea4 Iustin Pop
79 92921ea4 Iustin Pop
Two separate areas will have to change for nodes and external storage.
80 92921ea4 Iustin Pop
81 92921ea4 Iustin Pop
First, then allocating instances (either as part of a move or a new
82 92921ea4 Iustin Pop
allocation), if the instance is using external storage, then the
83 92921ea4 Iustin Pop
internal disk metrics should be ignored (for both the primary and
84 92921ea4 Iustin Pop
secondary cases).
85 92921ea4 Iustin Pop
86 92921ea4 Iustin Pop
Second, the per-node metrics used in the cluster scoring must take
87 92921ea4 Iustin Pop
into account that nodes might not have internal storage at all, and
88 92921ea4 Iustin Pop
handle this as a well-balanced case (score 0).
89 92921ea4 Iustin Pop
90 92921ea4 Iustin Pop
N+1 status
91 92921ea4 Iustin Pop
----------
92 92921ea4 Iustin Pop
93 92921ea4 Iustin Pop
Currently, computing the N+1 status of a node is simple:
94 92921ea4 Iustin Pop
95 92921ea4 Iustin Pop
- group the current secondary instances by their primary node, and
96 92921ea4 Iustin Pop
  compute the sum of each instance group memory
97 92921ea4 Iustin Pop
- choose the maximum sum, and check if it's smaller than the current
98 92921ea4 Iustin Pop
  available memory on this node
99 92921ea4 Iustin Pop
100 92921ea4 Iustin Pop
In effect, computing the N+1 status is a per-node matter. However,
101 92921ea4 Iustin Pop
with shared storage, we don't have secondary nodes, just potential
102 92921ea4 Iustin Pop
secondaries. Thus computing the N+1 status will be a cluster-level
103 92921ea4 Iustin Pop
matter, and much more expensive.
104 92921ea4 Iustin Pop
105 92921ea4 Iustin Pop
A simple version of the N+1 checks would be that for each instance
106 92921ea4 Iustin Pop
having said node as primary, we have enough memory in the cluster for
107 92921ea4 Iustin Pop
relocation. This means we would actually need to run allocation
108 92921ea4 Iustin Pop
checks, and update the cluster status from within allocation on one
109 92921ea4 Iustin Pop
node, while being careful that we don't recursively check N+1 status
110 92921ea4 Iustin Pop
during this relocation, which is too expensive.
111 92921ea4 Iustin Pop
112 92921ea4 Iustin Pop
However, the shared storage model has some properties that changes the
113 92921ea4 Iustin Pop
rules of the computation. Speaking broadly (and ignoring hard
114 92921ea4 Iustin Pop
restrictions like tag based exclusion and CPU limits), the exact
115 92921ea4 Iustin Pop
location of an instance in the cluster doesn't matter as long as
116 92921ea4 Iustin Pop
memory is available. This results in two changes:
117 92921ea4 Iustin Pop
118 92921ea4 Iustin Pop
- simply tracking the amount of free memory buckets is enough,
119 92921ea4 Iustin Pop
  cluster-wide
120 92921ea4 Iustin Pop
- moving an instance from one node to another would not change the N+1
121 92921ea4 Iustin Pop
  status of any node, and only allocation needs to deal with N+1
122 92921ea4 Iustin Pop
  checks
123 92921ea4 Iustin Pop
124 92921ea4 Iustin Pop
Unfortunately, this very cheap solution fails in case of any other
125 92921ea4 Iustin Pop
exclusion or prevention factors.
126 92921ea4 Iustin Pop
127 92921ea4 Iustin Pop
TODO: find a solution for N+1 checks.
128 92921ea4 Iustin Pop
129 92921ea4 Iustin Pop
130 92921ea4 Iustin Pop
Node groups support
131 92921ea4 Iustin Pop
===================
132 92921ea4 Iustin Pop
133 92921ea4 Iustin Pop
The addition of node groups has a small impact on the actual
134 92921ea4 Iustin Pop
algorithms, which will simply operate at node group level instead of
135 92921ea4 Iustin Pop
cluster level, but it requires the addition of new algorithms for
136 92921ea4 Iustin Pop
inter-node group operations.
137 92921ea4 Iustin Pop
138 92921ea4 Iustin Pop
The following two definitions will be used in the following
139 92921ea4 Iustin Pop
paragraphs:
140 92921ea4 Iustin Pop
141 92921ea4 Iustin Pop
local group
142 92921ea4 Iustin Pop
  The local group refers to a node's own node group, or when speaking
143 92921ea4 Iustin Pop
  about an instance, the node group of its primary node
144 92921ea4 Iustin Pop
145 92921ea4 Iustin Pop
regular cluster
146 92921ea4 Iustin Pop
  A cluster composed of a single node group, or pre-2.3 cluster
147 92921ea4 Iustin Pop
148 92921ea4 Iustin Pop
super cluster
149 92921ea4 Iustin Pop
  This term refers to a cluster which comprises multiple node groups,
150 92921ea4 Iustin Pop
  as opposed to a 2.2 and earlier cluster with a single node group
151 92921ea4 Iustin Pop
152 92921ea4 Iustin Pop
In all the below operations, it's assumed that Ganeti can gather the
153 92921ea4 Iustin Pop
entire super cluster state cheaply.
154 92921ea4 Iustin Pop
155 92921ea4 Iustin Pop
156 92921ea4 Iustin Pop
Balancing changes
157 92921ea4 Iustin Pop
-----------------
158 92921ea4 Iustin Pop
159 92921ea4 Iustin Pop
Balancing will move from cluster-level balancing to group
160 92921ea4 Iustin Pop
balancing. In order to achieve a reasonable improvement in a super
161 92921ea4 Iustin Pop
cluster, without needing to keep state of what groups have been
162 92921ea4 Iustin Pop
already balanced previously, the balancing algorithm will run as
163 92921ea4 Iustin Pop
follows:
164 92921ea4 Iustin Pop
165 92921ea4 Iustin Pop
#. the cluster data is gathered
166 92921ea4 Iustin Pop
#. if this is a regular cluster, as opposed to a super cluster,
167 92921ea4 Iustin Pop
   balancing will proceed normally as previously
168 92921ea4 Iustin Pop
#. otherwise, compute the cluster scores for all groups
169 92921ea4 Iustin Pop
#. choose the group with the worst score and see if we can improve it;
170 92921ea4 Iustin Pop
   if not choose the next-worst group, so on
171 92921ea4 Iustin Pop
#. once a group has been identified, run the balancing for it
172 92921ea4 Iustin Pop
173 92921ea4 Iustin Pop
Of course, explicit selection of a group will be allowed.
174 92921ea4 Iustin Pop
175 92921ea4 Iustin Pop
Super cluster operations
176 92921ea4 Iustin Pop
++++++++++++++++++++++++
177 92921ea4 Iustin Pop
178 92921ea4 Iustin Pop
Beside the regular group balancing, in a super cluster we have more
179 92921ea4 Iustin Pop
operations.
180 92921ea4 Iustin Pop
181 92921ea4 Iustin Pop
182 92921ea4 Iustin Pop
Redistribution
183 92921ea4 Iustin Pop
^^^^^^^^^^^^^^
184 92921ea4 Iustin Pop
185 92921ea4 Iustin Pop
In a regular cluster, once we run out of resources (offline nodes
186 92921ea4 Iustin Pop
which can't be fully evacuated, N+1 failures, etc.) there is nothing
187 92921ea4 Iustin Pop
we can do unless nodes are added or instances are removed.
188 92921ea4 Iustin Pop
189 92921ea4 Iustin Pop
In a super cluster however, there might be resources available in
190 92921ea4 Iustin Pop
another group, so there is the possibility of relocating instances
191 92921ea4 Iustin Pop
between groups to re-establish N+1 success within each group.
192 92921ea4 Iustin Pop
193 92921ea4 Iustin Pop
One difficulty in the presence of both super clusters and shared
194 92921ea4 Iustin Pop
storage is that the move paths of instances are quite complicated;
195 92921ea4 Iustin Pop
basically an instance can move inside its local group, and to any
196 92921ea4 Iustin Pop
other groups which have access to the same storage type and storage
197 92921ea4 Iustin Pop
pool pair. In effect, the super cluster is composed of multiple
198 92921ea4 Iustin Pop
‘partitions’, each containing one or more groups, but a node is
199 92921ea4 Iustin Pop
simultaneously present in multiple partitions, one for each storage
200 92921ea4 Iustin Pop
type and storage pool it supports. As such, the interactions between
201 92921ea4 Iustin Pop
the individual partitions are too complex for non-trivial clusters to
202 92921ea4 Iustin Pop
assume we can compute a perfect solution: we might need to move some
203 92921ea4 Iustin Pop
instances using shared storage pool ‘A’ in order to clear some more
204 92921ea4 Iustin Pop
memory to accept an instance using local storage, which will further
205 92921ea4 Iustin Pop
clear more VCPUs in a third partition, etc. As such, we'll limit
206 92921ea4 Iustin Pop
ourselves at simple relocation steps within a single partition.
207 92921ea4 Iustin Pop
208 92921ea4 Iustin Pop
Algorithm:
209 92921ea4 Iustin Pop
210 92921ea4 Iustin Pop
#. read super cluster data, and exit if cluster doesn't allow
211 92921ea4 Iustin Pop
   inter-group moves
212 92921ea4 Iustin Pop
#. filter out any groups that are “alone” in their partition
213 92921ea4 Iustin Pop
   (i.e. no other group sharing at least one storage method)
214 92921ea4 Iustin Pop
#. determine list of healthy versus unhealthy groups:
215 92921ea4 Iustin Pop
216 92902e91 Iustin Pop
    #. a group which contains offline nodes still hosting instances is
217 92902e91 Iustin Pop
       definitely not healthy
218 92902e91 Iustin Pop
    #. a group which has nodes failing N+1 is ‘weakly’ unhealthy
219 92921ea4 Iustin Pop
220 92921ea4 Iustin Pop
#. if either list is empty, exit (no work to do, or no way to fix problems)
221 92921ea4 Iustin Pop
#. for each unhealthy group:
222 92921ea4 Iustin Pop
223 92902e91 Iustin Pop
    #. compute the instances that are causing the problems: all
224 92902e91 Iustin Pop
       instances living on offline nodes, all instances living as
225 92902e91 Iustin Pop
       secondary on N+1 failing nodes, all instances living as primaries
226 92902e91 Iustin Pop
       on N+1 failing nodes (in this order)
227 92902e91 Iustin Pop
    #. remove instances, one by one, until the source group is healthy
228 92902e91 Iustin Pop
       again
229 92902e91 Iustin Pop
    #. try to run a standard allocation procedure for each instance on
230 92902e91 Iustin Pop
       all potential groups in its partition
231 92902e91 Iustin Pop
    #. if all instances were relocated successfully, it means we have a
232 92902e91 Iustin Pop
       solution for repairing the original group
233 92921ea4 Iustin Pop
234 92921ea4 Iustin Pop
Compression
235 92921ea4 Iustin Pop
^^^^^^^^^^^
236 92921ea4 Iustin Pop
237 92921ea4 Iustin Pop
In a super cluster which has had many instance reclamations, it is
238 92921ea4 Iustin Pop
possible that while none of the groups is empty, overall there is
239 92921ea4 Iustin Pop
enough empty capacity that an entire group could be removed.
240 92921ea4 Iustin Pop
241 92921ea4 Iustin Pop
The algorithm for “compressing” the super cluster is as follows:
242 92921ea4 Iustin Pop
243 92921ea4 Iustin Pop
#. read super cluster data
244 92921ea4 Iustin Pop
#. compute total *(memory, disk, cpu)*, and free *(memory, disk, cpu)*
245 92921ea4 Iustin Pop
   for the super-cluster
246 92921ea4 Iustin Pop
#. computer per-group used and free *(memory, disk, cpu)*
247 92921ea4 Iustin Pop
#. select candidate groups for evacuation:
248 92921ea4 Iustin Pop
249 92902e91 Iustin Pop
    #. they must be connected to other groups via a common storage type
250 92902e91 Iustin Pop
       and pool
251 92902e91 Iustin Pop
    #. they must have fewer used resources than the global free
252 92902e91 Iustin Pop
       resources (minus their own free resources)
253 92902e91 Iustin Pop
254 92921ea4 Iustin Pop
#. for each of these groups, try to relocate all its instances to
255 92921ea4 Iustin Pop
   connected peer groups
256 92921ea4 Iustin Pop
#. report the list of groups that could be evacuated, or if instructed
257 92921ea4 Iustin Pop
   so, perform the evacuation of the group with the largest free
258 92921ea4 Iustin Pop
   resources (i.e. in order to reclaim the most capacity)
259 92921ea4 Iustin Pop
260 92921ea4 Iustin Pop
Load balancing
261 92921ea4 Iustin Pop
^^^^^^^^^^^^^^
262 92921ea4 Iustin Pop
263 92921ea4 Iustin Pop
Assuming a super cluster using shared storage, where instance failover
264 92921ea4 Iustin Pop
is cheap, it should be possible to do a load-based balancing across
265 92921ea4 Iustin Pop
groups.
266 92921ea4 Iustin Pop
267 92921ea4 Iustin Pop
As opposed to the normal balancing, where we want to balance on all
268 92921ea4 Iustin Pop
node attributes, here we should look only at the load attributes; in
269 92921ea4 Iustin Pop
other words, compare the available (total) node capacity with the
270 92921ea4 Iustin Pop
(total) load generated by instances in a given group, and computing
271 92921ea4 Iustin Pop
such scores for all groups, trying to see if we have any outliers.
272 92921ea4 Iustin Pop
273 92921ea4 Iustin Pop
Once a reliable load-weighting method for groups exists, we can apply
274 92921ea4 Iustin Pop
a modified version of the cluster scoring method to score not
275 92921ea4 Iustin Pop
imbalances across nodes, but imbalances across groups which result in
276 92921ea4 Iustin Pop
a super cluster load-related score.
277 92921ea4 Iustin Pop
278 92921ea4 Iustin Pop
Allocation changes
279 92921ea4 Iustin Pop
------------------
280 92921ea4 Iustin Pop
281 92921ea4 Iustin Pop
It is important to keep the allocation method across groups internal
282 92921ea4 Iustin Pop
(in the Ganeti/Iallocator combination), instead of delegating it to an
283 92921ea4 Iustin Pop
external party (e.g. a RAPI client). For this, the IAllocator protocol
284 92921ea4 Iustin Pop
should be extended to provide proper group support.
285 92921ea4 Iustin Pop
286 92921ea4 Iustin Pop
For htools, the new algorithm will work as follows:
287 92921ea4 Iustin Pop
288 92921ea4 Iustin Pop
#. read/receive cluster data from Ganeti
289 92921ea4 Iustin Pop
#. filter out any groups that do not supports the requested storage
290 92921ea4 Iustin Pop
   method
291 92921ea4 Iustin Pop
#. for remaining groups, try allocation and compute scores after
292 92921ea4 Iustin Pop
   allocation
293 92921ea4 Iustin Pop
#. sort valid allocation solutions accordingly and return the entire
294 92921ea4 Iustin Pop
   list to Ganeti
295 92921ea4 Iustin Pop
296 92921ea4 Iustin Pop
The rationale for returning the entire group list, and not only the
297 92921ea4 Iustin Pop
best choice, is that we anyway have the list, and Ganeti might have
298 92921ea4 Iustin Pop
other criteria (e.g. the best group might be busy/locked down, etc.)
299 92921ea4 Iustin Pop
so even if from the point of view of resources it is the best choice,
300 92921ea4 Iustin Pop
it might not be the overall best one.
301 92921ea4 Iustin Pop
302 92921ea4 Iustin Pop
Node evacuation changes
303 92921ea4 Iustin Pop
-----------------------
304 92921ea4 Iustin Pop
305 92921ea4 Iustin Pop
While the basic concept in the ``multi-evac`` iallocator
306 92921ea4 Iustin Pop
mode remains unchanged (it's a simple local group issue), when failing
307 92921ea4 Iustin Pop
to evacuate and running in a super cluster, we could have resources
308 92921ea4 Iustin Pop
available elsewhere in the cluster for evacuation.
309 92921ea4 Iustin Pop
310 92921ea4 Iustin Pop
The algorithm for computing this will be the same as the one for super
311 92921ea4 Iustin Pop
cluster compression and redistribution, except that the list of
312 92921ea4 Iustin Pop
instances is fixed to the ones living on the nodes to-be-evacuated.
313 92921ea4 Iustin Pop
314 92921ea4 Iustin Pop
If the inter-group relocation is successful, the result to Ganeti will
315 92921ea4 Iustin Pop
not be a local group evacuation target, but instead (for each
316 92921ea4 Iustin Pop
instance) a pair *(remote group, nodes)*. Ganeti itself will have to
317 92921ea4 Iustin Pop
decide (based on user input) whether to continue with inter-group
318 92921ea4 Iustin Pop
evacuation or not.
319 92921ea4 Iustin Pop
320 92921ea4 Iustin Pop
In case that Ganeti doesn't provide complete cluster data, just the
321 92921ea4 Iustin Pop
local group, the inter-group relocation won't be attempted.
322 9ff4f2c0 Michael Hanselmann
323 9ff4f2c0 Michael Hanselmann
.. vim: set textwidth=72 :
324 9ff4f2c0 Michael Hanselmann
.. Local Variables:
325 9ff4f2c0 Michael Hanselmann
.. mode: rst
326 9ff4f2c0 Michael Hanselmann
.. fill-column: 72
327 9ff4f2c0 Michael Hanselmann
.. End: