Revision bcc6f36d
b/doc/design-2.3.rst | ||
---|---|---|
93 | 93 |
filling all groups first, or to have their own strategy based on the |
94 | 94 |
instance needs. |
95 | 95 |
|
96 |
Cluster/Internal/Config level changes
|
|
97 |
+++++++++++++++++++++++++++++++++++++
|
|
96 |
Internal changes
|
|
97 |
++++++++++++++++ |
|
98 | 98 |
|
99 | 99 |
We expect the following changes for cluster management: |
100 | 100 |
|
... | ... | |
147 | 147 |
hypervisor has block-migrate functionality, and we implement support for |
148 | 148 |
it (this would be theoretically possible, today, with KVM, for example). |
149 | 149 |
|
150 |
Scalability issues with big clusters |
|
151 |
------------------------------------ |
|
152 |
|
|
153 |
Current and future issues |
|
154 |
~~~~~~~~~~~~~~~~~~~~~~~~~ |
|
155 |
|
|
156 |
Assuming the node groups feature will enable bigger clusters, other |
|
157 |
parts of Ganeti will be impacted even more by the (in effect) bigger |
|
158 |
clusters. |
|
159 |
|
|
160 |
While many areas will be impacted, one is the most important: the fact |
|
161 |
that the watcher still needs to be able to repair instance data on the |
|
162 |
current 5 minutes time-frame (a shorter time-frame would be even |
|
163 |
better). This means that the watcher itself needs to have parallelism |
|
164 |
when dealing with node groups. |
|
165 |
|
|
166 |
Also, the iallocator plugins are being fed data from Ganeti but also |
|
167 |
need access to the full cluster state, and in general we still rely on |
|
168 |
being able to compute the full cluster state somewhat “cheaply” and |
|
169 |
on-demand. This conflicts with the goal of disconnecting the different |
|
170 |
node groups, and to keep the same parallelism while growing the cluster |
|
171 |
size. |
|
172 |
|
|
173 |
Another issue is that the current capacity calculations are done |
|
174 |
completely outside Ganeti (and they need access to the entire cluster |
|
175 |
state), and this prevents keeping the capacity numbers in sync with the |
|
176 |
cluster state. While this is still acceptable for smaller clusters where |
|
177 |
a small number of allocations/removal are presumed to occur between two |
|
178 |
periodic capacity calculations, on bigger clusters where we aim to |
|
179 |
parallelise heavily between node groups this is no longer true. |
|
180 |
|
|
181 |
|
|
182 |
|
|
183 |
As proposed changes, the main change is introducing a cluster state |
|
184 |
cache (not serialised to disk), and to update many of the LUs and |
|
185 |
cluster operations to account for it. Furthermore, the capacity |
|
186 |
calculations will be integrated via a new OpCode/LU, so that we have |
|
187 |
faster feedback (instead of periodic computation). |
|
188 |
|
|
189 |
Cluster state cache |
|
190 |
~~~~~~~~~~~~~~~~~~~ |
|
191 |
|
|
192 |
A new cluster state cache will be introduced. The cache relies on two |
|
193 |
main ideas: |
|
194 |
|
|
195 |
- the total node memory, CPU count are very seldom changing; the total |
|
196 |
node disk space is also slow changing, but can change at runtime; the |
|
197 |
free memory and free disk will change significantly for some jobs, but |
|
198 |
on a short timescale; in general, these values will mostly “constant” |
|
199 |
during the lifetime of a job |
|
200 |
- we already have a periodic set of jobs that query the node and |
|
201 |
instance state, driven the by :command:`ganeti-watcher` command, and |
|
202 |
we're just discarding the results after acting on them |
|
203 |
|
|
204 |
Given the above, it makes sense to cache inside the master daemon the |
|
205 |
results of node and instance state (with a focus on the node state). |
|
206 |
|
|
207 |
The cache will not be serialised to disk, and will be for the most part |
|
208 |
transparent to the outside of the master daemon. |
|
209 |
|
|
210 |
Cache structure |
|
211 |
+++++++++++++++ |
|
212 |
|
|
213 |
The cache will be oriented with a focus on node groups, so that it will |
|
214 |
be easy to invalidate an entire node group, or a subset of nodes, or the |
|
215 |
entire cache. The instances will be stored in the node group of their |
|
216 |
primary node. |
|
217 |
|
|
218 |
Furthermore, since the node and instance properties determine the |
|
219 |
capacity statistics in a deterministic way, the cache will also hold, at |
|
220 |
each node group level, the total capacity as determined by the new |
|
221 |
capacity iallocator mode. |
|
222 |
|
|
223 |
Cache updates |
|
224 |
+++++++++++++ |
|
225 |
|
|
226 |
The cache will be updated whenever a query for a node state returns |
|
227 |
“full” node information (so as to keep the cache state for a given node |
|
228 |
consistent). Partial results will not update the cache (see next |
|
229 |
paragraph). |
|
230 |
|
|
231 |
Since the there will be no way to feed the cache from outside, and we |
|
232 |
would like to have a consistent cache view when driven by the watcher, |
|
233 |
we'll introduce a new OpCode/LU for the watcher to run, instead of the |
|
234 |
current separate opcodes (see below in the watcher section). |
|
235 |
|
|
236 |
Updates to a node that change a node's specs “downward” (e.g. less |
|
237 |
memory) will invalidate the capacity data. Updates that increase the |
|
238 |
node will not invalidate the capacity, as we're more interested in “at |
|
239 |
least available” correctness, not “at most available”. |
|
240 |
|
|
241 |
Cache invalidations |
|
242 |
+++++++++++++++++++ |
|
243 |
|
|
244 |
If a partial node query is done (e.g. just for the node free space), and |
|
245 |
the returned values don't match with the cache, then the entire node |
|
246 |
state will be invalidated. |
|
247 |
|
|
248 |
By default, all LUs will invalidate the caches for all nodes and |
|
249 |
instances they lock. If an LU uses the BGL, then it will invalidate the |
|
250 |
entire cache. In time, it is expected that LUs will be modified to not |
|
251 |
invalidate, if they are not expected to change the node's and/or |
|
252 |
instance's state (e.g. ``LUConnectConsole``, or |
|
253 |
``LUActivateInstanceDisks``). |
|
254 |
|
|
255 |
Invalidation of a node's properties will also invalidate the capacity |
|
256 |
data associated with that node. |
|
257 |
|
|
258 |
Cache lifetime |
|
259 |
++++++++++++++ |
|
260 |
|
|
261 |
The cache elements will have an upper bound on their lifetime; the |
|
262 |
proposal is to make this an hour, which should be a high enough value to |
|
263 |
cover the watcher being blocked by a medium-term job (e.g. 20-30 |
|
264 |
minutes). |
|
265 |
|
|
266 |
Cache usage |
|
267 |
+++++++++++ |
|
268 |
|
|
269 |
The cache will be used by default for most queries (e.g. a Luxi call, |
|
270 |
without locks, for the entire cluster). Since this will be a change from |
|
271 |
the current behaviour, we'll need to allow non-cached responses, |
|
272 |
e.g. via a ``--cache=off`` or similar argument (which will force the |
|
273 |
query). |
|
274 |
|
|
275 |
The cache will also be used for the iallocator runs, so that computing |
|
276 |
allocation solution can proceed independent from other jobs which lock |
|
277 |
parts of the cluster. This is important as we need to separate |
|
278 |
allocation on one group from exclusive blocking jobs on other node |
|
279 |
groups. |
|
280 |
|
|
281 |
The capacity calculations will also use the cache—this is detailed in |
|
282 |
the respective sections. |
|
283 |
|
|
284 |
Watcher operation |
|
285 |
~~~~~~~~~~~~~~~~~ |
|
286 |
|
|
287 |
As detailed in the cluster cache section, the watcher also needs |
|
288 |
improvements in order to scale with the the cluster size. |
|
289 |
|
|
290 |
As a first improvement, the proposal is to introduce a new OpCode/LU |
|
291 |
pair that runs with locks held over the entire query sequence (the |
|
292 |
current watcher runs a job with two opcodes, which grab and release the |
|
293 |
locks individually). The new opcode will be called |
|
294 |
``OpUpdateNodeGroupCache`` and will do the following: |
|
295 |
|
|
296 |
- try to acquire all node/instance locks (to examine in more depth, and |
|
297 |
possibly alter) in the given node group |
|
298 |
- invalidate the cache for the node group |
|
299 |
- acquire node and instance state (possibly via a new single RPC call |
|
300 |
that combines node and instance information) |
|
301 |
- update cache |
|
302 |
- return the needed data |
|
303 |
|
|
304 |
The reason for the per-node group query is that we don't want a busy |
|
305 |
node group to prevent instance maintenance in other node |
|
306 |
groups. Therefore, the watcher will introduce parallelism across node |
|
307 |
groups, and it will possible to have overlapping watcher runs. The new |
|
308 |
execution sequence will be: |
|
309 |
|
|
310 |
- the parent watcher process acquires global watcher lock |
|
311 |
- query the list of node groups (lockless or very short locks only) |
|
312 |
- fork N children, one for each node group |
|
313 |
- release the global lock |
|
314 |
- poll/wait for the children to finish |
|
315 |
|
|
316 |
Each forked children will do the following: |
|
317 |
|
|
318 |
- try to acquire the per-node group watcher lock |
|
319 |
- if fail to acquire, exit with special code telling the parent that the |
|
320 |
node group is already being managed by a watcher process |
|
321 |
- otherwise, submit a OpUpdateNodeGroupCache job |
|
322 |
- get results (possibly after a long time, due to busy group) |
|
323 |
- run the needed maintenance operations for the current group |
|
324 |
|
|
325 |
This new mode of execution means that the master watcher processes might |
|
326 |
overlap in running, but not the individual per-node group child |
|
327 |
processes. |
|
328 |
|
|
329 |
This change allows us to keep (almost) the same parallelism when using a |
|
330 |
bigger cluster with node groups versus two separate clusters. |
|
331 |
|
|
332 |
|
|
333 |
Cost of periodic cache updating |
|
334 |
+++++++++++++++++++++++++++++++ |
|
335 |
|
|
336 |
Currently the watcher only does “small” queries for the node and |
|
337 |
instance state, and at first sight changing it to use the new OpCode |
|
338 |
which populates the cache with the entire state might introduce |
|
339 |
additional costs, which must be payed every five minutes. |
|
340 |
|
|
341 |
However, the OpCodes that the watcher submits are using the so-called |
|
342 |
dynamic fields (need to contact the remote nodes), and the LUs are not |
|
343 |
selective—they always grab all the node and instance state. So in the |
|
344 |
end, we have the same cost, it just becomes explicit rather than |
|
345 |
implicit. |
|
346 |
|
|
347 |
This ‘grab all node state’ behaviour is what makes the cache worth |
|
348 |
implementing. |
|
349 |
|
|
350 |
Intra-node group scalability |
|
351 |
++++++++++++++++++++++++++++ |
|
352 |
|
|
353 |
The design above only deals with inter-node group issues. It still makes |
|
354 |
sense to run instance maintenance for nodes A and B if only node C is |
|
355 |
locked (all being in the same node group). |
|
356 |
|
|
357 |
This problem is commonly encountered in previous Ganeti versions, and it |
|
358 |
should be handled similarly, by tweaking lock lifetime in long-duration |
|
359 |
jobs. |
|
360 |
|
|
361 |
TODO: add more ideas here. |
|
362 |
|
|
363 |
|
|
364 |
State file maintenance |
|
365 |
++++++++++++++++++++++ |
|
366 |
|
|
367 |
The splitting of node group maintenance to different children which will |
|
368 |
run in parallel requires that the state file handling changes from |
|
369 |
monolithic updates to partial ones. |
|
370 |
|
|
371 |
There are two file that the watcher maintains: |
|
372 |
|
|
373 |
- ``$LOCALSTATEDIR/lib/ganeti/watcher.data``, its internal state file, |
|
374 |
used for deciding internal actions |
|
375 |
- ``$LOCALSTATEDIR/run/ganeti/instance-status``, a file designed for |
|
376 |
external consumption |
|
377 |
|
|
378 |
For the first file, since it's used only internally to the watchers, we |
|
379 |
can move to a per node group configuration. |
|
380 |
|
|
381 |
For the second file, even if it's used as an external interface, we will |
|
382 |
need to make some changes to it: because the different node groups can |
|
383 |
return results at different times, we need to either split the file into |
|
384 |
per-group files or keep the single file and add a per-instance timestamp |
|
385 |
(currently the file holds only the instance name and state). |
|
386 |
|
|
387 |
The proposal is that each child process maintains its own node group |
|
388 |
file, and the master process will, right after querying the node group |
|
389 |
list, delete any extra per-node group state file. This leaves the |
|
390 |
consumers to run a simple ``cat instance-status.group-*`` to obtain the |
|
391 |
entire list of instance and their states. If needed, the modify |
|
392 |
timestamp of each file can be used to determine the age of the results. |
|
393 |
|
|
394 |
|
|
395 |
Capacity calculations |
|
396 |
~~~~~~~~~~~~~~~~~~~~~ |
|
397 |
|
|
398 |
Currently, the capacity calculations are done completely outside |
|
399 |
Ganeti. As explained in the current problems section, this needs to |
|
400 |
account better for the cluster state changes. |
|
401 |
|
|
402 |
Therefore a new OpCode will be introduced, ``OpComputeCapacity``, that |
|
403 |
will either return the current capacity numbers (if available), or |
|
404 |
trigger a new capacity calculation, via the iallocator framework, which |
|
405 |
will get a new method called ``capacity``. |
|
406 |
|
|
407 |
This method will feed the cluster state (for the complete set of node |
|
408 |
group, or alternative just a subset) to the iallocator plugin (either |
|
409 |
the specified one, or the default is none is specified), and return the |
|
410 |
new capacity in the format currently exported by the htools suite and |
|
411 |
known as the “tiered specs” (see :manpage:`hspace(1)`). |
|
412 |
|
|
413 |
tspec cluster parameters |
|
414 |
++++++++++++++++++++++++ |
|
415 |
|
|
416 |
Currently, the “tspec” calculations done in :command:`hspace` require |
|
417 |
some additional parameters: |
|
418 |
|
|
419 |
- maximum instance size |
|
420 |
- type of instance storage |
|
421 |
- maximum ratio of virtual CPUs per physical CPUs |
|
422 |
- minimum disk free |
|
423 |
|
|
424 |
For the integration in Ganeti, there are multiple ways to pass these: |
|
425 |
|
|
426 |
- ignored by Ganeti, and being the responsibility of the iallocator |
|
427 |
plugin whether to use these at all or not |
|
428 |
- as input to the opcode |
|
429 |
- as proper cluster parameters |
|
430 |
|
|
431 |
Since the first option is not consistent with the intended changes, a |
|
432 |
combination of the last two is proposed: |
|
433 |
|
|
434 |
- at cluster level, we'll have cluster-wide defaults |
|
435 |
- at node groups, we'll allow overriding the cluster defaults |
|
436 |
- and if they are passed in via the opcode, they will override for the |
|
437 |
current computation the values |
|
438 |
|
|
439 |
Whenever the capacity is requested via different parameters, it will |
|
440 |
invalidate the cache, even if otherwise the cache is up-to-date. |
|
441 |
|
|
442 |
The new parameters are: |
|
443 |
|
|
444 |
- max_inst_spec: (int, int, int), the maximum instance specification |
|
445 |
accepted by this cluster or node group, in the order of memory, disk, |
|
446 |
vcpus; |
|
447 |
- default_template: string, the default disk template to use |
|
448 |
- max_cpu_ratio: double, the maximum ratio of VCPUs/PCPUs |
|
449 |
- max_disk_usage: double, the maximum disk usage (as a ratio) |
|
450 |
|
|
451 |
These might also be used in instance creations (to be determined later, |
|
452 |
after they are introduced). |
|
453 |
|
|
454 |
OpCode details |
|
455 |
++++++++++++++ |
|
456 |
|
|
457 |
Input: |
|
458 |
|
|
459 |
- iallocator: string (optional, otherwise uses the cluster default) |
|
460 |
- cached: boolean, optional, defaults to true, and denotes whether we |
|
461 |
accept cached responses |
|
462 |
- the above new parameters, optional; if they are passed, they will |
|
463 |
overwrite all node group's parameters |
|
464 |
|
|
465 |
Output: |
|
466 |
|
|
467 |
- cluster: list of tuples (memory, disk, vcpu, count), in decreasing |
|
468 |
order of specifications; the first three members represent the |
|
469 |
instance specification, the last one the count of how many instances |
|
470 |
of this specification can be created on the cluster |
|
471 |
- node_groups: a dictionary keyed by node group UUID, with values a |
|
472 |
dictionary: |
|
473 |
|
|
474 |
- tspecs: a list like the cluster one |
|
475 |
- additionally, the new cluster parameters, denoting the input |
|
476 |
parameters that were used for this node group |
|
477 |
|
|
478 |
- ctime: the date the result has been computed; this represents the |
|
479 |
oldest creation time amongst all node groups (so as to accurately |
|
480 |
represent how much out-of-date the global response is) |
|
481 |
|
|
482 |
Note that due to the way the tspecs are computed, for any given |
|
483 |
specification, the total available count is the count for the given |
|
484 |
entry, plus the sum of counts for higher specifications. |
|
485 |
|
|
486 |
Also note that the node group information is provided just |
|
487 |
informationally, not for allocation decisions. |
|
488 |
|
|
150 | 489 |
|
151 | 490 |
Job priorities |
152 | 491 |
-------------- |
Also available in: Unified diff