root / doc / design-2.3.rst @ 861610e9
History | View | Annotate | Download (37.3 kB)
1 |
================= |
---|---|
2 |
Ganeti 2.3 design |
3 |
================= |
4 |
|
5 |
This document describes the major changes in Ganeti 2.3 compared to |
6 |
the 2.2 version. |
7 |
|
8 |
.. contents:: :depth: 4 |
9 |
|
10 |
As for 2.1 and 2.2 we divide the 2.3 design into three areas: |
11 |
|
12 |
- core changes, which affect the master daemon/job queue/locking or |
13 |
all/most logical units |
14 |
- logical unit/feature changes |
15 |
- external interface changes (e.g. command line, OS API, hooks, ...) |
16 |
|
17 |
Core changes |
18 |
============ |
19 |
|
20 |
Node Groups |
21 |
----------- |
22 |
|
23 |
Current state and shortcomings |
24 |
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
25 |
|
26 |
Currently all nodes of a Ganeti cluster are considered as part of the |
27 |
same pool, for allocation purposes: DRBD instances for example can be |
28 |
allocated on any two nodes. |
29 |
|
30 |
This does cause a problem in cases where nodes are not all equally |
31 |
connected to each other. For example if a cluster is created over two |
32 |
set of machines, each connected to its own switch, the internal bandwidth |
33 |
between machines connected to the same switch might be bigger than the |
34 |
bandwidth for inter-switch connections. |
35 |
|
36 |
Moreover, some operations inside a cluster require all nodes to be locked |
37 |
together for inter-node consistency, and won't scale if we increase the |
38 |
number of nodes to a few hundreds. |
39 |
|
40 |
Proposed changes |
41 |
~~~~~~~~~~~~~~~~ |
42 |
|
43 |
With this change we'll divide Ganeti nodes into groups. Nothing will |
44 |
change for clusters with only one node group. Bigger clusters will be |
45 |
able to have more than one group, and each node will belong to exactly |
46 |
one. |
47 |
|
48 |
Node group management |
49 |
+++++++++++++++++++++ |
50 |
|
51 |
To manage node groups and the nodes belonging to them, the following new |
52 |
commands and flags will be introduced:: |
53 |
|
54 |
gnt-group add <group> # add a new node group |
55 |
gnt-group remove <group> # delete an empty node group |
56 |
gnt-group list # list node groups |
57 |
gnt-group rename <oldname> <newname> # rename a node group |
58 |
gnt-node {list,info} -g <group> # list only nodes belonging to a node group |
59 |
gnt-node modify -g <group> # assign a node to a node group |
60 |
|
61 |
Node group attributes |
62 |
+++++++++++++++++++++ |
63 |
|
64 |
In clusters with more than one node group, it may be desirable to |
65 |
establish local policies regarding which groups should be preferred when |
66 |
performing allocation of new instances, or inter-group instance migrations. |
67 |
|
68 |
To help with this, we will provide an ``alloc_policy`` attribute for |
69 |
node groups. Such attribute will be honored by iallocator plugins when |
70 |
making automatic decisions regarding instance placement. |
71 |
|
72 |
The ``alloc_policy`` attribute can have the following values: |
73 |
|
74 |
- unallocable: the node group should not be a candidate for instance |
75 |
allocations, and the operation should fail if only groups in this |
76 |
state could be found that would satisfy the requirements. |
77 |
|
78 |
- last_resort: the node group should not be used for instance |
79 |
allocations, unless this would be the only way to have the operation |
80 |
succeed. Prioritization among groups in this state will be deferred to |
81 |
the iallocator plugin that's being used. |
82 |
|
83 |
- preferred: the node group can be used freely for allocation of |
84 |
instances (this is the default state for newly created node |
85 |
groups). Note that prioritization among groups in this state will be |
86 |
deferred to the iallocator plugin that's being used. |
87 |
|
88 |
Node group operations |
89 |
+++++++++++++++++++++ |
90 |
|
91 |
One operation at the node group level will be initially provided:: |
92 |
|
93 |
gnt-group drain <group> |
94 |
|
95 |
The purpose of this operation is to migrate all instances in a given |
96 |
node group to other groups in the cluster, e.g. to reclaim capacity if |
97 |
there are enough free resources in other node groups that share a |
98 |
storage pool with the evacuated group. |
99 |
|
100 |
Instance level changes |
101 |
++++++++++++++++++++++ |
102 |
|
103 |
With the introduction of node groups, instances will be required to live |
104 |
in only one group at a time; this is mostly important for DRBD |
105 |
instances, which will not be allowed to have their primary and secondary |
106 |
nodes in different node groups. To support this, we envision the |
107 |
following changes: |
108 |
|
109 |
- The iallocator interface will be augmented, and node groups exposed, |
110 |
so that plugins will be able to make a decision regarding the group |
111 |
in which to place a new instance. By default, all node groups will |
112 |
be considered, but it will be possible to include a list of groups |
113 |
in the creation job, in which case the plugin will limit itself to |
114 |
considering those; in both cases, the ``alloc_policy`` attribute |
115 |
will be honored. |
116 |
- If, on the other hand, a primary and secondary nodes are specified |
117 |
for a new instance, they will be required to be on the same node |
118 |
group. |
119 |
- Moving an instance between groups can only happen via an explicit |
120 |
operation, which for example in the case of DRBD will work by |
121 |
performing internally a replace-disks, a migration, and a second |
122 |
replace-disks. It will be possible to clean up an interrupted |
123 |
group-move operation. |
124 |
- Cluster verify will signal an error if an instance has nodes |
125 |
belonging to different groups. Additionally, changing the group of a |
126 |
given node will be initially only allowed if the node is empty, as a |
127 |
straightforward mechanism to avoid creating such situation. |
128 |
- Inter-group instance migration will have the same operation modes as |
129 |
new instance allocation, defined above: letting an iallocator plugin |
130 |
decide the target group, possibly restricting the set of node groups |
131 |
to consider, or specifying a target primary and secondary nodes. In |
132 |
both cases, the target group or nodes must be able to accept the |
133 |
instance network- and storage-wise; the operation will fail |
134 |
otherwise, though in the future we may be able to allow some |
135 |
parameter to be changed together with the move (in the meantime, an |
136 |
import/export will be required in this scenario). |
137 |
|
138 |
Internal changes |
139 |
++++++++++++++++ |
140 |
|
141 |
We expect the following changes for cluster management: |
142 |
|
143 |
- Frequent multinode operations, such as os-diagnose or cluster-verify, |
144 |
will act on one group at a time, which will have to be specified in |
145 |
all cases, except for clusters with just one group. Command line |
146 |
tools will also have a way to easily target all groups, by |
147 |
generating one job per group. |
148 |
- Groups will have a human-readable name, but will internally always |
149 |
be referenced by a UUID, which will be immutable; for example, nodes |
150 |
will contain the UUID of the group they belong to. This is done |
151 |
to simplify referencing while keeping it easy to handle renames and |
152 |
movements. If we see that this works well, we'll transition other |
153 |
config objects (instances, nodes) to the same model. |
154 |
- The addition of a new per-group lock will be evaluated, if we can |
155 |
transition some operations now requiring the BGL to it. |
156 |
- Master candidate status will be allowed to be spread among groups. |
157 |
For the first version we won't add any restriction over how this is |
158 |
done, although in the future we may have a minimum number of master |
159 |
candidates which Ganeti will try to keep in each group, for example. |
160 |
|
161 |
Other work and future changes |
162 |
+++++++++++++++++++++++++++++ |
163 |
|
164 |
Commands like ``gnt-cluster command``/``gnt-cluster copyfile`` will |
165 |
continue to work on the whole cluster, but it will be possible to target |
166 |
one group only by specifying it. |
167 |
|
168 |
Commands which allow selection of sets of resources (for example |
169 |
``gnt-instance start``/``gnt-instance stop``) will be able to select |
170 |
them by node group as well. |
171 |
|
172 |
Initially node groups won't be taggable objects, to simplify the first |
173 |
implementation, but we expect this to be easy to add in a future version |
174 |
should we see it's useful. |
175 |
|
176 |
We envision groups as a good place to enhance cluster scalability. In |
177 |
the future we may want to use them as units for configuration diffusion, |
178 |
to allow a better master scalability. For example it could be possible |
179 |
to change some all-nodes RPCs to contact each group once, from the |
180 |
master, and make one node in the group perform internal diffusion. We |
181 |
won't implement this in the first version, but we'll evaluate it for the |
182 |
future, if we see scalability problems on big multi-group clusters. |
183 |
|
184 |
When Ganeti will support more storage models (e.g. SANs, Sheepdog, Ceph) |
185 |
we expect groups to be the basis for this, allowing for example a |
186 |
different Sheepdog/Ceph cluster, or a different SAN to be connected to |
187 |
each group. In some cases this will mean that inter-group move operation |
188 |
will be necessarily performed with instance downtime, unless the |
189 |
hypervisor has block-migrate functionality, and we implement support for |
190 |
it (this would be theoretically possible, today, with KVM, for example). |
191 |
|
192 |
Scalability issues with big clusters |
193 |
------------------------------------ |
194 |
|
195 |
Current and future issues |
196 |
~~~~~~~~~~~~~~~~~~~~~~~~~ |
197 |
|
198 |
Assuming the node groups feature will enable bigger clusters, other |
199 |
parts of Ganeti will be impacted even more by the (in effect) bigger |
200 |
clusters. |
201 |
|
202 |
While many areas will be impacted, one is the most important: the fact |
203 |
that the watcher still needs to be able to repair instance data on the |
204 |
current 5 minutes time-frame (a shorter time-frame would be even |
205 |
better). This means that the watcher itself needs to have parallelism |
206 |
when dealing with node groups. |
207 |
|
208 |
Also, the iallocator plugins are being fed data from Ganeti but also |
209 |
need access to the full cluster state, and in general we still rely on |
210 |
being able to compute the full cluster state somewhat “cheaply” and |
211 |
on-demand. This conflicts with the goal of disconnecting the different |
212 |
node groups, and to keep the same parallelism while growing the cluster |
213 |
size. |
214 |
|
215 |
Another issue is that the current capacity calculations are done |
216 |
completely outside Ganeti (and they need access to the entire cluster |
217 |
state), and this prevents keeping the capacity numbers in sync with the |
218 |
cluster state. While this is still acceptable for smaller clusters where |
219 |
a small number of allocations/removal are presumed to occur between two |
220 |
periodic capacity calculations, on bigger clusters where we aim to |
221 |
parallelize heavily between node groups this is no longer true. |
222 |
|
223 |
|
224 |
|
225 |
As proposed changes, the main change is introducing a cluster state |
226 |
cache (not serialised to disk), and to update many of the LUs and |
227 |
cluster operations to account for it. Furthermore, the capacity |
228 |
calculations will be integrated via a new OpCode/LU, so that we have |
229 |
faster feedback (instead of periodic computation). |
230 |
|
231 |
Cluster state cache |
232 |
~~~~~~~~~~~~~~~~~~~ |
233 |
|
234 |
A new cluster state cache will be introduced. The cache relies on two |
235 |
main ideas: |
236 |
|
237 |
- the total node memory, CPU count are very seldom changing; the total |
238 |
node disk space is also slow changing, but can change at runtime; the |
239 |
free memory and free disk will change significantly for some jobs, but |
240 |
on a short timescale; in general, these values will be mostly “constant” |
241 |
during the lifetime of a job |
242 |
- we already have a periodic set of jobs that query the node and |
243 |
instance state, driven the by :command:`ganeti-watcher` command, and |
244 |
we're just discarding the results after acting on them |
245 |
|
246 |
Given the above, it makes sense to cache the results of node and instance |
247 |
state (with a focus on the node state) inside the master daemon. |
248 |
|
249 |
The cache will not be serialised to disk, and will be for the most part |
250 |
transparent to the outside of the master daemon. |
251 |
|
252 |
Cache structure |
253 |
+++++++++++++++ |
254 |
|
255 |
The cache will be oriented with a focus on node groups, so that it will |
256 |
be easy to invalidate an entire node group, or a subset of nodes, or the |
257 |
entire cache. The instances will be stored in the node group of their |
258 |
primary node. |
259 |
|
260 |
Furthermore, since the node and instance properties determine the |
261 |
capacity statistics in a deterministic way, the cache will also hold, at |
262 |
each node group level, the total capacity as determined by the new |
263 |
capacity iallocator mode. |
264 |
|
265 |
Cache updates |
266 |
+++++++++++++ |
267 |
|
268 |
The cache will be updated whenever a query for a node state returns |
269 |
“full” node information (so as to keep the cache state for a given node |
270 |
consistent). Partial results will not update the cache (see next |
271 |
paragraph). |
272 |
|
273 |
Since there will be no way to feed the cache from outside, and we |
274 |
would like to have a consistent cache view when driven by the watcher, |
275 |
we'll introduce a new OpCode/LU for the watcher to run, instead of the |
276 |
current separate opcodes (see below in the watcher section). |
277 |
|
278 |
Updates to a node that change a node's specs “downward” (e.g. less |
279 |
memory) will invalidate the capacity data. Updates that increase the |
280 |
node will not invalidate the capacity, as we're more interested in “at |
281 |
least available” correctness, not “at most available”. |
282 |
|
283 |
Cache invalidation |
284 |
++++++++++++++++++ |
285 |
|
286 |
If a partial node query is done (e.g. just for the node free space), and |
287 |
the returned values don't match with the cache, then the entire node |
288 |
state will be invalidated. |
289 |
|
290 |
By default, all LUs will invalidate the caches for all nodes and |
291 |
instances they lock. If an LU uses the BGL, then it will invalidate the |
292 |
entire cache. In time, it is expected that LUs will be modified to not |
293 |
invalidate, if they are not expected to change the node's and/or |
294 |
instance's state (e.g. ``LUInstanceConsole``, or |
295 |
``LUInstanceActivateDisks``). |
296 |
|
297 |
Invalidation of a node's properties will also invalidate the capacity |
298 |
data associated with that node. |
299 |
|
300 |
Cache lifetime |
301 |
++++++++++++++ |
302 |
|
303 |
The cache elements will have an upper bound on their lifetime; the |
304 |
proposal is to make this an hour, which should be a high enough value to |
305 |
cover the watcher being blocked by a medium-term job (e.g. 20-30 |
306 |
minutes). |
307 |
|
308 |
Cache usage |
309 |
+++++++++++ |
310 |
|
311 |
The cache will be used by default for most queries (e.g. a Luxi call, |
312 |
without locks, for the entire cluster). Since this will be a change from |
313 |
the current behaviour, we'll need to allow non-cached responses, |
314 |
e.g. via a ``--cache=off`` or similar argument (which will force the |
315 |
query). |
316 |
|
317 |
The cache will also be used for the iallocator runs, so that computing |
318 |
allocation solution can proceed independent from other jobs which lock |
319 |
parts of the cluster. This is important as we need to separate |
320 |
allocation on one group from exclusive blocking jobs on other node |
321 |
groups. |
322 |
|
323 |
The capacity calculations will also use the cache. This is detailed in |
324 |
the respective sections. |
325 |
|
326 |
Watcher operation |
327 |
~~~~~~~~~~~~~~~~~ |
328 |
|
329 |
As detailed in the cluster cache section, the watcher also needs |
330 |
improvements in order to scale with the the cluster size. |
331 |
|
332 |
As a first improvement, the proposal is to introduce a new OpCode/LU |
333 |
pair that runs with locks held over the entire query sequence (the |
334 |
current watcher runs a job with two opcodes, which grab and release the |
335 |
locks individually). The new opcode will be called |
336 |
``OpUpdateNodeGroupCache`` and will do the following: |
337 |
|
338 |
- try to acquire all node/instance locks (to examine in more depth, and |
339 |
possibly alter) in the given node group |
340 |
- invalidate the cache for the node group |
341 |
- acquire node and instance state (possibly via a new single RPC call |
342 |
that combines node and instance information) |
343 |
- update cache |
344 |
- return the needed data |
345 |
|
346 |
The reason for the per-node group query is that we don't want a busy |
347 |
node group to prevent instance maintenance in other node |
348 |
groups. Therefore, the watcher will introduce parallelism across node |
349 |
groups, and it will possible to have overlapping watcher runs. The new |
350 |
execution sequence will be: |
351 |
|
352 |
- the parent watcher process acquires global watcher lock |
353 |
- query the list of node groups (lockless or very short locks only) |
354 |
- fork N children, one for each node group |
355 |
- release the global lock |
356 |
- poll/wait for the children to finish |
357 |
|
358 |
Each forked children will do the following: |
359 |
|
360 |
- try to acquire the per-node group watcher lock |
361 |
- if fail to acquire, exit with special code telling the parent that the |
362 |
node group is already being managed by a watcher process |
363 |
- otherwise, submit a OpUpdateNodeGroupCache job |
364 |
- get results (possibly after a long time, due to busy group) |
365 |
- run the needed maintenance operations for the current group |
366 |
|
367 |
This new mode of execution means that the master watcher processes might |
368 |
overlap in running, but not the individual per-node group child |
369 |
processes. |
370 |
|
371 |
This change allows us to keep (almost) the same parallelism when using a |
372 |
bigger cluster with node groups versus two separate clusters. |
373 |
|
374 |
|
375 |
Cost of periodic cache updating |
376 |
+++++++++++++++++++++++++++++++ |
377 |
|
378 |
Currently the watcher only does “small” queries for the node and |
379 |
instance state, and at first sight changing it to use the new OpCode |
380 |
which populates the cache with the entire state might introduce |
381 |
additional costs, which must be payed every five minutes. |
382 |
|
383 |
However, the OpCodes that the watcher submits are using the so-called |
384 |
dynamic fields (need to contact the remote nodes), and the LUs are not |
385 |
selective—they always grab all the node and instance state. So in the |
386 |
end, we have the same cost, it just becomes explicit rather than |
387 |
implicit. |
388 |
|
389 |
This ‘grab all node state’ behaviour is what makes the cache worth |
390 |
implementing. |
391 |
|
392 |
Intra-node group scalability |
393 |
++++++++++++++++++++++++++++ |
394 |
|
395 |
The design above only deals with inter-node group issues. It still makes |
396 |
sense to run instance maintenance for nodes A and B if only node C is |
397 |
locked (all being in the same node group). |
398 |
|
399 |
This problem is commonly encountered in previous Ganeti versions, and it |
400 |
should be handled similarly, by tweaking lock lifetime in long-duration |
401 |
jobs. |
402 |
|
403 |
TODO: add more ideas here. |
404 |
|
405 |
|
406 |
State file maintenance |
407 |
++++++++++++++++++++++ |
408 |
|
409 |
The splitting of node group maintenance to different children which will |
410 |
run in parallel requires that the state file handling changes from |
411 |
monolithic updates to partial ones. |
412 |
|
413 |
There are two file that the watcher maintains: |
414 |
|
415 |
- ``$LOCALSTATEDIR/lib/ganeti/watcher.data``, its internal state file, |
416 |
used for deciding internal actions |
417 |
- ``$LOCALSTATEDIR/run/ganeti/instance-status``, a file designed for |
418 |
external consumption |
419 |
|
420 |
For the first file, since it's used only internally to the watchers, we |
421 |
can move to a per node group configuration. |
422 |
|
423 |
For the second file, even if it's used as an external interface, we will |
424 |
need to make some changes to it: because the different node groups can |
425 |
return results at different times, we need to either split the file into |
426 |
per-group files or keep the single file and add a per-instance timestamp |
427 |
(currently the file holds only the instance name and state). |
428 |
|
429 |
The proposal is that each child process maintains its own node group |
430 |
file, and the master process will, right after querying the node group |
431 |
list, delete any extra per-node group state file. This leaves the |
432 |
consumers to run a simple ``cat instance-status.group-*`` to obtain the |
433 |
entire list of instance and their states. If needed, the modify |
434 |
timestamp of each file can be used to determine the age of the results. |
435 |
|
436 |
|
437 |
Capacity calculations |
438 |
~~~~~~~~~~~~~~~~~~~~~ |
439 |
|
440 |
Currently, the capacity calculations are done completely outside |
441 |
Ganeti. As explained in the current problems section, this needs to |
442 |
account better for the cluster state changes. |
443 |
|
444 |
Therefore a new OpCode will be introduced, ``OpComputeCapacity``, that |
445 |
will either return the current capacity numbers (if available), or |
446 |
trigger a new capacity calculation, via the iallocator framework, which |
447 |
will get a new method called ``capacity``. |
448 |
|
449 |
This method will feed the cluster state (for the complete set of node |
450 |
group, or alternative just a subset) to the iallocator plugin (either |
451 |
the specified one, or the default if none is specified), and return the |
452 |
new capacity in the format currently exported by the htools suite and |
453 |
known as the “tiered specs” (see :manpage:`hspace(1)`). |
454 |
|
455 |
tspec cluster parameters |
456 |
++++++++++++++++++++++++ |
457 |
|
458 |
Currently, the “tspec” calculations done in :command:`hspace` require |
459 |
some additional parameters: |
460 |
|
461 |
- maximum instance size |
462 |
- type of instance storage |
463 |
- maximum ratio of virtual CPUs per physical CPUs |
464 |
- minimum disk free |
465 |
|
466 |
For the integration in Ganeti, there are multiple ways to pass these: |
467 |
|
468 |
- ignored by Ganeti, and being the responsibility of the iallocator |
469 |
plugin whether to use these at all or not |
470 |
- as input to the opcode |
471 |
- as proper cluster parameters |
472 |
|
473 |
Since the first option is not consistent with the intended changes, a |
474 |
combination of the last two is proposed: |
475 |
|
476 |
- at cluster level, we'll have cluster-wide defaults |
477 |
- at node groups, we'll allow overriding the cluster defaults |
478 |
- and if they are passed in via the opcode, they will override for the |
479 |
current computation the values |
480 |
|
481 |
Whenever the capacity is requested via different parameters, it will |
482 |
invalidate the cache, even if otherwise the cache is up-to-date. |
483 |
|
484 |
The new parameters are: |
485 |
|
486 |
- max_inst_spec: (int, int, int), the maximum instance specification |
487 |
accepted by this cluster or node group, in the order of memory, disk, |
488 |
vcpus; |
489 |
- default_template: string, the default disk template to use |
490 |
- max_cpu_ratio: double, the maximum ratio of VCPUs/PCPUs |
491 |
- max_disk_usage: double, the maximum disk usage (as a ratio) |
492 |
|
493 |
These might also be used in instance creations (to be determined later, |
494 |
after they are introduced). |
495 |
|
496 |
OpCode details |
497 |
++++++++++++++ |
498 |
|
499 |
Input: |
500 |
|
501 |
- iallocator: string (optional, otherwise uses the cluster default) |
502 |
- cached: boolean, optional, defaults to true, and denotes whether we |
503 |
accept cached responses |
504 |
- the above new parameters, optional; if they are passed, they will |
505 |
overwrite all node group's parameters |
506 |
|
507 |
Output: |
508 |
|
509 |
- cluster: list of tuples (memory, disk, vcpu, count), in decreasing |
510 |
order of specifications; the first three members represent the |
511 |
instance specification, the last one the count of how many instances |
512 |
of this specification can be created on the cluster |
513 |
- node_groups: a dictionary keyed by node group UUID, with values a |
514 |
dictionary: |
515 |
|
516 |
- tspecs: a list like the cluster one |
517 |
- additionally, the new cluster parameters, denoting the input |
518 |
parameters that were used for this node group |
519 |
|
520 |
- ctime: the date the result has been computed; this represents the |
521 |
oldest creation time amongst all node groups (so as to accurately |
522 |
represent how much out-of-date the global response is) |
523 |
|
524 |
Note that due to the way the tspecs are computed, for any given |
525 |
specification, the total available count is the count for the given |
526 |
entry, plus the sum of counts for higher specifications. |
527 |
|
528 |
|
529 |
Node flags |
530 |
---------- |
531 |
|
532 |
Current state and shortcomings |
533 |
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
534 |
|
535 |
Currently all nodes are, from the point of view of their capabilities, |
536 |
homogeneous. This means the cluster considers all nodes capable of |
537 |
becoming master candidates, and of hosting instances. |
538 |
|
539 |
This prevents some deployment scenarios: e.g. having a Ganeti instance |
540 |
(in another cluster) be just a master candidate, in case all other |
541 |
master candidates go down (but not, of course, host instances), or |
542 |
having a node in a remote location just host instances but not become |
543 |
master, etc. |
544 |
|
545 |
Proposed changes |
546 |
~~~~~~~~~~~~~~~~ |
547 |
|
548 |
Two new capability flags will be added to the node: |
549 |
|
550 |
- master_capable, denoting whether the node can become a master |
551 |
candidate or master |
552 |
- vm_capable, denoting whether the node can host instances |
553 |
|
554 |
In terms of the other flags, master_capable is a stronger version of |
555 |
"not master candidate", and vm_capable is a stronger version of |
556 |
"drained". |
557 |
|
558 |
For the master_capable flag, it will affect auto-promotion code and node |
559 |
modifications. |
560 |
|
561 |
The vm_capable flag will affect the iallocator protocol, capacity |
562 |
calculations, node checks in cluster verify, and will interact in novel |
563 |
ways with locking (unfortunately). |
564 |
|
565 |
It is envisaged that most nodes will be both vm_capable and |
566 |
master_capable, and just a few will have one of these flags |
567 |
removed. Ganeti itself will allow clearing of both flags, even though |
568 |
this doesn't make much sense currently. |
569 |
|
570 |
|
571 |
.. _jqueue-job-priority-design: |
572 |
|
573 |
Job priorities |
574 |
-------------- |
575 |
|
576 |
Current state and shortcomings |
577 |
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
578 |
|
579 |
Currently all jobs and opcodes have the same priority. Once a job |
580 |
started executing, its thread won't be released until all opcodes got |
581 |
their locks and did their work. When a job is finished, the next job is |
582 |
selected strictly by its incoming order. This does not mean jobs are run |
583 |
in their incoming order—locks and other delays can cause them to be |
584 |
stalled for some time. |
585 |
|
586 |
In some situations, e.g. an emergency shutdown, one may want to run a |
587 |
job as soon as possible. This is not possible currently if there are |
588 |
pending jobs in the queue. |
589 |
|
590 |
Proposed changes |
591 |
~~~~~~~~~~~~~~~~ |
592 |
|
593 |
Each opcode will be assigned a priority on submission. Opcode priorities |
594 |
are integers and the lower the number, the higher the opcode's priority |
595 |
is. Within the same priority, jobs and opcodes are initially processed |
596 |
in their incoming order. |
597 |
|
598 |
Submitted opcodes can have one of the priorities listed below. Other |
599 |
priorities are reserved for internal use. The absolute range is |
600 |
-20..+19. Opcodes submitted without a priority (e.g. by older clients) |
601 |
are assigned the default priority. |
602 |
|
603 |
- High (-10) |
604 |
- Normal (0, default) |
605 |
- Low (+10) |
606 |
|
607 |
As a change from the current model where executing a job blocks one |
608 |
thread for the whole duration, the new job processor must return the job |
609 |
to the queue after each opcode and also if it can't get all locks in a |
610 |
reasonable timeframe. This will allow opcodes of higher priority |
611 |
submitted in the meantime to be processed or opcodes of the same |
612 |
priority to try to get their locks. When added to the job queue's |
613 |
workerpool, the priority is determined by the first unprocessed opcode |
614 |
in the job. |
615 |
|
616 |
If an opcode is deferred, the job will go back to the "queued" status, |
617 |
even though it's just waiting to try to acquire its locks again later. |
618 |
|
619 |
If an opcode can not be processed after a certain number of retries or a |
620 |
certain amount of time, it should increase its priority. This will avoid |
621 |
starvation. |
622 |
|
623 |
A job's priority can never go below -20. If a job hits priority -20, it |
624 |
must acquire its locks in blocking mode. |
625 |
|
626 |
Opcode priorities are synchronised to disk in order to be restored after |
627 |
a restart or crash of the master daemon. |
628 |
|
629 |
Priorities also need to be considered inside the locking library to |
630 |
ensure opcodes with higher priorities get locks first. See |
631 |
:ref:`locking priorities <locking-priorities>` for more details. |
632 |
|
633 |
Worker pool |
634 |
+++++++++++ |
635 |
|
636 |
To support job priorities in the job queue, the worker pool underlying |
637 |
the job queue must be enhanced to support task priorities. Currently |
638 |
tasks are processed in the order they are added to the queue (but, due |
639 |
to their nature, they don't necessarily finish in that order). All tasks |
640 |
are equal. To support tasks with higher or lower priority, a few changes |
641 |
have to be made to the queue inside a worker pool. |
642 |
|
643 |
Each task is assigned a priority when added to the queue. This priority |
644 |
can not be changed until the task is executed (this is fine as in all |
645 |
current use-cases, tasks are added to a pool and then forgotten about |
646 |
until they're done). |
647 |
|
648 |
A task's priority can be compared to Unix' process priorities. The lower |
649 |
the priority number, the closer to the queue's front it is. A task with |
650 |
priority 0 is going to be run before one with priority 10. Tasks with |
651 |
the same priority are executed in the order in which they were added. |
652 |
|
653 |
While a task is running it can query its own priority. If it's not ready |
654 |
yet for finishing, it can raise an exception to defer itself, optionally |
655 |
changing its own priority. This is useful for the following cases: |
656 |
|
657 |
- A task is trying to acquire locks, but those locks are still held by |
658 |
other tasks. By deferring itself, the task gives others a chance to |
659 |
run. This is especially useful when all workers are busy. |
660 |
- If a task decides it hasn't gotten its locks in a long time, it can |
661 |
start to increase its own priority. |
662 |
- Tasks waiting for long-running operations running asynchronously could |
663 |
defer themselves while waiting for a long-running operation. |
664 |
|
665 |
With these changes, the job queue will be able to implement per-job |
666 |
priorities. |
667 |
|
668 |
.. _locking-priorities: |
669 |
|
670 |
Locking |
671 |
+++++++ |
672 |
|
673 |
In order to support priorities in Ganeti's own lock classes, |
674 |
``locking.SharedLock`` and ``locking.LockSet``, the internal structure |
675 |
of the former class needs to be changed. The last major change in this |
676 |
area was done for Ganeti 2.1 and can be found in the respective |
677 |
:doc:`design document <design-2.1>`. |
678 |
|
679 |
The plain list (``[]``) used as a queue is replaced by a heap queue, |
680 |
similar to the `worker pool`_. The heap or priority queue does automatic |
681 |
sorting, thereby automatically taking care of priorities. For each |
682 |
priority there's a plain list with pending acquires, like the single |
683 |
queue of pending acquires before this change. |
684 |
|
685 |
When the lock is released, the code locates the list of pending acquires |
686 |
for the highest priority waiting. The first condition (index 0) is |
687 |
notified. Once all waiting threads received the notification, the |
688 |
condition is removed from the list. If the list of conditions is empty |
689 |
it's removed from the heap queue. |
690 |
|
691 |
Like before, shared acquires are grouped and skip ahead of exclusive |
692 |
acquires if there's already an existing shared acquire for a priority. |
693 |
To accomplish this, a separate dictionary of shared acquires per |
694 |
priority is maintained. |
695 |
|
696 |
To simplify the code and reduce memory consumption, the concept of the |
697 |
"active" and "inactive" condition for shared acquires is abolished. The |
698 |
lock can't predict what priorities the next acquires will use and even |
699 |
keeping a cache can become computationally expensive for arguable |
700 |
benefit (the underlying POSIX pipe, see ``pipe(2)``, needs to be |
701 |
re-created for each notification anyway). |
702 |
|
703 |
The following diagram shows a possible state of the internal queue from |
704 |
a high-level view. Conditions are shown as (waiting) threads. Assuming |
705 |
no modifications are made to the queue (e.g. more acquires or timeouts), |
706 |
the lock would be acquired by the threads in this order (concurrent |
707 |
acquires in parentheses): ``threadE1``, ``threadE2``, (``threadS1``, |
708 |
``threadS2``, ``threadS3``), (``threadS4``, ``threadS5``), ``threadE3``, |
709 |
``threadS6``, ``threadE4``, ``threadE5``. |
710 |
|
711 |
:: |
712 |
|
713 |
[ |
714 |
(0, [exc/threadE1, exc/threadE2, shr/threadS1/threadS2/threadS3]), |
715 |
(2, [shr/threadS4/threadS5]), |
716 |
(10, [exc/threadE3]), |
717 |
(33, [shr/threadS6, exc/threadE4, exc/threadE5]), |
718 |
] |
719 |
|
720 |
|
721 |
IPv6 support |
722 |
------------ |
723 |
|
724 |
Currently Ganeti does not support IPv6. This is true for nodes as well |
725 |
as instances. Due to the fact that IPv4 exhaustion is threateningly near |
726 |
the need of using IPv6 is increasing, especially given that bigger and |
727 |
bigger clusters are supported. |
728 |
|
729 |
Supported IPv6 setup |
730 |
~~~~~~~~~~~~~~~~~~~~ |
731 |
|
732 |
In Ganeti 2.3 we introduce additionally to the ordinary pure IPv4 |
733 |
setup a hybrid IPv6/IPv4 mode. The latter works as follows: |
734 |
|
735 |
- all nodes in a cluster have a primary IPv6 address |
736 |
- the master has a IPv6 address |
737 |
- all nodes **must** have a secondary IPv4 address |
738 |
|
739 |
The reason for this hybrid setup is that key components that Ganeti |
740 |
depends on do not or only partially support IPv6. More precisely, Xen |
741 |
does not support instance migration via IPv6 in version 3.4 and 4.0. |
742 |
Similarly, KVM does not support instance migration nor VNC access for |
743 |
IPv6 at the time of this writing. |
744 |
|
745 |
This led to the decision of not supporting pure IPv6 Ganeti clusters, as |
746 |
very important cluster operations would not have been possible. Using |
747 |
IPv4 as secondary address does not affect any of the goals |
748 |
of the IPv6 support: since secondary addresses do not need to be |
749 |
publicly accessible, they need not be globally unique. In other words, |
750 |
one can practically use private IPv4 secondary addresses just for |
751 |
intra-cluster communication without propagating them across layer 3 |
752 |
boundaries. |
753 |
|
754 |
netutils: Utilities for handling common network tasks |
755 |
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
756 |
|
757 |
Currently common utility functions are kept in the ``utils`` module. |
758 |
Since this module grows bigger and bigger network-related functions are |
759 |
moved to a separate module named *netutils*. Additionally all these |
760 |
utilities will be IPv6-enabled. |
761 |
|
762 |
Cluster initialization |
763 |
~~~~~~~~~~~~~~~~~~~~~~ |
764 |
|
765 |
As mentioned above there will be two different setups in terms of IP |
766 |
addressing: pure IPv4 and hybrid IPv6/IPv4 address. To choose that a |
767 |
new cluster init parameter *--primary-ip-version* is introduced. This is |
768 |
needed as a given name can resolve to both an IPv4 and IPv6 address on a |
769 |
dual-stack host effectively making it impossible to infer that bit. |
770 |
|
771 |
Once a cluster is initialized and the primary IP version chosen all |
772 |
nodes that join have to conform to that setup. In the case of our |
773 |
IPv6/IPv4 setup all nodes *must* have a secondary IPv4 address. |
774 |
|
775 |
Furthermore we store the primary IP version in ssconf which is consulted |
776 |
every time a daemon starts to determine the default bind address (either |
777 |
*0.0.0.0* or *::*. In a IPv6/IPv4 setup we need to bind the Ganeti |
778 |
daemon listening on network sockets to the IPv6 address. |
779 |
|
780 |
Node addition |
781 |
~~~~~~~~~~~~~ |
782 |
|
783 |
When adding a new node to a IPv6/IPv4 cluster it must have a IPv6 |
784 |
address to be used as primary and a IPv4 address used as secondary. As |
785 |
explained above, every time a daemon is started we use the cluster |
786 |
primary IP version to determine to which any address to bind to. The |
787 |
only exception to this is when a node is added to the cluster. In this |
788 |
case there is no ssconf available when noded is started and therefore |
789 |
the correct address needs to be passed to it. |
790 |
|
791 |
Name resolution |
792 |
~~~~~~~~~~~~~~~ |
793 |
|
794 |
Since the gethostbyname*() functions do not support IPv6 name resolution |
795 |
will be done by using the recommended getaddrinfo(). |
796 |
|
797 |
IPv4-only components |
798 |
~~~~~~~~~~~~~~~~~~~~ |
799 |
|
800 |
============================ =================== ==================== |
801 |
Component IPv6 Status Planned Version |
802 |
============================ =================== ==================== |
803 |
Xen instance migration Not supported Xen 4.1: libxenlight |
804 |
KVM instance migration Not supported Unknown |
805 |
KVM VNC access Not supported Unknown |
806 |
============================ =================== ==================== |
807 |
|
808 |
|
809 |
Privilege Separation |
810 |
-------------------- |
811 |
|
812 |
Current state and shortcomings |
813 |
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
814 |
|
815 |
In Ganeti 2.2 we introduced privilege separation for the RAPI daemon. |
816 |
This was done directly in the daemon's code in the process of |
817 |
daemonizing itself. Doing so leads to several potential issues. For |
818 |
example, a file could be opened while the code is still running as |
819 |
``root`` and for some reason not be closed again. Even after changing |
820 |
the user ID, the file descriptor can be written to. |
821 |
|
822 |
Implementation |
823 |
~~~~~~~~~~~~~~ |
824 |
|
825 |
To address these shortcomings, daemons will be started under the target |
826 |
user right away. The ``start-stop-daemon`` utility used to start daemons |
827 |
supports the ``--chuid`` option to change user and group ID before |
828 |
starting the executable. |
829 |
|
830 |
The intermediate solution for the RAPI daemon from Ganeti 2.2 will be |
831 |
removed again. |
832 |
|
833 |
Files written by the daemons may need to have an explicit owner and |
834 |
group set (easily done through ``utils.WriteFile``). |
835 |
|
836 |
All SSH-related code is removed from the ``ganeti.bootstrap`` module and |
837 |
core components and moved to a separate script. The core code will |
838 |
simply assume a working SSH setup to be in place. |
839 |
|
840 |
Security Domains |
841 |
~~~~~~~~~~~~~~~~ |
842 |
|
843 |
In order to separate the permissions of file sets we separate them |
844 |
into the following 3 overall security domain chunks: |
845 |
|
846 |
1. Public: ``0755`` respectively ``0644`` |
847 |
2. Ganeti wide: shared between the daemons (gntdaemons) |
848 |
3. Secret files: shared among a specific set of daemons/users |
849 |
|
850 |
So for point 3 this tables shows the correlation of the sets to groups |
851 |
and their users: |
852 |
|
853 |
=== ========== ============================== ========================== |
854 |
Set Group Users Description |
855 |
=== ========== ============================== ========================== |
856 |
A gntrapi gntrapi, gntmasterd Share data between |
857 |
gntrapi and gntmasterd |
858 |
B gntadmins gntrapi, gntmasterd, *users* Shared between users who |
859 |
needs to call gntmasterd |
860 |
C gntconfd gntconfd, gntmasterd Share data between |
861 |
gntconfd and gntmasterd |
862 |
D gntmasterd gntmasterd masterd only; Currently |
863 |
only to redistribute the |
864 |
configuration, has access |
865 |
to all files under |
866 |
``lib/ganeti`` |
867 |
E gntdaemons gntmasterd, gntrapi, gntconfd Shared between the various |
868 |
Ganeti daemons to exchange |
869 |
data |
870 |
=== ========== ============================== ========================== |
871 |
|
872 |
Restricted commands |
873 |
~~~~~~~~~~~~~~~~~~~ |
874 |
|
875 |
The following commands needs still root to fulfill their functions: |
876 |
|
877 |
:: |
878 |
|
879 |
gnt-cluster {init|destroy|command|copyfile|rename|masterfailover|renew-crypto} |
880 |
gnt-node {add|remove} |
881 |
gnt-instance {console} |
882 |
|
883 |
Directory structure and permissions |
884 |
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
885 |
|
886 |
Here's how we propose to change the filesystem hierarchy and their |
887 |
permissions. |
888 |
|
889 |
Assuming it follows the defaults: ``gnt${daemon}`` for user and |
890 |
the groups from the section `Security Domains`_:: |
891 |
|
892 |
${localstatedir}/lib/ganeti/ (0755; gntmasterd:gntmasterd) |
893 |
cluster-domain-secret (0600; gntmasterd:gntmasterd) |
894 |
config.data (0640; gntmasterd:gntconfd) |
895 |
hmac.key (0440; gntmasterd:gntconfd) |
896 |
known_host (0644; gntmasterd:gntmasterd) |
897 |
queue/ (0700; gntmasterd:gntmasterd) |
898 |
archive/ (0700; gntmasterd:gntmasterd) |
899 |
* (0600; gntmasterd:gntmasterd) |
900 |
* (0600; gntmasterd:gntmasterd) |
901 |
rapi.pem (0440; gntrapi:gntrapi) |
902 |
rapi_users (0640; gntrapi:gntrapi) |
903 |
server.pem (0440; gntmasterd:gntmasterd) |
904 |
ssconf_* (0444; root:gntmasterd) |
905 |
uidpool/ (0750; root:gntmasterd) |
906 |
watcher.data (0600; root:gntmasterd) |
907 |
${localstatedir}/run/ganeti/ (0770; gntmasterd:gntdaemons) |
908 |
socket/ (0750; gntmasterd:gntadmins) |
909 |
ganeti-master (0770; gntmasterd:gntadmins) |
910 |
${localstatedir}/log/ganeti/ (0770; gntmasterd:gntdaemons) |
911 |
master-daemon.log (0600; gntmasterd:gntdaemons) |
912 |
rapi-daemon.log (0600; gntrapi:gntdaemons) |
913 |
conf-daemon.log (0600; gntconfd:gntdaemons) |
914 |
node-daemon.log (0600; gntnoded:gntdaemons) |
915 |
|
916 |
|
917 |
Feature changes |
918 |
=============== |
919 |
|
920 |
|
921 |
External interface changes |
922 |
========================== |
923 |
|
924 |
|
925 |
.. vim: set textwidth=72 : |
926 |
.. Local Variables: |
927 |
.. mode: rst |
928 |
.. fill-column: 72 |
929 |
.. End: |