root / doc / design-2.3.rst @ feec31d1
History | View | Annotate | Download (35.6 kB)
1 |
================= |
---|---|
2 |
Ganeti 2.3 design |
3 |
================= |
4 |
|
5 |
This document describes the major changes in Ganeti 2.3 compared to |
6 |
the 2.2 version. |
7 |
|
8 |
.. contents:: :depth: 4 |
9 |
|
10 |
As for 2.1 and 2.2 we divide the 2.3 design into three areas: |
11 |
|
12 |
- core changes, which affect the master daemon/job queue/locking or |
13 |
all/most logical units |
14 |
- logical unit/feature changes |
15 |
- external interface changes (e.g. command line, OS API, hooks, ...) |
16 |
|
17 |
Core changes |
18 |
============ |
19 |
|
20 |
Node Groups |
21 |
----------- |
22 |
|
23 |
Current state and shortcomings |
24 |
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
25 |
|
26 |
Currently all nodes of a Ganeti cluster are considered as part of the |
27 |
same pool, for allocation purposes: DRBD instances for example can be |
28 |
allocated on any two nodes. |
29 |
|
30 |
This does cause a problem in cases where nodes are not all equally |
31 |
connected to each other. For example if a cluster is created over two |
32 |
set of machines, each connected to its own switch, the internal bandwidth |
33 |
between machines connected to the same switch might be bigger than the |
34 |
bandwidth for inter-switch connections. |
35 |
|
36 |
Moreover some operations inside a cluster require all nodes to be locked |
37 |
together for inter-node consistency, and won't scale if we increase the |
38 |
number of nodes to a few hundreds. |
39 |
|
40 |
Proposed changes |
41 |
~~~~~~~~~~~~~~~~ |
42 |
|
43 |
With this change we'll divide Ganeti nodes into groups. Nothing will |
44 |
change for clusters with only one node group, the default one. Bigger |
45 |
cluster instead will be able to have more than one group, and each node |
46 |
will belong to exactly one. |
47 |
|
48 |
Node group management |
49 |
+++++++++++++++++++++ |
50 |
|
51 |
To manage node groups and the nodes belonging to them, the following new |
52 |
commands/flags will be introduced:: |
53 |
|
54 |
gnt-node group-add <group> # add a new node group |
55 |
gnt-node group-del <group> # delete an empty group |
56 |
gnt-node group-list # list node groups |
57 |
gnt-node group-rename <oldname> <newname> # rename a group |
58 |
gnt-node list/info -g <group> # list only nodes belonging to a group |
59 |
gnt-node add -g <group> # add a node to a certain group |
60 |
gnt-node modify -g <group> # move a node to a new group |
61 |
|
62 |
Instance level changes |
63 |
++++++++++++++++++++++ |
64 |
|
65 |
Instances will be able to live in only one group at a time. This is |
66 |
mostly important for DRBD instances, in which case both their primary |
67 |
and secondary nodes will need to be in the same group. To support this |
68 |
we envision the following changes: |
69 |
|
70 |
- The cluster will have a default group, which will initially be |
71 |
- Instance allocation will happen to the cluster's default group |
72 |
(which will be changeable via ``gnt-cluster modify`` or RAPI) unless |
73 |
a group is explicitly specified in the creation job (with -g or via |
74 |
RAPI). Iallocator will be only passed the nodes belonging to that |
75 |
group. |
76 |
- Moving an instance between groups can only happen via an explicit |
77 |
operation, which for example in the case of DRBD will work by |
78 |
performing internally a replace-disks, a migration, and a second |
79 |
replace-disks. It will be possible to cleanup an interrupted |
80 |
group-move operation. |
81 |
- Cluster verify will signal an error if an instance has been left |
82 |
mid-transition between groups. |
83 |
- Intra-group instance migration/failover will check that the target |
84 |
group will be able to accept the instance network/storage wise, and |
85 |
fail otherwise. In the future we may be able to make some parameter |
86 |
changed during the move, but in the first version we expect an |
87 |
import/export if this is not possible. |
88 |
- From an allocation point of view, inter-group movements will be |
89 |
shown to a iallocator as a new allocation over the target group. |
90 |
Only in a future version we may add allocator extensions to decide |
91 |
which group the instance should be in. In the meantime we expect |
92 |
Ganeti administrators to either put instances on different groups by |
93 |
filling all groups first, or to have their own strategy based on the |
94 |
instance needs. |
95 |
|
96 |
Internal changes |
97 |
++++++++++++++++ |
98 |
|
99 |
We expect the following changes for cluster management: |
100 |
|
101 |
- Frequent multinode operations, such as os-diagnose or cluster-verify |
102 |
will act one group at a time. The default group will be used if none |
103 |
is passed. Command line tools will have a way to easily target all |
104 |
groups, by generating one job per group. |
105 |
- Groups will have a human-readable name, but will internally always |
106 |
be referenced by a UUID, which will be immutable. For example the |
107 |
cluster object will contain the UUID of the default group, each node |
108 |
will contain the UUID of the group it belongs to, etc. This is done |
109 |
to simplify referencing while keeping it easy to handle renames and |
110 |
movements. If we see that this works well, we'll transition other |
111 |
config objects (instances, nodes) to the same model. |
112 |
- The addition of a new per-group lock will be evaluated, if we can |
113 |
transition some operations now requiring the BGL to it. |
114 |
- Master candidate status will be allowed to be spread among groups. |
115 |
For the first version we won't add any restriction over how this is |
116 |
done, although in the future we may have a minimum number of master |
117 |
candidates which Ganeti will try to keep in each group, for example. |
118 |
|
119 |
Other work and future changes |
120 |
+++++++++++++++++++++++++++++ |
121 |
|
122 |
Commands like ``gnt-cluster command``/``gnt-cluster copyfile`` will |
123 |
continue to work on the whole cluster, but it will be possible to target |
124 |
one group only by specifying it. |
125 |
|
126 |
Commands which allow selection of sets of resources (for example |
127 |
``gnt-instance start``/``gnt-instance stop``) will be able to select |
128 |
them by node group as well. |
129 |
|
130 |
Initially node groups won't be taggable objects, to simplify the first |
131 |
implementation, but we expect this to be easy to add in a future version |
132 |
should we see it's useful. |
133 |
|
134 |
We envision groups as a good place to enhance cluster scalability. In |
135 |
the future we may want to use them ad units for configuration diffusion, |
136 |
to allow a better master scalability. For example it could be possible |
137 |
to change some all-nodes RPCs to contact each group once, from the |
138 |
master, and make one node in the group perform internal diffusion. We |
139 |
won't implement this in the first version, but we'll evaluate it for the |
140 |
future, if we see scalability problems on big multi-group clusters. |
141 |
|
142 |
When Ganeti will support more storage models (e.g. SANs, Sheepdog, Ceph) |
143 |
we expect groups to be the basis for this, allowing for example a |
144 |
different Sheepdog/Ceph cluster, or a different SAN to be connected to |
145 |
each group. In some cases this will mean that inter-group move operation |
146 |
will be necessarily performed with instance downtime, unless the |
147 |
hypervisor has block-migrate functionality, and we implement support for |
148 |
it (this would be theoretically possible, today, with KVM, for example). |
149 |
|
150 |
Scalability issues with big clusters |
151 |
------------------------------------ |
152 |
|
153 |
Current and future issues |
154 |
~~~~~~~~~~~~~~~~~~~~~~~~~ |
155 |
|
156 |
Assuming the node groups feature will enable bigger clusters, other |
157 |
parts of Ganeti will be impacted even more by the (in effect) bigger |
158 |
clusters. |
159 |
|
160 |
While many areas will be impacted, one is the most important: the fact |
161 |
that the watcher still needs to be able to repair instance data on the |
162 |
current 5 minutes time-frame (a shorter time-frame would be even |
163 |
better). This means that the watcher itself needs to have parallelism |
164 |
when dealing with node groups. |
165 |
|
166 |
Also, the iallocator plugins are being fed data from Ganeti but also |
167 |
need access to the full cluster state, and in general we still rely on |
168 |
being able to compute the full cluster state somewhat “cheaply” and |
169 |
on-demand. This conflicts with the goal of disconnecting the different |
170 |
node groups, and to keep the same parallelism while growing the cluster |
171 |
size. |
172 |
|
173 |
Another issue is that the current capacity calculations are done |
174 |
completely outside Ganeti (and they need access to the entire cluster |
175 |
state), and this prevents keeping the capacity numbers in sync with the |
176 |
cluster state. While this is still acceptable for smaller clusters where |
177 |
a small number of allocations/removal are presumed to occur between two |
178 |
periodic capacity calculations, on bigger clusters where we aim to |
179 |
parallelize heavily between node groups this is no longer true. |
180 |
|
181 |
|
182 |
|
183 |
As proposed changes, the main change is introducing a cluster state |
184 |
cache (not serialised to disk), and to update many of the LUs and |
185 |
cluster operations to account for it. Furthermore, the capacity |
186 |
calculations will be integrated via a new OpCode/LU, so that we have |
187 |
faster feedback (instead of periodic computation). |
188 |
|
189 |
Cluster state cache |
190 |
~~~~~~~~~~~~~~~~~~~ |
191 |
|
192 |
A new cluster state cache will be introduced. The cache relies on two |
193 |
main ideas: |
194 |
|
195 |
- the total node memory, CPU count are very seldom changing; the total |
196 |
node disk space is also slow changing, but can change at runtime; the |
197 |
free memory and free disk will change significantly for some jobs, but |
198 |
on a short timescale; in general, these values will mostly “constant” |
199 |
during the lifetime of a job |
200 |
- we already have a periodic set of jobs that query the node and |
201 |
instance state, driven the by :command:`ganeti-watcher` command, and |
202 |
we're just discarding the results after acting on them |
203 |
|
204 |
Given the above, it makes sense to cache inside the master daemon the |
205 |
results of node and instance state (with a focus on the node state). |
206 |
|
207 |
The cache will not be serialised to disk, and will be for the most part |
208 |
transparent to the outside of the master daemon. |
209 |
|
210 |
Cache structure |
211 |
+++++++++++++++ |
212 |
|
213 |
The cache will be oriented with a focus on node groups, so that it will |
214 |
be easy to invalidate an entire node group, or a subset of nodes, or the |
215 |
entire cache. The instances will be stored in the node group of their |
216 |
primary node. |
217 |
|
218 |
Furthermore, since the node and instance properties determine the |
219 |
capacity statistics in a deterministic way, the cache will also hold, at |
220 |
each node group level, the total capacity as determined by the new |
221 |
capacity iallocator mode. |
222 |
|
223 |
Cache updates |
224 |
+++++++++++++ |
225 |
|
226 |
The cache will be updated whenever a query for a node state returns |
227 |
“full” node information (so as to keep the cache state for a given node |
228 |
consistent). Partial results will not update the cache (see next |
229 |
paragraph). |
230 |
|
231 |
Since the there will be no way to feed the cache from outside, and we |
232 |
would like to have a consistent cache view when driven by the watcher, |
233 |
we'll introduce a new OpCode/LU for the watcher to run, instead of the |
234 |
current separate opcodes (see below in the watcher section). |
235 |
|
236 |
Updates to a node that change a node's specs “downward” (e.g. less |
237 |
memory) will invalidate the capacity data. Updates that increase the |
238 |
node will not invalidate the capacity, as we're more interested in “at |
239 |
least available” correctness, not “at most available”. |
240 |
|
241 |
Cache invalidation |
242 |
++++++++++++++++++ |
243 |
|
244 |
If a partial node query is done (e.g. just for the node free space), and |
245 |
the returned values don't match with the cache, then the entire node |
246 |
state will be invalidated. |
247 |
|
248 |
By default, all LUs will invalidate the caches for all nodes and |
249 |
instances they lock. If an LU uses the BGL, then it will invalidate the |
250 |
entire cache. In time, it is expected that LUs will be modified to not |
251 |
invalidate, if they are not expected to change the node's and/or |
252 |
instance's state (e.g. ``LUConnectConsole``, or |
253 |
``LUActivateInstanceDisks``). |
254 |
|
255 |
Invalidation of a node's properties will also invalidate the capacity |
256 |
data associated with that node. |
257 |
|
258 |
Cache lifetime |
259 |
++++++++++++++ |
260 |
|
261 |
The cache elements will have an upper bound on their lifetime; the |
262 |
proposal is to make this an hour, which should be a high enough value to |
263 |
cover the watcher being blocked by a medium-term job (e.g. 20-30 |
264 |
minutes). |
265 |
|
266 |
Cache usage |
267 |
+++++++++++ |
268 |
|
269 |
The cache will be used by default for most queries (e.g. a Luxi call, |
270 |
without locks, for the entire cluster). Since this will be a change from |
271 |
the current behaviour, we'll need to allow non-cached responses, |
272 |
e.g. via a ``--cache=off`` or similar argument (which will force the |
273 |
query). |
274 |
|
275 |
The cache will also be used for the iallocator runs, so that computing |
276 |
allocation solution can proceed independent from other jobs which lock |
277 |
parts of the cluster. This is important as we need to separate |
278 |
allocation on one group from exclusive blocking jobs on other node |
279 |
groups. |
280 |
|
281 |
The capacity calculations will also use the cache—this is detailed in |
282 |
the respective sections. |
283 |
|
284 |
Watcher operation |
285 |
~~~~~~~~~~~~~~~~~ |
286 |
|
287 |
As detailed in the cluster cache section, the watcher also needs |
288 |
improvements in order to scale with the the cluster size. |
289 |
|
290 |
As a first improvement, the proposal is to introduce a new OpCode/LU |
291 |
pair that runs with locks held over the entire query sequence (the |
292 |
current watcher runs a job with two opcodes, which grab and release the |
293 |
locks individually). The new opcode will be called |
294 |
``OpUpdateNodeGroupCache`` and will do the following: |
295 |
|
296 |
- try to acquire all node/instance locks (to examine in more depth, and |
297 |
possibly alter) in the given node group |
298 |
- invalidate the cache for the node group |
299 |
- acquire node and instance state (possibly via a new single RPC call |
300 |
that combines node and instance information) |
301 |
- update cache |
302 |
- return the needed data |
303 |
|
304 |
The reason for the per-node group query is that we don't want a busy |
305 |
node group to prevent instance maintenance in other node |
306 |
groups. Therefore, the watcher will introduce parallelism across node |
307 |
groups, and it will possible to have overlapping watcher runs. The new |
308 |
execution sequence will be: |
309 |
|
310 |
- the parent watcher process acquires global watcher lock |
311 |
- query the list of node groups (lockless or very short locks only) |
312 |
- fork N children, one for each node group |
313 |
- release the global lock |
314 |
- poll/wait for the children to finish |
315 |
|
316 |
Each forked children will do the following: |
317 |
|
318 |
- try to acquire the per-node group watcher lock |
319 |
- if fail to acquire, exit with special code telling the parent that the |
320 |
node group is already being managed by a watcher process |
321 |
- otherwise, submit a OpUpdateNodeGroupCache job |
322 |
- get results (possibly after a long time, due to busy group) |
323 |
- run the needed maintenance operations for the current group |
324 |
|
325 |
This new mode of execution means that the master watcher processes might |
326 |
overlap in running, but not the individual per-node group child |
327 |
processes. |
328 |
|
329 |
This change allows us to keep (almost) the same parallelism when using a |
330 |
bigger cluster with node groups versus two separate clusters. |
331 |
|
332 |
|
333 |
Cost of periodic cache updating |
334 |
+++++++++++++++++++++++++++++++ |
335 |
|
336 |
Currently the watcher only does “small” queries for the node and |
337 |
instance state, and at first sight changing it to use the new OpCode |
338 |
which populates the cache with the entire state might introduce |
339 |
additional costs, which must be payed every five minutes. |
340 |
|
341 |
However, the OpCodes that the watcher submits are using the so-called |
342 |
dynamic fields (need to contact the remote nodes), and the LUs are not |
343 |
selective—they always grab all the node and instance state. So in the |
344 |
end, we have the same cost, it just becomes explicit rather than |
345 |
implicit. |
346 |
|
347 |
This ‘grab all node state’ behaviour is what makes the cache worth |
348 |
implementing. |
349 |
|
350 |
Intra-node group scalability |
351 |
++++++++++++++++++++++++++++ |
352 |
|
353 |
The design above only deals with inter-node group issues. It still makes |
354 |
sense to run instance maintenance for nodes A and B if only node C is |
355 |
locked (all being in the same node group). |
356 |
|
357 |
This problem is commonly encountered in previous Ganeti versions, and it |
358 |
should be handled similarly, by tweaking lock lifetime in long-duration |
359 |
jobs. |
360 |
|
361 |
TODO: add more ideas here. |
362 |
|
363 |
|
364 |
State file maintenance |
365 |
++++++++++++++++++++++ |
366 |
|
367 |
The splitting of node group maintenance to different children which will |
368 |
run in parallel requires that the state file handling changes from |
369 |
monolithic updates to partial ones. |
370 |
|
371 |
There are two file that the watcher maintains: |
372 |
|
373 |
- ``$LOCALSTATEDIR/lib/ganeti/watcher.data``, its internal state file, |
374 |
used for deciding internal actions |
375 |
- ``$LOCALSTATEDIR/run/ganeti/instance-status``, a file designed for |
376 |
external consumption |
377 |
|
378 |
For the first file, since it's used only internally to the watchers, we |
379 |
can move to a per node group configuration. |
380 |
|
381 |
For the second file, even if it's used as an external interface, we will |
382 |
need to make some changes to it: because the different node groups can |
383 |
return results at different times, we need to either split the file into |
384 |
per-group files or keep the single file and add a per-instance timestamp |
385 |
(currently the file holds only the instance name and state). |
386 |
|
387 |
The proposal is that each child process maintains its own node group |
388 |
file, and the master process will, right after querying the node group |
389 |
list, delete any extra per-node group state file. This leaves the |
390 |
consumers to run a simple ``cat instance-status.group-*`` to obtain the |
391 |
entire list of instance and their states. If needed, the modify |
392 |
timestamp of each file can be used to determine the age of the results. |
393 |
|
394 |
|
395 |
Capacity calculations |
396 |
~~~~~~~~~~~~~~~~~~~~~ |
397 |
|
398 |
Currently, the capacity calculations are done completely outside |
399 |
Ganeti. As explained in the current problems section, this needs to |
400 |
account better for the cluster state changes. |
401 |
|
402 |
Therefore a new OpCode will be introduced, ``OpComputeCapacity``, that |
403 |
will either return the current capacity numbers (if available), or |
404 |
trigger a new capacity calculation, via the iallocator framework, which |
405 |
will get a new method called ``capacity``. |
406 |
|
407 |
This method will feed the cluster state (for the complete set of node |
408 |
group, or alternative just a subset) to the iallocator plugin (either |
409 |
the specified one, or the default is none is specified), and return the |
410 |
new capacity in the format currently exported by the htools suite and |
411 |
known as the “tiered specs” (see :manpage:`hspace(1)`). |
412 |
|
413 |
tspec cluster parameters |
414 |
++++++++++++++++++++++++ |
415 |
|
416 |
Currently, the “tspec” calculations done in :command:`hspace` require |
417 |
some additional parameters: |
418 |
|
419 |
- maximum instance size |
420 |
- type of instance storage |
421 |
- maximum ratio of virtual CPUs per physical CPUs |
422 |
- minimum disk free |
423 |
|
424 |
For the integration in Ganeti, there are multiple ways to pass these: |
425 |
|
426 |
- ignored by Ganeti, and being the responsibility of the iallocator |
427 |
plugin whether to use these at all or not |
428 |
- as input to the opcode |
429 |
- as proper cluster parameters |
430 |
|
431 |
Since the first option is not consistent with the intended changes, a |
432 |
combination of the last two is proposed: |
433 |
|
434 |
- at cluster level, we'll have cluster-wide defaults |
435 |
- at node groups, we'll allow overriding the cluster defaults |
436 |
- and if they are passed in via the opcode, they will override for the |
437 |
current computation the values |
438 |
|
439 |
Whenever the capacity is requested via different parameters, it will |
440 |
invalidate the cache, even if otherwise the cache is up-to-date. |
441 |
|
442 |
The new parameters are: |
443 |
|
444 |
- max_inst_spec: (int, int, int), the maximum instance specification |
445 |
accepted by this cluster or node group, in the order of memory, disk, |
446 |
vcpus; |
447 |
- default_template: string, the default disk template to use |
448 |
- max_cpu_ratio: double, the maximum ratio of VCPUs/PCPUs |
449 |
- max_disk_usage: double, the maximum disk usage (as a ratio) |
450 |
|
451 |
These might also be used in instance creations (to be determined later, |
452 |
after they are introduced). |
453 |
|
454 |
OpCode details |
455 |
++++++++++++++ |
456 |
|
457 |
Input: |
458 |
|
459 |
- iallocator: string (optional, otherwise uses the cluster default) |
460 |
- cached: boolean, optional, defaults to true, and denotes whether we |
461 |
accept cached responses |
462 |
- the above new parameters, optional; if they are passed, they will |
463 |
overwrite all node group's parameters |
464 |
|
465 |
Output: |
466 |
|
467 |
- cluster: list of tuples (memory, disk, vcpu, count), in decreasing |
468 |
order of specifications; the first three members represent the |
469 |
instance specification, the last one the count of how many instances |
470 |
of this specification can be created on the cluster |
471 |
- node_groups: a dictionary keyed by node group UUID, with values a |
472 |
dictionary: |
473 |
|
474 |
- tspecs: a list like the cluster one |
475 |
- additionally, the new cluster parameters, denoting the input |
476 |
parameters that were used for this node group |
477 |
|
478 |
- ctime: the date the result has been computed; this represents the |
479 |
oldest creation time amongst all node groups (so as to accurately |
480 |
represent how much out-of-date the global response is) |
481 |
|
482 |
Note that due to the way the tspecs are computed, for any given |
483 |
specification, the total available count is the count for the given |
484 |
entry, plus the sum of counts for higher specifications. |
485 |
|
486 |
Also note that the node group information is provided just |
487 |
informationally, not for allocation decisions. |
488 |
|
489 |
|
490 |
Node flags |
491 |
---------- |
492 |
|
493 |
Current state and shortcomings |
494 |
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
495 |
|
496 |
Currently all nodes are, from the point of view of their capabilities, |
497 |
homogeneous. This means the cluster considers all nodes capable of |
498 |
becoming master candidates, and of hosting instances. |
499 |
|
500 |
This prevents some deployment scenarios: e.g. having a Ganeti instance |
501 |
(in another cluster) be just a master candidate, in case all other |
502 |
master candidates go down (but not, of course, host instances), or |
503 |
having a node in a remote location just host instances but not become |
504 |
master, etc. |
505 |
|
506 |
Proposed changes |
507 |
~~~~~~~~~~~~~~~~ |
508 |
|
509 |
Two new capability flags will be added to the node: |
510 |
|
511 |
- master_capable, denoting whether the node can become a master |
512 |
candidate or master |
513 |
- vm_capable, denoting whether the node can host instances |
514 |
|
515 |
In terms of the other flags, master_capable is a stronger version of |
516 |
"not master candidate", and vm_capable is a stronger version of |
517 |
"drained". |
518 |
|
519 |
For the master_capable flag, it will affect auto-promotion code and node |
520 |
modifications. |
521 |
|
522 |
The vm_capable flag will affect the iallocator protocol, capacity |
523 |
calculations, node checks in cluster verify, and will interact in novel |
524 |
ways with locking (unfortunately). |
525 |
|
526 |
It is envisaged that most nodes will be both vm_capable and |
527 |
master_capable, and just a few will have one of these flags |
528 |
removed. Ganeti itself will allow clearing of both flags, even though |
529 |
this doesn't make much sense currently. |
530 |
|
531 |
|
532 |
Job priorities |
533 |
-------------- |
534 |
|
535 |
Current state and shortcomings |
536 |
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
537 |
|
538 |
Currently all jobs and opcodes have the same priority. Once a job |
539 |
started executing, its thread won't be released until all opcodes got |
540 |
their locks and did their work. When a job is finished, the next job is |
541 |
selected strictly by its incoming order. This does not mean jobs are run |
542 |
in their incoming order—locks and other delays can cause them to be |
543 |
stalled for some time. |
544 |
|
545 |
In some situations, e.g. an emergency shutdown, one may want to run a |
546 |
job as soon as possible. This is not possible currently if there are |
547 |
pending jobs in the queue. |
548 |
|
549 |
Proposed changes |
550 |
~~~~~~~~~~~~~~~~ |
551 |
|
552 |
Each opcode will be assigned a priority on submission. Opcode priorities |
553 |
are integers and the lower the number, the higher the opcode's priority |
554 |
is. Within the same priority, jobs and opcodes are initially processed |
555 |
in their incoming order. |
556 |
|
557 |
Submitted opcodes can have one of the priorities listed below. Other |
558 |
priorities are reserved for internal use. The absolute range is |
559 |
-20..+19. Opcodes submitted without a priority (e.g. by older clients) |
560 |
are assigned the default priority. |
561 |
|
562 |
- High (-10) |
563 |
- Normal (0, default) |
564 |
- Low (+10) |
565 |
|
566 |
As a change from the current model where executing a job blocks one |
567 |
thread for the whole duration, the new job processor must return the job |
568 |
to the queue after each opcode and also if it can't get all locks in a |
569 |
reasonable timeframe. This will allow opcodes of higher priority |
570 |
submitted in the meantime to be processed or opcodes of the same |
571 |
priority to try to get their locks. When added to the job queue's |
572 |
workerpool, the priority is determined by the first unprocessed opcode |
573 |
in the job. |
574 |
|
575 |
If an opcode is deferred, the job will go back to the "queued" status, |
576 |
even though it's just waiting to try to acquire its locks again later. |
577 |
|
578 |
If an opcode can not be processed after a certain number of retries or a |
579 |
certain amount of time, it should increase its priority. This will avoid |
580 |
starvation. |
581 |
|
582 |
A job's priority can never go below -20. If a job hits priority -20, it |
583 |
must acquire its locks in blocking mode. |
584 |
|
585 |
Opcode priorities are synchronised to disk in order to be restored after |
586 |
a restart or crash of the master daemon. |
587 |
|
588 |
Priorities also need to be considered inside the locking library to |
589 |
ensure opcodes with higher priorities get locks first. See |
590 |
:ref:`locking priorities <locking-priorities>` for more details. |
591 |
|
592 |
Worker pool |
593 |
+++++++++++ |
594 |
|
595 |
To support job priorities in the job queue, the worker pool underlying |
596 |
the job queue must be enhanced to support task priorities. Currently |
597 |
tasks are processed in the order they are added to the queue (but, due |
598 |
to their nature, they don't necessarily finish in that order). All tasks |
599 |
are equal. To support tasks with higher or lower priority, a few changes |
600 |
have to be made to the queue inside a worker pool. |
601 |
|
602 |
Each task is assigned a priority when added to the queue. This priority |
603 |
can not be changed until the task is executed (this is fine as in all |
604 |
current use-cases, tasks are added to a pool and then forgotten about |
605 |
until they're done). |
606 |
|
607 |
A task's priority can be compared to Unix' process priorities. The lower |
608 |
the priority number, the closer to the queue's front it is. A task with |
609 |
priority 0 is going to be run before one with priority 10. Tasks with |
610 |
the same priority are executed in the order in which they were added. |
611 |
|
612 |
While a task is running it can query its own priority. If it's not ready |
613 |
yet for finishing, it can raise an exception to defer itself, optionally |
614 |
changing its own priority. This is useful for the following cases: |
615 |
|
616 |
- A task is trying to acquire locks, but those locks are still held by |
617 |
other tasks. By deferring itself, the task gives others a chance to |
618 |
run. This is especially useful when all workers are busy. |
619 |
- If a task decides it hasn't gotten its locks in a long time, it can |
620 |
start to increase its own priority. |
621 |
- Tasks waiting for long-running operations running asynchronously could |
622 |
defer themselves while waiting for a long-running operation. |
623 |
|
624 |
With these changes, the job queue will be able to implement per-job |
625 |
priorities. |
626 |
|
627 |
.. _locking-priorities: |
628 |
|
629 |
Locking |
630 |
+++++++ |
631 |
|
632 |
In order to support priorities in Ganeti's own lock classes, |
633 |
``locking.SharedLock`` and ``locking.LockSet``, the internal structure |
634 |
of the former class needs to be changed. The last major change in this |
635 |
area was done for Ganeti 2.1 and can be found in the respective |
636 |
:doc:`design document <design-2.1>`. |
637 |
|
638 |
The plain list (``[]``) used as a queue is replaced by a heap queue, |
639 |
similar to the `worker pool`_. The heap or priority queue does automatic |
640 |
sorting, thereby automatically taking care of priorities. For each |
641 |
priority there's a plain list with pending acquires, like the single |
642 |
queue of pending acquires before this change. |
643 |
|
644 |
When the lock is released, the code locates the list of pending acquires |
645 |
for the highest priority waiting. The first condition (index 0) is |
646 |
notified. Once all waiting threads received the notification, the |
647 |
condition is removed from the list. If the list of conditions is empty |
648 |
it's removed from the heap queue. |
649 |
|
650 |
Like before, shared acquires are grouped and skip ahead of exclusive |
651 |
acquires if there's already an existing shared acquire for a priority. |
652 |
To accomplish this, a separate dictionary of shared acquires per |
653 |
priority is maintained. |
654 |
|
655 |
To simplify the code and reduce memory consumption, the concept of the |
656 |
"active" and "inactive" condition for shared acquires is abolished. The |
657 |
lock can't predict what priorities the next acquires will use and even |
658 |
keeping a cache can become computationally expensive for arguable |
659 |
benefit (the underlying POSIX pipe, see ``pipe(2)``, needs to be |
660 |
re-created for each notification anyway). |
661 |
|
662 |
The following diagram shows a possible state of the internal queue from |
663 |
a high-level view. Conditions are shown as (waiting) threads. Assuming |
664 |
no modifications are made to the queue (e.g. more acquires or timeouts), |
665 |
the lock would be acquired by the threads in this order (concurrent |
666 |
acquires in parentheses): ``threadE1``, ``threadE2``, (``threadS1``, |
667 |
``threadS2``, ``threadS3``), (``threadS4``, ``threadS5``), ``threadE3``, |
668 |
``threadS6``, ``threadE4``, ``threadE5``. |
669 |
|
670 |
:: |
671 |
|
672 |
[ |
673 |
(0, [exc/threadE1, exc/threadE2, shr/threadS1/threadS2/threadS3]), |
674 |
(2, [shr/threadS4/threadS5]), |
675 |
(10, [exc/threadE3]), |
676 |
(33, [shr/threadS6, exc/threadE4, exc/threadE5]), |
677 |
] |
678 |
|
679 |
|
680 |
IPv6 support |
681 |
------------ |
682 |
|
683 |
Currently Ganeti does not support IPv6. This is true for nodes as well |
684 |
as instances. Due to the fact that IPv4 exhaustion is threateningly near |
685 |
the need of using IPv6 is increasing, especially given that bigger and |
686 |
bigger clusters are supported. |
687 |
|
688 |
Supported IPv6 setup |
689 |
~~~~~~~~~~~~~~~~~~~~ |
690 |
|
691 |
In Ganeti 2.3 we introduce additionally to the ordinary pure IPv4 |
692 |
setup a hybrid IPv6/IPv4 mode. The latter works as follows: |
693 |
|
694 |
- all nodes in a cluster have a primary IPv6 address |
695 |
- the master has a IPv6 address |
696 |
- all nodes **must** have a secondary IPv4 address |
697 |
|
698 |
The reason for this hybrid setup is that key components that Ganeti |
699 |
depends on do not or only partially support IPv6. More precisely, Xen |
700 |
does not support instance migration via IPv6 in version 3.4 and 4.0. |
701 |
Similarly, KVM does not support instance migration nor VNC access for |
702 |
IPv6 at the time of this writing. |
703 |
|
704 |
This led to the decision of not supporting pure IPv6 Ganeti clusters, as |
705 |
very important cluster operations would not have been possible. Using |
706 |
IPv4 as secondary address does not affect any of the goals |
707 |
of the IPv6 support: since secondary addresses do not need to be |
708 |
publicly accessible, they need not be globally unique. In other words, |
709 |
one can practically use private IPv4 secondary addresses just for |
710 |
intra-cluster communication without propagating them across layer 3 |
711 |
boundaries. |
712 |
|
713 |
netutils: Utilities for handling common network tasks |
714 |
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
715 |
|
716 |
Currently common utility functions are kept in the ``utils`` module. |
717 |
Since this module grows bigger and bigger network-related functions are |
718 |
moved to a separate module named *netutils*. Additionally all these |
719 |
utilities will be IPv6-enabled. |
720 |
|
721 |
Cluster initialization |
722 |
~~~~~~~~~~~~~~~~~~~~~~ |
723 |
|
724 |
As mentioned above there will be two different setups in terms of IP |
725 |
addressing: pure IPv4 and hybrid IPv6/IPv4 address. To choose that a |
726 |
new cluster init parameter *--primary-ip-version* is introduced. This is |
727 |
needed as a given name can resolve to both an IPv4 and IPv6 address on a |
728 |
dual-stack host effectively making it impossible to infer that bit. |
729 |
|
730 |
Once a cluster is initialized and the primary IP version chosen all |
731 |
nodes that join have to conform to that setup. In the case of our |
732 |
IPv6/IPv4 setup all nodes *must* have a secondary IPv4 address. |
733 |
|
734 |
Furthermore we store the primary IP version in ssconf which is consulted |
735 |
every time a daemon starts to determine the default bind address (either |
736 |
*0.0.0.0* or *::*. In a IPv6/IPv4 setup we need to bind the Ganeti |
737 |
daemon listening on network sockets to the IPv6 address. |
738 |
|
739 |
Node addition |
740 |
~~~~~~~~~~~~~ |
741 |
|
742 |
When adding a new node to a IPv6/IPv4 cluster it must have a IPv6 |
743 |
address to be used as primary and a IPv4 address used as secondary. As |
744 |
explained above, every time a daemon is started we use the cluster |
745 |
primary IP version to determine to which any address to bind to. The |
746 |
only exception to this is when a node is added to the cluster. In this |
747 |
case there is no ssconf available when noded is started and therefore |
748 |
the correct address needs to be passed to it. |
749 |
|
750 |
Name resolution |
751 |
~~~~~~~~~~~~~~~ |
752 |
|
753 |
Since the gethostbyname*() functions do not support IPv6 name resolution |
754 |
will be done by using the recommended getaddrinfo(). |
755 |
|
756 |
IPv4-only components |
757 |
~~~~~~~~~~~~~~~~~~~~ |
758 |
|
759 |
============================ =================== ==================== |
760 |
Component IPv6 Status Planned Version |
761 |
============================ =================== ==================== |
762 |
Xen instance migration Not supported Xen 4.1: libxenlight |
763 |
KVM instance migration Not supported Unknown |
764 |
KVM VNC access Not supported Unknown |
765 |
============================ =================== ==================== |
766 |
|
767 |
|
768 |
Privilege Separation |
769 |
-------------------- |
770 |
|
771 |
Current state and shortcomings |
772 |
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
773 |
|
774 |
In Ganeti 2.2 we introduced privilege separation for the RAPI daemon. |
775 |
This was done directly in the daemon's code in the process of |
776 |
daemonizing itself. Doing so leads to several potential issues. For |
777 |
example, a file could be opened while the code is still running as |
778 |
``root`` and for some reason not be closed again. Even after changing |
779 |
the user ID, the file descriptor can be written to. |
780 |
|
781 |
Implementation |
782 |
~~~~~~~~~~~~~~ |
783 |
|
784 |
To address these shortcomings, daemons will be started under the target |
785 |
user right away. The ``start-stop-daemon`` utility used to start daemons |
786 |
supports the ``--chuid`` option to change user and group ID before |
787 |
starting the executable. |
788 |
|
789 |
The intermediate solution for the RAPI daemon from Ganeti 2.2 will be |
790 |
removed again. |
791 |
|
792 |
Files written by the daemons may need to have an explicit owner and |
793 |
group set (easily done through ``utils.WriteFile``). |
794 |
|
795 |
All SSH-related code is removed from the ``ganeti.bootstrap`` module and |
796 |
core components and moved to a separate script. The core code will |
797 |
simply assume a working SSH setup to be in place. |
798 |
|
799 |
Security Domains |
800 |
~~~~~~~~~~~~~~~~ |
801 |
|
802 |
In order to separate the permissions of file sets we separate them |
803 |
into the following 3 overall security domain chunks: |
804 |
|
805 |
1. Public: ``0755`` respectively ``0644`` |
806 |
2. Ganeti wide: shared between the daemons (gntdaemons) |
807 |
3. Secret files: shared among a specific set of daemons/users |
808 |
|
809 |
So for point 3 this tables shows the correlation of the sets to groups |
810 |
and their users: |
811 |
|
812 |
=== ========== ============================== ========================== |
813 |
Set Group Users Description |
814 |
=== ========== ============================== ========================== |
815 |
A gntrapi gntrapi, gntmasterd Share data between |
816 |
gntrapi and gntmasterd |
817 |
B gntadmins gntrapi, gntmasterd, *users* Shared between users who |
818 |
needs to call gntmasterd |
819 |
C gntconfd gntconfd, gntmasterd Share data between |
820 |
gntconfd and gntmasterd |
821 |
D gntmasterd gntmasterd masterd only; Currently |
822 |
only to redistribute the |
823 |
configuration, has access |
824 |
to all files under |
825 |
``lib/ganeti`` |
826 |
E gntdaemons gntmasterd, gntrapi, gntconfd Shared between the various |
827 |
Ganeti daemons to exchange |
828 |
data |
829 |
=== ========== ============================== ========================== |
830 |
|
831 |
Restricted commands |
832 |
~~~~~~~~~~~~~~~~~~~ |
833 |
|
834 |
The following commands needs still root to fulfill their functions: |
835 |
|
836 |
:: |
837 |
|
838 |
gnt-cluster {init|destroy|command|copyfile|rename|masterfailover|renew-crypto} |
839 |
gnt-node {add|remove} |
840 |
gnt-instance {console} |
841 |
|
842 |
Directory structure and permissions |
843 |
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
844 |
|
845 |
Here's how we propose to change the filesystem hierarchy and their |
846 |
permissions. |
847 |
|
848 |
Assuming it follows the defaults: ``gnt${daemon}`` for user and |
849 |
the groups from the section `Security Domains`_:: |
850 |
|
851 |
${localstatedir}/lib/ganeti/ (0755; gntmasterd:gntmasterd) |
852 |
cluster-domain-secret (0600; gntmasterd:gntmasterd) |
853 |
config.data (0640; gntmasterd:gntconfd) |
854 |
hmac.key (0440; gntmasterd:gntconfd) |
855 |
known_host (0644; gntmasterd:gntmasterd) |
856 |
queue/ (0700; gntmasterd:gntmasterd) |
857 |
archive/ (0700; gntmasterd:gntmasterd) |
858 |
* (0600; gntmasterd:gntmasterd) |
859 |
* (0600; gntmasterd:gntmasterd) |
860 |
rapi.pem (0440; gntrapi:gntrapi) |
861 |
rapi_users (0640; gntrapi:gntrapi) |
862 |
server.pem (0440; gntmasterd:gntmasterd) |
863 |
ssconf_* (0444; root:gntmasterd) |
864 |
uidpool/ (0750; root:gntmasterd) |
865 |
watcher.data (0600; root:gntmasterd) |
866 |
${localstatedir}/run/ganeti/ (0770; gntmasterd:gntdaemons) |
867 |
socket/ (0750; gntmasterd:gntadmins) |
868 |
ganeti-master (0770; gntmasterd:gntadmins) |
869 |
${localstatedir}/log/ganeti/ (0770; gntmasterd:gntdaemons) |
870 |
master-daemon.log (0600; gntmasterd:gntdaemons) |
871 |
rapi-daemon.log (0600; gntrapi:gntdaemons) |
872 |
conf-daemon.log (0600; gntconfd:gntdaemons) |
873 |
node-daemon.log (0600; gntnoded:gntdaemons) |
874 |
|
875 |
|
876 |
Feature changes |
877 |
=============== |
878 |
|
879 |
|
880 |
External interface changes |
881 |
========================== |
882 |
|
883 |
|
884 |
.. vim: set textwidth=72 : |
885 |
.. Local Variables: |
886 |
.. mode: rst |
887 |
.. fill-column: 72 |
888 |
.. End: |