Statistics
| Branch: | Tag: | Revision:

root / doc / design-2.3.rst @ 66e884e1

History | View | Annotate | Download (37.2 kB)

1
=================
2
Ganeti 2.3 design
3
=================
4

    
5
This document describes the major changes in Ganeti 2.3 compared to
6
the 2.2 version.
7

    
8
.. contents:: :depth: 4
9

    
10
As for 2.1 and 2.2 we divide the 2.3 design into three areas:
11

    
12
- core changes, which affect the master daemon/job queue/locking or
13
  all/most logical units
14
- logical unit/feature changes
15
- external interface changes (e.g. command line, OS API, hooks, ...)
16

    
17
Core changes
18
============
19

    
20
Node Groups
21
-----------
22

    
23
Current state and shortcomings
24
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
25

    
26
Currently all nodes of a Ganeti cluster are considered as part of the
27
same pool, for allocation purposes: DRBD instances for example can be
28
allocated on any two nodes.
29

    
30
This does cause a problem in cases where nodes are not all equally
31
connected to each other. For example if a cluster is created over two
32
set of machines, each connected to its own switch, the internal bandwidth
33
between machines connected to the same switch might be bigger than the
34
bandwidth for inter-switch connections.
35

    
36
Moreover, some operations inside a cluster require all nodes to be locked
37
together for inter-node consistency, and won't scale if we increase the
38
number of nodes to a few hundreds.
39

    
40
Proposed changes
41
~~~~~~~~~~~~~~~~
42

    
43
With this change we'll divide Ganeti nodes into groups. Nothing will
44
change for clusters with only one node group. Bigger clusters will be
45
able to have more than one group, and each node will belong to exactly
46
one.
47

    
48
Node group management
49
+++++++++++++++++++++
50

    
51
To manage node groups and the nodes belonging to them, the following new
52
commands and flags will be introduced::
53

    
54
  gnt-group add <group> # add a new node group
55
  gnt-group remove <group> # delete an empty node group
56
  gnt-group list # list node groups
57
  gnt-group rename <oldname> <newname> # rename a node group
58
  gnt-node {list,info} -g <group> # list only nodes belonging to a node group
59
  gnt-node modify -g <group> # assign a node to a node group
60

    
61
Node group attributes
62
+++++++++++++++++++++
63

    
64
In clusters with more than one node group, it may be desirable to
65
establish local policies regarding which groups should be preferred when
66
performing allocation of new instances, or inter-group instance migrations.
67

    
68
To help with this, we will provide an ``alloc_policy`` attribute for
69
node groups. Such attribute will be honored by iallocator plugins when
70
making automatic decisions regarding instance placement.
71

    
72
The ``alloc_policy`` attribute can have the following values:
73

    
74
- unallocable: the node group should not be a candidate for instance
75
  allocations, and the operation should fail if only groups in this
76
  state could be found that would satisfy the requirements.
77

    
78
- last_resort: the node group should not be used for instance
79
  allocations, unless this would be the only way to have the operation
80
  succeed.
81

    
82
- preferred: the node group can be used freely for allocation of
83
  instances (this is the default state for newly created node
84
  groups). Note that prioritization among groups in this state will be
85
  deferred to the  iallocator plugin that's being used.
86

    
87
Node group operations
88
+++++++++++++++++++++
89

    
90
One operation at the node group level will be initially provided::
91

    
92
  gnt-group drain <group>
93

    
94
The purpose of this operation is to migrate all instances in a given
95
node group to other groups in the cluster, e.g. to reclaim capacity if
96
there are enough free resources in other node groups that share a
97
storage pool with the evacuated group.
98

    
99
Instance level changes
100
++++++++++++++++++++++
101

    
102
With the introduction of node groups, instances will be required to live
103
in only one group at a time; this is mostly important for DRBD
104
instances, which will not be allowed to have their primary and secondary
105
nodes in different node groups. To support this, we envision the
106
following changes:
107

    
108
  - The iallocator interface will be augmented, and node groups exposed,
109
    so that plugins will be able to make a decision regarding the group
110
    in which to place a new instance. By default, all node groups will
111
    be considered, but it will be possible to include a list of groups
112
    in the creation job, in which case the plugin will limit itself to
113
    considering those; in both cases, the ``alloc_policy`` attribute
114
    will be honored.
115
  - If, on the other hand, a primary and secondary nodes are specified
116
    for a new instance, they will be required to be on the same node
117
    group.
118
  - Moving an instance between groups can only happen via an explicit
119
    operation, which for example in the case of DRBD will work by
120
    performing internally a replace-disks, a migration, and a second
121
    replace-disks. It will be possible to clean up an interrupted
122
    group-move operation.
123
  - Cluster verify will signal an error if an instance has nodes
124
    belonging to different groups. Additionally, changing the group of a
125
    given node will be initially only allowed if the node is empty, as a
126
    straightforward mechanism to avoid creating such situation.
127
  - Inter-group instance migration will have the same operation modes as
128
    new instance allocation, defined above: letting an iallocator plugin
129
    decide the target group, possibly restricting the set of node groups
130
    to consider, or specifying a target primary and secondary nodes. In
131
    both cases, the target group or nodes must be able to accept the
132
    instance network- and storage-wise; the operation will fail
133
    otherwise, though in the future we may be able to allow some
134
    parameter to be changed together with the move (in the meantime, an
135
    import/export will be required in this scenario).
136

    
137
Internal changes
138
++++++++++++++++
139

    
140
We expect the following changes for cluster management:
141

    
142
  - Frequent multinode operations, such as os-diagnose or cluster-verify,
143
    will act on one group at a time, which will have to be specified in
144
    all cases, except for clusters with just one group. Command line
145
    tools will also have a way to easily target all groups, by
146
    generating one job per group.
147
  - Groups will have a human-readable name, but will internally always
148
    be referenced by a UUID, which will be immutable; for example, nodes
149
    will contain the UUID of the group they belong to. This is done
150
    to simplify referencing while keeping it easy to handle renames and
151
    movements. If we see that this works well, we'll transition other
152
    config objects (instances, nodes) to the same model.
153
  - The addition of a new per-group lock will be evaluated, if we can
154
    transition some operations now requiring the BGL to it.
155
  - Master candidate status will be allowed to be spread among groups.
156
    For the first version we won't add any restriction over how this is
157
    done, although in the future we may have a minimum number of master
158
    candidates which Ganeti will try to keep in each group, for example.
159

    
160
Other work and future changes
161
+++++++++++++++++++++++++++++
162

    
163
Commands like ``gnt-cluster command``/``gnt-cluster copyfile`` will
164
continue to work on the whole cluster, but it will be possible to target
165
one group only by specifying it.
166

    
167
Commands which allow selection of sets of resources (for example
168
``gnt-instance start``/``gnt-instance stop``) will be able to select
169
them by node group as well.
170

    
171
Initially node groups won't be taggable objects, to simplify the first
172
implementation, but we expect this to be easy to add in a future version
173
should we see it's useful.
174

    
175
We envision groups as a good place to enhance cluster scalability. In
176
the future we may want to use them as units for configuration diffusion,
177
to allow a better master scalability. For example it could be possible
178
to change some all-nodes RPCs to contact each group once, from the
179
master, and make one node in the group perform internal diffusion. We
180
won't implement this in the first version, but we'll evaluate it for the
181
future, if we see scalability problems on big multi-group clusters.
182

    
183
When Ganeti will support more storage models (e.g. SANs, Sheepdog, Ceph)
184
we expect groups to be the basis for this, allowing for example a
185
different Sheepdog/Ceph cluster, or a different SAN to be connected to
186
each group. In some cases this will mean that inter-group move operation
187
will be necessarily performed with instance downtime, unless the
188
hypervisor has block-migrate functionality, and we implement support for
189
it (this would be theoretically possible, today, with KVM, for example).
190

    
191
Scalability issues with big clusters
192
------------------------------------
193

    
194
Current and future issues
195
~~~~~~~~~~~~~~~~~~~~~~~~~
196

    
197
Assuming the node groups feature will enable bigger clusters, other
198
parts of Ganeti will be impacted even more by the (in effect) bigger
199
clusters.
200

    
201
While many areas will be impacted, one is the most important: the fact
202
that the watcher still needs to be able to repair instance data on the
203
current 5 minutes time-frame (a shorter time-frame would be even
204
better). This means that the watcher itself needs to have parallelism
205
when dealing with node groups.
206

    
207
Also, the iallocator plugins are being fed data from Ganeti but also
208
need access to the full cluster state, and in general we still rely on
209
being able to compute the full cluster state somewhat “cheaply” and
210
on-demand. This conflicts with the goal of disconnecting the different
211
node groups, and to keep the same parallelism while growing the cluster
212
size.
213

    
214
Another issue is that the current capacity calculations are done
215
completely outside Ganeti (and they need access to the entire cluster
216
state), and this prevents keeping the capacity numbers in sync with the
217
cluster state. While this is still acceptable for smaller clusters where
218
a small number of allocations/removal are presumed to occur between two
219
periodic capacity calculations, on bigger clusters where we aim to
220
parallelize heavily between node groups this is no longer true.
221

    
222

    
223

    
224
As proposed changes, the main change is introducing a cluster state
225
cache (not serialised to disk), and to update many of the LUs and
226
cluster operations to account for it. Furthermore, the capacity
227
calculations will be integrated via a new OpCode/LU, so that we have
228
faster feedback (instead of periodic computation).
229

    
230
Cluster state cache
231
~~~~~~~~~~~~~~~~~~~
232

    
233
A new cluster state cache will be introduced. The cache relies on two
234
main ideas:
235

    
236
- the total node memory, CPU count are very seldom changing; the total
237
  node disk space is also slow changing, but can change at runtime; the
238
  free memory and free disk will change significantly for some jobs, but
239
  on a short timescale; in general, these values will be mostly “constant”
240
  during the lifetime of a job
241
- we already have a periodic set of jobs that query the node and
242
  instance state, driven the by :command:`ganeti-watcher` command, and
243
  we're just discarding the results after acting on them
244

    
245
Given the above, it makes sense to cache the results of node and instance
246
state (with a focus on the node state) inside the master daemon.
247

    
248
The cache will not be serialised to disk, and will be for the most part
249
transparent to the outside of the master daemon.
250

    
251
Cache structure
252
+++++++++++++++
253

    
254
The cache will be oriented with a focus on node groups, so that it will
255
be easy to invalidate an entire node group, or a subset of nodes, or the
256
entire cache. The instances will be stored in the node group of their
257
primary node.
258

    
259
Furthermore, since the node and instance properties determine the
260
capacity statistics in a deterministic way, the cache will also hold, at
261
each node group level, the total capacity as determined by the new
262
capacity iallocator mode.
263

    
264
Cache updates
265
+++++++++++++
266

    
267
The cache will be updated whenever a query for a node state returns
268
“full” node information (so as to keep the cache state for a given node
269
consistent). Partial results will not update the cache (see next
270
paragraph).
271

    
272
Since there will be no way to feed the cache from outside, and we
273
would like to have a consistent cache view when driven by the watcher,
274
we'll introduce a new OpCode/LU for the watcher to run, instead of the
275
current separate opcodes (see below in the watcher section).
276

    
277
Updates to a node that change a node's specs “downward” (e.g. less
278
memory) will invalidate the capacity data. Updates that increase the
279
node will not invalidate the capacity, as we're more interested in “at
280
least available” correctness, not “at most available”.
281

    
282
Cache invalidation
283
++++++++++++++++++
284

    
285
If a partial node query is done (e.g. just for the node free space), and
286
the returned values don't match with the cache, then the entire node
287
state will be invalidated.
288

    
289
By default, all LUs will invalidate the caches for all nodes and
290
instances they lock. If an LU uses the BGL, then it will invalidate the
291
entire cache. In time, it is expected that LUs will be modified to not
292
invalidate, if they are not expected to change the node's and/or
293
instance's state (e.g. ``LUConnectConsole``, or
294
``LUActivateInstanceDisks``).
295

    
296
Invalidation of a node's properties will also invalidate the capacity
297
data associated with that node.
298

    
299
Cache lifetime
300
++++++++++++++
301

    
302
The cache elements will have an upper bound on their lifetime; the
303
proposal is to make this an hour, which should be a high enough value to
304
cover the watcher being blocked by a medium-term job (e.g. 20-30
305
minutes).
306

    
307
Cache usage
308
+++++++++++
309

    
310
The cache will be used by default for most queries (e.g. a Luxi call,
311
without locks, for the entire cluster). Since this will be a change from
312
the current behaviour, we'll need to allow non-cached responses,
313
e.g. via a ``--cache=off`` or similar argument (which will force the
314
query).
315

    
316
The cache will also be used for the iallocator runs, so that computing
317
allocation solution can proceed independent from other jobs which lock
318
parts of the cluster. This is important as we need to separate
319
allocation on one group from exclusive blocking jobs on other node
320
groups.
321

    
322
The capacity calculations will also use the cache. This is detailed in
323
the respective sections.
324

    
325
Watcher operation
326
~~~~~~~~~~~~~~~~~
327

    
328
As detailed in the cluster cache section, the watcher also needs
329
improvements in order to scale with the the cluster size.
330

    
331
As a first improvement, the proposal is to introduce a new OpCode/LU
332
pair that runs with locks held over the entire query sequence (the
333
current watcher runs a job with two opcodes, which grab and release the
334
locks individually). The new opcode will be called
335
``OpUpdateNodeGroupCache`` and will do the following:
336

    
337
- try to acquire all node/instance locks (to examine in more depth, and
338
  possibly alter) in the given node group
339
- invalidate the cache for the node group
340
- acquire node and instance state (possibly via a new single RPC call
341
  that combines node and instance information)
342
- update cache
343
- return the needed data
344

    
345
The reason for the per-node group query is that we don't want a busy
346
node group to prevent instance maintenance in other node
347
groups. Therefore, the watcher will introduce parallelism across node
348
groups, and it will possible to have overlapping watcher runs. The new
349
execution sequence will be:
350

    
351
- the parent watcher process acquires global watcher lock
352
- query the list of node groups (lockless or very short locks only)
353
- fork N children, one for each node group
354
- release the global lock
355
- poll/wait for the children to finish
356

    
357
Each forked children will do the following:
358

    
359
- try to acquire the per-node group watcher lock
360
- if fail to acquire, exit with special code telling the parent that the
361
  node group is already being managed by a watcher process
362
- otherwise, submit a OpUpdateNodeGroupCache job
363
- get results (possibly after a long time, due to busy group)
364
- run the needed maintenance operations for the current group
365

    
366
This new mode of execution means that the master watcher processes might
367
overlap in running, but not the individual per-node group child
368
processes.
369

    
370
This change allows us to keep (almost) the same parallelism when using a
371
bigger cluster with node groups versus two separate clusters.
372

    
373

    
374
Cost of periodic cache updating
375
+++++++++++++++++++++++++++++++
376

    
377
Currently the watcher only does “small” queries for the node and
378
instance state, and at first sight changing it to use the new OpCode
379
which populates the cache with the entire state might introduce
380
additional costs, which must be payed every five minutes.
381

    
382
However, the OpCodes that the watcher submits are using the so-called
383
dynamic fields (need to contact the remote nodes), and the LUs are not
384
selective—they always grab all the node and instance state. So in the
385
end, we have the same cost, it just becomes explicit rather than
386
implicit.
387

    
388
This ‘grab all node state’ behaviour is what makes the cache worth
389
implementing.
390

    
391
Intra-node group scalability
392
++++++++++++++++++++++++++++
393

    
394
The design above only deals with inter-node group issues. It still makes
395
sense to run instance maintenance for nodes A and B if only node C is
396
locked (all being in the same node group).
397

    
398
This problem is commonly encountered in previous Ganeti versions, and it
399
should be handled similarly, by tweaking lock lifetime in long-duration
400
jobs.
401

    
402
TODO: add more ideas here.
403

    
404

    
405
State file maintenance
406
++++++++++++++++++++++
407

    
408
The splitting of node group maintenance to different children which will
409
run in parallel requires that the state file handling changes from
410
monolithic updates to partial ones.
411

    
412
There are two file that the watcher maintains:
413

    
414
- ``$LOCALSTATEDIR/lib/ganeti/watcher.data``, its internal state file,
415
  used for deciding internal actions
416
- ``$LOCALSTATEDIR/run/ganeti/instance-status``, a file designed for
417
  external consumption
418

    
419
For the first file, since it's used only internally to the watchers, we
420
can move to a per node group configuration.
421

    
422
For the second file, even if it's used as an external interface, we will
423
need to make some changes to it: because the different node groups can
424
return results at different times, we need to either split the file into
425
per-group files or keep the single file and add a per-instance timestamp
426
(currently the file holds only the instance name and state).
427

    
428
The proposal is that each child process maintains its own node group
429
file, and the master process will, right after querying the node group
430
list, delete any extra per-node group state file. This leaves the
431
consumers to run a simple ``cat instance-status.group-*`` to obtain the
432
entire list of instance and their states. If needed, the modify
433
timestamp of each file can be used to determine the age of the results.
434

    
435

    
436
Capacity calculations
437
~~~~~~~~~~~~~~~~~~~~~
438

    
439
Currently, the capacity calculations are done completely outside
440
Ganeti. As explained in the current problems section, this needs to
441
account better for the cluster state changes.
442

    
443
Therefore a new OpCode will be introduced, ``OpComputeCapacity``, that
444
will either return the current capacity numbers (if available), or
445
trigger a new capacity calculation, via the iallocator framework, which
446
will get a new method called ``capacity``.
447

    
448
This method will feed the cluster state (for the complete set of node
449
group, or alternative just a subset) to the iallocator plugin (either
450
the specified one, or the default if none is specified), and return the
451
new capacity in the format currently exported by the htools suite and
452
known as the “tiered specs” (see :manpage:`hspace(1)`).
453

    
454
tspec cluster parameters
455
++++++++++++++++++++++++
456

    
457
Currently, the “tspec” calculations done in :command:`hspace` require
458
some additional parameters:
459

    
460
- maximum instance size
461
- type of instance storage
462
- maximum ratio of virtual CPUs per physical CPUs
463
- minimum disk free
464

    
465
For the integration in Ganeti, there are multiple ways to pass these:
466

    
467
- ignored by Ganeti, and being the responsibility of the iallocator
468
  plugin whether to use these at all or not
469
- as input to the opcode
470
- as proper cluster parameters
471

    
472
Since the first option is not consistent with the intended changes, a
473
combination of the last two is proposed:
474

    
475
- at cluster level, we'll have cluster-wide defaults
476
- at node groups, we'll allow overriding the cluster defaults
477
- and if they are passed in via the opcode, they will override for the
478
  current computation the values
479

    
480
Whenever the capacity is requested via different parameters, it will
481
invalidate the cache, even if otherwise the cache is up-to-date.
482

    
483
The new parameters are:
484

    
485
- max_inst_spec: (int, int, int), the maximum instance specification
486
  accepted by this cluster or node group, in the order of memory, disk,
487
  vcpus;
488
- default_template: string, the default disk template to use
489
- max_cpu_ratio: double, the maximum ratio of VCPUs/PCPUs
490
- max_disk_usage: double, the maximum disk usage (as a ratio)
491

    
492
These might also be used in instance creations (to be determined later,
493
after they are introduced).
494

    
495
OpCode details
496
++++++++++++++
497

    
498
Input:
499

    
500
- iallocator: string (optional, otherwise uses the cluster default)
501
- cached: boolean, optional, defaults to true, and denotes whether we
502
  accept cached responses
503
- the above new parameters, optional; if they are passed, they will
504
  overwrite all node group's parameters
505

    
506
Output:
507

    
508
- cluster: list of tuples (memory, disk, vcpu, count), in decreasing
509
  order of specifications; the first three members represent the
510
  instance specification, the last one the count of how many instances
511
  of this specification can be created on the cluster
512
- node_groups: a dictionary keyed by node group UUID, with values a
513
  dictionary:
514

    
515
  - tspecs: a list like the cluster one
516
  - additionally, the new cluster parameters, denoting the input
517
    parameters that were used for this node group
518

    
519
- ctime: the date the result has been computed; this represents the
520
  oldest creation time amongst all node groups (so as to accurately
521
  represent how much out-of-date the global response is)
522

    
523
Note that due to the way the tspecs are computed, for any given
524
specification, the total available count is the count for the given
525
entry, plus the sum of counts for higher specifications.
526

    
527

    
528
Node flags
529
----------
530

    
531
Current state and shortcomings
532
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
533

    
534
Currently all nodes are, from the point of view of their capabilities,
535
homogeneous. This means the cluster considers all nodes capable of
536
becoming master candidates, and of hosting instances.
537

    
538
This prevents some deployment scenarios: e.g. having a Ganeti instance
539
(in another cluster) be just a master candidate, in case all other
540
master candidates go down (but not, of course, host instances), or
541
having a node in a remote location just host instances but not become
542
master, etc.
543

    
544
Proposed changes
545
~~~~~~~~~~~~~~~~
546

    
547
Two new capability flags will be added to the node:
548

    
549
- master_capable, denoting whether the node can become a master
550
  candidate or master
551
- vm_capable, denoting whether the node can host instances
552

    
553
In terms of the other flags, master_capable is a stronger version of
554
"not master candidate", and vm_capable is a stronger version of
555
"drained".
556

    
557
For the master_capable flag, it will affect auto-promotion code and node
558
modifications.
559

    
560
The vm_capable flag will affect the iallocator protocol, capacity
561
calculations, node checks in cluster verify, and will interact in novel
562
ways with locking (unfortunately).
563

    
564
It is envisaged that most nodes will be both vm_capable and
565
master_capable, and just a few will have one of these flags
566
removed. Ganeti itself will allow clearing of both flags, even though
567
this doesn't make much sense currently.
568

    
569

    
570
Job priorities
571
--------------
572

    
573
Current state and shortcomings
574
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
575

    
576
Currently all jobs and opcodes have the same priority. Once a job
577
started executing, its thread won't be released until all opcodes got
578
their locks and did their work. When a job is finished, the next job is
579
selected strictly by its incoming order. This does not mean jobs are run
580
in their incoming order—locks and other delays can cause them to be
581
stalled for some time.
582

    
583
In some situations, e.g. an emergency shutdown, one may want to run a
584
job as soon as possible. This is not possible currently if there are
585
pending jobs in the queue.
586

    
587
Proposed changes
588
~~~~~~~~~~~~~~~~
589

    
590
Each opcode will be assigned a priority on submission. Opcode priorities
591
are integers and the lower the number, the higher the opcode's priority
592
is. Within the same priority, jobs and opcodes are initially processed
593
in their incoming order.
594

    
595
Submitted opcodes can have one of the priorities listed below. Other
596
priorities are reserved for internal use. The absolute range is
597
-20..+19. Opcodes submitted without a priority (e.g. by older clients)
598
are assigned the default priority.
599

    
600
  - High (-10)
601
  - Normal (0, default)
602
  - Low (+10)
603

    
604
As a change from the current model where executing a job blocks one
605
thread for the whole duration, the new job processor must return the job
606
to the queue after each opcode and also if it can't get all locks in a
607
reasonable timeframe. This will allow opcodes of higher priority
608
submitted in the meantime to be processed or opcodes of the same
609
priority to try to get their locks. When added to the job queue's
610
workerpool, the priority is determined by the first unprocessed opcode
611
in the job.
612

    
613
If an opcode is deferred, the job will go back to the "queued" status,
614
even though it's just waiting to try to acquire its locks again later.
615

    
616
If an opcode can not be processed after a certain number of retries or a
617
certain amount of time, it should increase its priority. This will avoid
618
starvation.
619

    
620
A job's priority can never go below -20. If a job hits priority -20, it
621
must acquire its locks in blocking mode.
622

    
623
Opcode priorities are synchronised to disk in order to be restored after
624
a restart or crash of the master daemon.
625

    
626
Priorities also need to be considered inside the locking library to
627
ensure opcodes with higher priorities get locks first. See
628
:ref:`locking priorities <locking-priorities>` for more details.
629

    
630
Worker pool
631
+++++++++++
632

    
633
To support job priorities in the job queue, the worker pool underlying
634
the job queue must be enhanced to support task priorities. Currently
635
tasks are processed in the order they are added to the queue (but, due
636
to their nature, they don't necessarily finish in that order). All tasks
637
are equal. To support tasks with higher or lower priority, a few changes
638
have to be made to the queue inside a worker pool.
639

    
640
Each task is assigned a priority when added to the queue. This priority
641
can not be changed until the task is executed (this is fine as in all
642
current use-cases, tasks are added to a pool and then forgotten about
643
until they're done).
644

    
645
A task's priority can be compared to Unix' process priorities. The lower
646
the priority number, the closer to the queue's front it is. A task with
647
priority 0 is going to be run before one with priority 10. Tasks with
648
the same priority are executed in the order in which they were added.
649

    
650
While a task is running it can query its own priority. If it's not ready
651
yet for finishing, it can raise an exception to defer itself, optionally
652
changing its own priority. This is useful for the following cases:
653

    
654
- A task is trying to acquire locks, but those locks are still held by
655
  other tasks. By deferring itself, the task gives others a chance to
656
  run. This is especially useful when all workers are busy.
657
- If a task decides it hasn't gotten its locks in a long time, it can
658
  start to increase its own priority.
659
- Tasks waiting for long-running operations running asynchronously could
660
  defer themselves while waiting for a long-running operation.
661

    
662
With these changes, the job queue will be able to implement per-job
663
priorities.
664

    
665
.. _locking-priorities:
666

    
667
Locking
668
+++++++
669

    
670
In order to support priorities in Ganeti's own lock classes,
671
``locking.SharedLock`` and ``locking.LockSet``, the internal structure
672
of the former class needs to be changed. The last major change in this
673
area was done for Ganeti 2.1 and can be found in the respective
674
:doc:`design document <design-2.1>`.
675

    
676
The plain list (``[]``) used as a queue is replaced by a heap queue,
677
similar to the `worker pool`_. The heap or priority queue does automatic
678
sorting, thereby automatically taking care of priorities. For each
679
priority there's a plain list with pending acquires, like the single
680
queue of pending acquires before this change.
681

    
682
When the lock is released, the code locates the list of pending acquires
683
for the highest priority waiting. The first condition (index 0) is
684
notified. Once all waiting threads received the notification, the
685
condition is removed from the list. If the list of conditions is empty
686
it's removed from the heap queue.
687

    
688
Like before, shared acquires are grouped and skip ahead of exclusive
689
acquires if there's already an existing shared acquire for a priority.
690
To accomplish this, a separate dictionary of shared acquires per
691
priority is maintained.
692

    
693
To simplify the code and reduce memory consumption, the concept of the
694
"active" and "inactive" condition for shared acquires is abolished. The
695
lock can't predict what priorities the next acquires will use and even
696
keeping a cache can become computationally expensive for arguable
697
benefit (the underlying POSIX pipe, see ``pipe(2)``, needs to be
698
re-created for each notification anyway).
699

    
700
The following diagram shows a possible state of the internal queue from
701
a high-level view. Conditions are shown as (waiting) threads. Assuming
702
no modifications are made to the queue (e.g. more acquires or timeouts),
703
the lock would be acquired by the threads in this order (concurrent
704
acquires in parentheses): ``threadE1``, ``threadE2``, (``threadS1``,
705
``threadS2``, ``threadS3``), (``threadS4``, ``threadS5``), ``threadE3``,
706
``threadS6``, ``threadE4``, ``threadE5``.
707

    
708
::
709

    
710
  [
711
    (0, [exc/threadE1, exc/threadE2, shr/threadS1/threadS2/threadS3]),
712
    (2, [shr/threadS4/threadS5]),
713
    (10, [exc/threadE3]),
714
    (33, [shr/threadS6, exc/threadE4, exc/threadE5]),
715
  ]
716

    
717

    
718
IPv6 support
719
------------
720

    
721
Currently Ganeti does not support IPv6. This is true for nodes as well
722
as instances. Due to the fact that IPv4 exhaustion is threateningly near
723
the need of using IPv6 is increasing, especially given that bigger and
724
bigger clusters are supported.
725

    
726
Supported IPv6 setup
727
~~~~~~~~~~~~~~~~~~~~
728

    
729
In Ganeti 2.3 we introduce additionally to the ordinary pure IPv4
730
setup a hybrid IPv6/IPv4 mode. The latter works as follows:
731

    
732
- all nodes in a cluster have a primary IPv6 address
733
- the master has a IPv6 address
734
- all nodes **must** have a secondary IPv4 address
735

    
736
The reason for this hybrid setup is that key components that Ganeti
737
depends on do not or only partially support IPv6. More precisely, Xen
738
does not support instance migration via IPv6 in version 3.4 and 4.0.
739
Similarly, KVM does not support instance migration nor VNC access for
740
IPv6 at the time of this writing.
741

    
742
This led to the decision of not supporting pure IPv6 Ganeti clusters, as
743
very important cluster operations would not have been possible. Using
744
IPv4 as secondary address does not affect any of the goals
745
of the IPv6 support: since secondary addresses do not need to be
746
publicly accessible, they need not be globally unique. In other words,
747
one can practically use private IPv4 secondary addresses just for
748
intra-cluster communication without propagating them across layer 3
749
boundaries.
750

    
751
netutils: Utilities for handling common network tasks
752
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
753

    
754
Currently common utility functions are kept in the ``utils`` module.
755
Since this module grows bigger and bigger network-related functions are
756
moved to a separate module named *netutils*. Additionally all these
757
utilities will be IPv6-enabled.
758

    
759
Cluster initialization
760
~~~~~~~~~~~~~~~~~~~~~~
761

    
762
As mentioned above there will be two different setups in terms of IP
763
addressing: pure IPv4 and hybrid IPv6/IPv4 address. To choose that a
764
new cluster init parameter *--primary-ip-version* is introduced. This is
765
needed as a given name can resolve to both an IPv4 and IPv6 address on a
766
dual-stack host effectively making it impossible to infer that bit.
767

    
768
Once a cluster is initialized and the primary IP version chosen all
769
nodes that join have to conform to that setup. In the case of our
770
IPv6/IPv4 setup all nodes *must* have a secondary IPv4 address.
771

    
772
Furthermore we store the primary IP version in ssconf which is consulted
773
every time a daemon starts to determine the default bind address (either
774
*0.0.0.0* or *::*. In a IPv6/IPv4 setup we need to bind the Ganeti
775
daemon listening on network sockets to the IPv6 address.
776

    
777
Node addition
778
~~~~~~~~~~~~~
779

    
780
When adding a new node to a IPv6/IPv4 cluster it must have a IPv6
781
address to be used as primary and a IPv4 address used as secondary. As
782
explained above, every time a daemon is started we use the cluster
783
primary IP version to determine to which any address to bind to. The
784
only exception to this is when a node is added to the cluster. In this
785
case there is no ssconf available when noded is started and therefore
786
the correct address needs to be passed to it.
787

    
788
Name resolution
789
~~~~~~~~~~~~~~~
790

    
791
Since the gethostbyname*() functions do not support IPv6 name resolution
792
will be done by using the recommended getaddrinfo().
793

    
794
IPv4-only components
795
~~~~~~~~~~~~~~~~~~~~
796

    
797
============================  ===================  ====================
798
Component                     IPv6 Status          Planned Version
799
============================  ===================  ====================
800
Xen instance migration        Not supported        Xen 4.1: libxenlight
801
KVM instance migration        Not supported        Unknown
802
KVM VNC access                Not supported        Unknown
803
============================  ===================  ====================
804

    
805

    
806
Privilege Separation
807
--------------------
808

    
809
Current state and shortcomings
810
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
811

    
812
In Ganeti 2.2 we introduced privilege separation for the RAPI daemon.
813
This was done directly in the daemon's code in the process of
814
daemonizing itself. Doing so leads to several potential issues. For
815
example, a file could be opened while the code is still running as
816
``root`` and for some reason not be closed again. Even after changing
817
the user ID, the file descriptor can be written to.
818

    
819
Implementation
820
~~~~~~~~~~~~~~
821

    
822
To address these shortcomings, daemons will be started under the target
823
user right away. The ``start-stop-daemon`` utility used to start daemons
824
supports the ``--chuid`` option to change user and group ID before
825
starting the executable.
826

    
827
The intermediate solution for the RAPI daemon from Ganeti 2.2 will be
828
removed again.
829

    
830
Files written by the daemons may need to have an explicit owner and
831
group set (easily done through ``utils.WriteFile``).
832

    
833
All SSH-related code is removed from the ``ganeti.bootstrap`` module and
834
core components and moved to a separate script. The core code will
835
simply assume a working SSH setup to be in place.
836

    
837
Security Domains
838
~~~~~~~~~~~~~~~~
839

    
840
In order to separate the permissions of file sets we separate them
841
into the following 3 overall security domain chunks:
842

    
843
1. Public: ``0755`` respectively ``0644``
844
2. Ganeti wide: shared between the daemons (gntdaemons)
845
3. Secret files: shared among a specific set of daemons/users
846

    
847
So for point 3 this tables shows the correlation of the sets to groups
848
and their users:
849

    
850
=== ========== ============================== ==========================
851
Set Group      Users                          Description
852
=== ========== ============================== ==========================
853
A   gntrapi    gntrapi, gntmasterd            Share data between
854
                                              gntrapi and gntmasterd
855
B   gntadmins  gntrapi, gntmasterd, *users*   Shared between users who
856
                                              needs to call gntmasterd
857
C   gntconfd   gntconfd, gntmasterd           Share data between
858
                                              gntconfd and gntmasterd
859
D   gntmasterd gntmasterd                     masterd only; Currently
860
                                              only to redistribute the
861
                                              configuration, has access
862
                                              to all files under
863
                                              ``lib/ganeti``
864
E   gntdaemons gntmasterd, gntrapi, gntconfd  Shared between the various
865
                                              Ganeti daemons to exchange
866
                                              data
867
=== ========== ============================== ==========================
868

    
869
Restricted commands
870
~~~~~~~~~~~~~~~~~~~
871

    
872
The following commands needs still root to fulfill their functions:
873

    
874
::
875

    
876
  gnt-cluster {init|destroy|command|copyfile|rename|masterfailover|renew-crypto}
877
  gnt-node {add|remove}
878
  gnt-instance {console}
879

    
880
Directory structure and permissions
881
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
882

    
883
Here's how we propose to change the filesystem hierarchy and their
884
permissions.
885

    
886
Assuming it follows the defaults: ``gnt${daemon}`` for user and
887
the groups from the section `Security Domains`_::
888

    
889
  ${localstatedir}/lib/ganeti/ (0755; gntmasterd:gntmasterd)
890
     cluster-domain-secret (0600; gntmasterd:gntmasterd)
891
     config.data (0640; gntmasterd:gntconfd)
892
     hmac.key (0440; gntmasterd:gntconfd)
893
     known_host (0644; gntmasterd:gntmasterd)
894
     queue/ (0700; gntmasterd:gntmasterd)
895
       archive/ (0700; gntmasterd:gntmasterd)
896
         * (0600; gntmasterd:gntmasterd)
897
       * (0600; gntmasterd:gntmasterd)
898
     rapi.pem (0440; gntrapi:gntrapi)
899
     rapi_users (0640; gntrapi:gntrapi)
900
     server.pem (0440; gntmasterd:gntmasterd)
901
     ssconf_* (0444; root:gntmasterd)
902
     uidpool/ (0750; root:gntmasterd)
903
     watcher.data (0600; root:gntmasterd)
904
  ${localstatedir}/run/ganeti/ (0770; gntmasterd:gntdaemons)
905
     socket/ (0750; gntmasterd:gntadmins)
906
       ganeti-master (0770; gntmasterd:gntadmins)
907
  ${localstatedir}/log/ganeti/ (0770; gntmasterd:gntdaemons)
908
     master-daemon.log (0600; gntmasterd:gntdaemons)
909
     rapi-daemon.log (0600; gntrapi:gntdaemons)
910
     conf-daemon.log (0600; gntconfd:gntdaemons)
911
     node-daemon.log (0600; gntnoded:gntdaemons)
912

    
913

    
914
Feature changes
915
===============
916

    
917

    
918
External interface changes
919
==========================
920

    
921

    
922
.. vim: set textwidth=72 :
923
.. Local Variables:
924
.. mode: rst
925
.. fill-column: 72
926
.. End: