Statistics
| Branch: | Tag: | Revision:

root / doc / design-opportunistic-locking.rst @ 87c7621a

History | View | Annotate | Download (5.6 kB)

1 0cc3f0d7 Michael Hanselmann
Design for parallelized instance creations and opportunistic locking
2 0cc3f0d7 Michael Hanselmann
====================================================================
3 0cc3f0d7 Michael Hanselmann
4 0cc3f0d7 Michael Hanselmann
.. contents:: :depth: 3
5 0cc3f0d7 Michael Hanselmann
6 0cc3f0d7 Michael Hanselmann
7 0cc3f0d7 Michael Hanselmann
Current state and shortcomings
8 0cc3f0d7 Michael Hanselmann
------------------------------
9 0cc3f0d7 Michael Hanselmann
10 0cc3f0d7 Michael Hanselmann
As of Ganeti 2.6, instance creations acquire all node locks when an
11 0cc3f0d7 Michael Hanselmann
:doc:`instance allocator <iallocator>` (henceforth "iallocator") is
12 0cc3f0d7 Michael Hanselmann
used. In situations where many instance should be created in a short
13 0cc3f0d7 Michael Hanselmann
timeframe, there is a lot of congestion on node locks. Effectively all
14 0cc3f0d7 Michael Hanselmann
instance creations are serialized, even on big clusters with multiple
15 0cc3f0d7 Michael Hanselmann
groups.
16 0cc3f0d7 Michael Hanselmann
17 0cc3f0d7 Michael Hanselmann
The situation gets worse when disk wiping is enabled (see
18 0cc3f0d7 Michael Hanselmann
:manpage:`gnt-cluster(8)`) as that can take, depending on disk size and
19 0cc3f0d7 Michael Hanselmann
hardware performance, from minutes to hours. Not waiting for DRBD disks
20 0cc3f0d7 Michael Hanselmann
to synchronize (``wait_for_sync=false``) makes instance creations
21 0cc3f0d7 Michael Hanselmann
slightly faster, but there's a risk of impacting I/O of other instances.
22 0cc3f0d7 Michael Hanselmann
23 0cc3f0d7 Michael Hanselmann
24 0cc3f0d7 Michael Hanselmann
Proposed changes
25 0cc3f0d7 Michael Hanselmann
----------------
26 0cc3f0d7 Michael Hanselmann
27 0cc3f0d7 Michael Hanselmann
The target is to speed up instance creations in combination with an
28 0cc3f0d7 Michael Hanselmann
iallocator even when the cluster's balance is sacrificed in the process.
29 0cc3f0d7 Michael Hanselmann
The cluster can later be re-balanced using ``hbal``. The main objective
30 0cc3f0d7 Michael Hanselmann
is to reduce the number of node locks acquired for creation and to
31 0cc3f0d7 Michael Hanselmann
release un-used locks as fast as possible (the latter is already being
32 0cc3f0d7 Michael Hanselmann
done). To do this safely, several changes are necessary.
33 0cc3f0d7 Michael Hanselmann
34 0cc3f0d7 Michael Hanselmann
Locking library
35 0cc3f0d7 Michael Hanselmann
~~~~~~~~~~~~~~~
36 0cc3f0d7 Michael Hanselmann
37 0cc3f0d7 Michael Hanselmann
Instead of forcibly acquiring all node locks for creating an instance
38 0cc3f0d7 Michael Hanselmann
using an iallocator, only those currently available will be acquired.
39 0cc3f0d7 Michael Hanselmann
40 0cc3f0d7 Michael Hanselmann
To this end, the locking library must be extended to implement
41 0cc3f0d7 Michael Hanselmann
opportunistic locking. Lock sets must be able to only acquire all locks
42 0cc3f0d7 Michael Hanselmann
available at the time, ignoring and not waiting for those held by
43 0cc3f0d7 Michael Hanselmann
another thread.
44 0cc3f0d7 Michael Hanselmann
45 0cc3f0d7 Michael Hanselmann
Locks (``SharedLock``) already support a timeout of zero. The latter is
46 0cc3f0d7 Michael Hanselmann
different from a blocking acquisition, in which case the timeout would
47 0cc3f0d7 Michael Hanselmann
be ``None``.
48 0cc3f0d7 Michael Hanselmann
49 0cc3f0d7 Michael Hanselmann
Lock sets can essentially be acquired in two different modes. One is to
50 0cc3f0d7 Michael Hanselmann
acquire the whole set, which in turn will also block adding new locks
51 0cc3f0d7 Michael Hanselmann
from other threads, and the other is to acquire specific locks by name.
52 0cc3f0d7 Michael Hanselmann
The function to acquire locks in a set accepts a timeout which, if not
53 0cc3f0d7 Michael Hanselmann
``None`` for blocking acquisitions, counts for the whole duration of
54 0cc3f0d7 Michael Hanselmann
acquiring, if necessary, the lock set's internal lock, as well as the
55 0cc3f0d7 Michael Hanselmann
member locks. For opportunistic acquisitions the timeout is only
56 0cc3f0d7 Michael Hanselmann
meaningful when acquiring the whole set, in which case it is only used
57 0cc3f0d7 Michael Hanselmann
for acquiring the set's internal lock (used to block lock additions).
58 0cc3f0d7 Michael Hanselmann
For acquiring member locks the timeout is effectively zero to make them
59 0cc3f0d7 Michael Hanselmann
opportunistic.
60 0cc3f0d7 Michael Hanselmann
61 0cc3f0d7 Michael Hanselmann
A new and optional boolean parameter named ``opportunistic`` is added to
62 0cc3f0d7 Michael Hanselmann
``LockSet.acquire`` and re-exported through
63 0cc3f0d7 Michael Hanselmann
``GanetiLockManager.acquire`` for use by ``mcpu``. Internally, lock sets
64 0cc3f0d7 Michael Hanselmann
do the lock acquisition using a helper function, ``__acquire_inner``. It
65 0cc3f0d7 Michael Hanselmann
will be extended to support opportunistic acquisitions. The algorithm is
66 0cc3f0d7 Michael Hanselmann
very similar to acquiring the whole set with the difference that
67 0cc3f0d7 Michael Hanselmann
acquisitions timing out will be ignored (the timeout in this case is
68 0cc3f0d7 Michael Hanselmann
zero).
69 0cc3f0d7 Michael Hanselmann
70 0cc3f0d7 Michael Hanselmann
71 0cc3f0d7 Michael Hanselmann
New lock level
72 0cc3f0d7 Michael Hanselmann
~~~~~~~~~~~~~~
73 0cc3f0d7 Michael Hanselmann
74 0cc3f0d7 Michael Hanselmann
With opportunistic locking used for instance creations (controlled by a
75 0cc3f0d7 Michael Hanselmann
parameter), multiple such requests can start at (essentially) the same
76 0cc3f0d7 Michael Hanselmann
time and compete for node locks. Some logical units, such as
77 0cc3f0d7 Michael Hanselmann
``LUClusterVerifyGroup``, need to acquire all node locks. In the latter
78 0cc3f0d7 Michael Hanselmann
case all instance allocations would fail to get their locks. This also
79 0cc3f0d7 Michael Hanselmann
applies when multiple instance creations are started at roughly the same
80 0cc3f0d7 Michael Hanselmann
time.
81 0cc3f0d7 Michael Hanselmann
82 0cc3f0d7 Michael Hanselmann
To avoid situations where an opcode holding all or many node locks
83 0cc3f0d7 Michael Hanselmann
causes allocations to fail, a new lock level must be added to control
84 0cc3f0d7 Michael Hanselmann
allocations. The logical units for instance failover and migration can
85 0cc3f0d7 Michael Hanselmann
only safely determine whether they need all node locks after the
86 0cc3f0d7 Michael Hanselmann
instance lock has been acquired. Therefore the new lock level, named
87 0cc3f0d7 Michael Hanselmann
"node-alloc" (shorthand for "node-allocation") will be inserted after
88 0cc3f0d7 Michael Hanselmann
instances (``LEVEL_INSTANCE``) and before node groups
89 0cc3f0d7 Michael Hanselmann
(``LEVEL_NODEGROUP``). Similar to the "big cluster lock" ("BGL") there
90 0cc3f0d7 Michael Hanselmann
is only a single lock at this level whose name is "node allocation lock"
91 0cc3f0d7 Michael Hanselmann
("NAL").
92 0cc3f0d7 Michael Hanselmann
93 0cc3f0d7 Michael Hanselmann
As a rule-of-thumb, the node allocation lock must be acquired in the
94 0cc3f0d7 Michael Hanselmann
same mode as nodes and/or node resources. If all or a large number of
95 0cc3f0d7 Michael Hanselmann
node locks are acquired, the node allocation lock should be acquired as
96 0cc3f0d7 Michael Hanselmann
well. Special attention should be given to logical units started for all
97 0cc3f0d7 Michael Hanselmann
node groups, such as ``LUGroupVerifyDisks``, as they also block many
98 0cc3f0d7 Michael Hanselmann
nodes over a short amount of time.
99 0cc3f0d7 Michael Hanselmann
100 0cc3f0d7 Michael Hanselmann
101 0cc3f0d7 Michael Hanselmann
iallocator
102 0cc3f0d7 Michael Hanselmann
~~~~~~~~~~
103 0cc3f0d7 Michael Hanselmann
104 0cc3f0d7 Michael Hanselmann
The :doc:`iallocator interface <iallocator>` does not need any
105 0cc3f0d7 Michael Hanselmann
modification. When an instance is created, the information for all nodes
106 0cc3f0d7 Michael Hanselmann
is passed to the iallocator plugin. Nodes for which the lock couldn't be
107 0cc3f0d7 Michael Hanselmann
acquired and therefore shouldn't be used for the instance in question,
108 0cc3f0d7 Michael Hanselmann
will be shown as offline.
109 0cc3f0d7 Michael Hanselmann
110 0cc3f0d7 Michael Hanselmann
111 0cc3f0d7 Michael Hanselmann
Opcodes
112 0cc3f0d7 Michael Hanselmann
~~~~~~~
113 0cc3f0d7 Michael Hanselmann
114 0cc3f0d7 Michael Hanselmann
The opcodes ``OpInstanceCreate`` and ``OpInstanceMultiAlloc`` will gain
115 0cc3f0d7 Michael Hanselmann
a new parameter to enable opportunistic locking. By default this mode is
116 0cc3f0d7 Michael Hanselmann
disabled as to not break backwards compatibility.
117 0cc3f0d7 Michael Hanselmann
118 0cc3f0d7 Michael Hanselmann
A new error type is added to describe a temporary lack of resources. Its
119 0cc3f0d7 Michael Hanselmann
name will be ``ECODE_TEMP_NORES``. With opportunistic locks the opcodes
120 0cc3f0d7 Michael Hanselmann
mentioned before only have a partial view of the cluster and can no
121 0cc3f0d7 Michael Hanselmann
longer decide if an instance could not be allocated due to the locks it
122 0cc3f0d7 Michael Hanselmann
has been given or whether the whole cluster is lacking resources.
123 0cc3f0d7 Michael Hanselmann
Therefore it is required, upon encountering the error code for a
124 0cc3f0d7 Michael Hanselmann
temporary lack of resources, for the job submitter to make this decision
125 0cc3f0d7 Michael Hanselmann
by re-submitting the job or by re-directing it to another cluster.
126 0cc3f0d7 Michael Hanselmann
127 0cc3f0d7 Michael Hanselmann
.. vim: set textwidth=72 :
128 0cc3f0d7 Michael Hanselmann
.. Local Variables:
129 0cc3f0d7 Michael Hanselmann
.. mode: rst
130 0cc3f0d7 Michael Hanselmann
.. fill-column: 72
131 0cc3f0d7 Michael Hanselmann
.. End: