root / doc / design-2.0-locking.rst @ 47eb4b45
History | View | Annotate | Download (9.4 kB)
1 |
Ganeti 2.0 Granular Locking |
---|---|
2 |
=========================== |
3 |
|
4 |
.. contents:: |
5 |
|
6 |
Objective |
7 |
--------- |
8 |
|
9 |
We want to make sure that multiple operations can run in parallel on a Ganeti |
10 |
Cluster. In order for this to happen we need to make sure concurrently run |
11 |
operations don't step on each other toes and break the cluster. |
12 |
|
13 |
This design addresses how we are going to deal with locking so that: |
14 |
|
15 |
- high urgency operations are not stopped by long length ones |
16 |
- long length operations can run in parallel |
17 |
- we preserve safety (data coherency) and liveness (no deadlock, no work |
18 |
postponed indefinitely) on the cluster |
19 |
|
20 |
Reaching the maximum possible parallelism is a Non-Goal. We have identified a |
21 |
set of operations that are currently bottlenecks and need to be parallelised |
22 |
and have worked on those. In the future it will be possible to address other |
23 |
needs, thus making the cluster more and more parallel one step at a time. |
24 |
|
25 |
This document only talks about parallelising Ganeti level operations, aka |
26 |
Logical Units, and the locking needed for that. Any other synchronisation lock |
27 |
needed internally by the code is outside its scope. |
28 |
|
29 |
Background |
30 |
---------- |
31 |
|
32 |
Ganeti 1.2 has a single global lock, which is used for all cluster operations. |
33 |
This has been painful at various times, for example: |
34 |
|
35 |
- It is impossible for two people to efficiently interact with a cluster |
36 |
(for example for debugging) at the same time. |
37 |
- When batch jobs are running it's impossible to do other work (for example |
38 |
failovers/fixes) on a cluster. |
39 |
|
40 |
This also poses scalability problems: as clusters grow in node and instance |
41 |
size it's a lot more likely that operations which one could conceive should run |
42 |
in parallel (for example because they happen on different nodes) are actually |
43 |
stalling each other while waiting for the global lock, without a real reason |
44 |
for that to happen. |
45 |
|
46 |
Overview |
47 |
-------- |
48 |
|
49 |
This design doc is best read in the context of the accompanying design |
50 |
docs for Ganeti 2.0: Master daemon design and Job queue design. |
51 |
|
52 |
We intend to implement a Ganeti locking library, which can be used by the |
53 |
various ganeti code components in order to easily, efficiently and correctly |
54 |
grab the locks they need to perform their function. |
55 |
|
56 |
The proposed library has these features: |
57 |
|
58 |
- Internally managing all the locks, making the implementation transparent |
59 |
from their usage |
60 |
- Automatically grabbing multiple locks in the right order (avoid deadlock) |
61 |
- Ability to transparently handle conversion to more granularity |
62 |
- Support asynchronous operation (future goal) |
63 |
|
64 |
Locking will be valid only on the master node and will not be a distributed |
65 |
operation. In case of master failure, though, if some locks were held it means |
66 |
some opcodes were in progress, so when recovery of the job queue is done it |
67 |
will be possible to determine by the interrupted opcodes which operations could |
68 |
have been left half way through and thus which locks could have been held. It |
69 |
is then the responsibility either of the master failover code, of the cluster |
70 |
verification code, or of the admin to do what's necessary to make sure that any |
71 |
leftover state is dealt with. This is not an issue from a locking point of view |
72 |
because the fact that the previous master has failed means that it cannot do |
73 |
any job. |
74 |
|
75 |
A corollary of this is that a master-failover operation with both masters alive |
76 |
needs to happen while no other locks are held. |
77 |
|
78 |
Detailed Design |
79 |
--------------- |
80 |
|
81 |
The Locks |
82 |
~~~~~~~~~ |
83 |
At the first stage we have decided to provide the following locks: |
84 |
|
85 |
- One "config file" lock |
86 |
- One lock per node in the cluster |
87 |
- One lock per instance in the cluster |
88 |
|
89 |
All the instance locks will need to be taken before the node locks, and the |
90 |
node locks before the config lock. Locks will need to be acquired at the same |
91 |
time for multiple instances and nodes, and internal ordering will be dealt |
92 |
within the locking library, which, for simplicity, will just use alphabetical |
93 |
order. |
94 |
|
95 |
Handling conversion to more granularity |
96 |
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
97 |
|
98 |
In order to convert to a more granular approach transparently each time we |
99 |
split a lock into more we'll create a "metalock", which will depend on those |
100 |
sublocks and live for the time necessary for all the code to convert (or |
101 |
forever, in some conditions). When a metalock exists all converted code must |
102 |
acquire it in shared mode, so it can run concurrently, but still be exclusive |
103 |
with old code, which acquires it exclusively. |
104 |
|
105 |
In the beginning the only such lock will be what replaces the current "command" |
106 |
lock, and will acquire all the locks in the system, before proceeding. This |
107 |
lock will be called the "Big Ganeti Lock" because holding that one will avoid |
108 |
any other concurrent ganeti operations. |
109 |
|
110 |
We might also want to devise more metalocks (eg. all nodes, all nodes+config) |
111 |
in order to make it easier for some parts of the code to acquire what it needs |
112 |
without specifying it explicitly. |
113 |
|
114 |
In the future things like the node locks could become metalocks, should we |
115 |
decide to split them into an even more fine grained approach, but this will |
116 |
probably be only after the first 2.0 version has been released. |
117 |
|
118 |
Library API |
119 |
~~~~~~~~~~~ |
120 |
|
121 |
All the locking will be its own class, and the locks will be created at |
122 |
initialisation time, from the config file. |
123 |
|
124 |
The API will have a way to grab one or more than one locks at the same time. |
125 |
Any attempt to grab a lock while already holding one in the wrong order will be |
126 |
checked for, and fail. |
127 |
|
128 |
Adding/Removing locks |
129 |
~~~~~~~~~~~~~~~~~~~~~ |
130 |
|
131 |
When a new instance or a new node is created an associated lock must be added |
132 |
to the list. The relevant code will need to inform the locking library of such |
133 |
a change. |
134 |
|
135 |
This needs to be compatible with every other lock in the system, especially |
136 |
metalocks that guarantee to grab sets of resources without specifying them |
137 |
explicitly. The implementation of this will be handled in the locking library |
138 |
itself. |
139 |
|
140 |
Of course when instances or nodes disappear from the cluster the relevant locks |
141 |
must be removed. This is easier than adding new elements, as the code which |
142 |
removes them must own them exclusively or can queue for their ownership, and |
143 |
thus deals with metalocks exactly as normal code acquiring those locks. Any |
144 |
operation queueing on a removed lock will fail after its removal. |
145 |
|
146 |
Asynchronous operations |
147 |
~~~~~~~~~~~~~~~~~~~~~~~ |
148 |
|
149 |
For the first version the locking library will only export synchronous |
150 |
operations, which will block till the needed lock are held, and only fail if |
151 |
the request is impossible or somehow erroneous. |
152 |
|
153 |
In the future we may want to implement different types of asynchronous |
154 |
operations such as: |
155 |
|
156 |
- Try to acquire this lock set and fail if not possible |
157 |
- Try to acquire one of these lock sets and return the first one you were |
158 |
able to get (or after a timeout) (select/poll like) |
159 |
|
160 |
These operations can be used to prioritize operations based on available locks, |
161 |
rather than making them just blindly queue for acquiring them. The inherent |
162 |
risk, though, is that any code using the first operation, or setting a timeout |
163 |
for the second one, is susceptible to starvation and thus may never be able to |
164 |
get the required locks and complete certain tasks. Considering this |
165 |
providing/using these operations should not be among our first priorities. |
166 |
|
167 |
Locking granularity |
168 |
~~~~~~~~~~~~~~~~~~~ |
169 |
|
170 |
For the first version of this code we'll convert each Logical Unit to |
171 |
acquire/release the locks it needs, so locking will be at the Logical Unit |
172 |
level. In the future we may want to split logical units in independent |
173 |
"tasklets" with their own locking requirements. A different design doc (or mini |
174 |
design doc) will cover the move from Logical Units to tasklets. |
175 |
|
176 |
Lock acquisition code path |
177 |
~~~~~~~~~~~~~~~~~~~~~~~~~~ |
178 |
|
179 |
In general when acquiring locks we should use a code path equivalent to:: |
180 |
|
181 |
lock.acquire() |
182 |
try: |
183 |
... |
184 |
# other code |
185 |
finally: |
186 |
lock.release() |
187 |
|
188 |
This makes sure we release all locks, and avoid possible deadlocks. Of course |
189 |
extra care must be used not to leave, if possible locked structures in an |
190 |
unusable state. |
191 |
|
192 |
In order to avoid this extra indentation and code changes everywhere in the |
193 |
Logical Units code, we decided to allow LUs to declare locks, and then execute |
194 |
their code with their locks acquired. In the new world LUs are called like |
195 |
this:: |
196 |
|
197 |
# user passed names are expanded to the internal lock/resource name, |
198 |
# then known needed locks are declared |
199 |
lu.ExpandNames() |
200 |
... some locking/adding of locks may happen ... |
201 |
# late declaration of locks for one level: this is useful because sometimes |
202 |
# we can't know which resource we need before locking the previous level |
203 |
lu.DeclareLocks() # for each level (cluster, instance, node) |
204 |
... more locking/adding of locks can happen ... |
205 |
# these functions are called with the proper locks held |
206 |
lu.CheckPrereq() |
207 |
lu.Exec() |
208 |
... locks declared for removal are removed, all acquired locks released ... |
209 |
|
210 |
The Processor and the LogicalUnit class will contain exact documentation on how |
211 |
locks are supposed to be declared. |
212 |
|
213 |
Caveats |
214 |
------- |
215 |
|
216 |
This library will provide an easy upgrade path to bring all the code to |
217 |
granular locking without breaking everything, and it will also guarantee |
218 |
against a lot of common errors. Code switching from the old "lock everything" |
219 |
lock to the new system, though, needs to be carefully scrutinised to be sure it |
220 |
is really acquiring all the necessary locks, and none has been overlooked or |
221 |
forgotten. |
222 |
|
223 |
The code can contain other locks outside of this library, to synchronise other |
224 |
threaded code (eg for the job queue) but in general these should be leaf locks |
225 |
or carefully structured non-leaf ones, to avoid deadlock race conditions. |
226 |
|