root / doc / design-2.0-locking.rst @ 6e4f6dfa
History | View | Annotate | Download (8.1 kB)
1 |
Ganeti 2.0 Granular Locking |
---|---|
2 |
=========================== |
3 |
|
4 |
Objective |
5 |
--------- |
6 |
|
7 |
We want to make sure that multiple operations can run in parallel on a Ganeti |
8 |
Cluster. In order for this to happen we need to make sure concurrently run |
9 |
operations don't step on each other toes and break the cluster. |
10 |
|
11 |
This design addresses how we are going to deal with locking so that: |
12 |
|
13 |
- high urgency operations are not stopped by long length ones |
14 |
- long length operations can run in parallel |
15 |
- we preserve safety (data coherency) and liveness (no deadlock, no work |
16 |
postponed indefinitely) on the cluster |
17 |
|
18 |
Reaching the maximum possible parallelism is a Non-Goal. We have identified a |
19 |
set of operations that are currently bottlenecks and need to be parallelised |
20 |
and have worked on those. In the future it will be possible to address other |
21 |
needs, thus making the cluster more and more parallel one step at a time. |
22 |
|
23 |
This document only talks about parallelising Ganeti level operations, aka |
24 |
Logical Units, and the locking needed for that. Any other synchronisation lock |
25 |
needed internally by the code is outside its scope. |
26 |
|
27 |
Background |
28 |
---------- |
29 |
|
30 |
Ganeti 1.2 has a single global lock, which is used for all cluster operations. |
31 |
This has been painful at various times, for example: |
32 |
|
33 |
- It is impossible for two people to efficiently interact with a cluster |
34 |
(for example for debugging) at the same time. |
35 |
- When batch jobs are running it's impossible to do other work (for example |
36 |
failovers/fixes) on a cluster. |
37 |
|
38 |
This also poses scalability problems: as clusters grow in node and instance |
39 |
size it's a lot more likely that operations which one could conceive should run |
40 |
in parallel (for example because they happen on different nodes) are actually |
41 |
stalling each other while waiting for the global lock, without a real reason |
42 |
for that to happen. |
43 |
|
44 |
Overview |
45 |
-------- |
46 |
|
47 |
This design doc is best read in the context of the accompanying design |
48 |
docs for Ganeti 2.0: Master daemon design and Job queue design. |
49 |
|
50 |
We intend to implement a Ganeti locking library, which can be used by the |
51 |
various ganeti code components in order to easily, efficiently and correctly |
52 |
grab the locks they need to perform their function. |
53 |
|
54 |
The proposed library has these features: |
55 |
|
56 |
- Internally managing all the locks, making the implementation transparent |
57 |
from their usage |
58 |
- Automatically grabbing multiple locks in the right order (avoid deadlock) |
59 |
- Ability to transparently handle conversion to more granularity |
60 |
- Support asynchronous operation (future goal) |
61 |
|
62 |
Locking will be valid only on the master node and will not be a distributed |
63 |
operation. In case of master failure, though, if some locks were held it means |
64 |
some opcodes were in progress, so when recovery of the job queue is done it |
65 |
will be possible to determine by the interrupted opcodes which operations could |
66 |
have been left half way through and thus which locks could have been held. It |
67 |
is then the responsibility either of the master failover code, of the cluster |
68 |
verification code, or of the admin to do what's necessary to make sure that any |
69 |
leftover state is dealt with. This is not an issue from a locking point of view |
70 |
because the fact that the previous master has failed means that it cannot do |
71 |
any job. |
72 |
|
73 |
A corollary of this is that a master-failover operation with both masters alive |
74 |
needs to happen while no other locks are held. |
75 |
|
76 |
Detailed Design |
77 |
--------------- |
78 |
|
79 |
The Locks |
80 |
~~~~~~~~~ |
81 |
At the first stage we have decided to provide the following locks: |
82 |
|
83 |
- One "config file" lock |
84 |
- One lock per node in the cluster |
85 |
- One lock per instance in the cluster |
86 |
|
87 |
All the instance locks will need to be taken before the node locks, and the |
88 |
node locks before the config lock. Locks will need to be acquired at the same |
89 |
time for multiple instances and nodes, and internal ordering will be dealt |
90 |
within the locking library, which, for simplicity, will just use alphabetical |
91 |
order. |
92 |
|
93 |
Handling conversion to more granularity |
94 |
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
95 |
|
96 |
In order to convert to a more granular approach transparently each time we |
97 |
split a lock into more we'll create a "metalock", which will depend on those |
98 |
sublocks and live for the time necessary for all the code to convert (or |
99 |
forever, in some conditions). When a metalock exists all converted code must |
100 |
acquire it in shared mode, so it can run concurrently, but still be exclusive |
101 |
with old code, which acquires it exclusively. |
102 |
|
103 |
In the beginning the only such lock will be what replaces the current "command" |
104 |
lock, and will acquire all the locks in the system, before proceeding. This |
105 |
lock will be called the "Big Ganeti Lock" because holding that one will avoid |
106 |
any other concurrent ganeti operations. |
107 |
|
108 |
We might also want to devise more metalocks (eg. all nodes, all nodes+config) |
109 |
in order to make it easier for some parts of the code to acquire what it needs |
110 |
without specifying it explicitly. |
111 |
|
112 |
In the future things like the node locks could become metalocks, should we |
113 |
decide to split them into an even more fine grained approach, but this will |
114 |
probably be only after the first 2.0 version has been released. |
115 |
|
116 |
Library API |
117 |
~~~~~~~~~~~ |
118 |
|
119 |
All the locking will be its own class, and the locks will be created at |
120 |
initialisation time, from the config file. |
121 |
|
122 |
The API will have a way to grab one or more than one locks at the same time. |
123 |
Any attempt to grab a lock while already holding one in the wrong order will be |
124 |
checked for, and fail. |
125 |
|
126 |
Adding/Removing locks |
127 |
~~~~~~~~~~~~~~~~~~~~~ |
128 |
|
129 |
When a new instance or a new node is created an associated lock must be added |
130 |
to the list. The relevant code will need to inform the locking library of such |
131 |
a change. |
132 |
|
133 |
This needs to be compatible with every other lock in the system, especially |
134 |
metalocks that guarantee to grab sets of resources without specifying them |
135 |
explicitly. The implementation of this will be handled in the locking library |
136 |
itself. |
137 |
|
138 |
Of course when instances or nodes disappear from the cluster the relevant locks |
139 |
must be removed. This is easier than adding new elements, as the code which |
140 |
removes them must own them exclusively or can queue for their ownership, and |
141 |
thus deals with metalocks exactly as normal code acquiring those locks. Any |
142 |
operation queueing on a removed lock will fail after its removal. |
143 |
|
144 |
Asynchronous operations |
145 |
~~~~~~~~~~~~~~~~~~~~~~~ |
146 |
|
147 |
For the first version the locking library will only export synchronous |
148 |
operations, which will block till the needed lock are held, and only fail if |
149 |
the request is impossible or somehow erroneous. |
150 |
|
151 |
In the future we may want to implement different types of asynchronous |
152 |
operations such as: |
153 |
|
154 |
- Try to acquire this lock set and fail if not possible |
155 |
- Try to acquire one of these lock sets and return the first one you were |
156 |
able to get (or after a timeout) (select/poll like) |
157 |
|
158 |
These operations can be used to prioritize operations based on available locks, |
159 |
rather than making them just blindly queue for acquiring them. The inherent |
160 |
risk, though, is that any code using the first operation, or setting a timeout |
161 |
for the second one, is susceptible to starvation and thus may never be able to |
162 |
get the required locks and complete certain tasks. Considering this |
163 |
providing/using these operations should not be among our first priorities. |
164 |
|
165 |
Locking granularity |
166 |
~~~~~~~~~~~~~~~~~~~ |
167 |
|
168 |
For the first version of this code we'll convert each Logical Unit to |
169 |
acquire/release the locks it needs, so locking will be at the Logical Unit |
170 |
level. In the future we may want to split logical units in independent |
171 |
"tasklets" with their own locking requirements. A different design doc (or mini |
172 |
design doc) will cover the move from Logical Units to tasklets. |
173 |
|
174 |
Caveats |
175 |
------- |
176 |
|
177 |
This library will provide an easy upgrade path to bring all the code to |
178 |
granular locking without breaking everything, and it will also guarantee |
179 |
against a lot of common errors. Code switching from the old "lock everything" |
180 |
lock to the new system, though, needs to be carefully scrutinised to be sure it |
181 |
is really acquiring all the necessary locks, and none has been overlooked or |
182 |
forgotten. |
183 |
|
184 |
The code can contain other locks outside of this library, to synchronise other |
185 |
threaded code (eg for the job queue) but in general these should be leaf locks |
186 |
or carefully structured non-leaf ones, to avoid deadlock race conditions. |
187 |
|