Statistics
| Branch: | Tag: | Revision:

root / doc / design-2.0-locking.rst @ cd55576a

History | View | Annotate | Download (9.4 kB)

1 040408a3 Guido Trotter
Ganeti 2.0 Granular Locking
2 040408a3 Guido Trotter
===========================
3 040408a3 Guido Trotter
4 47eb4b45 Guido Trotter
.. contents::
5 47eb4b45 Guido Trotter
6 040408a3 Guido Trotter
Objective
7 040408a3 Guido Trotter
---------
8 040408a3 Guido Trotter
9 040408a3 Guido Trotter
We want to make sure that multiple operations can run in parallel on a Ganeti
10 040408a3 Guido Trotter
Cluster. In order for this to happen we need to make sure concurrently run
11 040408a3 Guido Trotter
operations don't step on each other toes and break the cluster.
12 040408a3 Guido Trotter
13 040408a3 Guido Trotter
This design addresses how we are going to deal with locking so that:
14 040408a3 Guido Trotter
15 040408a3 Guido Trotter
- high urgency operations are not stopped by long length ones
16 040408a3 Guido Trotter
- long length operations can run in parallel
17 040408a3 Guido Trotter
- we preserve safety (data coherency) and liveness (no deadlock, no work
18 040408a3 Guido Trotter
  postponed indefinitely) on the cluster
19 040408a3 Guido Trotter
20 040408a3 Guido Trotter
Reaching the maximum possible parallelism is a Non-Goal. We have identified a
21 040408a3 Guido Trotter
set of operations that are currently bottlenecks and need to be parallelised
22 040408a3 Guido Trotter
and have worked on those. In the future it will be possible to address other
23 040408a3 Guido Trotter
needs, thus making the cluster more and more parallel one step at a time.
24 040408a3 Guido Trotter
25 040408a3 Guido Trotter
This document only talks about parallelising Ganeti level operations, aka
26 040408a3 Guido Trotter
Logical Units, and the locking needed for that. Any other synchronisation lock
27 040408a3 Guido Trotter
needed internally by the code is outside its scope.
28 040408a3 Guido Trotter
29 040408a3 Guido Trotter
Background
30 040408a3 Guido Trotter
----------
31 040408a3 Guido Trotter
32 040408a3 Guido Trotter
Ganeti 1.2 has a single global lock, which is used for all cluster operations.
33 040408a3 Guido Trotter
This has been painful at various times, for example:
34 040408a3 Guido Trotter
35 040408a3 Guido Trotter
- It is impossible for two people to efficiently interact with a cluster
36 040408a3 Guido Trotter
  (for example for debugging) at the same time.
37 040408a3 Guido Trotter
- When batch jobs are running it's impossible to do other work (for example
38 040408a3 Guido Trotter
  failovers/fixes) on a cluster.
39 040408a3 Guido Trotter
40 040408a3 Guido Trotter
This also poses scalability problems: as clusters grow in node and instance
41 040408a3 Guido Trotter
size it's a lot more likely that operations which one could conceive should run
42 040408a3 Guido Trotter
in parallel (for example because they happen on different nodes) are actually
43 040408a3 Guido Trotter
stalling each other while waiting for the global lock, without a real reason
44 040408a3 Guido Trotter
for that to happen.
45 040408a3 Guido Trotter
46 040408a3 Guido Trotter
Overview
47 040408a3 Guido Trotter
--------
48 040408a3 Guido Trotter
49 040408a3 Guido Trotter
This design doc is best read in the context of the accompanying design
50 040408a3 Guido Trotter
docs for Ganeti 2.0: Master daemon design and Job queue design.
51 040408a3 Guido Trotter
52 040408a3 Guido Trotter
We intend to implement a Ganeti locking library, which can be used by the
53 040408a3 Guido Trotter
various ganeti code components in order to easily, efficiently and correctly
54 040408a3 Guido Trotter
grab the locks they need to perform their function.
55 040408a3 Guido Trotter
56 040408a3 Guido Trotter
The proposed library has these features:
57 040408a3 Guido Trotter
58 040408a3 Guido Trotter
- Internally managing all the locks, making the implementation transparent
59 040408a3 Guido Trotter
  from their usage
60 040408a3 Guido Trotter
- Automatically grabbing multiple locks in the right order (avoid deadlock)
61 040408a3 Guido Trotter
- Ability to transparently handle conversion to more granularity
62 040408a3 Guido Trotter
- Support asynchronous operation (future goal)
63 040408a3 Guido Trotter
64 040408a3 Guido Trotter
Locking will be valid only on the master node and will not be a distributed
65 040408a3 Guido Trotter
operation. In case of master failure, though, if some locks were held it means
66 040408a3 Guido Trotter
some opcodes were in progress, so when recovery of the job queue is done it
67 040408a3 Guido Trotter
will be possible to determine by the interrupted opcodes which operations could
68 040408a3 Guido Trotter
have been left half way through and thus which locks could have been held. It
69 040408a3 Guido Trotter
is then the responsibility either of the master failover code, of the cluster
70 040408a3 Guido Trotter
verification code, or of the admin to do what's necessary to make sure that any
71 040408a3 Guido Trotter
leftover state is dealt with. This is not an issue from a locking point of view
72 040408a3 Guido Trotter
because the fact that the previous master has failed means that it cannot do
73 040408a3 Guido Trotter
any job.
74 040408a3 Guido Trotter
75 040408a3 Guido Trotter
A corollary of this is that a master-failover operation with both masters alive
76 040408a3 Guido Trotter
needs to happen while no other locks are held.
77 040408a3 Guido Trotter
78 040408a3 Guido Trotter
Detailed Design
79 040408a3 Guido Trotter
---------------
80 040408a3 Guido Trotter
81 040408a3 Guido Trotter
The Locks
82 040408a3 Guido Trotter
~~~~~~~~~
83 040408a3 Guido Trotter
At the first stage we have decided to provide the following locks:
84 040408a3 Guido Trotter
85 040408a3 Guido Trotter
- One "config file" lock
86 040408a3 Guido Trotter
- One lock per node in the cluster
87 040408a3 Guido Trotter
- One lock per instance in the cluster
88 040408a3 Guido Trotter
89 040408a3 Guido Trotter
All the instance locks will need to be taken before the node locks, and the
90 040408a3 Guido Trotter
node locks before the config lock. Locks will need to be acquired at the same
91 040408a3 Guido Trotter
time for multiple instances and nodes, and internal ordering will be dealt
92 040408a3 Guido Trotter
within the locking library, which, for simplicity, will just use alphabetical
93 040408a3 Guido Trotter
order.
94 040408a3 Guido Trotter
95 040408a3 Guido Trotter
Handling conversion to more granularity
96 040408a3 Guido Trotter
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
97 040408a3 Guido Trotter
98 040408a3 Guido Trotter
In order to convert to a more granular approach transparently each time we
99 040408a3 Guido Trotter
split a lock into more we'll create a "metalock", which will depend on those
100 040408a3 Guido Trotter
sublocks and live for the time necessary for all the code to convert (or
101 040408a3 Guido Trotter
forever, in some conditions). When a metalock exists all converted code must
102 040408a3 Guido Trotter
acquire it in shared mode, so it can run concurrently, but still be exclusive
103 040408a3 Guido Trotter
with old code, which acquires it exclusively.
104 040408a3 Guido Trotter
105 040408a3 Guido Trotter
In the beginning the only such lock will be what replaces the current "command"
106 040408a3 Guido Trotter
lock, and will acquire all the locks in the system, before proceeding. This
107 040408a3 Guido Trotter
lock will be called the "Big Ganeti Lock" because holding that one will avoid
108 040408a3 Guido Trotter
any other concurrent ganeti operations.
109 040408a3 Guido Trotter
110 040408a3 Guido Trotter
We might also want to devise more metalocks (eg. all nodes, all nodes+config)
111 040408a3 Guido Trotter
in order to make it easier for some parts of the code to acquire what it needs
112 040408a3 Guido Trotter
without specifying it explicitly.
113 040408a3 Guido Trotter
114 040408a3 Guido Trotter
In the future things like the node locks could become metalocks, should we
115 040408a3 Guido Trotter
decide to split them into an even more fine grained approach, but this will
116 040408a3 Guido Trotter
probably be only after the first 2.0 version has been released.
117 040408a3 Guido Trotter
118 040408a3 Guido Trotter
Library API
119 040408a3 Guido Trotter
~~~~~~~~~~~
120 040408a3 Guido Trotter
121 040408a3 Guido Trotter
All the locking will be its own class, and the locks will be created at
122 040408a3 Guido Trotter
initialisation time, from the config file.
123 040408a3 Guido Trotter
124 040408a3 Guido Trotter
The API will have a way to grab one or more than one locks at the same time.
125 040408a3 Guido Trotter
Any attempt to grab a lock while already holding one in the wrong order will be
126 040408a3 Guido Trotter
checked for, and fail.
127 040408a3 Guido Trotter
128 164a5bcb Guido Trotter
Adding/Removing locks
129 164a5bcb Guido Trotter
~~~~~~~~~~~~~~~~~~~~~
130 040408a3 Guido Trotter
131 040408a3 Guido Trotter
When a new instance or a new node is created an associated lock must be added
132 040408a3 Guido Trotter
to the list. The relevant code will need to inform the locking library of such
133 040408a3 Guido Trotter
a change.
134 040408a3 Guido Trotter
135 040408a3 Guido Trotter
This needs to be compatible with every other lock in the system, especially
136 040408a3 Guido Trotter
metalocks that guarantee to grab sets of resources without specifying them
137 164a5bcb Guido Trotter
explicitly. The implementation of this will be handled in the locking library
138 164a5bcb Guido Trotter
itself.
139 164a5bcb Guido Trotter
140 164a5bcb Guido Trotter
Of course when instances or nodes disappear from the cluster the relevant locks
141 164a5bcb Guido Trotter
must be removed. This is easier than adding new elements, as the code which
142 164a5bcb Guido Trotter
removes them must own them exclusively or can queue for their ownership, and
143 164a5bcb Guido Trotter
thus deals with metalocks exactly as normal code acquiring those locks. Any
144 164a5bcb Guido Trotter
operation queueing on a removed lock will fail after its removal.
145 040408a3 Guido Trotter
146 040408a3 Guido Trotter
Asynchronous operations
147 040408a3 Guido Trotter
~~~~~~~~~~~~~~~~~~~~~~~
148 040408a3 Guido Trotter
149 040408a3 Guido Trotter
For the first version the locking library will only export synchronous
150 040408a3 Guido Trotter
operations, which will block till the needed lock are held, and only fail if
151 040408a3 Guido Trotter
the request is impossible or somehow erroneous.
152 040408a3 Guido Trotter
153 040408a3 Guido Trotter
In the future we may want to implement different types of asynchronous
154 040408a3 Guido Trotter
operations such as:
155 040408a3 Guido Trotter
156 040408a3 Guido Trotter
- Try to acquire this lock set and fail if not possible
157 040408a3 Guido Trotter
- Try to acquire one of these lock sets and return the first one you were
158 040408a3 Guido Trotter
  able to get (or after a timeout) (select/poll like)
159 040408a3 Guido Trotter
160 6e4f6dfa Guido Trotter
These operations can be used to prioritize operations based on available locks,
161 6e4f6dfa Guido Trotter
rather than making them just blindly queue for acquiring them. The inherent
162 6e4f6dfa Guido Trotter
risk, though, is that any code using the first operation, or setting a timeout
163 6e4f6dfa Guido Trotter
for the second one, is susceptible to starvation and thus may never be able to
164 6e4f6dfa Guido Trotter
get the required locks and complete certain tasks. Considering this
165 6e4f6dfa Guido Trotter
providing/using these operations should not be among our first priorities.
166 040408a3 Guido Trotter
167 040408a3 Guido Trotter
Locking granularity
168 040408a3 Guido Trotter
~~~~~~~~~~~~~~~~~~~
169 040408a3 Guido Trotter
170 040408a3 Guido Trotter
For the first version of this code we'll convert each Logical Unit to
171 040408a3 Guido Trotter
acquire/release the locks it needs, so locking will be at the Logical Unit
172 040408a3 Guido Trotter
level.  In the future we may want to split logical units in independent
173 040408a3 Guido Trotter
"tasklets" with their own locking requirements. A different design doc (or mini
174 040408a3 Guido Trotter
design doc) will cover the move from Logical Units to tasklets.
175 040408a3 Guido Trotter
176 4e8d0685 Guido Trotter
Lock acquisition code path
177 4e8d0685 Guido Trotter
~~~~~~~~~~~~~~~~~~~~~~~~~~
178 4e8d0685 Guido Trotter
179 4e8d0685 Guido Trotter
In general when acquiring locks we should use a code path equivalent to::
180 4e8d0685 Guido Trotter
181 4e8d0685 Guido Trotter
  lock.acquire()
182 4e8d0685 Guido Trotter
  try:
183 4e8d0685 Guido Trotter
    ...
184 4e8d0685 Guido Trotter
    # other code
185 4e8d0685 Guido Trotter
  finally:
186 4e8d0685 Guido Trotter
    lock.release()
187 4e8d0685 Guido Trotter
188 4e8d0685 Guido Trotter
This makes sure we release all locks, and avoid possible deadlocks. Of course
189 4e8d0685 Guido Trotter
extra care must be used not to leave, if possible locked structures in an
190 4e8d0685 Guido Trotter
unusable state.
191 4e8d0685 Guido Trotter
192 4e8d0685 Guido Trotter
In order to avoid this extra indentation and code changes everywhere in the
193 4e8d0685 Guido Trotter
Logical Units code, we decided to allow LUs to declare locks, and then execute
194 4e8d0685 Guido Trotter
their code with their locks acquired. In the new world LUs are called like
195 4e8d0685 Guido Trotter
this::
196 4e8d0685 Guido Trotter
197 4e8d0685 Guido Trotter
  # user passed names are expanded to the internal lock/resource name,
198 4e8d0685 Guido Trotter
  # then known needed locks are declared
199 4e8d0685 Guido Trotter
  lu.ExpandNames()
200 4e8d0685 Guido Trotter
  ... some locking/adding of locks may happen ...
201 4e8d0685 Guido Trotter
  # late declaration of locks for one level: this is useful because sometimes
202 4e8d0685 Guido Trotter
  # we can't know which resource we need before locking the previous level
203 4e8d0685 Guido Trotter
  lu.DeclareLocks() # for each level (cluster, instance, node)
204 4e8d0685 Guido Trotter
  ... more locking/adding of locks can happen ...
205 4e8d0685 Guido Trotter
  # these functions are called with the proper locks held
206 4e8d0685 Guido Trotter
  lu.CheckPrereq()
207 4e8d0685 Guido Trotter
  lu.Exec()
208 4e8d0685 Guido Trotter
  ... locks declared for removal are removed, all acquired locks released ...
209 4e8d0685 Guido Trotter
210 4e8d0685 Guido Trotter
The Processor and the LogicalUnit class will contain exact documentation on how
211 4e8d0685 Guido Trotter
locks are supposed to be declared.
212 4e8d0685 Guido Trotter
213 040408a3 Guido Trotter
Caveats
214 040408a3 Guido Trotter
-------
215 040408a3 Guido Trotter
216 040408a3 Guido Trotter
This library will provide an easy upgrade path to bring all the code to
217 040408a3 Guido Trotter
granular locking without breaking everything, and it will also guarantee
218 040408a3 Guido Trotter
against a lot of common errors. Code switching from the old "lock everything"
219 040408a3 Guido Trotter
lock to the new system, though, needs to be carefully scrutinised to be sure it
220 040408a3 Guido Trotter
is really acquiring all the necessary locks, and none has been overlooked or
221 040408a3 Guido Trotter
forgotten.
222 040408a3 Guido Trotter
223 040408a3 Guido Trotter
The code can contain other locks outside of this library, to synchronise other
224 040408a3 Guido Trotter
threaded code (eg for the job queue) but in general these should be leaf locks
225 040408a3 Guido Trotter
or carefully structured non-leaf ones, to avoid deadlock race conditions.