Statistics
| Branch: | Tag: | Revision:

root / doc / design-2.0-locking.rst @ 47eb4b45

History | View | Annotate | Download (9.4 kB)

1
Ganeti 2.0 Granular Locking
2
===========================
3

    
4
.. contents::
5

    
6
Objective
7
---------
8

    
9
We want to make sure that multiple operations can run in parallel on a Ganeti
10
Cluster. In order for this to happen we need to make sure concurrently run
11
operations don't step on each other toes and break the cluster.
12

    
13
This design addresses how we are going to deal with locking so that:
14

    
15
- high urgency operations are not stopped by long length ones
16
- long length operations can run in parallel
17
- we preserve safety (data coherency) and liveness (no deadlock, no work
18
  postponed indefinitely) on the cluster
19

    
20
Reaching the maximum possible parallelism is a Non-Goal. We have identified a
21
set of operations that are currently bottlenecks and need to be parallelised
22
and have worked on those. In the future it will be possible to address other
23
needs, thus making the cluster more and more parallel one step at a time.
24

    
25
This document only talks about parallelising Ganeti level operations, aka
26
Logical Units, and the locking needed for that. Any other synchronisation lock
27
needed internally by the code is outside its scope.
28

    
29
Background
30
----------
31

    
32
Ganeti 1.2 has a single global lock, which is used for all cluster operations.
33
This has been painful at various times, for example:
34

    
35
- It is impossible for two people to efficiently interact with a cluster
36
  (for example for debugging) at the same time.
37
- When batch jobs are running it's impossible to do other work (for example
38
  failovers/fixes) on a cluster.
39

    
40
This also poses scalability problems: as clusters grow in node and instance
41
size it's a lot more likely that operations which one could conceive should run
42
in parallel (for example because they happen on different nodes) are actually
43
stalling each other while waiting for the global lock, without a real reason
44
for that to happen.
45

    
46
Overview
47
--------
48

    
49
This design doc is best read in the context of the accompanying design
50
docs for Ganeti 2.0: Master daemon design and Job queue design.
51

    
52
We intend to implement a Ganeti locking library, which can be used by the
53
various ganeti code components in order to easily, efficiently and correctly
54
grab the locks they need to perform their function.
55

    
56
The proposed library has these features:
57

    
58
- Internally managing all the locks, making the implementation transparent
59
  from their usage
60
- Automatically grabbing multiple locks in the right order (avoid deadlock)
61
- Ability to transparently handle conversion to more granularity
62
- Support asynchronous operation (future goal)
63

    
64
Locking will be valid only on the master node and will not be a distributed
65
operation. In case of master failure, though, if some locks were held it means
66
some opcodes were in progress, so when recovery of the job queue is done it
67
will be possible to determine by the interrupted opcodes which operations could
68
have been left half way through and thus which locks could have been held. It
69
is then the responsibility either of the master failover code, of the cluster
70
verification code, or of the admin to do what's necessary to make sure that any
71
leftover state is dealt with. This is not an issue from a locking point of view
72
because the fact that the previous master has failed means that it cannot do
73
any job.
74

    
75
A corollary of this is that a master-failover operation with both masters alive
76
needs to happen while no other locks are held.
77

    
78
Detailed Design
79
---------------
80

    
81
The Locks
82
~~~~~~~~~
83
At the first stage we have decided to provide the following locks:
84

    
85
- One "config file" lock
86
- One lock per node in the cluster
87
- One lock per instance in the cluster
88

    
89
All the instance locks will need to be taken before the node locks, and the
90
node locks before the config lock. Locks will need to be acquired at the same
91
time for multiple instances and nodes, and internal ordering will be dealt
92
within the locking library, which, for simplicity, will just use alphabetical
93
order.
94

    
95
Handling conversion to more granularity
96
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
97

    
98
In order to convert to a more granular approach transparently each time we
99
split a lock into more we'll create a "metalock", which will depend on those
100
sublocks and live for the time necessary for all the code to convert (or
101
forever, in some conditions). When a metalock exists all converted code must
102
acquire it in shared mode, so it can run concurrently, but still be exclusive
103
with old code, which acquires it exclusively.
104

    
105
In the beginning the only such lock will be what replaces the current "command"
106
lock, and will acquire all the locks in the system, before proceeding. This
107
lock will be called the "Big Ganeti Lock" because holding that one will avoid
108
any other concurrent ganeti operations.
109

    
110
We might also want to devise more metalocks (eg. all nodes, all nodes+config)
111
in order to make it easier for some parts of the code to acquire what it needs
112
without specifying it explicitly.
113

    
114
In the future things like the node locks could become metalocks, should we
115
decide to split them into an even more fine grained approach, but this will
116
probably be only after the first 2.0 version has been released.
117

    
118
Library API
119
~~~~~~~~~~~
120

    
121
All the locking will be its own class, and the locks will be created at
122
initialisation time, from the config file.
123

    
124
The API will have a way to grab one or more than one locks at the same time.
125
Any attempt to grab a lock while already holding one in the wrong order will be
126
checked for, and fail.
127

    
128
Adding/Removing locks
129
~~~~~~~~~~~~~~~~~~~~~
130

    
131
When a new instance or a new node is created an associated lock must be added
132
to the list. The relevant code will need to inform the locking library of such
133
a change.
134

    
135
This needs to be compatible with every other lock in the system, especially
136
metalocks that guarantee to grab sets of resources without specifying them
137
explicitly. The implementation of this will be handled in the locking library
138
itself.
139

    
140
Of course when instances or nodes disappear from the cluster the relevant locks
141
must be removed. This is easier than adding new elements, as the code which
142
removes them must own them exclusively or can queue for their ownership, and
143
thus deals with metalocks exactly as normal code acquiring those locks. Any
144
operation queueing on a removed lock will fail after its removal.
145

    
146
Asynchronous operations
147
~~~~~~~~~~~~~~~~~~~~~~~
148

    
149
For the first version the locking library will only export synchronous
150
operations, which will block till the needed lock are held, and only fail if
151
the request is impossible or somehow erroneous.
152

    
153
In the future we may want to implement different types of asynchronous
154
operations such as:
155

    
156
- Try to acquire this lock set and fail if not possible
157
- Try to acquire one of these lock sets and return the first one you were
158
  able to get (or after a timeout) (select/poll like)
159

    
160
These operations can be used to prioritize operations based on available locks,
161
rather than making them just blindly queue for acquiring them. The inherent
162
risk, though, is that any code using the first operation, or setting a timeout
163
for the second one, is susceptible to starvation and thus may never be able to
164
get the required locks and complete certain tasks. Considering this
165
providing/using these operations should not be among our first priorities.
166

    
167
Locking granularity
168
~~~~~~~~~~~~~~~~~~~
169

    
170
For the first version of this code we'll convert each Logical Unit to
171
acquire/release the locks it needs, so locking will be at the Logical Unit
172
level.  In the future we may want to split logical units in independent
173
"tasklets" with their own locking requirements. A different design doc (or mini
174
design doc) will cover the move from Logical Units to tasklets.
175

    
176
Lock acquisition code path
177
~~~~~~~~~~~~~~~~~~~~~~~~~~
178

    
179
In general when acquiring locks we should use a code path equivalent to::
180

    
181
  lock.acquire()
182
  try:
183
    ...
184
    # other code
185
  finally:
186
    lock.release()
187

    
188
This makes sure we release all locks, and avoid possible deadlocks. Of course
189
extra care must be used not to leave, if possible locked structures in an
190
unusable state.
191

    
192
In order to avoid this extra indentation and code changes everywhere in the
193
Logical Units code, we decided to allow LUs to declare locks, and then execute
194
their code with their locks acquired. In the new world LUs are called like
195
this::
196

    
197
  # user passed names are expanded to the internal lock/resource name,
198
  # then known needed locks are declared
199
  lu.ExpandNames()
200
  ... some locking/adding of locks may happen ...
201
  # late declaration of locks for one level: this is useful because sometimes
202
  # we can't know which resource we need before locking the previous level
203
  lu.DeclareLocks() # for each level (cluster, instance, node)
204
  ... more locking/adding of locks can happen ...
205
  # these functions are called with the proper locks held
206
  lu.CheckPrereq()
207
  lu.Exec()
208
  ... locks declared for removal are removed, all acquired locks released ...
209

    
210
The Processor and the LogicalUnit class will contain exact documentation on how
211
locks are supposed to be declared.
212

    
213
Caveats
214
-------
215

    
216
This library will provide an easy upgrade path to bring all the code to
217
granular locking without breaking everything, and it will also guarantee
218
against a lot of common errors. Code switching from the old "lock everything"
219
lock to the new system, though, needs to be carefully scrutinised to be sure it
220
is really acquiring all the necessary locks, and none has been overlooked or
221
forgotten.
222

    
223
The code can contain other locks outside of this library, to synchronise other
224
threaded code (eg for the job queue) but in general these should be leaf locks
225
or carefully structured non-leaf ones, to avoid deadlock race conditions.
226