Statistics
| Branch: | Tag: | Revision:

root / doc / design-2.0-locking.rst @ 6e4f6dfa

History | View | Annotate | Download (8.1 kB)

1
Ganeti 2.0 Granular Locking
2
===========================
3

    
4
Objective
5
---------
6

    
7
We want to make sure that multiple operations can run in parallel on a Ganeti
8
Cluster. In order for this to happen we need to make sure concurrently run
9
operations don't step on each other toes and break the cluster.
10

    
11
This design addresses how we are going to deal with locking so that:
12

    
13
- high urgency operations are not stopped by long length ones
14
- long length operations can run in parallel
15
- we preserve safety (data coherency) and liveness (no deadlock, no work
16
  postponed indefinitely) on the cluster
17

    
18
Reaching the maximum possible parallelism is a Non-Goal. We have identified a
19
set of operations that are currently bottlenecks and need to be parallelised
20
and have worked on those. In the future it will be possible to address other
21
needs, thus making the cluster more and more parallel one step at a time.
22

    
23
This document only talks about parallelising Ganeti level operations, aka
24
Logical Units, and the locking needed for that. Any other synchronisation lock
25
needed internally by the code is outside its scope.
26

    
27
Background
28
----------
29

    
30
Ganeti 1.2 has a single global lock, which is used for all cluster operations.
31
This has been painful at various times, for example:
32

    
33
- It is impossible for two people to efficiently interact with a cluster
34
  (for example for debugging) at the same time.
35
- When batch jobs are running it's impossible to do other work (for example
36
  failovers/fixes) on a cluster.
37

    
38
This also poses scalability problems: as clusters grow in node and instance
39
size it's a lot more likely that operations which one could conceive should run
40
in parallel (for example because they happen on different nodes) are actually
41
stalling each other while waiting for the global lock, without a real reason
42
for that to happen.
43

    
44
Overview
45
--------
46

    
47
This design doc is best read in the context of the accompanying design
48
docs for Ganeti 2.0: Master daemon design and Job queue design.
49

    
50
We intend to implement a Ganeti locking library, which can be used by the
51
various ganeti code components in order to easily, efficiently and correctly
52
grab the locks they need to perform their function.
53

    
54
The proposed library has these features:
55

    
56
- Internally managing all the locks, making the implementation transparent
57
  from their usage
58
- Automatically grabbing multiple locks in the right order (avoid deadlock)
59
- Ability to transparently handle conversion to more granularity
60
- Support asynchronous operation (future goal)
61

    
62
Locking will be valid only on the master node and will not be a distributed
63
operation. In case of master failure, though, if some locks were held it means
64
some opcodes were in progress, so when recovery of the job queue is done it
65
will be possible to determine by the interrupted opcodes which operations could
66
have been left half way through and thus which locks could have been held. It
67
is then the responsibility either of the master failover code, of the cluster
68
verification code, or of the admin to do what's necessary to make sure that any
69
leftover state is dealt with. This is not an issue from a locking point of view
70
because the fact that the previous master has failed means that it cannot do
71
any job.
72

    
73
A corollary of this is that a master-failover operation with both masters alive
74
needs to happen while no other locks are held.
75

    
76
Detailed Design
77
---------------
78

    
79
The Locks
80
~~~~~~~~~
81
At the first stage we have decided to provide the following locks:
82

    
83
- One "config file" lock
84
- One lock per node in the cluster
85
- One lock per instance in the cluster
86

    
87
All the instance locks will need to be taken before the node locks, and the
88
node locks before the config lock. Locks will need to be acquired at the same
89
time for multiple instances and nodes, and internal ordering will be dealt
90
within the locking library, which, for simplicity, will just use alphabetical
91
order.
92

    
93
Handling conversion to more granularity
94
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
95

    
96
In order to convert to a more granular approach transparently each time we
97
split a lock into more we'll create a "metalock", which will depend on those
98
sublocks and live for the time necessary for all the code to convert (or
99
forever, in some conditions). When a metalock exists all converted code must
100
acquire it in shared mode, so it can run concurrently, but still be exclusive
101
with old code, which acquires it exclusively.
102

    
103
In the beginning the only such lock will be what replaces the current "command"
104
lock, and will acquire all the locks in the system, before proceeding. This
105
lock will be called the "Big Ganeti Lock" because holding that one will avoid
106
any other concurrent ganeti operations.
107

    
108
We might also want to devise more metalocks (eg. all nodes, all nodes+config)
109
in order to make it easier for some parts of the code to acquire what it needs
110
without specifying it explicitly.
111

    
112
In the future things like the node locks could become metalocks, should we
113
decide to split them into an even more fine grained approach, but this will
114
probably be only after the first 2.0 version has been released.
115

    
116
Library API
117
~~~~~~~~~~~
118

    
119
All the locking will be its own class, and the locks will be created at
120
initialisation time, from the config file.
121

    
122
The API will have a way to grab one or more than one locks at the same time.
123
Any attempt to grab a lock while already holding one in the wrong order will be
124
checked for, and fail.
125

    
126
Adding/Removing locks
127
~~~~~~~~~~~~~~~~~~~~~
128

    
129
When a new instance or a new node is created an associated lock must be added
130
to the list. The relevant code will need to inform the locking library of such
131
a change.
132

    
133
This needs to be compatible with every other lock in the system, especially
134
metalocks that guarantee to grab sets of resources without specifying them
135
explicitly. The implementation of this will be handled in the locking library
136
itself.
137

    
138
Of course when instances or nodes disappear from the cluster the relevant locks
139
must be removed. This is easier than adding new elements, as the code which
140
removes them must own them exclusively or can queue for their ownership, and
141
thus deals with metalocks exactly as normal code acquiring those locks. Any
142
operation queueing on a removed lock will fail after its removal.
143

    
144
Asynchronous operations
145
~~~~~~~~~~~~~~~~~~~~~~~
146

    
147
For the first version the locking library will only export synchronous
148
operations, which will block till the needed lock are held, and only fail if
149
the request is impossible or somehow erroneous.
150

    
151
In the future we may want to implement different types of asynchronous
152
operations such as:
153

    
154
- Try to acquire this lock set and fail if not possible
155
- Try to acquire one of these lock sets and return the first one you were
156
  able to get (or after a timeout) (select/poll like)
157

    
158
These operations can be used to prioritize operations based on available locks,
159
rather than making them just blindly queue for acquiring them. The inherent
160
risk, though, is that any code using the first operation, or setting a timeout
161
for the second one, is susceptible to starvation and thus may never be able to
162
get the required locks and complete certain tasks. Considering this
163
providing/using these operations should not be among our first priorities.
164

    
165
Locking granularity
166
~~~~~~~~~~~~~~~~~~~
167

    
168
For the first version of this code we'll convert each Logical Unit to
169
acquire/release the locks it needs, so locking will be at the Logical Unit
170
level.  In the future we may want to split logical units in independent
171
"tasklets" with their own locking requirements. A different design doc (or mini
172
design doc) will cover the move from Logical Units to tasklets.
173

    
174
Caveats
175
-------
176

    
177
This library will provide an easy upgrade path to bring all the code to
178
granular locking without breaking everything, and it will also guarantee
179
against a lot of common errors. Code switching from the old "lock everything"
180
lock to the new system, though, needs to be carefully scrutinised to be sure it
181
is really acquiring all the necessary locks, and none has been overlooked or
182
forgotten.
183

    
184
The code can contain other locks outside of this library, to synchronise other
185
threaded code (eg for the job queue) but in general these should be leaf locks
186
or carefully structured non-leaf ones, to avoid deadlock race conditions.
187