Statistics
| Branch: | Tag: | Revision:

root / doc / design-2.0-locking.rst @ 6e4f6dfa

History | View | Annotate | Download (8.1 kB)

1 040408a3 Guido Trotter
Ganeti 2.0 Granular Locking
2 040408a3 Guido Trotter
===========================
3 040408a3 Guido Trotter
4 040408a3 Guido Trotter
Objective
5 040408a3 Guido Trotter
---------
6 040408a3 Guido Trotter
7 040408a3 Guido Trotter
We want to make sure that multiple operations can run in parallel on a Ganeti
8 040408a3 Guido Trotter
Cluster. In order for this to happen we need to make sure concurrently run
9 040408a3 Guido Trotter
operations don't step on each other toes and break the cluster.
10 040408a3 Guido Trotter
11 040408a3 Guido Trotter
This design addresses how we are going to deal with locking so that:
12 040408a3 Guido Trotter
13 040408a3 Guido Trotter
- high urgency operations are not stopped by long length ones
14 040408a3 Guido Trotter
- long length operations can run in parallel
15 040408a3 Guido Trotter
- we preserve safety (data coherency) and liveness (no deadlock, no work
16 040408a3 Guido Trotter
  postponed indefinitely) on the cluster
17 040408a3 Guido Trotter
18 040408a3 Guido Trotter
Reaching the maximum possible parallelism is a Non-Goal. We have identified a
19 040408a3 Guido Trotter
set of operations that are currently bottlenecks and need to be parallelised
20 040408a3 Guido Trotter
and have worked on those. In the future it will be possible to address other
21 040408a3 Guido Trotter
needs, thus making the cluster more and more parallel one step at a time.
22 040408a3 Guido Trotter
23 040408a3 Guido Trotter
This document only talks about parallelising Ganeti level operations, aka
24 040408a3 Guido Trotter
Logical Units, and the locking needed for that. Any other synchronisation lock
25 040408a3 Guido Trotter
needed internally by the code is outside its scope.
26 040408a3 Guido Trotter
27 040408a3 Guido Trotter
Background
28 040408a3 Guido Trotter
----------
29 040408a3 Guido Trotter
30 040408a3 Guido Trotter
Ganeti 1.2 has a single global lock, which is used for all cluster operations.
31 040408a3 Guido Trotter
This has been painful at various times, for example:
32 040408a3 Guido Trotter
33 040408a3 Guido Trotter
- It is impossible for two people to efficiently interact with a cluster
34 040408a3 Guido Trotter
  (for example for debugging) at the same time.
35 040408a3 Guido Trotter
- When batch jobs are running it's impossible to do other work (for example
36 040408a3 Guido Trotter
  failovers/fixes) on a cluster.
37 040408a3 Guido Trotter
38 040408a3 Guido Trotter
This also poses scalability problems: as clusters grow in node and instance
39 040408a3 Guido Trotter
size it's a lot more likely that operations which one could conceive should run
40 040408a3 Guido Trotter
in parallel (for example because they happen on different nodes) are actually
41 040408a3 Guido Trotter
stalling each other while waiting for the global lock, without a real reason
42 040408a3 Guido Trotter
for that to happen.
43 040408a3 Guido Trotter
44 040408a3 Guido Trotter
Overview
45 040408a3 Guido Trotter
--------
46 040408a3 Guido Trotter
47 040408a3 Guido Trotter
This design doc is best read in the context of the accompanying design
48 040408a3 Guido Trotter
docs for Ganeti 2.0: Master daemon design and Job queue design.
49 040408a3 Guido Trotter
50 040408a3 Guido Trotter
We intend to implement a Ganeti locking library, which can be used by the
51 040408a3 Guido Trotter
various ganeti code components in order to easily, efficiently and correctly
52 040408a3 Guido Trotter
grab the locks they need to perform their function.
53 040408a3 Guido Trotter
54 040408a3 Guido Trotter
The proposed library has these features:
55 040408a3 Guido Trotter
56 040408a3 Guido Trotter
- Internally managing all the locks, making the implementation transparent
57 040408a3 Guido Trotter
  from their usage
58 040408a3 Guido Trotter
- Automatically grabbing multiple locks in the right order (avoid deadlock)
59 040408a3 Guido Trotter
- Ability to transparently handle conversion to more granularity
60 040408a3 Guido Trotter
- Support asynchronous operation (future goal)
61 040408a3 Guido Trotter
62 040408a3 Guido Trotter
Locking will be valid only on the master node and will not be a distributed
63 040408a3 Guido Trotter
operation. In case of master failure, though, if some locks were held it means
64 040408a3 Guido Trotter
some opcodes were in progress, so when recovery of the job queue is done it
65 040408a3 Guido Trotter
will be possible to determine by the interrupted opcodes which operations could
66 040408a3 Guido Trotter
have been left half way through and thus which locks could have been held. It
67 040408a3 Guido Trotter
is then the responsibility either of the master failover code, of the cluster
68 040408a3 Guido Trotter
verification code, or of the admin to do what's necessary to make sure that any
69 040408a3 Guido Trotter
leftover state is dealt with. This is not an issue from a locking point of view
70 040408a3 Guido Trotter
because the fact that the previous master has failed means that it cannot do
71 040408a3 Guido Trotter
any job.
72 040408a3 Guido Trotter
73 040408a3 Guido Trotter
A corollary of this is that a master-failover operation with both masters alive
74 040408a3 Guido Trotter
needs to happen while no other locks are held.
75 040408a3 Guido Trotter
76 040408a3 Guido Trotter
Detailed Design
77 040408a3 Guido Trotter
---------------
78 040408a3 Guido Trotter
79 040408a3 Guido Trotter
The Locks
80 040408a3 Guido Trotter
~~~~~~~~~
81 040408a3 Guido Trotter
At the first stage we have decided to provide the following locks:
82 040408a3 Guido Trotter
83 040408a3 Guido Trotter
- One "config file" lock
84 040408a3 Guido Trotter
- One lock per node in the cluster
85 040408a3 Guido Trotter
- One lock per instance in the cluster
86 040408a3 Guido Trotter
87 040408a3 Guido Trotter
All the instance locks will need to be taken before the node locks, and the
88 040408a3 Guido Trotter
node locks before the config lock. Locks will need to be acquired at the same
89 040408a3 Guido Trotter
time for multiple instances and nodes, and internal ordering will be dealt
90 040408a3 Guido Trotter
within the locking library, which, for simplicity, will just use alphabetical
91 040408a3 Guido Trotter
order.
92 040408a3 Guido Trotter
93 040408a3 Guido Trotter
Handling conversion to more granularity
94 040408a3 Guido Trotter
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
95 040408a3 Guido Trotter
96 040408a3 Guido Trotter
In order to convert to a more granular approach transparently each time we
97 040408a3 Guido Trotter
split a lock into more we'll create a "metalock", which will depend on those
98 040408a3 Guido Trotter
sublocks and live for the time necessary for all the code to convert (or
99 040408a3 Guido Trotter
forever, in some conditions). When a metalock exists all converted code must
100 040408a3 Guido Trotter
acquire it in shared mode, so it can run concurrently, but still be exclusive
101 040408a3 Guido Trotter
with old code, which acquires it exclusively.
102 040408a3 Guido Trotter
103 040408a3 Guido Trotter
In the beginning the only such lock will be what replaces the current "command"
104 040408a3 Guido Trotter
lock, and will acquire all the locks in the system, before proceeding. This
105 040408a3 Guido Trotter
lock will be called the "Big Ganeti Lock" because holding that one will avoid
106 040408a3 Guido Trotter
any other concurrent ganeti operations.
107 040408a3 Guido Trotter
108 040408a3 Guido Trotter
We might also want to devise more metalocks (eg. all nodes, all nodes+config)
109 040408a3 Guido Trotter
in order to make it easier for some parts of the code to acquire what it needs
110 040408a3 Guido Trotter
without specifying it explicitly.
111 040408a3 Guido Trotter
112 040408a3 Guido Trotter
In the future things like the node locks could become metalocks, should we
113 040408a3 Guido Trotter
decide to split them into an even more fine grained approach, but this will
114 040408a3 Guido Trotter
probably be only after the first 2.0 version has been released.
115 040408a3 Guido Trotter
116 040408a3 Guido Trotter
Library API
117 040408a3 Guido Trotter
~~~~~~~~~~~
118 040408a3 Guido Trotter
119 040408a3 Guido Trotter
All the locking will be its own class, and the locks will be created at
120 040408a3 Guido Trotter
initialisation time, from the config file.
121 040408a3 Guido Trotter
122 040408a3 Guido Trotter
The API will have a way to grab one or more than one locks at the same time.
123 040408a3 Guido Trotter
Any attempt to grab a lock while already holding one in the wrong order will be
124 040408a3 Guido Trotter
checked for, and fail.
125 040408a3 Guido Trotter
126 164a5bcb Guido Trotter
Adding/Removing locks
127 164a5bcb Guido Trotter
~~~~~~~~~~~~~~~~~~~~~
128 040408a3 Guido Trotter
129 040408a3 Guido Trotter
When a new instance or a new node is created an associated lock must be added
130 040408a3 Guido Trotter
to the list. The relevant code will need to inform the locking library of such
131 040408a3 Guido Trotter
a change.
132 040408a3 Guido Trotter
133 040408a3 Guido Trotter
This needs to be compatible with every other lock in the system, especially
134 040408a3 Guido Trotter
metalocks that guarantee to grab sets of resources without specifying them
135 164a5bcb Guido Trotter
explicitly. The implementation of this will be handled in the locking library
136 164a5bcb Guido Trotter
itself.
137 164a5bcb Guido Trotter
138 164a5bcb Guido Trotter
Of course when instances or nodes disappear from the cluster the relevant locks
139 164a5bcb Guido Trotter
must be removed. This is easier than adding new elements, as the code which
140 164a5bcb Guido Trotter
removes them must own them exclusively or can queue for their ownership, and
141 164a5bcb Guido Trotter
thus deals with metalocks exactly as normal code acquiring those locks. Any
142 164a5bcb Guido Trotter
operation queueing on a removed lock will fail after its removal.
143 040408a3 Guido Trotter
144 040408a3 Guido Trotter
Asynchronous operations
145 040408a3 Guido Trotter
~~~~~~~~~~~~~~~~~~~~~~~
146 040408a3 Guido Trotter
147 040408a3 Guido Trotter
For the first version the locking library will only export synchronous
148 040408a3 Guido Trotter
operations, which will block till the needed lock are held, and only fail if
149 040408a3 Guido Trotter
the request is impossible or somehow erroneous.
150 040408a3 Guido Trotter
151 040408a3 Guido Trotter
In the future we may want to implement different types of asynchronous
152 040408a3 Guido Trotter
operations such as:
153 040408a3 Guido Trotter
154 040408a3 Guido Trotter
- Try to acquire this lock set and fail if not possible
155 040408a3 Guido Trotter
- Try to acquire one of these lock sets and return the first one you were
156 040408a3 Guido Trotter
  able to get (or after a timeout) (select/poll like)
157 040408a3 Guido Trotter
158 6e4f6dfa Guido Trotter
These operations can be used to prioritize operations based on available locks,
159 6e4f6dfa Guido Trotter
rather than making them just blindly queue for acquiring them. The inherent
160 6e4f6dfa Guido Trotter
risk, though, is that any code using the first operation, or setting a timeout
161 6e4f6dfa Guido Trotter
for the second one, is susceptible to starvation and thus may never be able to
162 6e4f6dfa Guido Trotter
get the required locks and complete certain tasks. Considering this
163 6e4f6dfa Guido Trotter
providing/using these operations should not be among our first priorities.
164 040408a3 Guido Trotter
165 040408a3 Guido Trotter
Locking granularity
166 040408a3 Guido Trotter
~~~~~~~~~~~~~~~~~~~
167 040408a3 Guido Trotter
168 040408a3 Guido Trotter
For the first version of this code we'll convert each Logical Unit to
169 040408a3 Guido Trotter
acquire/release the locks it needs, so locking will be at the Logical Unit
170 040408a3 Guido Trotter
level.  In the future we may want to split logical units in independent
171 040408a3 Guido Trotter
"tasklets" with their own locking requirements. A different design doc (or mini
172 040408a3 Guido Trotter
design doc) will cover the move from Logical Units to tasklets.
173 040408a3 Guido Trotter
174 040408a3 Guido Trotter
Caveats
175 040408a3 Guido Trotter
-------
176 040408a3 Guido Trotter
177 040408a3 Guido Trotter
This library will provide an easy upgrade path to bring all the code to
178 040408a3 Guido Trotter
granular locking without breaking everything, and it will also guarantee
179 040408a3 Guido Trotter
against a lot of common errors. Code switching from the old "lock everything"
180 040408a3 Guido Trotter
lock to the new system, though, needs to be carefully scrutinised to be sure it
181 040408a3 Guido Trotter
is really acquiring all the necessary locks, and none has been overlooked or
182 040408a3 Guido Trotter
forgotten.
183 040408a3 Guido Trotter
184 040408a3 Guido Trotter
The code can contain other locks outside of this library, to synchronise other
185 040408a3 Guido Trotter
threaded code (eg for the job queue) but in general these should be leaf locks
186 040408a3 Guido Trotter
or carefully structured non-leaf ones, to avoid deadlock race conditions.