root / doc / design-2.0-locking.rst @ cd55576a
History | View | Annotate | Download (9.4 kB)
1 | 040408a3 | Guido Trotter | Ganeti 2.0 Granular Locking |
---|---|---|---|
2 | 040408a3 | Guido Trotter | =========================== |
3 | 040408a3 | Guido Trotter | |
4 | 47eb4b45 | Guido Trotter | .. contents:: |
5 | 47eb4b45 | Guido Trotter | |
6 | 040408a3 | Guido Trotter | Objective |
7 | 040408a3 | Guido Trotter | --------- |
8 | 040408a3 | Guido Trotter | |
9 | 040408a3 | Guido Trotter | We want to make sure that multiple operations can run in parallel on a Ganeti |
10 | 040408a3 | Guido Trotter | Cluster. In order for this to happen we need to make sure concurrently run |
11 | 040408a3 | Guido Trotter | operations don't step on each other toes and break the cluster. |
12 | 040408a3 | Guido Trotter | |
13 | 040408a3 | Guido Trotter | This design addresses how we are going to deal with locking so that: |
14 | 040408a3 | Guido Trotter | |
15 | 040408a3 | Guido Trotter | - high urgency operations are not stopped by long length ones |
16 | 040408a3 | Guido Trotter | - long length operations can run in parallel |
17 | 040408a3 | Guido Trotter | - we preserve safety (data coherency) and liveness (no deadlock, no work |
18 | 040408a3 | Guido Trotter | postponed indefinitely) on the cluster |
19 | 040408a3 | Guido Trotter | |
20 | 040408a3 | Guido Trotter | Reaching the maximum possible parallelism is a Non-Goal. We have identified a |
21 | 040408a3 | Guido Trotter | set of operations that are currently bottlenecks and need to be parallelised |
22 | 040408a3 | Guido Trotter | and have worked on those. In the future it will be possible to address other |
23 | 040408a3 | Guido Trotter | needs, thus making the cluster more and more parallel one step at a time. |
24 | 040408a3 | Guido Trotter | |
25 | 040408a3 | Guido Trotter | This document only talks about parallelising Ganeti level operations, aka |
26 | 040408a3 | Guido Trotter | Logical Units, and the locking needed for that. Any other synchronisation lock |
27 | 040408a3 | Guido Trotter | needed internally by the code is outside its scope. |
28 | 040408a3 | Guido Trotter | |
29 | 040408a3 | Guido Trotter | Background |
30 | 040408a3 | Guido Trotter | ---------- |
31 | 040408a3 | Guido Trotter | |
32 | 040408a3 | Guido Trotter | Ganeti 1.2 has a single global lock, which is used for all cluster operations. |
33 | 040408a3 | Guido Trotter | This has been painful at various times, for example: |
34 | 040408a3 | Guido Trotter | |
35 | 040408a3 | Guido Trotter | - It is impossible for two people to efficiently interact with a cluster |
36 | 040408a3 | Guido Trotter | (for example for debugging) at the same time. |
37 | 040408a3 | Guido Trotter | - When batch jobs are running it's impossible to do other work (for example |
38 | 040408a3 | Guido Trotter | failovers/fixes) on a cluster. |
39 | 040408a3 | Guido Trotter | |
40 | 040408a3 | Guido Trotter | This also poses scalability problems: as clusters grow in node and instance |
41 | 040408a3 | Guido Trotter | size it's a lot more likely that operations which one could conceive should run |
42 | 040408a3 | Guido Trotter | in parallel (for example because they happen on different nodes) are actually |
43 | 040408a3 | Guido Trotter | stalling each other while waiting for the global lock, without a real reason |
44 | 040408a3 | Guido Trotter | for that to happen. |
45 | 040408a3 | Guido Trotter | |
46 | 040408a3 | Guido Trotter | Overview |
47 | 040408a3 | Guido Trotter | -------- |
48 | 040408a3 | Guido Trotter | |
49 | 040408a3 | Guido Trotter | This design doc is best read in the context of the accompanying design |
50 | 040408a3 | Guido Trotter | docs for Ganeti 2.0: Master daemon design and Job queue design. |
51 | 040408a3 | Guido Trotter | |
52 | 040408a3 | Guido Trotter | We intend to implement a Ganeti locking library, which can be used by the |
53 | 040408a3 | Guido Trotter | various ganeti code components in order to easily, efficiently and correctly |
54 | 040408a3 | Guido Trotter | grab the locks they need to perform their function. |
55 | 040408a3 | Guido Trotter | |
56 | 040408a3 | Guido Trotter | The proposed library has these features: |
57 | 040408a3 | Guido Trotter | |
58 | 040408a3 | Guido Trotter | - Internally managing all the locks, making the implementation transparent |
59 | 040408a3 | Guido Trotter | from their usage |
60 | 040408a3 | Guido Trotter | - Automatically grabbing multiple locks in the right order (avoid deadlock) |
61 | 040408a3 | Guido Trotter | - Ability to transparently handle conversion to more granularity |
62 | 040408a3 | Guido Trotter | - Support asynchronous operation (future goal) |
63 | 040408a3 | Guido Trotter | |
64 | 040408a3 | Guido Trotter | Locking will be valid only on the master node and will not be a distributed |
65 | 040408a3 | Guido Trotter | operation. In case of master failure, though, if some locks were held it means |
66 | 040408a3 | Guido Trotter | some opcodes were in progress, so when recovery of the job queue is done it |
67 | 040408a3 | Guido Trotter | will be possible to determine by the interrupted opcodes which operations could |
68 | 040408a3 | Guido Trotter | have been left half way through and thus which locks could have been held. It |
69 | 040408a3 | Guido Trotter | is then the responsibility either of the master failover code, of the cluster |
70 | 040408a3 | Guido Trotter | verification code, or of the admin to do what's necessary to make sure that any |
71 | 040408a3 | Guido Trotter | leftover state is dealt with. This is not an issue from a locking point of view |
72 | 040408a3 | Guido Trotter | because the fact that the previous master has failed means that it cannot do |
73 | 040408a3 | Guido Trotter | any job. |
74 | 040408a3 | Guido Trotter | |
75 | 040408a3 | Guido Trotter | A corollary of this is that a master-failover operation with both masters alive |
76 | 040408a3 | Guido Trotter | needs to happen while no other locks are held. |
77 | 040408a3 | Guido Trotter | |
78 | 040408a3 | Guido Trotter | Detailed Design |
79 | 040408a3 | Guido Trotter | --------------- |
80 | 040408a3 | Guido Trotter | |
81 | 040408a3 | Guido Trotter | The Locks |
82 | 040408a3 | Guido Trotter | ~~~~~~~~~ |
83 | 040408a3 | Guido Trotter | At the first stage we have decided to provide the following locks: |
84 | 040408a3 | Guido Trotter | |
85 | 040408a3 | Guido Trotter | - One "config file" lock |
86 | 040408a3 | Guido Trotter | - One lock per node in the cluster |
87 | 040408a3 | Guido Trotter | - One lock per instance in the cluster |
88 | 040408a3 | Guido Trotter | |
89 | 040408a3 | Guido Trotter | All the instance locks will need to be taken before the node locks, and the |
90 | 040408a3 | Guido Trotter | node locks before the config lock. Locks will need to be acquired at the same |
91 | 040408a3 | Guido Trotter | time for multiple instances and nodes, and internal ordering will be dealt |
92 | 040408a3 | Guido Trotter | within the locking library, which, for simplicity, will just use alphabetical |
93 | 040408a3 | Guido Trotter | order. |
94 | 040408a3 | Guido Trotter | |
95 | 040408a3 | Guido Trotter | Handling conversion to more granularity |
96 | 040408a3 | Guido Trotter | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
97 | 040408a3 | Guido Trotter | |
98 | 040408a3 | Guido Trotter | In order to convert to a more granular approach transparently each time we |
99 | 040408a3 | Guido Trotter | split a lock into more we'll create a "metalock", which will depend on those |
100 | 040408a3 | Guido Trotter | sublocks and live for the time necessary for all the code to convert (or |
101 | 040408a3 | Guido Trotter | forever, in some conditions). When a metalock exists all converted code must |
102 | 040408a3 | Guido Trotter | acquire it in shared mode, so it can run concurrently, but still be exclusive |
103 | 040408a3 | Guido Trotter | with old code, which acquires it exclusively. |
104 | 040408a3 | Guido Trotter | |
105 | 040408a3 | Guido Trotter | In the beginning the only such lock will be what replaces the current "command" |
106 | 040408a3 | Guido Trotter | lock, and will acquire all the locks in the system, before proceeding. This |
107 | 040408a3 | Guido Trotter | lock will be called the "Big Ganeti Lock" because holding that one will avoid |
108 | 040408a3 | Guido Trotter | any other concurrent ganeti operations. |
109 | 040408a3 | Guido Trotter | |
110 | 040408a3 | Guido Trotter | We might also want to devise more metalocks (eg. all nodes, all nodes+config) |
111 | 040408a3 | Guido Trotter | in order to make it easier for some parts of the code to acquire what it needs |
112 | 040408a3 | Guido Trotter | without specifying it explicitly. |
113 | 040408a3 | Guido Trotter | |
114 | 040408a3 | Guido Trotter | In the future things like the node locks could become metalocks, should we |
115 | 040408a3 | Guido Trotter | decide to split them into an even more fine grained approach, but this will |
116 | 040408a3 | Guido Trotter | probably be only after the first 2.0 version has been released. |
117 | 040408a3 | Guido Trotter | |
118 | 040408a3 | Guido Trotter | Library API |
119 | 040408a3 | Guido Trotter | ~~~~~~~~~~~ |
120 | 040408a3 | Guido Trotter | |
121 | 040408a3 | Guido Trotter | All the locking will be its own class, and the locks will be created at |
122 | 040408a3 | Guido Trotter | initialisation time, from the config file. |
123 | 040408a3 | Guido Trotter | |
124 | 040408a3 | Guido Trotter | The API will have a way to grab one or more than one locks at the same time. |
125 | 040408a3 | Guido Trotter | Any attempt to grab a lock while already holding one in the wrong order will be |
126 | 040408a3 | Guido Trotter | checked for, and fail. |
127 | 040408a3 | Guido Trotter | |
128 | 164a5bcb | Guido Trotter | Adding/Removing locks |
129 | 164a5bcb | Guido Trotter | ~~~~~~~~~~~~~~~~~~~~~ |
130 | 040408a3 | Guido Trotter | |
131 | 040408a3 | Guido Trotter | When a new instance or a new node is created an associated lock must be added |
132 | 040408a3 | Guido Trotter | to the list. The relevant code will need to inform the locking library of such |
133 | 040408a3 | Guido Trotter | a change. |
134 | 040408a3 | Guido Trotter | |
135 | 040408a3 | Guido Trotter | This needs to be compatible with every other lock in the system, especially |
136 | 040408a3 | Guido Trotter | metalocks that guarantee to grab sets of resources without specifying them |
137 | 164a5bcb | Guido Trotter | explicitly. The implementation of this will be handled in the locking library |
138 | 164a5bcb | Guido Trotter | itself. |
139 | 164a5bcb | Guido Trotter | |
140 | 164a5bcb | Guido Trotter | Of course when instances or nodes disappear from the cluster the relevant locks |
141 | 164a5bcb | Guido Trotter | must be removed. This is easier than adding new elements, as the code which |
142 | 164a5bcb | Guido Trotter | removes them must own them exclusively or can queue for their ownership, and |
143 | 164a5bcb | Guido Trotter | thus deals with metalocks exactly as normal code acquiring those locks. Any |
144 | 164a5bcb | Guido Trotter | operation queueing on a removed lock will fail after its removal. |
145 | 040408a3 | Guido Trotter | |
146 | 040408a3 | Guido Trotter | Asynchronous operations |
147 | 040408a3 | Guido Trotter | ~~~~~~~~~~~~~~~~~~~~~~~ |
148 | 040408a3 | Guido Trotter | |
149 | 040408a3 | Guido Trotter | For the first version the locking library will only export synchronous |
150 | 040408a3 | Guido Trotter | operations, which will block till the needed lock are held, and only fail if |
151 | 040408a3 | Guido Trotter | the request is impossible or somehow erroneous. |
152 | 040408a3 | Guido Trotter | |
153 | 040408a3 | Guido Trotter | In the future we may want to implement different types of asynchronous |
154 | 040408a3 | Guido Trotter | operations such as: |
155 | 040408a3 | Guido Trotter | |
156 | 040408a3 | Guido Trotter | - Try to acquire this lock set and fail if not possible |
157 | 040408a3 | Guido Trotter | - Try to acquire one of these lock sets and return the first one you were |
158 | 040408a3 | Guido Trotter | able to get (or after a timeout) (select/poll like) |
159 | 040408a3 | Guido Trotter | |
160 | 6e4f6dfa | Guido Trotter | These operations can be used to prioritize operations based on available locks, |
161 | 6e4f6dfa | Guido Trotter | rather than making them just blindly queue for acquiring them. The inherent |
162 | 6e4f6dfa | Guido Trotter | risk, though, is that any code using the first operation, or setting a timeout |
163 | 6e4f6dfa | Guido Trotter | for the second one, is susceptible to starvation and thus may never be able to |
164 | 6e4f6dfa | Guido Trotter | get the required locks and complete certain tasks. Considering this |
165 | 6e4f6dfa | Guido Trotter | providing/using these operations should not be among our first priorities. |
166 | 040408a3 | Guido Trotter | |
167 | 040408a3 | Guido Trotter | Locking granularity |
168 | 040408a3 | Guido Trotter | ~~~~~~~~~~~~~~~~~~~ |
169 | 040408a3 | Guido Trotter | |
170 | 040408a3 | Guido Trotter | For the first version of this code we'll convert each Logical Unit to |
171 | 040408a3 | Guido Trotter | acquire/release the locks it needs, so locking will be at the Logical Unit |
172 | 040408a3 | Guido Trotter | level. In the future we may want to split logical units in independent |
173 | 040408a3 | Guido Trotter | "tasklets" with their own locking requirements. A different design doc (or mini |
174 | 040408a3 | Guido Trotter | design doc) will cover the move from Logical Units to tasklets. |
175 | 040408a3 | Guido Trotter | |
176 | 4e8d0685 | Guido Trotter | Lock acquisition code path |
177 | 4e8d0685 | Guido Trotter | ~~~~~~~~~~~~~~~~~~~~~~~~~~ |
178 | 4e8d0685 | Guido Trotter | |
179 | 4e8d0685 | Guido Trotter | In general when acquiring locks we should use a code path equivalent to:: |
180 | 4e8d0685 | Guido Trotter | |
181 | 4e8d0685 | Guido Trotter | lock.acquire() |
182 | 4e8d0685 | Guido Trotter | try: |
183 | 4e8d0685 | Guido Trotter | ... |
184 | 4e8d0685 | Guido Trotter | # other code |
185 | 4e8d0685 | Guido Trotter | finally: |
186 | 4e8d0685 | Guido Trotter | lock.release() |
187 | 4e8d0685 | Guido Trotter | |
188 | 4e8d0685 | Guido Trotter | This makes sure we release all locks, and avoid possible deadlocks. Of course |
189 | 4e8d0685 | Guido Trotter | extra care must be used not to leave, if possible locked structures in an |
190 | 4e8d0685 | Guido Trotter | unusable state. |
191 | 4e8d0685 | Guido Trotter | |
192 | 4e8d0685 | Guido Trotter | In order to avoid this extra indentation and code changes everywhere in the |
193 | 4e8d0685 | Guido Trotter | Logical Units code, we decided to allow LUs to declare locks, and then execute |
194 | 4e8d0685 | Guido Trotter | their code with their locks acquired. In the new world LUs are called like |
195 | 4e8d0685 | Guido Trotter | this:: |
196 | 4e8d0685 | Guido Trotter | |
197 | 4e8d0685 | Guido Trotter | # user passed names are expanded to the internal lock/resource name, |
198 | 4e8d0685 | Guido Trotter | # then known needed locks are declared |
199 | 4e8d0685 | Guido Trotter | lu.ExpandNames() |
200 | 4e8d0685 | Guido Trotter | ... some locking/adding of locks may happen ... |
201 | 4e8d0685 | Guido Trotter | # late declaration of locks for one level: this is useful because sometimes |
202 | 4e8d0685 | Guido Trotter | # we can't know which resource we need before locking the previous level |
203 | 4e8d0685 | Guido Trotter | lu.DeclareLocks() # for each level (cluster, instance, node) |
204 | 4e8d0685 | Guido Trotter | ... more locking/adding of locks can happen ... |
205 | 4e8d0685 | Guido Trotter | # these functions are called with the proper locks held |
206 | 4e8d0685 | Guido Trotter | lu.CheckPrereq() |
207 | 4e8d0685 | Guido Trotter | lu.Exec() |
208 | 4e8d0685 | Guido Trotter | ... locks declared for removal are removed, all acquired locks released ... |
209 | 4e8d0685 | Guido Trotter | |
210 | 4e8d0685 | Guido Trotter | The Processor and the LogicalUnit class will contain exact documentation on how |
211 | 4e8d0685 | Guido Trotter | locks are supposed to be declared. |
212 | 4e8d0685 | Guido Trotter | |
213 | 040408a3 | Guido Trotter | Caveats |
214 | 040408a3 | Guido Trotter | ------- |
215 | 040408a3 | Guido Trotter | |
216 | 040408a3 | Guido Trotter | This library will provide an easy upgrade path to bring all the code to |
217 | 040408a3 | Guido Trotter | granular locking without breaking everything, and it will also guarantee |
218 | 040408a3 | Guido Trotter | against a lot of common errors. Code switching from the old "lock everything" |
219 | 040408a3 | Guido Trotter | lock to the new system, though, needs to be carefully scrutinised to be sure it |
220 | 040408a3 | Guido Trotter | is really acquiring all the necessary locks, and none has been overlooked or |
221 | 040408a3 | Guido Trotter | forgotten. |
222 | 040408a3 | Guido Trotter | |
223 | 040408a3 | Guido Trotter | The code can contain other locks outside of this library, to synchronise other |
224 | 040408a3 | Guido Trotter | threaded code (eg for the job queue) but in general these should be leaf locks |
225 | 040408a3 | Guido Trotter | or carefully structured non-leaf ones, to avoid deadlock race conditions. |