root / doc / design-daemons.rst @ bb3011ad
History | View | Annotate | Download (23.7 kB)
1 | ffedf64d | Michele Tartara | ========================== |
---|---|---|---|
2 | ffedf64d | Michele Tartara | Ganeti daemons refactoring |
3 | ffedf64d | Michele Tartara | ========================== |
4 | ffedf64d | Michele Tartara | |
5 | ffedf64d | Michele Tartara | .. contents:: :depth: 2 |
6 | ffedf64d | Michele Tartara | |
7 | ffedf64d | Michele Tartara | This is a design document detailing the plan for refactoring the internal |
8 | ffedf64d | Michele Tartara | structure of Ganeti, and particularly the set of daemons it is divided into. |
9 | ffedf64d | Michele Tartara | |
10 | ffedf64d | Michele Tartara | |
11 | ffedf64d | Michele Tartara | Current state and shortcomings |
12 | ffedf64d | Michele Tartara | ============================== |
13 | ffedf64d | Michele Tartara | |
14 | ffedf64d | Michele Tartara | Ganeti is comprised of a growing number of daemons, each dealing with part of |
15 | ffedf64d | Michele Tartara | the tasks the cluster has to face, and communicating with the other daemons |
16 | ffedf64d | Michele Tartara | using a variety of protocols. |
17 | ffedf64d | Michele Tartara | |
18 | ffedf64d | Michele Tartara | Specifically, as of Ganeti 2.8, the situation is as follows: |
19 | ffedf64d | Michele Tartara | |
20 | ffedf64d | Michele Tartara | ``Master daemon (MasterD)`` |
21 | ffedf64d | Michele Tartara | It is responsible for managing the entire cluster, and it's written in Python. |
22 | ffedf64d | Michele Tartara | It is executed on a single node (the master node). It receives the commands |
23 | ffedf64d | Michele Tartara | given by the cluster administrator (through the remote API daemon or the |
24 | ffedf64d | Michele Tartara | command line tools) over the LUXI protocol. The master daemon is responsible |
25 | ffedf64d | Michele Tartara | for creating and managing the jobs that will execute such commands, and for |
26 | ffedf64d | Michele Tartara | managing the locks that ensure the cluster will not incur in race conditions. |
27 | ffedf64d | Michele Tartara | |
28 | ffedf64d | Michele Tartara | Each job is managed by a separate Python thread, that interacts with the node |
29 | ffedf64d | Michele Tartara | daemons via RPC calls. |
30 | ffedf64d | Michele Tartara | |
31 | ffedf64d | Michele Tartara | The master daemon is also responsible for managing the configuration of the |
32 | ffedf64d | Michele Tartara | cluster, changing it when required by some job. It is also responsible for |
33 | ffedf64d | Michele Tartara | copying the configuration to the other master candidates after updating it. |
34 | ffedf64d | Michele Tartara | |
35 | ffedf64d | Michele Tartara | ``RAPI daemon (RapiD)`` |
36 | ffedf64d | Michele Tartara | It is written in Python and runs on the master node only. It waits for |
37 | ffedf64d | Michele Tartara | requests issued remotely through the remote API protocol. Then, it forwards |
38 | ffedf64d | Michele Tartara | them, using the LUXI protocol, to the master daemon (if they are commands) or |
39 | ffedf64d | Michele Tartara | to the query daemon if they are queries about the configuration (including |
40 | ffedf64d | Michele Tartara | live status) of the cluster. |
41 | ffedf64d | Michele Tartara | |
42 | ffedf64d | Michele Tartara | ``Node daemon (NodeD)`` |
43 | ffedf64d | Michele Tartara | It is written in Python. It runs on all the nodes. It is responsible for |
44 | ffedf64d | Michele Tartara | receiving the master requests over RPC and execute them, using the appropriate |
45 | ffedf64d | Michele Tartara | backend (hypervisors, DRBD, LVM, etc.). It also receives requests over RPC for |
46 | ffedf64d | Michele Tartara | the execution of queries gathering live data on behalf of the query daemon. |
47 | ffedf64d | Michele Tartara | |
48 | ffedf64d | Michele Tartara | ``Configuration daemon (ConfD)`` |
49 | ffedf64d | Michele Tartara | It is written in Haskell. It runs on all the master candidates. Since the |
50 | ffedf64d | Michele Tartara | configuration is replicated only on the master node, this daemon exists in |
51 | ffedf64d | Michele Tartara | order to provide information about the configuration to nodes needing them. |
52 | ffedf64d | Michele Tartara | The requests are done through ConfD's own protocol, HMAC signed, |
53 | ffedf64d | Michele Tartara | implemented over UDP, and meant to be used by parallely querying all the |
54 | ffedf64d | Michele Tartara | master candidates (or a subset thereof) and getting the most up to date |
55 | ffedf64d | Michele Tartara | answer. This is meant as a way to provide a robust service even in case master |
56 | ffedf64d | Michele Tartara | is temporarily unavailable. |
57 | ffedf64d | Michele Tartara | |
58 | ffedf64d | Michele Tartara | ``Query daemon (QueryD)`` |
59 | ffedf64d | Michele Tartara | It is written in Haskell. It runs on all the master candidates. It replies |
60 | ffedf64d | Michele Tartara | to Luxi queries about the current status of the system, including live data it |
61 | ffedf64d | Michele Tartara | obtains by querying the node daemons through RPCs. |
62 | ffedf64d | Michele Tartara | |
63 | ffedf64d | Michele Tartara | ``Monitoring daemon (MonD)`` |
64 | ffedf64d | Michele Tartara | It is written in Haskell. It runs on all nodes, including the ones that are |
65 | ffedf64d | Michele Tartara | not vm-capable. It is meant to provide information on the status of the |
66 | ffedf64d | Michele Tartara | system. Such information is related only to the specific node the daemon is |
67 | ffedf64d | Michele Tartara | running on, and it is provided as JSON encoded data over HTTP, to be easily |
68 | ffedf64d | Michele Tartara | readable by external tools. |
69 | ffedf64d | Michele Tartara | The monitoring daemon communicates with ConfD to get information about the |
70 | ffedf64d | Michele Tartara | configuration of the cluster. The choice of communicating with ConfD instead |
71 | ffedf64d | Michele Tartara | of MasterD allows it to obtain configuration information even when the cluster |
72 | ffedf64d | Michele Tartara | is heavily degraded (e.g.: when master and some, but not all, of the master |
73 | ffedf64d | Michele Tartara | candidates are unreachable). |
74 | ffedf64d | Michele Tartara | |
75 | ffedf64d | Michele Tartara | The current structure of the Ganeti daemons is inefficient because there are |
76 | ffedf64d | Michele Tartara | many different protocols involved, and each daemon needs to be able to use |
77 | ffedf64d | Michele Tartara | multiple ones, and has to deal with doing different things, thus making |
78 | ffedf64d | Michele Tartara | sometimes unclear which daemon is responsible for performing a specific task. |
79 | ffedf64d | Michele Tartara | |
80 | ffedf64d | Michele Tartara | Also, with the current configuration, jobs are managed by the master daemon |
81 | ffedf64d | Michele Tartara | using python threads. This makes terminating a job after it has started a |
82 | ffedf64d | Michele Tartara | difficult operation, and it is the main reason why this is not possible yet. |
83 | ffedf64d | Michele Tartara | |
84 | ffedf64d | Michele Tartara | The master daemon currently has too many different tasks, that could be handled |
85 | ffedf64d | Michele Tartara | better if split among different daemons. |
86 | ffedf64d | Michele Tartara | |
87 | ffedf64d | Michele Tartara | |
88 | ffedf64d | Michele Tartara | Proposed changes |
89 | ffedf64d | Michele Tartara | ================ |
90 | ffedf64d | Michele Tartara | |
91 | ffedf64d | Michele Tartara | In order to improve on the current situation, a new daemon subdivision is |
92 | ffedf64d | Michele Tartara | proposed, and presented hereafter. |
93 | ffedf64d | Michele Tartara | |
94 | ffedf64d | Michele Tartara | .. digraph:: "new-daemons-structure" |
95 | ffedf64d | Michele Tartara | |
96 | ffedf64d | Michele Tartara | {rank=same; RConfD LuxiD;} |
97 | ffedf64d | Michele Tartara | {rank=same; Jobs rconfigdata;} |
98 | ffedf64d | Michele Tartara | node [shape=box] |
99 | ffedf64d | Michele Tartara | RapiD [label="RapiD [M]"] |
100 | ffedf64d | Michele Tartara | LuxiD [label="LuxiD [M]"] |
101 | ffedf64d | Michele Tartara | WConfD [label="WConfD [M]"] |
102 | ffedf64d | Michele Tartara | Jobs [label="Jobs [M]"] |
103 | ffedf64d | Michele Tartara | RConfD [label="RConfD [MC]"] |
104 | ffedf64d | Michele Tartara | MonD [label="MonD [All]"] |
105 | ffedf64d | Michele Tartara | NodeD [label="NodeD [All]"] |
106 | ffedf64d | Michele Tartara | Clients [label="gnt-*\nclients [M]"] |
107 | ffedf64d | Michele Tartara | p1 [shape=none, label=""] |
108 | ffedf64d | Michele Tartara | p2 [shape=none, label=""] |
109 | ffedf64d | Michele Tartara | p3 [shape=none, label=""] |
110 | ffedf64d | Michele Tartara | p4 [shape=none, label=""] |
111 | ffedf64d | Michele Tartara | configdata [shape=none, label="config.data"] |
112 | ffedf64d | Michele Tartara | rconfigdata [shape=none, label="config.data\n[MC copy]"] |
113 | ffedf64d | Michele Tartara | locksdata [shape=none, label="locks.data"] |
114 | ffedf64d | Michele Tartara | |
115 | ffedf64d | Michele Tartara | RapiD -> LuxiD [label="LUXI"] |
116 | ffedf64d | Michele Tartara | LuxiD -> WConfD [label="WConfD\nproto"] |
117 | ffedf64d | Michele Tartara | LuxiD -> Jobs [label="fork/exec"] |
118 | ffedf64d | Michele Tartara | Jobs -> WConfD [label="WConfD\nproto"] |
119 | ffedf64d | Michele Tartara | Jobs -> NodeD [label="RPC"] |
120 | ffedf64d | Michele Tartara | LuxiD -> NodeD [label="RPC"] |
121 | ffedf64d | Michele Tartara | rconfigdata -> RConfD |
122 | ffedf64d | Michele Tartara | configdata -> rconfigdata [label="sync via\nNodeD RPC"] |
123 | ffedf64d | Michele Tartara | WConfD -> NodeD [label="RPC"] |
124 | ffedf64d | Michele Tartara | WConfD -> configdata |
125 | ffedf64d | Michele Tartara | WConfD -> locksdata |
126 | ffedf64d | Michele Tartara | MonD -> RConfD [label="RConfD\nproto"] |
127 | ffedf64d | Michele Tartara | Clients -> LuxiD [label="LUXI"] |
128 | ffedf64d | Michele Tartara | p1 -> MonD [label="MonD proto"] |
129 | ffedf64d | Michele Tartara | p2 -> RapiD [label="RAPI"] |
130 | ffedf64d | Michele Tartara | p3 -> RConfD [label="RConfD\nproto"] |
131 | ffedf64d | Michele Tartara | p4 -> Clients [label="CLI"] |
132 | ffedf64d | Michele Tartara | |
133 | ffedf64d | Michele Tartara | ``LUXI daemon (LuxiD)`` |
134 | ffedf64d | Michele Tartara | It will be written in Haskell. It will run on the master node and it will be |
135 | ffedf64d | Michele Tartara | the only LUXI server, replying to all the LUXI queries. These includes both |
136 | ffedf64d | Michele Tartara | the queries about the live configuration of the cluster, previously served by |
137 | ffedf64d | Michele Tartara | QueryD, and the commands actually changing the status of the cluster by |
138 | ffedf64d | Michele Tartara | submitting jobs. Therefore, this daemon will also be the one responsible with |
139 | ffedf64d | Michele Tartara | managing the job queue. When a job needs to be executed, the LuxiD will spawn |
140 | ffedf64d | Michele Tartara | a separate process tasked with the execution of that specific job, thus making |
141 | ffedf64d | Michele Tartara | it easier to terminate the job itself, if needeed. When a job requires locks, |
142 | ffedf64d | Michele Tartara | LuxiD will request them from WConfD. |
143 | ffedf64d | Michele Tartara | In order to keep availability of the cluster in case of failure of the master |
144 | ffedf64d | Michele Tartara | node, LuxiD will replicate the job queue to the other master candidates, by |
145 | ffedf64d | Michele Tartara | RPCs to the NodeD running there (the choice of RPCs for this task might be |
146 | ffedf64d | Michele Tartara | reviewed at a second time, after implementing this design). |
147 | ffedf64d | Michele Tartara | |
148 | ffedf64d | Michele Tartara | ``Configuration management daemon (WConfD)`` |
149 | ffedf64d | Michele Tartara | It will run on the master node and it will be responsible for the management |
150 | ffedf64d | Michele Tartara | of the authoritative copy of the cluster configuration (that is, it will be |
151 | ffedf64d | Michele Tartara | the daemon actually modifying the ``config.data`` file). All the requests of |
152 | ffedf64d | Michele Tartara | configuration changes will have to pass through this daemon, and will be |
153 | ffedf64d | Michele Tartara | performed using a LUXI-like protocol ("WConfD proto" in the graph. The exact |
154 | ffedf64d | Michele Tartara | protocol will be defined in the separate design document that will detail the |
155 | ffedf64d | Michele Tartara | WConfD separation). Having a single point of configuration management will |
156 | ffedf64d | Michele Tartara | also allow Ganeti to get rid of possible race conditions due to concurrent |
157 | ffedf64d | Michele Tartara | modifications of the configuration. When the configuration is updated, it |
158 | ffedf64d | Michele Tartara | will have to push the received changes to the other master candidates, via |
159 | ffedf64d | Michele Tartara | RPCs, so that RConfD daemons and (in case of a failure on the master node) |
160 | ffedf64d | Michele Tartara | the WConfD daemon on the new master can access an up-to-date version of it |
161 | ffedf64d | Michele Tartara | (the choice of RPCs for this task might be reviewed at a second time). This |
162 | ffedf64d | Michele Tartara | daemon will also be the one responsible for managing the locks, granting them |
163 | ffedf64d | Michele Tartara | to the jobs requesting them, and taking care of freeing them up if the jobs |
164 | ffedf64d | Michele Tartara | holding them crash or are terminated before releasing them. In order to do |
165 | ffedf64d | Michele Tartara | this, each job, after being spawned by LuxiD, will open a local unix socket |
166 | ffedf64d | Michele Tartara | that will be used to communicate with it, and will be destroyed when the job |
167 | ffedf64d | Michele Tartara | terminates. LuxiD will be able to check, after a timeout, whether the job is |
168 | ffedf64d | Michele Tartara | still running by connecting here, and to ask WConfD to forcefully remove the |
169 | ffedf64d | Michele Tartara | locks if the socket is closed. |
170 | ffedf64d | Michele Tartara | Also, WConfD should hold a serialized list of the locks and their owners in a |
171 | ffedf64d | Michele Tartara | file (``locks.data``), so that it can keep track of their status in case it |
172 | ffedf64d | Michele Tartara | crashes and needs to be restarted (by asking LuxiD which of them are still |
173 | ffedf64d | Michele Tartara | running). |
174 | ffedf64d | Michele Tartara | Interaction with this daemon will be performed using Unix sockets. |
175 | ffedf64d | Michele Tartara | |
176 | ffedf64d | Michele Tartara | ``Configuration query daemon (RConfD)`` |
177 | ffedf64d | Michele Tartara | It is written in Haskell, and it corresponds to the old ConfD. It will run on |
178 | ffedf64d | Michele Tartara | all the master candidates and it will serve information about the the static |
179 | ffedf64d | Michele Tartara | configuration of the cluster (the one contained in ``config.data``). The |
180 | ffedf64d | Michele Tartara | provided information will be highly available (as in: a response will be |
181 | ffedf64d | Michele Tartara | available as long as a stable-enough connection between the client and at |
182 | ffedf64d | Michele Tartara | least one working master candidate is available) and its freshness will be |
183 | ffedf64d | Michele Tartara | best effort (the most recent reply from any of the master candidates will be |
184 | ffedf64d | Michele Tartara | returned, but it might still be older than the one available through WConfD). |
185 | ffedf64d | Michele Tartara | The information will be served through the ConfD protocol. |
186 | ffedf64d | Michele Tartara | |
187 | ffedf64d | Michele Tartara | ``Rapi daemon (RapiD)`` |
188 | ffedf64d | Michele Tartara | It remains basically unchanged, with the only difference that all of its LUXI |
189 | ffedf64d | Michele Tartara | query are directed towards LuxiD instead of being split between MasterD and |
190 | ffedf64d | Michele Tartara | QueryD. |
191 | ffedf64d | Michele Tartara | |
192 | ffedf64d | Michele Tartara | ``Monitoring daemon (MonD)`` |
193 | ffedf64d | Michele Tartara | It remains unaffected by the changes in this design document. It will just get |
194 | ffedf64d | Michele Tartara | some of the data it needs from RConfD instead of the old ConfD, but the |
195 | ffedf64d | Michele Tartara | interfaces of the two are identical. |
196 | ffedf64d | Michele Tartara | |
197 | ffedf64d | Michele Tartara | ``Node daemon (NodeD)`` |
198 | ffedf64d | Michele Tartara | It remains unaffected by the changes proposed in the design document. The only |
199 | ffedf64d | Michele Tartara | difference being that it will receive its RPCs from LuxiD (for job queue |
200 | ffedf64d | Michele Tartara | replication), from WConfD (for configuration replication) and for the |
201 | ffedf64d | Michele Tartara | processes executing single jobs (for all the operations to be performed by |
202 | ffedf64d | Michele Tartara | nodes) instead of receiving them just from MasterD. |
203 | ffedf64d | Michele Tartara | |
204 | ffedf64d | Michele Tartara | This restructuring will allow us to reorganize and improve the codebase, |
205 | ffedf64d | Michele Tartara | introducing cleaner interfaces and giving well defined and more restricted tasks |
206 | ffedf64d | Michele Tartara | to each daemon. |
207 | ffedf64d | Michele Tartara | |
208 | ffedf64d | Michele Tartara | Furthermore, having more well-defined interfaces will allow us to have easier |
209 | ffedf64d | Michele Tartara | upgrade procedures, and to work towards the possibility of upgrading single |
210 | ffedf64d | Michele Tartara | components of a cluster one at a time, without the need for immediately |
211 | ffedf64d | Michele Tartara | upgrading the entire cluster in a single step. |
212 | ffedf64d | Michele Tartara | |
213 | ffedf64d | Michele Tartara | |
214 | ffedf64d | Michele Tartara | Implementation |
215 | ffedf64d | Michele Tartara | ============== |
216 | ffedf64d | Michele Tartara | |
217 | ffedf64d | Michele Tartara | While performing this refactoring, we aim to increase the amount of |
218 | ffedf64d | Michele Tartara | Haskell code, thus benefiting from the additional type safety provided by its |
219 | ffedf64d | Michele Tartara | wide compile-time checks. In particular, all the job queue management and the |
220 | ffedf64d | Michele Tartara | configuration management daemon will be written in Haskell, taking over the role |
221 | ffedf64d | Michele Tartara | currently fulfilled by Python code executed as part of MasterD. |
222 | ffedf64d | Michele Tartara | |
223 | ffedf64d | Michele Tartara | The changes describe by this design document are quite extensive, therefore they |
224 | ffedf64d | Michele Tartara | will not be implemented all at the same time, but through a sequence of steps, |
225 | ffedf64d | Michele Tartara | leaving the codebase in a consistent and usable state. |
226 | ffedf64d | Michele Tartara | |
227 | ffedf64d | Michele Tartara | #. Rename QueryD to LuxiD. |
228 | ffedf64d | Michele Tartara | A part of LuxiD, the one replying to configuration |
229 | ffedf64d | Michele Tartara | queries including live information about the system, already exists in the |
230 | ffedf64d | Michele Tartara | form of QueryD. This is being renamed to LuxiD, and will form the first part |
231 | ffedf64d | Michele Tartara | of the new daemon. NB: this is happening starting from Ganeti 2.8. At the |
232 | ffedf64d | Michele Tartara | beginning, only the already existing queries will be replied to by LuxiD. |
233 | ffedf64d | Michele Tartara | More queries will be implemented in the next versions. |
234 | ffedf64d | Michele Tartara | |
235 | ffedf64d | Michele Tartara | #. Let LuxiD be the interface for the queries and MasterD be their executor. |
236 | ffedf64d | Michele Tartara | Currently, MasterD is the only responsible for receiving and executing LUXI |
237 | ffedf64d | Michele Tartara | queries, and for managing the jobs they create. |
238 | ffedf64d | Michele Tartara | Receiving the queries and managing the job queue will be extracted from |
239 | ffedf64d | Michele Tartara | MasterD into LuxiD. |
240 | ffedf64d | Michele Tartara | Actually executing jobs will still be done by MasterD, that contains all the |
241 | ffedf64d | Michele Tartara | logic for doing that and for properly managing locks and the configuration. |
242 | ce10eb31 | Klaus Aehlig | At this stage, scheduling will simply consist in starting jobs until a fixed |
243 | ce10eb31 | Klaus Aehlig | maximum number of simultaneously running jobs is reached. |
244 | ffedf64d | Michele Tartara | |
245 | ffedf64d | Michele Tartara | #. Extract WConfD from MasterD. |
246 | ffedf64d | Michele Tartara | The logic for managing the configuration file is factored out to the |
247 | ffedf64d | Michele Tartara | dedicated WConfD daemon. All configuration changes, currently executed |
248 | ffedf64d | Michele Tartara | directly by MasterD, will be changed to be IPC requests sent to the new |
249 | ffedf64d | Michele Tartara | daemon. |
250 | ffedf64d | Michele Tartara | |
251 | ffedf64d | Michele Tartara | #. Extract locking management from MasterD. |
252 | ffedf64d | Michele Tartara | The logic for managing and granting locks is extracted to WConfD as well. |
253 | ffedf64d | Michele Tartara | Locks will not be taken directly anymore, but asked via IPC to WConfD. |
254 | ffedf64d | Michele Tartara | This step can be executed on its own or at the same time as the previous one. |
255 | ffedf64d | Michele Tartara | |
256 | ffedf64d | Michele Tartara | #. Jobs are executed as processes. |
257 | ffedf64d | Michele Tartara | The logic for running jobs is rewritten so that each job can be managed by an |
258 | ffedf64d | Michele Tartara | independent process. LuxiD will spawn a new (Python) process for every single |
259 | ffedf64d | Michele Tartara | job. The RPCs will remain unchanged, and the LU code will stay as is as much |
260 | ffedf64d | Michele Tartara | as possible. |
261 | ffedf64d | Michele Tartara | MasterD will cease to exist as a deamon on its own at this point, but not |
262 | ffedf64d | Michele Tartara | before. |
263 | ffedf64d | Michele Tartara | |
264 | ce10eb31 | Klaus Aehlig | #. Improve job scheduling algorithm. |
265 | ce10eb31 | Klaus Aehlig | The simple algorithm for scheduling jobs will be replaced by a more |
266 | ce10eb31 | Klaus Aehlig | intelligent one. Also, the implementation of :doc:`design-optables` can be |
267 | ce10eb31 | Klaus Aehlig | started. |
268 | ce10eb31 | Klaus Aehlig | |
269 | 2de55c83 | Petr Pudlak | Job death detection |
270 | 2de55c83 | Petr Pudlak | ------------------- |
271 | 2de55c83 | Petr Pudlak | |
272 | 2de55c83 | Petr Pudlak | **Requirements:** |
273 | 2de55c83 | Petr Pudlak | |
274 | 2de55c83 | Petr Pudlak | - It must be possible to reliably detect a death of a process even under |
275 | 2de55c83 | Petr Pudlak | uncommon conditions such as very heavy system load. |
276 | 2de55c83 | Petr Pudlak | - A daemon must be able to detect a death of a process even if the |
277 | 2de55c83 | Petr Pudlak | daemon is restarted while the process is running. |
278 | 2de55c83 | Petr Pudlak | - The solution must not rely on being able to communicate with |
279 | 2de55c83 | Petr Pudlak | a process. |
280 | 2de55c83 | Petr Pudlak | - The solution must work for the current situation where multiple jobs |
281 | 2de55c83 | Petr Pudlak | run in a single process. |
282 | 2de55c83 | Petr Pudlak | - It must be POSIX compliant. |
283 | 2de55c83 | Petr Pudlak | |
284 | 2de55c83 | Petr Pudlak | These conditions rule out simple solutions like checking a process ID |
285 | 2de55c83 | Petr Pudlak | (because the process might be eventually replaced by another process |
286 | 2de55c83 | Petr Pudlak | with the same ID) or keeping an open connection to a process. |
287 | 2de55c83 | Petr Pudlak | |
288 | 2de55c83 | Petr Pudlak | **Solution:** As a job process is spawned, before attempting to |
289 | 2de55c83 | Petr Pudlak | communicate with any other process, it will create a designated empty |
290 | 2de55c83 | Petr Pudlak | lock file, open it, acquire an *exclusive* lock on it, and keep it open. |
291 | 2de55c83 | Petr Pudlak | When connecting to a daemon, the job process will provide it with the |
292 | 2de55c83 | Petr Pudlak | path of the file. If the process dies unexpectedly, the operating system |
293 | 2de55c83 | Petr Pudlak | kernel automatically cleans up the lock. |
294 | 2de55c83 | Petr Pudlak | |
295 | 2de55c83 | Petr Pudlak | Therefore, daemons can check if a process is dead by trying to acquire |
296 | 2de55c83 | Petr Pudlak | a *shared* lock on the lock file in a non-blocking mode: |
297 | 2de55c83 | Petr Pudlak | |
298 | 2de55c83 | Petr Pudlak | - If the locking operation succeeds, it means that the exclusive lock is |
299 | 2de55c83 | Petr Pudlak | missing, therefore the process has died, but the lock |
300 | 2de55c83 | Petr Pudlak | file hasn't been cleaned up yet. The daemon should release the lock |
301 | 2de55c83 | Petr Pudlak | immediately. Optionally, the daemon may delete the lock file. |
302 | 2de55c83 | Petr Pudlak | - If the file is missing, the process has died and the lock file has |
303 | 2de55c83 | Petr Pudlak | been cleaned up. |
304 | 2de55c83 | Petr Pudlak | - If the locking operation fails due to a lock conflict, it means |
305 | 2de55c83 | Petr Pudlak | the process is alive. |
306 | 2de55c83 | Petr Pudlak | |
307 | 2de55c83 | Petr Pudlak | Using shared locks for querying lock files ensures that the detection |
308 | 2de55c83 | Petr Pudlak | works correctly even if multiple daemons query a file at the same time. |
309 | 2de55c83 | Petr Pudlak | |
310 | 2de55c83 | Petr Pudlak | A job should close and remove its lock file when completely finishes. |
311 | 2de55c83 | Petr Pudlak | The WConfD daemon will be responsible for removing stale lock files of |
312 | 2de55c83 | Petr Pudlak | jobs that didn't remove its lock files themselves. |
313 | 2de55c83 | Petr Pudlak | |
314 | 2de55c83 | Petr Pudlak | **Considered alternatives:** An alternative to creating a separate lock |
315 | 2de55c83 | Petr Pudlak | file would be to lock the job status file. However, file locks are kept |
316 | 2de55c83 | Petr Pudlak | only as long as the file is open. Therefore any operation followed by |
317 | 2de55c83 | Petr Pudlak | closing the file would cause the process to release the lock. In |
318 | 2de55c83 | Petr Pudlak | particular, with jobs as threads, the master daemon wouldn't be able to |
319 | 2de55c83 | Petr Pudlak | keep locks and operate on job files at the same time. |
320 | 2de55c83 | Petr Pudlak | |
321 | 5eeb7168 | Petr Pudlak | WConfD details |
322 | 5eeb7168 | Petr Pudlak | -------------- |
323 | 5eeb7168 | Petr Pudlak | |
324 | 5eeb7168 | Petr Pudlak | WConfD will communicate with its clients through a Unix domain socket for both |
325 | 5eeb7168 | Petr Pudlak | configuration management and locking. Clients can issue multiple RPC calls |
326 | 5eeb7168 | Petr Pudlak | through one socket. For each such a call the client sends a JSON request |
327 | 5eeb7168 | Petr Pudlak | document with a remote function name and data for its arguments. The server |
328 | 5eeb7168 | Petr Pudlak | replies with a JSON response document containing either the result of |
329 | 5eeb7168 | Petr Pudlak | signalling a failure. |
330 | 5eeb7168 | Petr Pudlak | |
331 | 5eeb7168 | Petr Pudlak | There will be a special RPC call for identifying a client when connecting to |
332 | 5eeb7168 | Petr Pudlak | WConfD. The client will tell WConfD it's job number and process ID. WConfD will |
333 | 5eeb7168 | Petr Pudlak | fail any other RPC calls before a client identifies this way. |
334 | 5eeb7168 | Petr Pudlak | |
335 | 5eeb7168 | Petr Pudlak | Any state associated with client processes will be mirrored on persistent |
336 | 5eeb7168 | Petr Pudlak | storage and linked to the identity of processes so that the WConfD daemon will |
337 | 5eeb7168 | Petr Pudlak | be able to resume its operation at any point after a restart or a crash. WConfD |
338 | 5eeb7168 | Petr Pudlak | will track each client's process start time along with its process ID to be |
339 | 5eeb7168 | Petr Pudlak | able detect if a process dies and it's process ID is reused. WConfD will clear |
340 | 5eeb7168 | Petr Pudlak | all locks and other state associated with a client if it detects it's process |
341 | 5eeb7168 | Petr Pudlak | no longer exists. |
342 | 5eeb7168 | Petr Pudlak | |
343 | 5eeb7168 | Petr Pudlak | Configuration management |
344 | 5eeb7168 | Petr Pudlak | ++++++++++++++++++++++++ |
345 | 5eeb7168 | Petr Pudlak | |
346 | 5eeb7168 | Petr Pudlak | The new configuration management protocol will be implemented in the following |
347 | 5eeb7168 | Petr Pudlak | steps: |
348 | 5eeb7168 | Petr Pudlak | |
349 | b0159850 | Petr Pudlak | Step 1: |
350 | b0159850 | Petr Pudlak | #. Implement the following functions in WConfD and export them through |
351 | b0159850 | Petr Pudlak | RPC: |
352 | b0159850 | Petr Pudlak | |
353 | b0159850 | Petr Pudlak | - Obtain a single internal lock, either in shared or |
354 | b0159850 | Petr Pudlak | exclusive mode. This lock will substitute the current lock |
355 | b0159850 | Petr Pudlak | ``_config_lock`` in config.py. |
356 | b0159850 | Petr Pudlak | - Release the lock. |
357 | b0159850 | Petr Pudlak | - Return the whole configuration data to a client. |
358 | b0159850 | Petr Pudlak | - Receive the whole configuration data from a client and replace the |
359 | b0159850 | Petr Pudlak | current configuration with it. Distribute it to master candidates |
360 | b0159850 | Petr Pudlak | and distribute the corresponding *ssconf*. |
361 | b0159850 | Petr Pudlak | |
362 | b0159850 | Petr Pudlak | WConfD must detect deaths of its clients (see `Job death |
363 | b0159850 | Petr Pudlak | detection`_) and release locks automatically. |
364 | b0159850 | Petr Pudlak | |
365 | b0159850 | Petr Pudlak | #. In config.py modify public methods that access configuration: |
366 | b0159850 | Petr Pudlak | |
367 | b0159850 | Petr Pudlak | - Instead of acquiring a local lock, obtain a lock from WConfD |
368 | b0159850 | Petr Pudlak | using the above functions |
369 | b0159850 | Petr Pudlak | - Fetch the current configuration from WConfD. |
370 | b0159850 | Petr Pudlak | - Use it to perform the method's task. |
371 | b0159850 | Petr Pudlak | - If the configuration was modified, send it to WConfD at the end. |
372 | b0159850 | Petr Pudlak | - Release the lock to WConfD. |
373 | b0159850 | Petr Pudlak | |
374 | b0159850 | Petr Pudlak | This will decouple the configuration management from the master daemon, |
375 | b0159850 | Petr Pudlak | even though the specific configuration tasks will still performed by |
376 | b0159850 | Petr Pudlak | individual jobs. |
377 | b0159850 | Petr Pudlak | |
378 | b0159850 | Petr Pudlak | After this step it'll be possible access the configuration from separate |
379 | b0159850 | Petr Pudlak | processes. |
380 | b0159850 | Petr Pudlak | |
381 | b0159850 | Petr Pudlak | Step 2: |
382 | b0159850 | Petr Pudlak | #. Reimplement all current methods of ``ConfigWriter`` for reading and |
383 | b0159850 | Petr Pudlak | writing the configuration of a cluster in WConfD. |
384 | b0159850 | Petr Pudlak | #. Expose each of those functions in WConfD as a separate RPC function. |
385 | b0159850 | Petr Pudlak | This will allow easy future extensions or modifications. |
386 | b0159850 | Petr Pudlak | #. Replace ``ConfigWriter`` with a stub (preferably automatically |
387 | b0159850 | Petr Pudlak | generated from the Haskell code) that will contain the same methods |
388 | b0159850 | Petr Pudlak | as the current ``ConfigWriter`` and delegate all calls to its |
389 | b0159850 | Petr Pudlak | methods to WConfD. |
390 | b0159850 | Petr Pudlak | |
391 | b0159850 | Petr Pudlak | Step 3: |
392 | b0159850 | Petr Pudlak | #. Remove WConfD's RPC functions for obtaining/releasing the single |
393 | b0159850 | Petr Pudlak | internal lock from Step 1. |
394 | b0159850 | Petr Pudlak | #. Remove WConfD's RPC functions for sending/receiving the whole |
395 | b0159850 | Petr Pudlak | configuration from Step 1. |
396 | 5eeb7168 | Petr Pudlak | |
397 | 5eeb7168 | Petr Pudlak | Future aims: |
398 | 5eeb7168 | Petr Pudlak | |
399 | 5eeb7168 | Petr Pudlak | - Optionally refactor the RPC calls to reduce their number or improve their |
400 | 5eeb7168 | Petr Pudlak | efficiency (for example by obtaining a larger set of data instead of |
401 | 5eeb7168 | Petr Pudlak | querying items one by one). |
402 | 5eeb7168 | Petr Pudlak | |
403 | 5eeb7168 | Petr Pudlak | Locking |
404 | 5eeb7168 | Petr Pudlak | +++++++ |
405 | 5eeb7168 | Petr Pudlak | |
406 | 5eeb7168 | Petr Pudlak | The new locking protocol will be implemented as follows: |
407 | 5eeb7168 | Petr Pudlak | |
408 | 5eeb7168 | Petr Pudlak | Re-implement the current locking mechanism in WConfD and expose it for RPC |
409 | 5eeb7168 | Petr Pudlak | calls. All current locks will be mapped into a data structure that will |
410 | 5eeb7168 | Petr Pudlak | uniquely identify them (storing lock's level together with it's name). |
411 | 5eeb7168 | Petr Pudlak | |
412 | 5eeb7168 | Petr Pudlak | WConfD will impose a linear order on locks. The order will be compatible |
413 | 5eeb7168 | Petr Pudlak | with the current ordering of lock levels so that existing code will work |
414 | 5eeb7168 | Petr Pudlak | without changes. |
415 | 5eeb7168 | Petr Pudlak | |
416 | 5eeb7168 | Petr Pudlak | WConfD will keep the set of currently held locks for each client. The |
417 | 5eeb7168 | Petr Pudlak | protocol will allow the following operations on the set: |
418 | 5eeb7168 | Petr Pudlak | |
419 | 5eeb7168 | Petr Pudlak | *Update:* |
420 | 5eeb7168 | Petr Pudlak | Update the current set of locks according to a given list. The list contains |
421 | 5eeb7168 | Petr Pudlak | locks and their desired level (release / shared / exclusive). To prevent |
422 | 5eeb7168 | Petr Pudlak | deadlocks, WConfD will check that all newly requested locks (or already held |
423 | 5eeb7168 | Petr Pudlak | locks requested to be upgraded to *exclusive*) are greater in the sense of |
424 | 5eeb7168 | Petr Pudlak | the linear order than all currently held locks, and fail the operation if |
425 | 5eeb7168 | Petr Pudlak | not. Only the locks in the list will be updated, other locks already held |
426 | 5eeb7168 | Petr Pudlak | will be left intact. If the operation fails, the client's lock set will be |
427 | 5eeb7168 | Petr Pudlak | left intact. |
428 | 5eeb7168 | Petr Pudlak | *Opportunistic union:* |
429 | 5eeb7168 | Petr Pudlak | Add as much as possible locks from a given set to the current set within a |
430 | 5eeb7168 | Petr Pudlak | given timeout. WConfD will again check the proper order of locks and |
431 | 5eeb7168 | Petr Pudlak | acquire only the ones that are allowed wrt. the current set. Returns the |
432 | 5eeb7168 | Petr Pudlak | set of acquired locks, possibly empty. Immediate. Never fails. (It would also |
433 | 5eeb7168 | Petr Pudlak | be possible to extend the operation to try to wait until a given number of |
434 | 5eeb7168 | Petr Pudlak | locks is available, or a given timeout elapses.) |
435 | 5eeb7168 | Petr Pudlak | *List:* |
436 | 5eeb7168 | Petr Pudlak | List the current set of held locks. Immediate, never fails. |
437 | 5eeb7168 | Petr Pudlak | *Intersection:* |
438 | 5eeb7168 | Petr Pudlak | Retain only a given set of locks in the current one. This function is |
439 | 5eeb7168 | Petr Pudlak | provided for convenience, it's redundant wrt. *list* and *update*. Immediate, |
440 | 5eeb7168 | Petr Pudlak | never fails. |
441 | 5eeb7168 | Petr Pudlak | |
442 | 5eeb7168 | Petr Pudlak | After this step it'll be possible to use locks from jobs as separate processes. |
443 | 5eeb7168 | Petr Pudlak | |
444 | 5eeb7168 | Petr Pudlak | The above set of operations allows the clients to use various work-flows. In particular: |
445 | 5eeb7168 | Petr Pudlak | |
446 | 5eeb7168 | Petr Pudlak | Pessimistic strategy: |
447 | 5eeb7168 | Petr Pudlak | Lock all potentially relevant resources (for example all nodes), determine |
448 | 5eeb7168 | Petr Pudlak | which will be needed, and release all the others. |
449 | 5eeb7168 | Petr Pudlak | Optimistic strategy: |
450 | 5eeb7168 | Petr Pudlak | Determine what locks need to be acquired without holding any. Lock the |
451 | 5eeb7168 | Petr Pudlak | required set of locks. Determine the set of required locks again and check if |
452 | 5eeb7168 | Petr Pudlak | they are all held. If not, release everything and restart. |
453 | 5eeb7168 | Petr Pudlak | |
454 | 5eeb7168 | Petr Pudlak | .. COMMENTED OUT: |
455 | 5eeb7168 | Petr Pudlak | Start with the smallest set of locks and when determining what more |
456 | 5eeb7168 | Petr Pudlak | relevant resources will be needed, expand the set. If an *union* operation |
457 | 5eeb7168 | Petr Pudlak | fails, release all locks, acquire the desired union and restart the |
458 | 5eeb7168 | Petr Pudlak | operation so that all preconditions and possible concurrent changes are |
459 | 5eeb7168 | Petr Pudlak | checked again. |
460 | 5eeb7168 | Petr Pudlak | |
461 | 5eeb7168 | Petr Pudlak | Future aims: |
462 | 5eeb7168 | Petr Pudlak | |
463 | 5eeb7168 | Petr Pudlak | - Add more fine-grained locks to prevent unnecessary blocking of jobs. This |
464 | 5eeb7168 | Petr Pudlak | could include locks on parameters of entities or locks on their states (so that |
465 | 5eeb7168 | Petr Pudlak | a node remains online, but otherwise can change, etc.). In particular, |
466 | 5eeb7168 | Petr Pudlak | adding, moving and removing instances currently blocks the whole node. |
467 | 5eeb7168 | Petr Pudlak | - Add checks that all modified configuration parameters belong to entities |
468 | 5eeb7168 | Petr Pudlak | the client has locked and log violations. |
469 | 5eeb7168 | Petr Pudlak | - Make the above checks mandatory. |
470 | 5eeb7168 | Petr Pudlak | - Automate optimistic locking and checking the locks in logical units. |
471 | 5eeb7168 | Petr Pudlak | For example, this could be accomplished by allowing some of the initial |
472 | 5eeb7168 | Petr Pudlak | phases of `LogicalUnit` (such as `ExpandNames` and `DeclareLocks`) to be run |
473 | 5eeb7168 | Petr Pudlak | repeatedly, checking if the set of locks requested the second time is |
474 | 5eeb7168 | Petr Pudlak | contained in the set acquired after the first pass. |
475 | 5eeb7168 | Petr Pudlak | - Add the possibility for a job to reserve hardware resources such as disk |
476 | 5eeb7168 | Petr Pudlak | space or memory on nodes. Most likely as a new, special kind of instances |
477 | 5eeb7168 | Petr Pudlak | that would only block its resources and allow to be converted to a regular |
478 | 5eeb7168 | Petr Pudlak | instance. This would allow long-running jobs such as instance creation or |
479 | 5eeb7168 | Petr Pudlak | move to lock the corresponding nodes, acquire the resources and turn the |
480 | 5eeb7168 | Petr Pudlak | locks into shared ones, keeping an exclusive lock only on the instance. |
481 | 5eeb7168 | Petr Pudlak | - Use more sophisticated algorithm for preventing deadlocks such as a |
482 | 5eeb7168 | Petr Pudlak | `wait-for graph`_. This would allow less *union* failures and allow more |
483 | 5eeb7168 | Petr Pudlak | optimistic, scalable acquisition of locks. |
484 | 5eeb7168 | Petr Pudlak | |
485 | 5eeb7168 | Petr Pudlak | .. _`wait-for graph`: http://en.wikipedia.org/wiki/Wait-for_graph |
486 | 5eeb7168 | Petr Pudlak | |
487 | 5eeb7168 | Petr Pudlak | |
488 | ffedf64d | Michele Tartara | Further considerations |
489 | ffedf64d | Michele Tartara | ====================== |
490 | ffedf64d | Michele Tartara | |
491 | ffedf64d | Michele Tartara | There is a possibility that a job will finish performing its task while LuxiD |
492 | ffedf64d | Michele Tartara | and/or WConfD will not be available. |
493 | 9269d118 | Klaus Aehlig | In order to deal with this situation, each job will update its job file |
494 | 9269d118 | Klaus Aehlig | in the queue. This is race free, as LuxiD will no longer touch the job file, |
495 | 9269d118 | Klaus Aehlig | once the job is started; a corollary of this is that the job also has to |
496 | 9269d118 | Klaus Aehlig | take care of replicating updates to the job file. LuxiD will watch job files for |
497 | 9269d118 | Klaus Aehlig | changes to determine when a job as cleanly finished. To determine jobs |
498 | 9269d118 | Klaus Aehlig | that died without having the chance of updating the job file, the `Job death |
499 | 9269d118 | Klaus Aehlig | detection`_ mechanism will be used. |
500 | ffedf64d | Michele Tartara | |
501 | ffedf64d | Michele Tartara | .. vim: set textwidth=72 : |
502 | ffedf64d | Michele Tartara | .. Local Variables: |
503 | ffedf64d | Michele Tartara | .. mode: rst |
504 | ffedf64d | Michele Tartara | .. fill-column: 72 |
505 | ffedf64d | Michele Tartara | .. End: |