Statistics
| Branch: | Tag: | Revision:

root / doc / design-daemons.rst @ bb3011ad

History | View | Annotate | Download (23.7 kB)

1 ffedf64d Michele Tartara
==========================
2 ffedf64d Michele Tartara
Ganeti daemons refactoring
3 ffedf64d Michele Tartara
==========================
4 ffedf64d Michele Tartara
5 ffedf64d Michele Tartara
.. contents:: :depth: 2
6 ffedf64d Michele Tartara
7 ffedf64d Michele Tartara
This is a design document detailing the plan for refactoring the internal
8 ffedf64d Michele Tartara
structure of Ganeti, and particularly the set of daemons it is divided into.
9 ffedf64d Michele Tartara
10 ffedf64d Michele Tartara
11 ffedf64d Michele Tartara
Current state and shortcomings
12 ffedf64d Michele Tartara
==============================
13 ffedf64d Michele Tartara
14 ffedf64d Michele Tartara
Ganeti is comprised of a growing number of daemons, each dealing with part of
15 ffedf64d Michele Tartara
the tasks the cluster has to face, and communicating with the other daemons
16 ffedf64d Michele Tartara
using a variety of protocols.
17 ffedf64d Michele Tartara
18 ffedf64d Michele Tartara
Specifically, as of Ganeti 2.8, the situation is as follows:
19 ffedf64d Michele Tartara
20 ffedf64d Michele Tartara
``Master daemon (MasterD)``
21 ffedf64d Michele Tartara
  It is responsible for managing the entire cluster, and it's written in Python.
22 ffedf64d Michele Tartara
  It is executed on a single node (the master node). It receives the commands
23 ffedf64d Michele Tartara
  given by the cluster administrator (through the remote API daemon or the
24 ffedf64d Michele Tartara
  command line tools) over the LUXI protocol.  The master daemon is responsible
25 ffedf64d Michele Tartara
  for creating and managing the jobs that will execute such commands, and for
26 ffedf64d Michele Tartara
  managing the locks that ensure the cluster will not incur in race conditions.
27 ffedf64d Michele Tartara
28 ffedf64d Michele Tartara
  Each job is managed by a separate Python thread, that interacts with the node
29 ffedf64d Michele Tartara
  daemons via RPC calls.
30 ffedf64d Michele Tartara
31 ffedf64d Michele Tartara
  The master daemon is also responsible for managing the configuration of the
32 ffedf64d Michele Tartara
  cluster, changing it when required by some job. It is also responsible for
33 ffedf64d Michele Tartara
  copying the configuration to the other master candidates after updating it.
34 ffedf64d Michele Tartara
35 ffedf64d Michele Tartara
``RAPI daemon (RapiD)``
36 ffedf64d Michele Tartara
  It is written in Python and runs on the master node only. It waits for
37 ffedf64d Michele Tartara
  requests issued remotely through the remote API protocol. Then, it forwards
38 ffedf64d Michele Tartara
  them, using the LUXI protocol, to the master daemon (if they are commands) or
39 ffedf64d Michele Tartara
  to the query daemon if they are queries about the configuration (including
40 ffedf64d Michele Tartara
  live status) of the cluster.
41 ffedf64d Michele Tartara
42 ffedf64d Michele Tartara
``Node daemon (NodeD)``
43 ffedf64d Michele Tartara
  It is written in Python. It runs on all the nodes. It is responsible for
44 ffedf64d Michele Tartara
  receiving the master requests over RPC and execute them, using the appropriate
45 ffedf64d Michele Tartara
  backend (hypervisors, DRBD, LVM, etc.). It also receives requests over RPC for
46 ffedf64d Michele Tartara
  the execution of queries gathering live data on behalf of the query daemon.
47 ffedf64d Michele Tartara
48 ffedf64d Michele Tartara
``Configuration daemon (ConfD)``
49 ffedf64d Michele Tartara
  It is written in Haskell. It runs on all the master candidates. Since the
50 ffedf64d Michele Tartara
  configuration is replicated only on the master node, this daemon exists in
51 ffedf64d Michele Tartara
  order to provide information about the configuration to nodes needing them.
52 ffedf64d Michele Tartara
  The requests are done through ConfD's own protocol, HMAC signed,
53 ffedf64d Michele Tartara
  implemented over UDP, and meant to be used by parallely querying all the
54 ffedf64d Michele Tartara
  master candidates (or a subset thereof) and getting the most up to date
55 ffedf64d Michele Tartara
  answer. This is meant as a way to provide a robust service even in case master
56 ffedf64d Michele Tartara
  is temporarily unavailable.
57 ffedf64d Michele Tartara
58 ffedf64d Michele Tartara
``Query daemon (QueryD)``
59 ffedf64d Michele Tartara
  It is written in Haskell. It runs on all the master candidates. It replies
60 ffedf64d Michele Tartara
  to Luxi queries about the current status of the system, including live data it
61 ffedf64d Michele Tartara
  obtains by querying the node daemons through RPCs.
62 ffedf64d Michele Tartara
63 ffedf64d Michele Tartara
``Monitoring daemon (MonD)``
64 ffedf64d Michele Tartara
  It is written in Haskell. It runs on all nodes, including the ones that are
65 ffedf64d Michele Tartara
  not vm-capable. It is meant to provide information on the status of the
66 ffedf64d Michele Tartara
  system. Such information is related only to the specific node the daemon is
67 ffedf64d Michele Tartara
  running on, and it is provided as JSON encoded data over HTTP, to be easily
68 ffedf64d Michele Tartara
  readable by external tools.
69 ffedf64d Michele Tartara
  The monitoring daemon communicates with ConfD to get information about the
70 ffedf64d Michele Tartara
  configuration of the cluster. The choice of communicating with ConfD instead
71 ffedf64d Michele Tartara
  of MasterD allows it to obtain configuration information even when the cluster
72 ffedf64d Michele Tartara
  is heavily degraded (e.g.: when master and some, but not all, of the master
73 ffedf64d Michele Tartara
  candidates are unreachable).
74 ffedf64d Michele Tartara
75 ffedf64d Michele Tartara
The current structure of the Ganeti daemons is inefficient because there are
76 ffedf64d Michele Tartara
many different protocols involved, and each daemon needs to be able to use
77 ffedf64d Michele Tartara
multiple ones, and has to deal with doing different things, thus making
78 ffedf64d Michele Tartara
sometimes unclear which daemon is responsible for performing a specific task.
79 ffedf64d Michele Tartara
80 ffedf64d Michele Tartara
Also, with the current configuration, jobs are managed by the master daemon
81 ffedf64d Michele Tartara
using python threads. This makes terminating a job after it has started a
82 ffedf64d Michele Tartara
difficult operation, and it is the main reason why this is not possible yet.
83 ffedf64d Michele Tartara
84 ffedf64d Michele Tartara
The master daemon currently has too many different tasks, that could be handled
85 ffedf64d Michele Tartara
better if split among different daemons.
86 ffedf64d Michele Tartara
87 ffedf64d Michele Tartara
88 ffedf64d Michele Tartara
Proposed changes
89 ffedf64d Michele Tartara
================
90 ffedf64d Michele Tartara
91 ffedf64d Michele Tartara
In order to improve on the current situation, a new daemon subdivision is
92 ffedf64d Michele Tartara
proposed, and presented hereafter.
93 ffedf64d Michele Tartara
94 ffedf64d Michele Tartara
.. digraph:: "new-daemons-structure"
95 ffedf64d Michele Tartara
96 ffedf64d Michele Tartara
  {rank=same; RConfD LuxiD;}
97 ffedf64d Michele Tartara
  {rank=same; Jobs rconfigdata;}
98 ffedf64d Michele Tartara
  node [shape=box]
99 ffedf64d Michele Tartara
  RapiD [label="RapiD [M]"]
100 ffedf64d Michele Tartara
  LuxiD [label="LuxiD [M]"]
101 ffedf64d Michele Tartara
  WConfD [label="WConfD [M]"]
102 ffedf64d Michele Tartara
  Jobs [label="Jobs [M]"]
103 ffedf64d Michele Tartara
  RConfD [label="RConfD [MC]"]
104 ffedf64d Michele Tartara
  MonD [label="MonD [All]"]
105 ffedf64d Michele Tartara
  NodeD [label="NodeD [All]"]
106 ffedf64d Michele Tartara
  Clients [label="gnt-*\nclients [M]"]
107 ffedf64d Michele Tartara
  p1 [shape=none, label=""]
108 ffedf64d Michele Tartara
  p2 [shape=none, label=""]
109 ffedf64d Michele Tartara
  p3 [shape=none, label=""]
110 ffedf64d Michele Tartara
  p4 [shape=none, label=""]
111 ffedf64d Michele Tartara
  configdata [shape=none, label="config.data"]
112 ffedf64d Michele Tartara
  rconfigdata [shape=none, label="config.data\n[MC copy]"]
113 ffedf64d Michele Tartara
  locksdata [shape=none, label="locks.data"]
114 ffedf64d Michele Tartara
115 ffedf64d Michele Tartara
  RapiD -> LuxiD [label="LUXI"]
116 ffedf64d Michele Tartara
  LuxiD -> WConfD [label="WConfD\nproto"]
117 ffedf64d Michele Tartara
  LuxiD -> Jobs [label="fork/exec"]
118 ffedf64d Michele Tartara
  Jobs -> WConfD [label="WConfD\nproto"]
119 ffedf64d Michele Tartara
  Jobs -> NodeD [label="RPC"]
120 ffedf64d Michele Tartara
  LuxiD -> NodeD [label="RPC"]
121 ffedf64d Michele Tartara
  rconfigdata -> RConfD
122 ffedf64d Michele Tartara
  configdata -> rconfigdata [label="sync via\nNodeD RPC"]
123 ffedf64d Michele Tartara
  WConfD -> NodeD [label="RPC"]
124 ffedf64d Michele Tartara
  WConfD -> configdata
125 ffedf64d Michele Tartara
  WConfD -> locksdata
126 ffedf64d Michele Tartara
  MonD -> RConfD [label="RConfD\nproto"]
127 ffedf64d Michele Tartara
  Clients -> LuxiD [label="LUXI"]
128 ffedf64d Michele Tartara
  p1 -> MonD [label="MonD proto"]
129 ffedf64d Michele Tartara
  p2 -> RapiD [label="RAPI"]
130 ffedf64d Michele Tartara
  p3 -> RConfD [label="RConfD\nproto"]
131 ffedf64d Michele Tartara
  p4 -> Clients [label="CLI"]
132 ffedf64d Michele Tartara
133 ffedf64d Michele Tartara
``LUXI daemon (LuxiD)``
134 ffedf64d Michele Tartara
  It will be written in Haskell. It will run on the master node and it will be
135 ffedf64d Michele Tartara
  the only LUXI server, replying to all the LUXI queries. These includes both
136 ffedf64d Michele Tartara
  the queries about the live configuration of the cluster, previously served by
137 ffedf64d Michele Tartara
  QueryD, and the commands actually changing the status of the cluster by
138 ffedf64d Michele Tartara
  submitting jobs. Therefore, this daemon will also be the one responsible with
139 ffedf64d Michele Tartara
  managing the job queue. When a job needs to be executed, the LuxiD will spawn
140 ffedf64d Michele Tartara
  a separate process tasked with the execution of that specific job, thus making
141 ffedf64d Michele Tartara
  it easier to terminate the job itself, if needeed.  When a job requires locks,
142 ffedf64d Michele Tartara
  LuxiD will request them from WConfD.
143 ffedf64d Michele Tartara
  In order to keep availability of the cluster in case of failure of the master
144 ffedf64d Michele Tartara
  node, LuxiD will replicate the job queue to the other master candidates, by
145 ffedf64d Michele Tartara
  RPCs to the NodeD running there (the choice of RPCs for this task might be
146 ffedf64d Michele Tartara
  reviewed at a second time, after implementing this design).
147 ffedf64d Michele Tartara
148 ffedf64d Michele Tartara
``Configuration management daemon (WConfD)``
149 ffedf64d Michele Tartara
  It will run on the master node and it will be responsible for the management
150 ffedf64d Michele Tartara
  of the authoritative copy of the cluster configuration (that is, it will be
151 ffedf64d Michele Tartara
  the daemon actually modifying the ``config.data`` file). All the requests of
152 ffedf64d Michele Tartara
  configuration changes will have to pass through this daemon, and will be
153 ffedf64d Michele Tartara
  performed using a LUXI-like protocol ("WConfD proto" in the graph. The exact
154 ffedf64d Michele Tartara
  protocol will be defined in the separate design document that will detail the
155 ffedf64d Michele Tartara
  WConfD separation).  Having a single point of configuration management will
156 ffedf64d Michele Tartara
  also allow Ganeti to get rid of possible race conditions due to concurrent
157 ffedf64d Michele Tartara
  modifications of the configuration.  When the configuration is updated, it
158 ffedf64d Michele Tartara
  will have to push the received changes to the other master candidates, via
159 ffedf64d Michele Tartara
  RPCs, so that RConfD daemons and (in case of a failure on the master node)
160 ffedf64d Michele Tartara
  the WConfD daemon on the new master can access an up-to-date version of it
161 ffedf64d Michele Tartara
  (the choice of RPCs for this task might be reviewed at a second time). This
162 ffedf64d Michele Tartara
  daemon will also be the one responsible for managing the locks, granting them
163 ffedf64d Michele Tartara
  to the jobs requesting them, and taking care of freeing them up if the jobs
164 ffedf64d Michele Tartara
  holding them crash or are terminated before releasing them.  In order to do
165 ffedf64d Michele Tartara
  this, each job, after being spawned by LuxiD, will open a local unix socket
166 ffedf64d Michele Tartara
  that will be used to communicate with it, and will be destroyed when the job
167 ffedf64d Michele Tartara
  terminates.  LuxiD will be able to check, after a timeout, whether the job is
168 ffedf64d Michele Tartara
  still running by connecting here, and to ask WConfD to forcefully remove the
169 ffedf64d Michele Tartara
  locks if the socket is closed.
170 ffedf64d Michele Tartara
  Also, WConfD should hold a serialized list of the locks and their owners in a
171 ffedf64d Michele Tartara
  file (``locks.data``), so that it can keep track of their status in case it
172 ffedf64d Michele Tartara
  crashes and needs to be restarted (by asking LuxiD which of them are still
173 ffedf64d Michele Tartara
  running).
174 ffedf64d Michele Tartara
  Interaction with this daemon will be performed using Unix sockets.
175 ffedf64d Michele Tartara
176 ffedf64d Michele Tartara
``Configuration query daemon (RConfD)``
177 ffedf64d Michele Tartara
  It is written in Haskell, and it corresponds to the old ConfD. It will run on
178 ffedf64d Michele Tartara
  all the master candidates and it will serve information about the the static
179 ffedf64d Michele Tartara
  configuration of the cluster (the one contained in ``config.data``). The
180 ffedf64d Michele Tartara
  provided information will be highly available (as in: a response will be
181 ffedf64d Michele Tartara
  available as long as a stable-enough connection between the client and at
182 ffedf64d Michele Tartara
  least one working master candidate is available) and its freshness will be
183 ffedf64d Michele Tartara
  best effort (the most recent reply from any of the master candidates will be
184 ffedf64d Michele Tartara
  returned, but it might still be older than the one available through WConfD).
185 ffedf64d Michele Tartara
  The information will be served through the ConfD protocol.
186 ffedf64d Michele Tartara
187 ffedf64d Michele Tartara
``Rapi daemon (RapiD)``
188 ffedf64d Michele Tartara
  It remains basically unchanged, with the only difference that all of its LUXI
189 ffedf64d Michele Tartara
  query are directed towards LuxiD instead of being split between MasterD and
190 ffedf64d Michele Tartara
  QueryD.
191 ffedf64d Michele Tartara
192 ffedf64d Michele Tartara
``Monitoring daemon (MonD)``
193 ffedf64d Michele Tartara
  It remains unaffected by the changes in this design document. It will just get
194 ffedf64d Michele Tartara
  some of the data it needs from RConfD instead of the old ConfD, but the
195 ffedf64d Michele Tartara
  interfaces of the two are identical.
196 ffedf64d Michele Tartara
197 ffedf64d Michele Tartara
``Node daemon (NodeD)``
198 ffedf64d Michele Tartara
  It remains unaffected by the changes proposed in the design document. The only
199 ffedf64d Michele Tartara
  difference being that it will receive its RPCs from LuxiD (for job queue
200 ffedf64d Michele Tartara
  replication), from WConfD (for configuration replication) and for the
201 ffedf64d Michele Tartara
  processes executing single jobs (for all the operations to be performed by
202 ffedf64d Michele Tartara
  nodes) instead of receiving them just from MasterD.
203 ffedf64d Michele Tartara
204 ffedf64d Michele Tartara
This restructuring will allow us to reorganize and improve the codebase,
205 ffedf64d Michele Tartara
introducing cleaner interfaces and giving well defined and more restricted tasks
206 ffedf64d Michele Tartara
to each daemon.
207 ffedf64d Michele Tartara
208 ffedf64d Michele Tartara
Furthermore, having more well-defined interfaces will allow us to have easier
209 ffedf64d Michele Tartara
upgrade procedures, and to work towards the possibility of upgrading single
210 ffedf64d Michele Tartara
components of a cluster one at a time, without the need for immediately
211 ffedf64d Michele Tartara
upgrading the entire cluster in a single step.
212 ffedf64d Michele Tartara
213 ffedf64d Michele Tartara
214 ffedf64d Michele Tartara
Implementation
215 ffedf64d Michele Tartara
==============
216 ffedf64d Michele Tartara
217 ffedf64d Michele Tartara
While performing this refactoring, we aim to increase the amount of
218 ffedf64d Michele Tartara
Haskell code, thus benefiting from the additional type safety provided by its
219 ffedf64d Michele Tartara
wide compile-time checks. In particular, all the job queue management and the
220 ffedf64d Michele Tartara
configuration management daemon will be written in Haskell, taking over the role
221 ffedf64d Michele Tartara
currently fulfilled by Python code executed as part of MasterD.
222 ffedf64d Michele Tartara
223 ffedf64d Michele Tartara
The changes describe by this design document are quite extensive, therefore they
224 ffedf64d Michele Tartara
will not be implemented all at the same time, but through a sequence of steps,
225 ffedf64d Michele Tartara
leaving the codebase in a consistent and usable state.
226 ffedf64d Michele Tartara
227 ffedf64d Michele Tartara
#. Rename QueryD to LuxiD.
228 ffedf64d Michele Tartara
   A part of LuxiD, the one replying to configuration
229 ffedf64d Michele Tartara
   queries including live information about the system, already exists in the
230 ffedf64d Michele Tartara
   form of QueryD. This is being renamed to LuxiD, and will form the first part
231 ffedf64d Michele Tartara
   of the new daemon. NB: this is happening starting from Ganeti 2.8. At the
232 ffedf64d Michele Tartara
   beginning, only the already existing queries will be replied to by LuxiD.
233 ffedf64d Michele Tartara
   More queries will be implemented in the next versions.
234 ffedf64d Michele Tartara
235 ffedf64d Michele Tartara
#. Let LuxiD be the interface for the queries and MasterD be their executor.
236 ffedf64d Michele Tartara
   Currently, MasterD is the only responsible for receiving and executing LUXI
237 ffedf64d Michele Tartara
   queries, and for managing the jobs they create.
238 ffedf64d Michele Tartara
   Receiving the queries and managing the job queue will be extracted from
239 ffedf64d Michele Tartara
   MasterD into LuxiD.
240 ffedf64d Michele Tartara
   Actually executing jobs will still be done by MasterD, that contains all the
241 ffedf64d Michele Tartara
   logic for doing that and for properly managing locks and the configuration.
242 ce10eb31 Klaus Aehlig
   At this stage, scheduling will simply consist in starting jobs until a fixed
243 ce10eb31 Klaus Aehlig
   maximum number of simultaneously running jobs is reached.
244 ffedf64d Michele Tartara
245 ffedf64d Michele Tartara
#. Extract WConfD from MasterD.
246 ffedf64d Michele Tartara
   The logic for managing the configuration file is factored out to the
247 ffedf64d Michele Tartara
   dedicated WConfD daemon. All configuration changes, currently executed
248 ffedf64d Michele Tartara
   directly by MasterD, will be changed to be IPC requests sent to the new
249 ffedf64d Michele Tartara
   daemon.
250 ffedf64d Michele Tartara
251 ffedf64d Michele Tartara
#. Extract locking management from MasterD.
252 ffedf64d Michele Tartara
   The logic for managing and granting locks is extracted to WConfD as well.
253 ffedf64d Michele Tartara
   Locks will not be taken directly anymore, but asked via IPC to WConfD.
254 ffedf64d Michele Tartara
   This step can be executed on its own or at the same time as the previous one.
255 ffedf64d Michele Tartara
256 ffedf64d Michele Tartara
#. Jobs are executed as processes.
257 ffedf64d Michele Tartara
   The logic for running jobs is rewritten so that each job can be managed by an
258 ffedf64d Michele Tartara
   independent process. LuxiD will spawn a new (Python) process for every single
259 ffedf64d Michele Tartara
   job. The RPCs will remain unchanged, and the LU code will stay as is as much
260 ffedf64d Michele Tartara
   as possible.
261 ffedf64d Michele Tartara
   MasterD will cease to exist as a deamon on its own at this point, but not
262 ffedf64d Michele Tartara
   before.
263 ffedf64d Michele Tartara
264 ce10eb31 Klaus Aehlig
#. Improve job scheduling algorithm.
265 ce10eb31 Klaus Aehlig
   The simple algorithm for scheduling jobs will be replaced by a more
266 ce10eb31 Klaus Aehlig
   intelligent one. Also, the implementation of :doc:`design-optables` can be
267 ce10eb31 Klaus Aehlig
   started.
268 ce10eb31 Klaus Aehlig
269 2de55c83 Petr Pudlak
Job death detection
270 2de55c83 Petr Pudlak
-------------------
271 2de55c83 Petr Pudlak
272 2de55c83 Petr Pudlak
**Requirements:**
273 2de55c83 Petr Pudlak
274 2de55c83 Petr Pudlak
- It must be possible to reliably detect a death of a process even under
275 2de55c83 Petr Pudlak
  uncommon conditions such as very heavy system load.
276 2de55c83 Petr Pudlak
- A daemon must be able to detect a death of a process even if the
277 2de55c83 Petr Pudlak
  daemon is restarted while the process is running.
278 2de55c83 Petr Pudlak
- The solution must not rely on being able to communicate with
279 2de55c83 Petr Pudlak
  a process.
280 2de55c83 Petr Pudlak
- The solution must work for the current situation where multiple jobs
281 2de55c83 Petr Pudlak
  run in a single process.
282 2de55c83 Petr Pudlak
- It must be POSIX compliant.
283 2de55c83 Petr Pudlak
284 2de55c83 Petr Pudlak
These conditions rule out simple solutions like checking a process ID
285 2de55c83 Petr Pudlak
(because the process might be eventually replaced by another process
286 2de55c83 Petr Pudlak
with the same ID) or keeping an open connection to a process.
287 2de55c83 Petr Pudlak
288 2de55c83 Petr Pudlak
**Solution:** As a job process is spawned, before attempting to
289 2de55c83 Petr Pudlak
communicate with any other process, it will create a designated empty
290 2de55c83 Petr Pudlak
lock file, open it, acquire an *exclusive* lock on it, and keep it open.
291 2de55c83 Petr Pudlak
When connecting to a daemon, the job process will provide it with the
292 2de55c83 Petr Pudlak
path of the file. If the process dies unexpectedly, the operating system
293 2de55c83 Petr Pudlak
kernel automatically cleans up the lock.
294 2de55c83 Petr Pudlak
295 2de55c83 Petr Pudlak
Therefore, daemons can check if a process is dead by trying to acquire
296 2de55c83 Petr Pudlak
a *shared* lock on the lock file in a non-blocking mode:
297 2de55c83 Petr Pudlak
298 2de55c83 Petr Pudlak
- If the locking operation succeeds, it means that the exclusive lock is
299 2de55c83 Petr Pudlak
  missing, therefore the process has died, but the lock
300 2de55c83 Petr Pudlak
  file hasn't been cleaned up yet. The daemon should release the lock
301 2de55c83 Petr Pudlak
  immediately. Optionally, the daemon may delete the lock file.
302 2de55c83 Petr Pudlak
- If the file is missing, the process has died and the lock file has
303 2de55c83 Petr Pudlak
  been cleaned up.
304 2de55c83 Petr Pudlak
- If the locking operation fails due to a lock conflict, it means
305 2de55c83 Petr Pudlak
  the process is alive.
306 2de55c83 Petr Pudlak
307 2de55c83 Petr Pudlak
Using shared locks for querying lock files ensures that the detection
308 2de55c83 Petr Pudlak
works correctly even if multiple daemons query a file at the same time.
309 2de55c83 Petr Pudlak
310 2de55c83 Petr Pudlak
A job should close and remove its lock file when completely finishes.
311 2de55c83 Petr Pudlak
The WConfD daemon will be responsible for removing stale lock files of
312 2de55c83 Petr Pudlak
jobs that didn't remove its lock files themselves.
313 2de55c83 Petr Pudlak
314 2de55c83 Petr Pudlak
**Considered alternatives:** An alternative to creating a separate lock
315 2de55c83 Petr Pudlak
file would be to lock the job status file. However, file locks are kept
316 2de55c83 Petr Pudlak
only as long as the file is open. Therefore any operation followed by
317 2de55c83 Petr Pudlak
closing the file would cause the process to release the lock. In
318 2de55c83 Petr Pudlak
particular, with jobs as threads, the master daemon wouldn't be able to
319 2de55c83 Petr Pudlak
keep locks and operate on job files at the same time.
320 2de55c83 Petr Pudlak
321 5eeb7168 Petr Pudlak
WConfD details
322 5eeb7168 Petr Pudlak
--------------
323 5eeb7168 Petr Pudlak
324 5eeb7168 Petr Pudlak
WConfD will communicate with its clients through a Unix domain socket for both
325 5eeb7168 Petr Pudlak
configuration management and locking. Clients can issue multiple RPC calls
326 5eeb7168 Petr Pudlak
through one socket. For each such a call the client sends a JSON request
327 5eeb7168 Petr Pudlak
document with a remote function name and data for its arguments. The server
328 5eeb7168 Petr Pudlak
replies with a JSON response document containing either the result of
329 5eeb7168 Petr Pudlak
signalling a failure.
330 5eeb7168 Petr Pudlak
331 5eeb7168 Petr Pudlak
There will be a special RPC call for identifying a client when connecting to
332 5eeb7168 Petr Pudlak
WConfD. The client will tell WConfD it's job number and process ID. WConfD will
333 5eeb7168 Petr Pudlak
fail any other RPC calls before a client identifies this way.
334 5eeb7168 Petr Pudlak
335 5eeb7168 Petr Pudlak
Any state associated with client processes will be mirrored on persistent
336 5eeb7168 Petr Pudlak
storage and linked to the identity of processes so that the WConfD daemon will
337 5eeb7168 Petr Pudlak
be able to resume its operation at any point after a restart or a crash. WConfD
338 5eeb7168 Petr Pudlak
will track each client's process start time along with its process ID to be
339 5eeb7168 Petr Pudlak
able detect if a process dies and it's process ID is reused.  WConfD will clear
340 5eeb7168 Petr Pudlak
all locks and other state associated with a client if it detects it's process
341 5eeb7168 Petr Pudlak
no longer exists.
342 5eeb7168 Petr Pudlak
343 5eeb7168 Petr Pudlak
Configuration management
344 5eeb7168 Petr Pudlak
++++++++++++++++++++++++
345 5eeb7168 Petr Pudlak
346 5eeb7168 Petr Pudlak
The new configuration management protocol will be implemented in the following
347 5eeb7168 Petr Pudlak
steps:
348 5eeb7168 Petr Pudlak
349 b0159850 Petr Pudlak
Step 1:
350 b0159850 Petr Pudlak
  #. Implement the following functions in WConfD and export them through
351 b0159850 Petr Pudlak
     RPC:
352 b0159850 Petr Pudlak
353 b0159850 Petr Pudlak
     - Obtain a single internal lock, either in shared or
354 b0159850 Petr Pudlak
       exclusive mode. This lock will substitute the current lock
355 b0159850 Petr Pudlak
       ``_config_lock`` in config.py.
356 b0159850 Petr Pudlak
     - Release the lock.
357 b0159850 Petr Pudlak
     - Return the whole configuration data to a client.
358 b0159850 Petr Pudlak
     - Receive the whole configuration data from a client and replace the
359 b0159850 Petr Pudlak
       current configuration with it. Distribute it to master candidates
360 b0159850 Petr Pudlak
       and distribute the corresponding *ssconf*.
361 b0159850 Petr Pudlak
362 b0159850 Petr Pudlak
     WConfD must detect deaths of its clients (see `Job death
363 b0159850 Petr Pudlak
     detection`_) and release locks automatically.
364 b0159850 Petr Pudlak
365 b0159850 Petr Pudlak
  #. In config.py modify public methods that access configuration:
366 b0159850 Petr Pudlak
367 b0159850 Petr Pudlak
     - Instead of acquiring a local lock, obtain a lock from WConfD
368 b0159850 Petr Pudlak
       using the above functions
369 b0159850 Petr Pudlak
     - Fetch the current configuration from WConfD.
370 b0159850 Petr Pudlak
     - Use it to perform the method's task.
371 b0159850 Petr Pudlak
     - If the configuration was modified, send it to WConfD at the end.
372 b0159850 Petr Pudlak
     - Release the lock to WConfD.
373 b0159850 Petr Pudlak
374 b0159850 Petr Pudlak
  This will decouple the configuration management from the master daemon,
375 b0159850 Petr Pudlak
  even though the specific configuration tasks will still performed by
376 b0159850 Petr Pudlak
  individual jobs.
377 b0159850 Petr Pudlak
378 b0159850 Petr Pudlak
  After this step it'll be possible access the configuration from separate
379 b0159850 Petr Pudlak
  processes.
380 b0159850 Petr Pudlak
381 b0159850 Petr Pudlak
Step 2:
382 b0159850 Petr Pudlak
  #. Reimplement all current methods of ``ConfigWriter`` for reading and
383 b0159850 Petr Pudlak
     writing the configuration of a cluster in WConfD.
384 b0159850 Petr Pudlak
  #. Expose each of those functions in WConfD as a separate RPC function.
385 b0159850 Petr Pudlak
     This will allow easy future extensions or modifications.
386 b0159850 Petr Pudlak
  #. Replace ``ConfigWriter`` with a stub (preferably automatically
387 b0159850 Petr Pudlak
     generated from the Haskell code) that will contain the same methods
388 b0159850 Petr Pudlak
     as the current ``ConfigWriter`` and delegate all calls to its
389 b0159850 Petr Pudlak
     methods to WConfD.
390 b0159850 Petr Pudlak
391 b0159850 Petr Pudlak
Step 3:
392 b0159850 Petr Pudlak
  #. Remove WConfD's RPC functions for obtaining/releasing the single
393 b0159850 Petr Pudlak
     internal lock from Step 1.
394 b0159850 Petr Pudlak
  #. Remove WConfD's RPC functions for sending/receiving the whole
395 b0159850 Petr Pudlak
     configuration from Step 1.
396 5eeb7168 Petr Pudlak
397 5eeb7168 Petr Pudlak
Future aims:
398 5eeb7168 Petr Pudlak
399 5eeb7168 Petr Pudlak
-  Optionally refactor the RPC calls to reduce their number or improve their
400 5eeb7168 Petr Pudlak
   efficiency (for example by obtaining a larger set of data instead of
401 5eeb7168 Petr Pudlak
   querying items one by one).
402 5eeb7168 Petr Pudlak
403 5eeb7168 Petr Pudlak
Locking
404 5eeb7168 Petr Pudlak
+++++++
405 5eeb7168 Petr Pudlak
406 5eeb7168 Petr Pudlak
The new locking protocol will be implemented as follows:
407 5eeb7168 Petr Pudlak
408 5eeb7168 Petr Pudlak
Re-implement the current locking mechanism in WConfD and expose it for RPC
409 5eeb7168 Petr Pudlak
calls. All current locks will be mapped into a data structure that will
410 5eeb7168 Petr Pudlak
uniquely identify them (storing lock's level together with it's name).
411 5eeb7168 Petr Pudlak
412 5eeb7168 Petr Pudlak
WConfD will impose a linear order on locks. The order will be compatible
413 5eeb7168 Petr Pudlak
with the current ordering of lock levels so that existing code will work
414 5eeb7168 Petr Pudlak
without changes.
415 5eeb7168 Petr Pudlak
416 5eeb7168 Petr Pudlak
WConfD will keep the set of currently held locks for each client. The
417 5eeb7168 Petr Pudlak
protocol will allow the following operations on the set:
418 5eeb7168 Petr Pudlak
419 5eeb7168 Petr Pudlak
*Update:*
420 5eeb7168 Petr Pudlak
  Update the current set of locks according to a given list. The list contains
421 5eeb7168 Petr Pudlak
  locks and their desired level (release / shared / exclusive). To prevent
422 5eeb7168 Petr Pudlak
  deadlocks, WConfD will check that all newly requested locks (or already held
423 5eeb7168 Petr Pudlak
  locks requested to be upgraded to *exclusive*) are greater in the sense of
424 5eeb7168 Petr Pudlak
  the linear order than all currently held locks, and fail the operation if
425 5eeb7168 Petr Pudlak
  not. Only the locks in the list will be updated, other locks already held
426 5eeb7168 Petr Pudlak
  will be left intact. If the operation fails, the client's lock set will be
427 5eeb7168 Petr Pudlak
  left intact.
428 5eeb7168 Petr Pudlak
*Opportunistic union:*
429 5eeb7168 Petr Pudlak
  Add as much as possible locks from a given set to the current set within a
430 5eeb7168 Petr Pudlak
  given timeout. WConfD will again check the proper order of locks and
431 5eeb7168 Petr Pudlak
  acquire only the ones that are allowed wrt. the current set.  Returns the
432 5eeb7168 Petr Pudlak
  set of acquired locks, possibly empty. Immediate. Never fails. (It would also
433 5eeb7168 Petr Pudlak
  be possible to extend the operation to try to wait until a given number of
434 5eeb7168 Petr Pudlak
  locks is available, or a given timeout elapses.)
435 5eeb7168 Petr Pudlak
*List:*
436 5eeb7168 Petr Pudlak
  List the current set of held locks. Immediate, never fails.
437 5eeb7168 Petr Pudlak
*Intersection:*
438 5eeb7168 Petr Pudlak
  Retain only a given set of locks in the current one. This function is
439 5eeb7168 Petr Pudlak
  provided for convenience, it's redundant wrt. *list* and *update*. Immediate,
440 5eeb7168 Petr Pudlak
  never fails.
441 5eeb7168 Petr Pudlak
442 5eeb7168 Petr Pudlak
After this step it'll be possible to use locks from jobs as separate processes.
443 5eeb7168 Petr Pudlak
444 5eeb7168 Petr Pudlak
The above set of operations allows the clients to use various work-flows. In particular:
445 5eeb7168 Petr Pudlak
446 5eeb7168 Petr Pudlak
Pessimistic strategy:
447 5eeb7168 Petr Pudlak
  Lock all potentially relevant resources (for example all nodes), determine
448 5eeb7168 Petr Pudlak
  which will be needed, and release all the others.
449 5eeb7168 Petr Pudlak
Optimistic strategy:
450 5eeb7168 Petr Pudlak
  Determine what locks need to be acquired without holding any. Lock the
451 5eeb7168 Petr Pudlak
  required set of locks. Determine the set of required locks again and check if
452 5eeb7168 Petr Pudlak
  they are all held. If not, release everything and restart.
453 5eeb7168 Petr Pudlak
454 5eeb7168 Petr Pudlak
.. COMMENTED OUT:
455 5eeb7168 Petr Pudlak
  Start with the smallest set of locks and when determining what more
456 5eeb7168 Petr Pudlak
  relevant resources will be needed, expand the set. If an *union* operation
457 5eeb7168 Petr Pudlak
  fails, release all locks, acquire the desired union and restart the
458 5eeb7168 Petr Pudlak
  operation so that all preconditions and possible concurrent changes are
459 5eeb7168 Petr Pudlak
  checked again.
460 5eeb7168 Petr Pudlak
461 5eeb7168 Petr Pudlak
Future aims:
462 5eeb7168 Petr Pudlak
463 5eeb7168 Petr Pudlak
-  Add more fine-grained locks to prevent unnecessary blocking of jobs. This
464 5eeb7168 Petr Pudlak
   could include locks on parameters of entities or locks on their states (so that
465 5eeb7168 Petr Pudlak
   a node remains online, but otherwise can change, etc.). In particular,
466 5eeb7168 Petr Pudlak
   adding, moving and removing instances currently blocks the whole node.
467 5eeb7168 Petr Pudlak
-  Add checks that all modified configuration parameters belong to entities
468 5eeb7168 Petr Pudlak
   the client has locked and log violations.
469 5eeb7168 Petr Pudlak
-  Make the above checks mandatory.
470 5eeb7168 Petr Pudlak
-  Automate optimistic locking and checking the locks in logical units.
471 5eeb7168 Petr Pudlak
   For example, this could be accomplished by allowing some of the initial
472 5eeb7168 Petr Pudlak
   phases of `LogicalUnit` (such as `ExpandNames` and `DeclareLocks`) to be run
473 5eeb7168 Petr Pudlak
   repeatedly, checking if the set of locks requested the second time is
474 5eeb7168 Petr Pudlak
   contained in the set acquired after the first pass.
475 5eeb7168 Petr Pudlak
-  Add the possibility for a job to reserve hardware resources such as disk
476 5eeb7168 Petr Pudlak
   space or memory on nodes. Most likely as a new, special kind of instances
477 5eeb7168 Petr Pudlak
   that would only block its resources and allow to be converted to a regular
478 5eeb7168 Petr Pudlak
   instance. This would allow long-running jobs such as instance creation or
479 5eeb7168 Petr Pudlak
   move to lock the corresponding nodes, acquire the resources and turn the
480 5eeb7168 Petr Pudlak
   locks into shared ones, keeping an exclusive lock only on the instance.
481 5eeb7168 Petr Pudlak
-  Use more sophisticated algorithm for preventing deadlocks such as a
482 5eeb7168 Petr Pudlak
   `wait-for graph`_. This would allow less *union* failures and allow more
483 5eeb7168 Petr Pudlak
   optimistic, scalable acquisition of locks.
484 5eeb7168 Petr Pudlak
485 5eeb7168 Petr Pudlak
.. _`wait-for graph`: http://en.wikipedia.org/wiki/Wait-for_graph
486 5eeb7168 Petr Pudlak
487 5eeb7168 Petr Pudlak
488 ffedf64d Michele Tartara
Further considerations
489 ffedf64d Michele Tartara
======================
490 ffedf64d Michele Tartara
491 ffedf64d Michele Tartara
There is a possibility that a job will finish performing its task while LuxiD
492 ffedf64d Michele Tartara
and/or WConfD will not be available.
493 9269d118 Klaus Aehlig
In order to deal with this situation, each job will update its job file
494 9269d118 Klaus Aehlig
in the queue. This is race free, as LuxiD will no longer touch the job file,
495 9269d118 Klaus Aehlig
once the job is started; a corollary of this is that the job also has to
496 9269d118 Klaus Aehlig
take care of replicating updates to the job file. LuxiD will watch job files for
497 9269d118 Klaus Aehlig
changes to determine when a job as cleanly finished. To determine jobs
498 9269d118 Klaus Aehlig
that died without having the chance of updating the job file, the `Job death
499 9269d118 Klaus Aehlig
detection`_ mechanism will be used.
500 ffedf64d Michele Tartara
501 ffedf64d Michele Tartara
.. vim: set textwidth=72 :
502 ffedf64d Michele Tartara
.. Local Variables:
503 ffedf64d Michele Tartara
.. mode: rst
504 ffedf64d Michele Tartara
.. fill-column: 72
505 ffedf64d Michele Tartara
.. End: