Revision 2de55c83
b/doc/design-daemons.rst | ||
---|---|---|
266 | 266 |
intelligent one. Also, the implementation of :doc:`design-optables` can be |
267 | 267 |
started. |
268 | 268 |
|
269 |
Job death detection |
|
270 |
------------------- |
|
271 |
|
|
272 |
**Requirements:** |
|
273 |
|
|
274 |
- It must be possible to reliably detect a death of a process even under |
|
275 |
uncommon conditions such as very heavy system load. |
|
276 |
- A daemon must be able to detect a death of a process even if the |
|
277 |
daemon is restarted while the process is running. |
|
278 |
- The solution must not rely on being able to communicate with |
|
279 |
a process. |
|
280 |
- The solution must work for the current situation where multiple jobs |
|
281 |
run in a single process. |
|
282 |
- It must be POSIX compliant. |
|
283 |
|
|
284 |
These conditions rule out simple solutions like checking a process ID |
|
285 |
(because the process might be eventually replaced by another process |
|
286 |
with the same ID) or keeping an open connection to a process. |
|
287 |
|
|
288 |
**Solution:** As a job process is spawned, before attempting to |
|
289 |
communicate with any other process, it will create a designated empty |
|
290 |
lock file, open it, acquire an *exclusive* lock on it, and keep it open. |
|
291 |
When connecting to a daemon, the job process will provide it with the |
|
292 |
path of the file. If the process dies unexpectedly, the operating system |
|
293 |
kernel automatically cleans up the lock. |
|
294 |
|
|
295 |
Therefore, daemons can check if a process is dead by trying to acquire |
|
296 |
a *shared* lock on the lock file in a non-blocking mode: |
|
297 |
|
|
298 |
- If the locking operation succeeds, it means that the exclusive lock is |
|
299 |
missing, therefore the process has died, but the lock |
|
300 |
file hasn't been cleaned up yet. The daemon should release the lock |
|
301 |
immediately. Optionally, the daemon may delete the lock file. |
|
302 |
- If the file is missing, the process has died and the lock file has |
|
303 |
been cleaned up. |
|
304 |
- If the locking operation fails due to a lock conflict, it means |
|
305 |
the process is alive. |
|
306 |
|
|
307 |
Using shared locks for querying lock files ensures that the detection |
|
308 |
works correctly even if multiple daemons query a file at the same time. |
|
309 |
|
|
310 |
A job should close and remove its lock file when completely finishes. |
|
311 |
The WConfD daemon will be responsible for removing stale lock files of |
|
312 |
jobs that didn't remove its lock files themselves. |
|
313 |
|
|
314 |
**Considered alternatives:** An alternative to creating a separate lock |
|
315 |
file would be to lock the job status file. However, file locks are kept |
|
316 |
only as long as the file is open. Therefore any operation followed by |
|
317 |
closing the file would cause the process to release the lock. In |
|
318 |
particular, with jobs as threads, the master daemon wouldn't be able to |
|
319 |
keep locks and operate on job files at the same time. |
|
320 |
|
|
269 | 321 |
WConfD details |
270 | 322 |
-------------- |
271 | 323 |
|
Also available in: Unified diff