Revision 2de55c83

b/doc/design-daemons.rst
266 266
   intelligent one. Also, the implementation of :doc:`design-optables` can be
267 267
   started.
268 268

  
269
Job death detection
270
-------------------
271

  
272
**Requirements:**
273

  
274
- It must be possible to reliably detect a death of a process even under
275
  uncommon conditions such as very heavy system load.
276
- A daemon must be able to detect a death of a process even if the
277
  daemon is restarted while the process is running.
278
- The solution must not rely on being able to communicate with
279
  a process.
280
- The solution must work for the current situation where multiple jobs
281
  run in a single process.
282
- It must be POSIX compliant.
283

  
284
These conditions rule out simple solutions like checking a process ID
285
(because the process might be eventually replaced by another process
286
with the same ID) or keeping an open connection to a process.
287

  
288
**Solution:** As a job process is spawned, before attempting to
289
communicate with any other process, it will create a designated empty
290
lock file, open it, acquire an *exclusive* lock on it, and keep it open.
291
When connecting to a daemon, the job process will provide it with the
292
path of the file. If the process dies unexpectedly, the operating system
293
kernel automatically cleans up the lock.
294

  
295
Therefore, daemons can check if a process is dead by trying to acquire
296
a *shared* lock on the lock file in a non-blocking mode:
297

  
298
- If the locking operation succeeds, it means that the exclusive lock is
299
  missing, therefore the process has died, but the lock
300
  file hasn't been cleaned up yet. The daemon should release the lock
301
  immediately. Optionally, the daemon may delete the lock file.
302
- If the file is missing, the process has died and the lock file has
303
  been cleaned up.
304
- If the locking operation fails due to a lock conflict, it means
305
  the process is alive.
306

  
307
Using shared locks for querying lock files ensures that the detection
308
works correctly even if multiple daemons query a file at the same time.
309

  
310
A job should close and remove its lock file when completely finishes.
311
The WConfD daemon will be responsible for removing stale lock files of
312
jobs that didn't remove its lock files themselves.
313

  
314
**Considered alternatives:** An alternative to creating a separate lock
315
file would be to lock the job status file. However, file locks are kept
316
only as long as the file is open. Therefore any operation followed by
317
closing the file would cause the process to release the lock. In
318
particular, with jobs as threads, the master daemon wouldn't be able to
319
keep locks and operate on job files at the same time.
320

  
269 321
WConfD details
270 322
--------------
271 323

  

Also available in: Unified diff