Statistics
| Branch: | Tag: | Revision:

root / doc / design-2.0-job-queue.rst @ 93c4f7f1

History | View | Annotate | Download (4 kB)

1
Job Queue
2
=========
3

    
4
.. contents::
5

    
6
Overview
7
--------
8

    
9
In Ganeti 1.2, operations in a cluster have to be done in a serialized way.
10
Virtually any operation locks the whole cluster by grabbing the global lock.
11
Other commands can't return before all work has been done.
12

    
13
By implementing a job queue and granular locking, we can lower the latency of
14
command execution inside a Ganeti cluster.
15

    
16

    
17
Detailed Design
18
---------------
19

    
20
Job execution—“Life of a Ganeti job”
21
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
22

    
23
#. Job gets submitted by the client. A new job identifier is generated and
24
   assigned to the job. The job is then automatically replicated to all nodes
25
   in the cluster. The identifier is returned to the client.
26
#. A pool of worker threads waits for new jobs. If all are busy, the job has
27
   to wait and the first worker finishing its work will grab it. Otherwise any
28
   of the waiting threads will pick up the new job.
29
#. Client waits for job status updates by calling a waiting RPC function.
30
   Log message may be shown to the user. Until the job is started, it can also
31
   be cancelled.
32
#. As soon as the job is finished, its final result and status can be retrieved
33
   from the server.
34
#. If the client archives the job, it gets moved to a history directory.
35
   This could also be done regularily using a cron script.
36

    
37

    
38
Queue structure
39
~~~~~~~~~~~~~~~
40

    
41
All file operations have to be done atomically by writing to a temporary file
42
and subsequent renaming. Except for log messages, every change in a job is
43
stored and replicated to other nodes.
44

    
45
::
46

    
47
  /var/lib/ganeti/queue/
48
    job-1 (JSON encoded job description and status)
49
    […]
50
    job-37
51
    job-38
52
    job-39
53
    lock (Queue managing process opens this file in exclusive mode)
54
    serial (Last job ID used)
55
    version (Queue format version)
56

    
57

    
58
Locking
59
~~~~~~~
60

    
61
Locking in the job queue is a complicated topic. It is called from more than
62
one thread and must be thread-safe. For simplicity, a single lock is used for
63
the whole job queue.
64

    
65
A more detailed description can be found in doc/locking.txt.
66

    
67

    
68
Internal RPC
69
~~~~~~~~~~~~
70

    
71
RPC calls available between Ganeti master and node daemons:
72

    
73
jobqueue_update(file_name, content)
74
  Writes a file in the job queue directory.
75
jobqueue_purge()
76
  Cleans the job queue directory completely, including archived job.
77
jobqueue_rename(old, new)
78
  Renames a file in the job queue directory.
79

    
80

    
81
Client RPC
82
~~~~~~~~~~
83

    
84
RPC between Ganeti clients and the Ganeti master daemon supports the following
85
operations:
86

    
87
SubmitJob(ops)
88
  Submits a list of opcodes and returns the job identifier. The identifier is
89
  guaranteed to be unique during the lifetime of a cluster.
90
WaitForJobChange(job_id, fields, […], timeout)
91
  This function waits until a job changes or a timeout expires. The condition
92
  for when a job changed is defined by the fields passed and the last log
93
  message received.
94
QueryJobs(job_ids, fields)
95
  Returns field values for the job identifiers passed.
96
CancelJob(job_id)
97
  Cancels the job specified by identifier. This operation may fail if the job
98
  is already running, canceled or finished.
99
ArchiveJob(job_id)
100
  Moves a job into the …/archive/ directory. This operation will fail if the
101
  job has not been canceled or finished.
102

    
103

    
104
Job and opcode status
105
~~~~~~~~~~~~~~~~~~~~~
106

    
107
Each job and each opcode has, at any time, one of the following states:
108

    
109
Queued
110
  The job/opcode was submitted, but did not yet start.
111
Waiting
112
  The job/opcode is waiting for a lock to proceed.
113
Running
114
  The job/opcode is running.
115
Canceled
116
  The job/opcode was canceled before it started.
117
Success
118
  The job/opcode ran and finished successfully.
119
Error
120
  The job/opcode was aborted with an error.
121

    
122
If the master is aborted while a job is running, the job will be set to the
123
Error status once the master started again.
124

    
125

    
126
History
127
~~~~~~~
128

    
129
Archived jobs are kept in a separate directory, /var/lib/ganeti/queue/archive/.
130
The idea is to speed up the queue handling.
131

    
132

    
133
Ganeti updates
134
~~~~~~~~~~~~~~
135

    
136
The queue has to be completely empty for Ganeti updates with changes in the job
137
queue structure.