root / doc / design-2.0-job-queue.rst @ 93c4f7f1
History | View | Annotate | Download (4 kB)
1 |
Job Queue |
---|---|
2 |
========= |
3 |
|
4 |
.. contents:: |
5 |
|
6 |
Overview |
7 |
-------- |
8 |
|
9 |
In Ganeti 1.2, operations in a cluster have to be done in a serialized way. |
10 |
Virtually any operation locks the whole cluster by grabbing the global lock. |
11 |
Other commands can't return before all work has been done. |
12 |
|
13 |
By implementing a job queue and granular locking, we can lower the latency of |
14 |
command execution inside a Ganeti cluster. |
15 |
|
16 |
|
17 |
Detailed Design |
18 |
--------------- |
19 |
|
20 |
Job execution—“Life of a Ganeti job” |
21 |
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
22 |
|
23 |
#. Job gets submitted by the client. A new job identifier is generated and |
24 |
assigned to the job. The job is then automatically replicated to all nodes |
25 |
in the cluster. The identifier is returned to the client. |
26 |
#. A pool of worker threads waits for new jobs. If all are busy, the job has |
27 |
to wait and the first worker finishing its work will grab it. Otherwise any |
28 |
of the waiting threads will pick up the new job. |
29 |
#. Client waits for job status updates by calling a waiting RPC function. |
30 |
Log message may be shown to the user. Until the job is started, it can also |
31 |
be cancelled. |
32 |
#. As soon as the job is finished, its final result and status can be retrieved |
33 |
from the server. |
34 |
#. If the client archives the job, it gets moved to a history directory. |
35 |
This could also be done regularily using a cron script. |
36 |
|
37 |
|
38 |
Queue structure |
39 |
~~~~~~~~~~~~~~~ |
40 |
|
41 |
All file operations have to be done atomically by writing to a temporary file |
42 |
and subsequent renaming. Except for log messages, every change in a job is |
43 |
stored and replicated to other nodes. |
44 |
|
45 |
:: |
46 |
|
47 |
/var/lib/ganeti/queue/ |
48 |
job-1 (JSON encoded job description and status) |
49 |
[…] |
50 |
job-37 |
51 |
job-38 |
52 |
job-39 |
53 |
lock (Queue managing process opens this file in exclusive mode) |
54 |
serial (Last job ID used) |
55 |
version (Queue format version) |
56 |
|
57 |
|
58 |
Locking |
59 |
~~~~~~~ |
60 |
|
61 |
Locking in the job queue is a complicated topic. It is called from more than |
62 |
one thread and must be thread-safe. For simplicity, a single lock is used for |
63 |
the whole job queue. |
64 |
|
65 |
A more detailed description can be found in doc/locking.txt. |
66 |
|
67 |
|
68 |
Internal RPC |
69 |
~~~~~~~~~~~~ |
70 |
|
71 |
RPC calls available between Ganeti master and node daemons: |
72 |
|
73 |
jobqueue_update(file_name, content) |
74 |
Writes a file in the job queue directory. |
75 |
jobqueue_purge() |
76 |
Cleans the job queue directory completely, including archived job. |
77 |
jobqueue_rename(old, new) |
78 |
Renames a file in the job queue directory. |
79 |
|
80 |
|
81 |
Client RPC |
82 |
~~~~~~~~~~ |
83 |
|
84 |
RPC between Ganeti clients and the Ganeti master daemon supports the following |
85 |
operations: |
86 |
|
87 |
SubmitJob(ops) |
88 |
Submits a list of opcodes and returns the job identifier. The identifier is |
89 |
guaranteed to be unique during the lifetime of a cluster. |
90 |
WaitForJobChange(job_id, fields, […], timeout) |
91 |
This function waits until a job changes or a timeout expires. The condition |
92 |
for when a job changed is defined by the fields passed and the last log |
93 |
message received. |
94 |
QueryJobs(job_ids, fields) |
95 |
Returns field values for the job identifiers passed. |
96 |
CancelJob(job_id) |
97 |
Cancels the job specified by identifier. This operation may fail if the job |
98 |
is already running, canceled or finished. |
99 |
ArchiveJob(job_id) |
100 |
Moves a job into the …/archive/ directory. This operation will fail if the |
101 |
job has not been canceled or finished. |
102 |
|
103 |
|
104 |
Job and opcode status |
105 |
~~~~~~~~~~~~~~~~~~~~~ |
106 |
|
107 |
Each job and each opcode has, at any time, one of the following states: |
108 |
|
109 |
Queued |
110 |
The job/opcode was submitted, but did not yet start. |
111 |
Waiting |
112 |
The job/opcode is waiting for a lock to proceed. |
113 |
Running |
114 |
The job/opcode is running. |
115 |
Canceled |
116 |
The job/opcode was canceled before it started. |
117 |
Success |
118 |
The job/opcode ran and finished successfully. |
119 |
Error |
120 |
The job/opcode was aborted with an error. |
121 |
|
122 |
If the master is aborted while a job is running, the job will be set to the |
123 |
Error status once the master started again. |
124 |
|
125 |
|
126 |
History |
127 |
~~~~~~~ |
128 |
|
129 |
Archived jobs are kept in a separate directory, /var/lib/ganeti/queue/archive/. |
130 |
The idea is to speed up the queue handling. |
131 |
|
132 |
|
133 |
Ganeti updates |
134 |
~~~~~~~~~~~~~~ |
135 |
|
136 |
The queue has to be completely empty for Ganeti updates with changes in the job |
137 |
queue structure. |