root / doc / design-chained-jobs.rst @ a295eb80
History | View | Annotate | Download (8.2 kB)
1 |
============ |
---|---|
2 |
Chained jobs |
3 |
============ |
4 |
|
5 |
.. contents:: :depth: 4 |
6 |
|
7 |
This is a design document about the innards of Ganeti's job processing. |
8 |
Readers are advised to study previous design documents on the topic: |
9 |
|
10 |
- :ref:`Original job queue <jqueue-original-design>` |
11 |
- :ref:`Job priorities <jqueue-job-priority-design>` |
12 |
- :doc:`LU-generated jobs <design-lu-generated-jobs>` |
13 |
|
14 |
|
15 |
Current state and shortcomings |
16 |
============================== |
17 |
|
18 |
Ever since the introduction of the job queue with Ganeti 2.0 there have |
19 |
been situations where we wanted to run several jobs in a specific order. |
20 |
Due to the job queue's current design, such a guarantee can not be |
21 |
given. Jobs are run according to their priority, their ability to |
22 |
acquire all necessary locks and other factors. |
23 |
|
24 |
One way to work around this limitation is to do some kind of job |
25 |
grouping in the client code. Once all jobs of a group have finished, the |
26 |
next group is submitted and waited for. There are different kinds of |
27 |
clients for Ganeti, some of which don't share code (e.g. Python clients |
28 |
vs. htools). This design proposes a solution which would be implemented |
29 |
as part of the job queue in the master daemon. |
30 |
|
31 |
|
32 |
Proposed changes |
33 |
================ |
34 |
|
35 |
With the implementation of :ref:`job priorities |
36 |
<jqueue-job-priority-design>` the processing code was re-architectured |
37 |
and became a lot more versatile. It now returns jobs to the queue in |
38 |
case the locks for an opcode can't be acquired, allowing other |
39 |
jobs/opcodes to be run in the meantime. |
40 |
|
41 |
The proposal is to add a new, optional property to opcodes to define |
42 |
dependencies on other jobs. Job X could define opcodes with a dependency |
43 |
on the success of job Y and would only be run once job Y is finished. If |
44 |
there's a dependency on success and job Y failed, job X would fail as |
45 |
well. Since such dependencies would use job IDs, the jobs still need to |
46 |
be submitted in the right order. |
47 |
|
48 |
.. pyassert:: |
49 |
|
50 |
# Update description below if finalized job status change |
51 |
constants.JOBS_FINALIZED == frozenset([ |
52 |
constants.JOB_STATUS_CANCELED, |
53 |
constants.JOB_STATUS_SUCCESS, |
54 |
constants.JOB_STATUS_ERROR, |
55 |
]) |
56 |
|
57 |
The new attribute's value would be a list of two-valued tuples. Each |
58 |
tuple contains a job ID and a list of requested status for the job |
59 |
depended upon. Only final status are accepted |
60 |
(:pyeval:`utils.CommaJoin(constants.JOBS_FINALIZED)`). An empty list is |
61 |
equivalent to specifying all final status (except |
62 |
:pyeval:`constants.JOB_STATUS_CANCELED`, which is treated specially). |
63 |
An opcode runs only once all its dependency requirements have been |
64 |
fulfilled. |
65 |
|
66 |
Any job referring to a cancelled job is also cancelled unless it |
67 |
explicitly lists :pyeval:`constants.JOB_STATUS_CANCELED` as a requested |
68 |
status. |
69 |
|
70 |
In case a referenced job can not be found in the normal queue or the |
71 |
archive, referring jobs fail as the status of the referenced job can't |
72 |
be determined. |
73 |
|
74 |
With this change, clients can submit all wanted jobs in the right order |
75 |
and proceed to wait for changes on all these jobs (see |
76 |
``cli.JobExecutor``). The master daemon will take care of executing them |
77 |
in the right order, while still presenting the client with a simple |
78 |
interface. |
79 |
|
80 |
Clients using the ``SubmitManyJobs`` interface can use relative job IDs |
81 |
(negative integers) to refer to jobs in the same submission. |
82 |
|
83 |
.. highlight:: javascript |
84 |
|
85 |
Example data structures:: |
86 |
|
87 |
# First job |
88 |
{ |
89 |
"job_id": "6151", |
90 |
"ops": [ |
91 |
{ "OP_ID": "OP_INSTANCE_REPLACE_DISKS", ..., }, |
92 |
{ "OP_ID": "OP_INSTANCE_FAILOVER", ..., }, |
93 |
], |
94 |
} |
95 |
|
96 |
# Second job, runs in parallel with first job |
97 |
{ |
98 |
"job_id": "7687", |
99 |
"ops": [ |
100 |
{ "OP_ID": "OP_INSTANCE_MIGRATE", ..., }, |
101 |
], |
102 |
} |
103 |
|
104 |
# Third job, depending on success of previous jobs |
105 |
{ |
106 |
"job_id": "9218", |
107 |
"ops": [ |
108 |
{ "OP_ID": "OP_NODE_SET_PARAMS", |
109 |
"depend": [ |
110 |
[6151, ["success"]], |
111 |
[7687, ["success"]], |
112 |
], |
113 |
"offline": True, }, |
114 |
], |
115 |
} |
116 |
|
117 |
|
118 |
Implementation details |
119 |
---------------------- |
120 |
|
121 |
Status while waiting for dependencies |
122 |
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
123 |
|
124 |
Jobs waiting for dependencies are certainly not in the queue anymore and |
125 |
therefore need to change their status from "queued". While waiting for |
126 |
opcode locks the job is in the "waiting" status (the constant is named |
127 |
``JOB_STATUS_WAITLOCK``, but the actual value is ``waiting``). There the |
128 |
following possibilities: |
129 |
|
130 |
#. Introduce a new status, e.g. "waitdeps". |
131 |
|
132 |
Pro: |
133 |
|
134 |
- Clients know for sure a job is waiting for dependencies, not locks |
135 |
|
136 |
Con: |
137 |
|
138 |
- Code and tests would have to be updated/extended for the new status |
139 |
- List of possible state transitions certainly wouldn't get simpler |
140 |
- Breaks backwards compatibility, older clients might get confused |
141 |
|
142 |
#. Use existing "waiting" status. |
143 |
|
144 |
Pro: |
145 |
|
146 |
- No client changes necessary, less code churn (note that there are |
147 |
clients which don't live in Ganeti core) |
148 |
- Clients don't need to know the difference between waiting for a job |
149 |
and waiting for a lock; it doesn't make a difference |
150 |
- Fewer state transitions (see commit ``5fd6b69479c0``, which removed |
151 |
many state transitions and disk writes) |
152 |
|
153 |
Con: |
154 |
|
155 |
- Not immediately visible what a job is waiting for, but it's the |
156 |
same issue with locks; this is the reason why the lock monitor |
157 |
(``gnt-debug locks``) was introduced; job dependencies can be shown |
158 |
as "locks" in the monitor |
159 |
|
160 |
Based on these arguments, the proposal is to do the following: |
161 |
|
162 |
- Rename ``JOB_STATUS_WAITLOCK`` constant to ``JOB_STATUS_WAITING`` to |
163 |
reflect its actual meanting: the job is waiting for something |
164 |
- While waiting for dependencies and locks, jobs are in the "waiting" |
165 |
status |
166 |
- Export dependency information in lock monitor; example output:: |
167 |
|
168 |
Name Mode Owner Pending |
169 |
job/27491 - - success:job/34709,job/21459 |
170 |
job/21459 - - success,error:job/14513 |
171 |
|
172 |
|
173 |
Cost of deserialization |
174 |
~~~~~~~~~~~~~~~~~~~~~~~ |
175 |
|
176 |
To determine the status of a dependency job the job queue must have |
177 |
access to its data structure. Other queue operations already do this, |
178 |
e.g. archiving, watching a job's progress and querying jobs. |
179 |
|
180 |
Initially (Ganeti 2.0/2.1) the job queue shared the job objects |
181 |
in memory and protected them using locks. Ganeti 2.2 (see :doc:`design |
182 |
document <design-2.2>`) changed the queue to read and deserialize jobs |
183 |
from disk. This significantly reduced locking and code complexity. |
184 |
Nowadays inotify is used to wait for changes on job files when watching |
185 |
a job's progress. |
186 |
|
187 |
Reading from disk and deserializing certainly has some cost associated |
188 |
with it, but it's a significantly simpler architecture than |
189 |
synchronizing in memory with locks. At the stage where dependencies are |
190 |
evaluated the queue lock is held in shared mode, so different workers |
191 |
can read at the same time (deliberately ignoring CPython's interpreter |
192 |
lock). |
193 |
|
194 |
It is expected that the majority of executed jobs won't use |
195 |
dependencies and therefore won't be affected. |
196 |
|
197 |
|
198 |
Other discussed solutions |
199 |
========================= |
200 |
|
201 |
Job-level attribute |
202 |
------------------- |
203 |
|
204 |
At a first look it might seem to be better to put dependencies on |
205 |
previous jobs at a job level. However, it turns out that having the |
206 |
option of defining only a single opcode in a job as having such a |
207 |
dependency can be useful as well. The code complexity in the job queue |
208 |
is equivalent if not simpler. |
209 |
|
210 |
Since opcodes are guaranteed to run in order, clients can just define |
211 |
the dependency on the first opcode. |
212 |
|
213 |
Another reason for the choice of an opcode-level attribute is that the |
214 |
current LUXI interface for submitting jobs is a bit restricted and would |
215 |
need to be changed to allow the addition of job-level attributes, |
216 |
potentially requiring changes in all LUXI clients and/or breaking |
217 |
backwards compatibility. |
218 |
|
219 |
|
220 |
Client-side logic |
221 |
----------------- |
222 |
|
223 |
There's at least one implementation of a batched job executor twisted |
224 |
into the ``burnin`` tool's code. While certainly possible, a client-side |
225 |
solution should be avoided due to the different clients already in use. |
226 |
For one, the :doc:`remote API <rapi>` client shouldn't import |
227 |
non-standard modules. htools are written in Haskell and can't use Python |
228 |
modules. A batched job executor contains quite some logic. Even if |
229 |
cleanly abstracted in a (Python) library, sharing code between different |
230 |
clients is difficult if not impossible. |
231 |
|
232 |
|
233 |
.. vim: set textwidth=72 : |
234 |
.. Local Variables: |
235 |
.. mode: rst |
236 |
.. fill-column: 72 |
237 |
.. End: |