root / doc / design-lu-generated-jobs.rst @ 2a50e2e8
History | View | Annotate | Download (3.4 kB)
1 |
================================== |
---|---|
2 |
Submitting jobs from logical units |
3 |
================================== |
4 |
|
5 |
.. contents:: :depth: 4 |
6 |
|
7 |
This is a design document about the innards of Ganeti's job processing. |
8 |
Readers are advised to study previous design documents on the topic: |
9 |
|
10 |
- :ref:`Original job queue <jqueue-original-design>` |
11 |
- :ref:`Job priorities <jqueue-job-priority-design>` |
12 |
|
13 |
|
14 |
Current state and shortcomings |
15 |
============================== |
16 |
|
17 |
Some Ganeti operations want to execute as many operations in parallel as |
18 |
possible. Examples are evacuating or failing over a node (``gnt-node |
19 |
evacuate``/``gnt-node failover``). Without changing large parts of the |
20 |
code, e.g. the RPC layer, to be asynchronous, or using threads inside a |
21 |
logical unit, only a single operation can be executed at a time per job. |
22 |
|
23 |
Currently clients work around this limitation by retrieving the list of |
24 |
desired targets and then re-submitting a number of jobs. This requires |
25 |
logic to be kept in the client, in some cases leading to duplication |
26 |
(e.g. CLI and RAPI). |
27 |
|
28 |
|
29 |
Proposed changes |
30 |
================ |
31 |
|
32 |
The job queue lock is guaranteed to be released while executing an |
33 |
opcode/logical unit. This means an opcode can talk to the job queue and |
34 |
submit more jobs. It then receives the job IDs, like any job submitter |
35 |
using the LUXI interface would. These job IDs are returned to the |
36 |
client, who then will then proceed to wait for the jobs to finish. |
37 |
|
38 |
Technically, the job queue already passes a number of callbacks to the |
39 |
opcode processor. These are used for giving user feedback, notifying the |
40 |
job queue of an opcode having gotten its locks, and checking whether the |
41 |
opcode has been cancelled. A new callback function is added to submit |
42 |
jobs. Its signature and result will be equivalent to the job queue's |
43 |
existing ``SubmitManyJobs`` function. |
44 |
|
45 |
Logical units can submit jobs by returning an instance of a special |
46 |
container class with a list of jobs, each of which is a list of opcodes |
47 |
(e.g. ``[[op1, op2], [op3]]``). The opcode processor will recognize |
48 |
instances of the special class when used a return value and will submit |
49 |
the contained jobs. The submission status and job IDs returned by the |
50 |
submission callback are used as the opcode's result. It should be |
51 |
encapsulated in a dictionary allowing for future extensions. |
52 |
|
53 |
.. highlight:: javascript |
54 |
|
55 |
Example:: |
56 |
|
57 |
{ |
58 |
"jobs": [ |
59 |
(True, "8149"), |
60 |
(True, "21019"), |
61 |
(False, "Submission failed"), |
62 |
(True, "31594"), |
63 |
], |
64 |
} |
65 |
|
66 |
Job submissions can fail for variety of reasons, e.g. a full or drained |
67 |
job queue. Lists of jobs can not be submitted atomically, meaning some |
68 |
might fail while others succeed. The client is responsible for handling |
69 |
such cases. |
70 |
|
71 |
|
72 |
Other discussed solutions |
73 |
========================= |
74 |
|
75 |
Instead of requiring the client to wait for the returned jobs, another |
76 |
idea was to do so from within the submitting opcode in the master |
77 |
daemon. While technically possible, doing so would have two major |
78 |
drawbacks: |
79 |
|
80 |
- Opcodes waiting for other jobs to finish block one job queue worker |
81 |
thread |
82 |
- All locks must be released before starting the waiting process, |
83 |
failure to do so can lead to deadlocks |
84 |
|
85 |
Instead of returning the job IDs as part of the normal opcode result, |
86 |
introducing a new opcode field, e.g. ``op_jobids``, was discussed and |
87 |
dismissed. A new field would touch many areas and possibly break some |
88 |
assumptions. There were also questions about the semantics. |
89 |
|
90 |
.. vim: set textwidth=72 : |
91 |
.. Local Variables: |
92 |
.. mode: rst |
93 |
.. fill-column: 72 |
94 |
.. End: |