root / doc / design-lu-generated-jobs.rst @ 5d94c034
History | View | Annotate | Download (3.4 kB)
1 | ed9fda24 | Michael Hanselmann | ================================== |
---|---|---|---|
2 | ed9fda24 | Michael Hanselmann | Submitting jobs from logical units |
3 | ed9fda24 | Michael Hanselmann | ================================== |
4 | ed9fda24 | Michael Hanselmann | |
5 | ed9fda24 | Michael Hanselmann | .. contents:: :depth: 4 |
6 | ed9fda24 | Michael Hanselmann | |
7 | ed9fda24 | Michael Hanselmann | This is a design document about the innards of Ganeti's job processing. |
8 | ed9fda24 | Michael Hanselmann | Readers are advised to study previous design documents on the topic: |
9 | ed9fda24 | Michael Hanselmann | |
10 | ed9fda24 | Michael Hanselmann | - :ref:`Original job queue <jqueue-original-design>` |
11 | ed9fda24 | Michael Hanselmann | - :ref:`Job priorities <jqueue-job-priority-design>` |
12 | ed9fda24 | Michael Hanselmann | |
13 | ed9fda24 | Michael Hanselmann | |
14 | ed9fda24 | Michael Hanselmann | Current state and shortcomings |
15 | ed9fda24 | Michael Hanselmann | ============================== |
16 | ed9fda24 | Michael Hanselmann | |
17 | ed9fda24 | Michael Hanselmann | Some Ganeti operations want to execute as many operations in parallel as |
18 | ed9fda24 | Michael Hanselmann | possible. Examples are evacuating or failing over a node (``gnt-node |
19 | ed9fda24 | Michael Hanselmann | evacuate``/``gnt-node failover``). Without changing large parts of the |
20 | ed9fda24 | Michael Hanselmann | code, e.g. the RPC layer, to be asynchronous, or using threads inside a |
21 | ed9fda24 | Michael Hanselmann | logical unit, only a single operation can be executed at a time per job. |
22 | ed9fda24 | Michael Hanselmann | |
23 | ed9fda24 | Michael Hanselmann | Currently clients work around this limitation by retrieving the list of |
24 | ed9fda24 | Michael Hanselmann | desired targets and then re-submitting a number of jobs. This requires |
25 | ed9fda24 | Michael Hanselmann | logic to be kept in the client, in some cases leading to duplication |
26 | ed9fda24 | Michael Hanselmann | (e.g. CLI and RAPI). |
27 | ed9fda24 | Michael Hanselmann | |
28 | ed9fda24 | Michael Hanselmann | |
29 | ed9fda24 | Michael Hanselmann | Proposed changes |
30 | ed9fda24 | Michael Hanselmann | ================ |
31 | ed9fda24 | Michael Hanselmann | |
32 | ed9fda24 | Michael Hanselmann | The job queue lock is guaranteed to be released while executing an |
33 | ed9fda24 | Michael Hanselmann | opcode/logical unit. This means an opcode can talk to the job queue and |
34 | ed9fda24 | Michael Hanselmann | submit more jobs. It then receives the job IDs, like any job submitter |
35 | ed9fda24 | Michael Hanselmann | using the LUXI interface would. These job IDs are returned to the |
36 | ed9fda24 | Michael Hanselmann | client, who then will then proceed to wait for the jobs to finish. |
37 | ed9fda24 | Michael Hanselmann | |
38 | ed9fda24 | Michael Hanselmann | Technically, the job queue already passes a number of callbacks to the |
39 | ed9fda24 | Michael Hanselmann | opcode processor. These are used for giving user feedback, notifying the |
40 | ed9fda24 | Michael Hanselmann | job queue of an opcode having gotten its locks, and checking whether the |
41 | ed9fda24 | Michael Hanselmann | opcode has been cancelled. A new callback function is added to submit |
42 | ed9fda24 | Michael Hanselmann | jobs. Its signature and result will be equivalent to the job queue's |
43 | ed9fda24 | Michael Hanselmann | existing ``SubmitManyJobs`` function. |
44 | ed9fda24 | Michael Hanselmann | |
45 | ed9fda24 | Michael Hanselmann | Logical units can submit jobs by returning an instance of a special |
46 | ed9fda24 | Michael Hanselmann | container class with a list of jobs, each of which is a list of opcodes |
47 | ed9fda24 | Michael Hanselmann | (e.g. ``[[op1, op2], [op3]]``). The opcode processor will recognize |
48 | ed9fda24 | Michael Hanselmann | instances of the special class when used a return value and will submit |
49 | ed9fda24 | Michael Hanselmann | the contained jobs. The submission status and job IDs returned by the |
50 | ed9fda24 | Michael Hanselmann | submission callback are used as the opcode's result. It should be |
51 | ed9fda24 | Michael Hanselmann | encapsulated in a dictionary allowing for future extensions. |
52 | ed9fda24 | Michael Hanselmann | |
53 | ed9fda24 | Michael Hanselmann | .. highlight:: javascript |
54 | ed9fda24 | Michael Hanselmann | |
55 | ed9fda24 | Michael Hanselmann | Example:: |
56 | ed9fda24 | Michael Hanselmann | |
57 | ed9fda24 | Michael Hanselmann | { |
58 | ed9fda24 | Michael Hanselmann | "jobs": [ |
59 | ed9fda24 | Michael Hanselmann | (True, "8149"), |
60 | ed9fda24 | Michael Hanselmann | (True, "21019"), |
61 | ed9fda24 | Michael Hanselmann | (False, "Submission failed"), |
62 | ed9fda24 | Michael Hanselmann | (True, "31594"), |
63 | ed9fda24 | Michael Hanselmann | ], |
64 | ed9fda24 | Michael Hanselmann | } |
65 | ed9fda24 | Michael Hanselmann | |
66 | ed9fda24 | Michael Hanselmann | Job submissions can fail for variety of reasons, e.g. a full or drained |
67 | ed9fda24 | Michael Hanselmann | job queue. Lists of jobs can not be submitted atomically, meaning some |
68 | ed9fda24 | Michael Hanselmann | might fail while others succeed. The client is responsible for handling |
69 | ed9fda24 | Michael Hanselmann | such cases. |
70 | ed9fda24 | Michael Hanselmann | |
71 | ed9fda24 | Michael Hanselmann | |
72 | ed9fda24 | Michael Hanselmann | Other discussed solutions |
73 | ed9fda24 | Michael Hanselmann | ========================= |
74 | ed9fda24 | Michael Hanselmann | |
75 | ed9fda24 | Michael Hanselmann | Instead of requiring the client to wait for the returned jobs, another |
76 | ed9fda24 | Michael Hanselmann | idea was to do so from within the submitting opcode in the master |
77 | ed9fda24 | Michael Hanselmann | daemon. While technically possible, doing so would have two major |
78 | ed9fda24 | Michael Hanselmann | drawbacks: |
79 | ed9fda24 | Michael Hanselmann | |
80 | ed9fda24 | Michael Hanselmann | - Opcodes waiting for other jobs to finish block one job queue worker |
81 | ed9fda24 | Michael Hanselmann | thread |
82 | ed9fda24 | Michael Hanselmann | - All locks must be released before starting the waiting process, |
83 | ed9fda24 | Michael Hanselmann | failure to do so can lead to deadlocks |
84 | ed9fda24 | Michael Hanselmann | |
85 | ed9fda24 | Michael Hanselmann | Instead of returning the job IDs as part of the normal opcode result, |
86 | ed9fda24 | Michael Hanselmann | introducing a new opcode field, e.g. ``op_jobids``, was discussed and |
87 | ed9fda24 | Michael Hanselmann | dismissed. A new field would touch many areas and possibly break some |
88 | ed9fda24 | Michael Hanselmann | assumptions. There were also questions about the semantics. |
89 | ed9fda24 | Michael Hanselmann | |
90 | ed9fda24 | Michael Hanselmann | .. vim: set textwidth=72 : |
91 | ed9fda24 | Michael Hanselmann | .. Local Variables: |
92 | ed9fda24 | Michael Hanselmann | .. mode: rst |
93 | ed9fda24 | Michael Hanselmann | .. fill-column: 72 |
94 | ed9fda24 | Michael Hanselmann | .. End: |