Statistics
| Branch: | Tag: | Revision:

root / doc / design-chained-jobs.rst @ 6c3d18e0

History | View | Annotate | Download (5.3 kB)

1
============
2
Chained jobs
3
============
4

    
5
.. contents:: :depth: 4
6

    
7
This is a design document about the innards of Ganeti's job processing.
8
Readers are advised to study previous design documents on the topic:
9

    
10
- :ref:`Original job queue <jqueue-original-design>`
11
- :ref:`Job priorities <jqueue-job-priority-design>`
12
- :doc:`LU-generated jobs <design-lu-generated-jobs>`
13

    
14

    
15
Current state and shortcomings
16
==============================
17

    
18
Ever since the introduction of the job queue with Ganeti 2.0 there have
19
been situations where we wanted to run several jobs in a specific order.
20
Due to the job queue's current design, such a guarantee can not be
21
given. Jobs are run according to their priority, their ability to
22
acquire all necessary locks and other factors.
23

    
24
One way to work around this limitation is to do some kind of job
25
grouping in the client code. Once all jobs of a group have finished, the
26
next group is submitted and waited for. There are different kinds of
27
clients for Ganeti, some of which don't share code (e.g. Python clients
28
vs. htools). This design proposes a solution which would be implemented
29
as part of the job queue in the master daemon.
30

    
31

    
32
Proposed changes
33
================
34

    
35
With the implementation of :ref:`job priorities
36
<jqueue-job-priority-design>` the processing code was re-architectured
37
and became a lot more versatile. It now returns jobs to the queue in
38
case the locks for an opcode can't be acquired, allowing other
39
jobs/opcodes to be run in the meantime.
40

    
41
The proposal is to add a new, optional property to opcodes to define
42
dependencies on other jobs. Job X could define opcodes with a dependency
43
on the success of job Y and would only be run once job Y is finished. If
44
there's a dependency on success and job Y failed, job X would fail as
45
well. Since such dependencies would use job IDs, the jobs still need to
46
be submitted in the right order.
47

    
48
.. pyassert::
49

    
50
   # Update description below if finalized job status change
51
   constants.JOBS_FINALIZED == frozenset([
52
     constants.JOB_STATUS_CANCELED,
53
     constants.JOB_STATUS_SUCCESS,
54
     constants.JOB_STATUS_ERROR,
55
     ])
56

    
57
The new attribute's value would be a list of two-valued tuples. Each
58
tuple contains a job ID and a list of requested status for the job
59
depended upon. Only final status are accepted
60
(:pyeval:`utils.CommaJoin(constants.JOBS_FINALIZED)`). An empty list is
61
equivalent to specifying all final status (except
62
:pyeval:`constants.JOB_STATUS_CANCELED`, which is treated specially).
63
An opcode runs only once all its dependency requirements have been
64
fulfilled.
65

    
66
Any job referring to a cancelled job is also cancelled unless it
67
explicitely lists :pyeval:`constants.JOB_STATUS_CANCELED` as a requested
68
status.
69

    
70
In case a referenced job can not be found in the normal queue or the
71
archive, referring jobs fail as the status of the referenced job can't
72
be determined.
73

    
74
With this change, clients can submit all wanted jobs in the right order
75
and proceed to wait for changes on all these jobs (see
76
``cli.JobExecutor``). The master daemon will take care of executing them
77
in the right order, while still presenting the client with a simple
78
interface.
79

    
80
Clients using the ``SubmitManyJobs`` interface can use relative job IDs
81
(negative integers) to refer to jobs in the same submission.
82

    
83
.. highlight:: javascript
84

    
85
Example data structures::
86

    
87
  # First job
88
  {
89
    "job_id": "6151",
90
    "ops": [
91
      { "OP_ID": "OP_INSTANCE_REPLACE_DISKS", ..., },
92
      { "OP_ID": "OP_INSTANCE_FAILOVER", ..., },
93
      ],
94
  }
95

    
96
  # Second job, runs in parallel with first job
97
  {
98
    "job_id": "7687",
99
    "ops": [
100
      { "OP_ID": "OP_INSTANCE_MIGRATE", ..., },
101
      ],
102
  }
103

    
104
  # Third job, depending on success of previous jobs
105
  {
106
    "job_id": "9218",
107
    "ops": [
108
      { "OP_ID": "OP_NODE_SET_PARAMS",
109
        "depend": [
110
          [6151, ["success"]],
111
          [7687, ["success"]],
112
          ],
113
        "offline": True, },
114
      ],
115
  }
116

    
117

    
118
Other discussed solutions
119
=========================
120

    
121
Job-level attribute
122
-------------------
123

    
124
At a first look it might seem to be better to put dependencies on
125
previous jobs at a job level. However, it turns out that having the
126
option of defining only a single opcode in a job as having such a
127
dependency can be useful as well. The code complexity in the job queue
128
is equivalent if not simpler.
129

    
130
Since opcodes are guaranteed to run in order, clients can just define
131
the dependency on the first opcode.
132

    
133
Another reason for the choice of an opcode-level attribute is that the
134
current LUXI interface for submitting jobs is a bit restricted and would
135
need to be changed to allow the addition of job-level attributes,
136
potentially requiring changes in all LUXI clients and/or breaking
137
backwards compatibility.
138

    
139

    
140
Client-side logic
141
-----------------
142

    
143
There's at least one implementation of a batched job executor twisted
144
into the ``burnin`` tool's code. While certainly possible, a client-side
145
solution should be avoided due to the different clients already in use.
146
For one, the :doc:`remote API <rapi>` client shouldn't import
147
non-standard modules. htools are written in Haskell and can't use Python
148
modules. A batched job executor contains quite some logic. Even if
149
cleanly abstracted in a (Python) library, sharing code between different
150
clients is difficult if not impossible.
151

    
152

    
153
.. vim: set textwidth=72 :
154
.. Local Variables:
155
.. mode: rst
156
.. fill-column: 72
157
.. End: