code.grnet.gr Git - ganeti-local/blob - doc/design-chained-jobs.rst

   1 ============
   2 Chained jobs
   3 ============
   4
   5 .. contents:: :depth: 4
   6
   7 This is a design document about the innards of Ganeti's job processing.
   8 Readers are advised to study previous design documents on the topic:
   9
  10 - :ref:`Original job queue <jqueue-original-design>`
  11 - :ref:`Job priorities <jqueue-job-priority-design>`
  12 - :doc:`LU-generated jobs <design-lu-generated-jobs>`
  13
  14
  15 Current state and shortcomings
  16 ==============================
  17
  18 Ever since the introduction of the job queue with Ganeti 2.0 there have
  19 been situations where we wanted to run several jobs in a specific order.
  20 Due to the job queue's current design, such a guarantee can not be
  21 given. Jobs are run according to their priority, their ability to
  22 acquire all necessary locks and other factors.
  23
  24 One way to work around this limitation is to do some kind of job
  25 grouping in the client code. Once all jobs of a group have finished, the
  26 next group is submitted and waited for. There are different kinds of
  27 clients for Ganeti, some of which don't share code (e.g. Python clients
  28 vs. htools). This design proposes a solution which would be implemented
  29 as part of the job queue in the master daemon.
  30
  31
  32 Proposed changes
  33 ================
  34
  35 With the implementation of :ref:`job priorities
  36 <jqueue-job-priority-design>` the processing code was re-architectured
  37 and became a lot more versatile. It now returns jobs to the queue in
  38 case the locks for an opcode can't be acquired, allowing other
  39 jobs/opcodes to be run in the meantime.
  40
  41 The proposal is to add a new, optional property to opcodes to define
  42 dependencies on other jobs. Job X could define opcodes with a dependency
  43 on the success of job Y and would only be run once job Y is finished. If
  44 there's a dependency on success and job Y failed, job X would fail as
  45 well. Since such dependencies would use job IDs, the jobs still need to
  46 be submitted in the right order.
  47
  48 .. pyassert::
  49
  50    # Update description below if finalized job status change
  51    constants.JOBS_FINALIZED == frozenset([
  52      constants.JOB_STATUS_CANCELED,
  53      constants.JOB_STATUS_SUCCESS,
  54      constants.JOB_STATUS_ERROR,
  55      ])
  56
  57 The new attribute's value would be a list of two-valued tuples. Each
  58 tuple contains a job ID and a list of requested status for the job
  59 depended upon. Only final status are accepted
  60 (:pyeval:`utils.CommaJoin(constants.JOBS_FINALIZED)`). An empty list is
  61 equivalent to specifying all final status (except
  62 :pyeval:`constants.JOB_STATUS_CANCELED`, which is treated specially).
  63 An opcode runs only once all its dependency requirements have been
  64 fulfilled.
  65
  66 Any job referring to a cancelled job is also cancelled unless it
  67 explicitly lists :pyeval:`constants.JOB_STATUS_CANCELED` as a requested
  68 status.
  69
  70 In case a referenced job can not be found in the normal queue or the
  71 archive, referring jobs fail as the status of the referenced job can't
  72 be determined.
  73
  74 With this change, clients can submit all wanted jobs in the right order
  75 and proceed to wait for changes on all these jobs (see
  76 ``cli.JobExecutor``). The master daemon will take care of executing them
  77 in the right order, while still presenting the client with a simple
  78 interface.
  79
  80 Clients using the ``SubmitManyJobs`` interface can use relative job IDs
  81 (negative integers) to refer to jobs in the same submission.
  82
  83 .. highlight:: javascript
  84
  85 Example data structures::
  86
  87   # First job
  88   {
  89     "job_id": "6151",
  90     "ops": [
  91       { "OP_ID": "OP_INSTANCE_REPLACE_DISKS", ..., },
  92       { "OP_ID": "OP_INSTANCE_FAILOVER", ..., },
  93       ],
  94   }
  95
  96   # Second job, runs in parallel with first job
  97   {
  98     "job_id": "7687",
  99     "ops": [
 100       { "OP_ID": "OP_INSTANCE_MIGRATE", ..., },
 101       ],
 102   }
 103
 104   # Third job, depending on success of previous jobs
 105   {
 106     "job_id": "9218",
 107     "ops": [
 108       { "OP_ID": "OP_NODE_SET_PARAMS",
 109         "depend": [
 110           [6151, ["success"]],
 111           [7687, ["success"]],
 112           ],
 113         "offline": True, },
 114       ],
 115   }
 116
 117
 118 Implementation details
 119 ----------------------
 120
 121 Status while waiting for dependencies
 122 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 123
 124 Jobs waiting for dependencies are certainly not in the queue anymore and
 125 therefore need to change their status from "queued". While waiting for
 126 opcode locks the job is in the "waiting" status (the constant is named
 127 ``JOB_STATUS_WAITLOCK``, but the actual value is ``waiting``). There the
 128 following possibilities:
 129
 130 #. Introduce a new status, e.g. "waitdeps".
 131
 132    Pro:
 133
 134    - Clients know for sure a job is waiting for dependencies, not locks
 135
 136    Con:
 137
 138    - Code and tests would have to be updated/extended for the new status
 139    - List of possible state transitions certainly wouldn't get simpler
 140    - Breaks backwards compatibility, older clients might get confused
 141
 142 #. Use existing "waiting" status.
 143
 144    Pro:
 145
 146    - No client changes necessary, less code churn (note that there are
 147      clients which don't live in Ganeti core)
 148    - Clients don't need to know the difference between waiting for a job
 149      and waiting for a lock; it doesn't make a difference
 150    - Fewer state transitions (see commit ``5fd6b69479c0``, which removed
 151      many state transitions and disk writes)
 152
 153    Con:
 154
 155    - Not immediately visible what a job is waiting for, but it's the
 156      same issue with locks; this is the reason why the lock monitor
 157      (``gnt-debug locks``) was introduced; job dependencies can be shown
 158      as "locks" in the monitor
 159
 160 Based on these arguments, the proposal is to do the following:
 161
 162 - Rename ``JOB_STATUS_WAITLOCK`` constant to ``JOB_STATUS_WAITING`` to
 163   reflect its actual meanting: the job is waiting for something
 164 - While waiting for dependencies and locks, jobs are in the "waiting"
 165   status
 166 - Export dependency information in lock monitor; example output::
 167
 168     Name      Mode Owner Pending
 169     job/27491 -    -     success:job/34709,job/21459
 170     job/21459 -    -     success,error:job/14513
 171
 172
 173 Cost of deserialization
 174 ~~~~~~~~~~~~~~~~~~~~~~~
 175
 176 To determine the status of a dependency job the job queue must have
 177 access to its data structure. Other queue operations already do this,
 178 e.g. archiving, watching a job's progress and querying jobs.
 179
 180 Initially (Ganeti 2.0/2.1) the job queue shared the job objects
 181 in memory and protected them using locks. Ganeti 2.2 (see :doc:`design
 182 document <design-2.2>`) changed the queue to read and deserialize jobs
 183 from disk. This significantly reduced locking and code complexity.
 184 Nowadays inotify is used to wait for changes on job files when watching
 185 a job's progress.
 186
 187 Reading from disk and deserializing certainly has some cost associated
 188 with it, but it's a significantly simpler architecture than
 189 synchronizing in memory with locks. At the stage where dependencies are
 190 evaluated the queue lock is held in shared mode, so different workers
 191 can read at the same time (deliberately ignoring CPython's interpreter
 192 lock).
 193
 194 It is expected that the majority of executed jobs won't use
 195 dependencies and therefore won't be affected.
 196
 197
 198 Other discussed solutions
 199 =========================
 200
 201 Job-level attribute
 202 -------------------
 203
 204 At a first look it might seem to be better to put dependencies on
 205 previous jobs at a job level. However, it turns out that having the
 206 option of defining only a single opcode in a job as having such a
 207 dependency can be useful as well. The code complexity in the job queue
 208 is equivalent if not simpler.
 209
 210 Since opcodes are guaranteed to run in order, clients can just define
 211 the dependency on the first opcode.
 212
 213 Another reason for the choice of an opcode-level attribute is that the
 214 current LUXI interface for submitting jobs is a bit restricted and would
 215 need to be changed to allow the addition of job-level attributes,
 216 potentially requiring changes in all LUXI clients and/or breaking
 217 backwards compatibility.
 218
 219
 220 Client-side logic
 221 -----------------
 222
 223 There's at least one implementation of a batched job executor twisted
 224 into the ``burnin`` tool's code. While certainly possible, a client-side
 225 solution should be avoided due to the different clients already in use.
 226 For one, the :doc:`remote API <rapi>` client shouldn't import
 227 non-standard modules. htools are written in Haskell and can't use Python
 228 modules. A batched job executor contains quite some logic. Even if
 229 cleanly abstracted in a (Python) library, sharing code between different
 230 clients is difficult if not impossible.
 231
 232
 233 .. vim: set textwidth=72 :
 234 .. Local Variables:
 235 .. mode: rst
 236 .. fill-column: 72
 237 .. End: