Statistics
| Branch: | Tag: | Revision:

root / doc / design-2.0-job-queue.rst @ f6bd6e98

History | View | Annotate | Download (3.9 kB)

1 b2cee5e5 Michael Hanselmann
Job Queue
2 b2cee5e5 Michael Hanselmann
=========
3 b2cee5e5 Michael Hanselmann
4 b2cee5e5 Michael Hanselmann
.. contents::
5 b2cee5e5 Michael Hanselmann
6 b2cee5e5 Michael Hanselmann
Overview
7 b2cee5e5 Michael Hanselmann
--------
8 b2cee5e5 Michael Hanselmann
9 b2cee5e5 Michael Hanselmann
In Ganeti 1.2, operations in a cluster have to be done in a serialized way.
10 b2cee5e5 Michael Hanselmann
Virtually any operation locks the whole cluster by grabbing the global lock.
11 b2cee5e5 Michael Hanselmann
Other commands can't return before all work has been done.
12 b2cee5e5 Michael Hanselmann
13 b2cee5e5 Michael Hanselmann
By implementing a job queue and granular locking, we can lower the latency of
14 b2cee5e5 Michael Hanselmann
command execution inside a Ganeti cluster.
15 b2cee5e5 Michael Hanselmann
16 b2cee5e5 Michael Hanselmann
17 b2cee5e5 Michael Hanselmann
Detailed Design
18 b2cee5e5 Michael Hanselmann
---------------
19 b2cee5e5 Michael Hanselmann
20 b2cee5e5 Michael Hanselmann
Job execution—“Life of a Ganeti job”
21 b2cee5e5 Michael Hanselmann
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
22 b2cee5e5 Michael Hanselmann
23 b2cee5e5 Michael Hanselmann
#. Job gets submitted by the client. A new job identifier is generated and
24 b2cee5e5 Michael Hanselmann
   assigned to the job. The job is then automatically replicated to all nodes
25 b2cee5e5 Michael Hanselmann
   in the cluster. The identifier is returned to the client.
26 b2cee5e5 Michael Hanselmann
#. A pool of worker threads waits for new jobs. If all are busy, the job has
27 b2cee5e5 Michael Hanselmann
   to wait and the first worker finishing its work will grab it. Otherwise any
28 b2cee5e5 Michael Hanselmann
   of the waiting threads will pick up the new job.
29 b2cee5e5 Michael Hanselmann
#. Client waits for job status updates by calling a waiting RPC function.
30 b2cee5e5 Michael Hanselmann
   Log message may be shown to the user. Until the job is started, it can also
31 b2cee5e5 Michael Hanselmann
   be cancelled.
32 b2cee5e5 Michael Hanselmann
#. As soon as the job is finished, its final result and status can be retrieved
33 b2cee5e5 Michael Hanselmann
   from the server.
34 b2cee5e5 Michael Hanselmann
#. If the client archives the job, it gets moved to a history directory.
35 b2cee5e5 Michael Hanselmann
   This could also be done regularily using a cron script.
36 b2cee5e5 Michael Hanselmann
37 b2cee5e5 Michael Hanselmann
38 b2cee5e5 Michael Hanselmann
Queue structure
39 b2cee5e5 Michael Hanselmann
~~~~~~~~~~~~~~~
40 b2cee5e5 Michael Hanselmann
41 b2cee5e5 Michael Hanselmann
All file operations have to be done atomically by writing to a temporary file
42 b2cee5e5 Michael Hanselmann
and subsequent renaming. Except for log messages, every change in a job is
43 b2cee5e5 Michael Hanselmann
stored and replicated to other nodes.
44 b2cee5e5 Michael Hanselmann
45 b2cee5e5 Michael Hanselmann
::
46 b2cee5e5 Michael Hanselmann
47 b2cee5e5 Michael Hanselmann
  /var/lib/ganeti/queue/
48 b2cee5e5 Michael Hanselmann
    job-1 (JSON encoded job description and status)
49 b2cee5e5 Michael Hanselmann
    […]
50 b2cee5e5 Michael Hanselmann
    job-37
51 b2cee5e5 Michael Hanselmann
    job-38
52 b2cee5e5 Michael Hanselmann
    job-39
53 b2cee5e5 Michael Hanselmann
    lock (Queue managing process opens this file in exclusive mode)
54 b2cee5e5 Michael Hanselmann
    serial (Last job ID used)
55 b2cee5e5 Michael Hanselmann
    version (Queue format version)
56 b2cee5e5 Michael Hanselmann
57 b2cee5e5 Michael Hanselmann
58 b2cee5e5 Michael Hanselmann
Locking
59 b2cee5e5 Michael Hanselmann
~~~~~~~
60 b2cee5e5 Michael Hanselmann
61 b2cee5e5 Michael Hanselmann
Locking in the job queue is a complicated topic. It is called from more than
62 b2cee5e5 Michael Hanselmann
one thread and must be thread-safe. For simplicity, a single lock is used for
63 b2cee5e5 Michael Hanselmann
the whole job queue.
64 b2cee5e5 Michael Hanselmann
65 b2cee5e5 Michael Hanselmann
A more detailed description can be found in doc/locking.txt.
66 b2cee5e5 Michael Hanselmann
67 b2cee5e5 Michael Hanselmann
68 b2cee5e5 Michael Hanselmann
Internal RPC
69 b2cee5e5 Michael Hanselmann
~~~~~~~~~~~~
70 b2cee5e5 Michael Hanselmann
71 b2cee5e5 Michael Hanselmann
RPC calls available between Ganeti master and node daemons:
72 b2cee5e5 Michael Hanselmann
73 b2cee5e5 Michael Hanselmann
jobqueue_update(file_name, content)
74 b2cee5e5 Michael Hanselmann
  Writes a file in the job queue directory.
75 b2cee5e5 Michael Hanselmann
jobqueue_purge()
76 b2cee5e5 Michael Hanselmann
  Cleans the job queue directory completely, including archived job.
77 b2cee5e5 Michael Hanselmann
jobqueue_rename(old, new)
78 b2cee5e5 Michael Hanselmann
  Renames a file in the job queue directory.
79 b2cee5e5 Michael Hanselmann
80 b2cee5e5 Michael Hanselmann
81 b2cee5e5 Michael Hanselmann
Client RPC
82 b2cee5e5 Michael Hanselmann
~~~~~~~~~~
83 b2cee5e5 Michael Hanselmann
84 b2cee5e5 Michael Hanselmann
RPC between Ganeti clients and the Ganeti master daemon supports the following
85 b2cee5e5 Michael Hanselmann
operations:
86 b2cee5e5 Michael Hanselmann
87 b2cee5e5 Michael Hanselmann
SubmitJob(ops)
88 b2cee5e5 Michael Hanselmann
  Submits a list of opcodes and returns the job identifier. The identifier is
89 b2cee5e5 Michael Hanselmann
  guaranteed to be unique during the lifetime of a cluster.
90 b2cee5e5 Michael Hanselmann
WaitForJobChange(job_id, fields, […], timeout)
91 b2cee5e5 Michael Hanselmann
  This function waits until a job changes or a timeout expires. The condition
92 b2cee5e5 Michael Hanselmann
  for when a job changed is defined by the fields passed and the last log
93 b2cee5e5 Michael Hanselmann
  message received.
94 b2cee5e5 Michael Hanselmann
QueryJobs(job_ids, fields)
95 b2cee5e5 Michael Hanselmann
  Returns field values for the job identifiers passed.
96 b2cee5e5 Michael Hanselmann
CancelJob(job_id)
97 b2cee5e5 Michael Hanselmann
  Cancels the job specified by identifier. This operation may fail if the job
98 b2cee5e5 Michael Hanselmann
  is already running, canceled or finished.
99 b2cee5e5 Michael Hanselmann
ArchiveJob(job_id)
100 b2cee5e5 Michael Hanselmann
  Moves a job into the …/archive/ directory. This operation will fail if the
101 b2cee5e5 Michael Hanselmann
  job has not been canceled or finished.
102 b2cee5e5 Michael Hanselmann
103 b2cee5e5 Michael Hanselmann
104 b2cee5e5 Michael Hanselmann
Job and opcode status
105 b2cee5e5 Michael Hanselmann
~~~~~~~~~~~~~~~~~~~~~
106 b2cee5e5 Michael Hanselmann
107 b2cee5e5 Michael Hanselmann
Each job and each opcode has, at any time, one of the following states:
108 b2cee5e5 Michael Hanselmann
109 b2cee5e5 Michael Hanselmann
Queued
110 b2cee5e5 Michael Hanselmann
  The job/opcode was submitted, but did not yet start.
111 b2cee5e5 Michael Hanselmann
Running
112 b2cee5e5 Michael Hanselmann
  The job/opcode is running.
113 b2cee5e5 Michael Hanselmann
Canceled
114 b2cee5e5 Michael Hanselmann
  The job/opcode was canceled before it started.
115 b2cee5e5 Michael Hanselmann
Success
116 b2cee5e5 Michael Hanselmann
  The job/opcode ran and finished successfully.
117 b2cee5e5 Michael Hanselmann
Error
118 b2cee5e5 Michael Hanselmann
  The job/opcode was aborted with an error.
119 b2cee5e5 Michael Hanselmann
120 b2cee5e5 Michael Hanselmann
If the master is aborted while a job is running, the job will be set to the
121 b2cee5e5 Michael Hanselmann
Error status once the master started again.
122 b2cee5e5 Michael Hanselmann
123 b2cee5e5 Michael Hanselmann
124 b2cee5e5 Michael Hanselmann
History
125 b2cee5e5 Michael Hanselmann
~~~~~~~
126 b2cee5e5 Michael Hanselmann
127 b2cee5e5 Michael Hanselmann
Archived jobs are kept in a separate directory, /var/lib/ganeti/queue/archive/.
128 b2cee5e5 Michael Hanselmann
The idea is to speed up the queue handling.
129 b2cee5e5 Michael Hanselmann
130 b2cee5e5 Michael Hanselmann
131 b2cee5e5 Michael Hanselmann
Ganeti updates
132 b2cee5e5 Michael Hanselmann
~~~~~~~~~~~~~~
133 b2cee5e5 Michael Hanselmann
134 b2cee5e5 Michael Hanselmann
The queue has to be completely empty for Ganeti updates with changes in the job
135 b2cee5e5 Michael Hanselmann
queue structure.