Statistics
| Branch: | Tag: | Revision:

root / doc / design-2.0-master-daemon.rst @ 10799c59

History | View | Annotate | Download (7.2 kB)

1 bb083b25 Iustin Pop
Ganeti 2.0 Master daemon
2 bb083b25 Iustin Pop
========================
3 bb083b25 Iustin Pop
4 47eb4b45 Guido Trotter
.. contents::
5 47eb4b45 Guido Trotter
6 bb083b25 Iustin Pop
Objective
7 bb083b25 Iustin Pop
---------
8 bb083b25 Iustin Pop
9 bb083b25 Iustin Pop
Many of the important features of Ganeti 2.0 — job queue, granular
10 bb083b25 Iustin Pop
locking, external API, etc. — will be integrated via a master
11 bb083b25 Iustin Pop
daemon. While not absolutely necessary, it is the best way to
12 bb083b25 Iustin Pop
integrate all these components.
13 bb083b25 Iustin Pop
14 bb083b25 Iustin Pop
Background
15 bb083b25 Iustin Pop
----------
16 bb083b25 Iustin Pop
17 bb083b25 Iustin Pop
Currently there is no "master" daemon in Ganeti (1.2). Each command
18 bb083b25 Iustin Pop
tries to acquire the so called *cmd* lock and when it succeeds, it
19 bb083b25 Iustin Pop
takes complete ownership of the cluster configuration and state. The
20 bb083b25 Iustin Pop
scheduled improvements to Ganeti require or can use a daemon that
21 bb083b25 Iustin Pop
coordinates the activities/jobs scheduled/etc.
22 bb083b25 Iustin Pop
23 bb083b25 Iustin Pop
Overview
24 bb083b25 Iustin Pop
--------
25 bb083b25 Iustin Pop
26 bb083b25 Iustin Pop
The master daemon will be the central point of the cluster; command
27 bb083b25 Iustin Pop
line tools and the external API will interact with the cluster via
28 bb083b25 Iustin Pop
this daemon; it will be one coordinating the node daemons.
29 bb083b25 Iustin Pop
30 bb083b25 Iustin Pop
This design doc is best read in the context of the accompanying design
31 bb083b25 Iustin Pop
docs for Ganeti 2.0: Granular locking design and Job queue design.
32 bb083b25 Iustin Pop
33 bb083b25 Iustin Pop
34 bb083b25 Iustin Pop
Detailed Design
35 bb083b25 Iustin Pop
---------------
36 bb083b25 Iustin Pop
37 bb083b25 Iustin Pop
In Ganeti 2.0, we will have the following *entities*:
38 bb083b25 Iustin Pop
39 bb083b25 Iustin Pop
- the master daemon (on master node)
40 bb083b25 Iustin Pop
- the node daemon (all nodes)
41 bb083b25 Iustin Pop
- the command line tools (master node)
42 bb083b25 Iustin Pop
- the RAPI daemon (master node)
43 bb083b25 Iustin Pop
44 bb083b25 Iustin Pop
Interaction paths are between:
45 bb083b25 Iustin Pop
46 bb083b25 Iustin Pop
- (CLI tools/RAPI daemon) and the master daemon, via the so called *luxi* API
47 bb083b25 Iustin Pop
- the master daemon and the node daemons, via the node RPC
48 bb083b25 Iustin Pop
49 bb083b25 Iustin Pop
The protocol between the master daemon and the node daemons will be
50 bb083b25 Iustin Pop
changed to HTTP(S), using a simple PUT/GET of JSON-encoded
51 efd0d44f Michael Hanselmann
messages. This is done due to difficulties in working with the Twisted
52 efd0d44f Michael Hanselmann
framework and its protocols in a multithreaded environment, which we can
53 efd0d44f Michael Hanselmann
overcome by using a simpler stack (see the caveats section). The protocol
54 efd0d44f Michael Hanselmann
between the CLI/RAPI and the master daemon will be a custom one: on a UNIX
55 bb083b25 Iustin Pop
socket on the master node, with rights restricted by filesystem
56 efd0d44f Michael Hanselmann
permissions, the CLI/RAPI will talk to the master daemon using JSON-encoded
57 efd0d44f Michael Hanselmann
messages.
58 bb083b25 Iustin Pop
59 bb083b25 Iustin Pop
The operations supported over this internal protocol will be encoded
60 bb083b25 Iustin Pop
via a python library that will expose a simple API for its
61 bb083b25 Iustin Pop
users. Internally, the protocol will simply encode all objects in JSON
62 bb083b25 Iustin Pop
format and decode them on the receiver side.
63 bb083b25 Iustin Pop
64 bb083b25 Iustin Pop
The LUXI protocol
65 bb083b25 Iustin Pop
~~~~~~~~~~~~~~~~~
66 bb083b25 Iustin Pop
67 bb083b25 Iustin Pop
We will have two main classes of operations over the master daemon API:
68 bb083b25 Iustin Pop
69 bb083b25 Iustin Pop
- cluster query functions
70 bb083b25 Iustin Pop
- job related functions
71 bb083b25 Iustin Pop
72 bb083b25 Iustin Pop
The cluster query functions are usually short-duration, and are the
73 bb083b25 Iustin Pop
equivalent of the OP_QUERY_* opcodes in ganeti 1.2 (and they are
74 bb083b25 Iustin Pop
internally implemented still with these opcodes). The clients are
75 bb083b25 Iustin Pop
guaranteed to receive the response in a reasonable time via a timeout.
76 bb083b25 Iustin Pop
77 bb083b25 Iustin Pop
The job-related functions will be:
78 bb083b25 Iustin Pop
79 bb083b25 Iustin Pop
- submit job
80 bb083b25 Iustin Pop
- query job (which could also be categorized in the query-functions)
81 bb083b25 Iustin Pop
- archive job (see the job queue design doc)
82 bb083b25 Iustin Pop
- wait for job change, which allows a client to wait without polling
83 bb083b25 Iustin Pop
84 efd0d44f Michael Hanselmann
For more details, see the job queue design document.
85 efd0d44f Michael Hanselmann
86 bb083b25 Iustin Pop
Daemon implementation
87 bb083b25 Iustin Pop
~~~~~~~~~~~~~~~~~~~~~
88 bb083b25 Iustin Pop
89 bb083b25 Iustin Pop
The daemon will be based around a main I/O thread that will wait for
90 bb083b25 Iustin Pop
new requests from the clients, and that does the setup/shutdown of the
91 bb083b25 Iustin Pop
other thread (pools).
92 bb083b25 Iustin Pop
93 bb083b25 Iustin Pop
There will two other classes of threads in the daemon:
94 bb083b25 Iustin Pop
95 bb083b25 Iustin Pop
- job processing threads, part of a thread pool, and which are
96 bb083b25 Iustin Pop
  long-lived, started at daemon startup and terminated only at shutdown
97 bb083b25 Iustin Pop
  time
98 bb083b25 Iustin Pop
- client I/O threads, which are the ones that talk the local protocol
99 bb083b25 Iustin Pop
  to the clients
100 bb083b25 Iustin Pop
101 bb083b25 Iustin Pop
Master startup/failover
102 bb083b25 Iustin Pop
~~~~~~~~~~~~~~~~~~~~~~~
103 bb083b25 Iustin Pop
104 bb083b25 Iustin Pop
In Ganeti 1.x there is no protection against failing over the master
105 bb083b25 Iustin Pop
to a node with stale configuration. In effect, the responsibility of
106 bb083b25 Iustin Pop
correct failovers falls on the admin. This is true both for the new
107 bb083b25 Iustin Pop
master and for when an old, offline master startup.
108 bb083b25 Iustin Pop
109 bb083b25 Iustin Pop
Since in 2.x we are extending the cluster state to cover the job queue
110 bb083b25 Iustin Pop
and have a daemon that will execute by itself the job queue, we want
111 bb083b25 Iustin Pop
to have more resilience for the master role.
112 bb083b25 Iustin Pop
113 bb083b25 Iustin Pop
The following algorithm will happen whenever a node is ready to
114 bb083b25 Iustin Pop
transition to the master role, either at startup time or at node
115 bb083b25 Iustin Pop
failover:
116 bb083b25 Iustin Pop
117 bb083b25 Iustin Pop
#. read the configuration file and parse the node list
118 bb083b25 Iustin Pop
   contained within
119 bb083b25 Iustin Pop
120 bb083b25 Iustin Pop
#. query all the nodes and make sure we obtain an agreement via
121 bb083b25 Iustin Pop
   a quorum of at least half plus one nodes for the following:
122 bb083b25 Iustin Pop
123 bb083b25 Iustin Pop
    - we have the latest configuration and job list (as
124 bb083b25 Iustin Pop
      determined by the serial number on the configuration and
125 bb083b25 Iustin Pop
      highest job ID on the job queue)
126 bb083b25 Iustin Pop
127 bb083b25 Iustin Pop
    - there is not even a single node having a newer
128 bb083b25 Iustin Pop
      configuration file
129 bb083b25 Iustin Pop
130 bb083b25 Iustin Pop
    - if we are not failing over (but just starting), the
131 bb083b25 Iustin Pop
      quorum agrees that we are the designated master
132 bb083b25 Iustin Pop
133 bb083b25 Iustin Pop
#. at this point, the node transitions to the master role
134 bb083b25 Iustin Pop
135 bb083b25 Iustin Pop
#. for all the in-progress jobs, mark them as failed, with
136 bb083b25 Iustin Pop
   reason unknown or something similar (master failed, etc.)
137 bb083b25 Iustin Pop
138 bb083b25 Iustin Pop
139 bb083b25 Iustin Pop
Logging
140 bb083b25 Iustin Pop
~~~~~~~
141 bb083b25 Iustin Pop
142 bb083b25 Iustin Pop
The logging system will be switched completely to the logging module;
143 bb083b25 Iustin Pop
currently it's logging-based, but exposes a different API, which is
144 bb083b25 Iustin Pop
just overhead. As such, the code will be switched over to standard
145 bb083b25 Iustin Pop
logging calls, and only the setup will be custom.
146 bb083b25 Iustin Pop
147 bb083b25 Iustin Pop
With this change, we will remove the separate debug/info/error logs,
148 bb083b25 Iustin Pop
and instead have always one logfile per daemon model:
149 bb083b25 Iustin Pop
150 bb083b25 Iustin Pop
- master-daemon.log for the master daemon
151 bb083b25 Iustin Pop
- node-daemon.log for the node daemon (this is the same as in 1.2)
152 bb083b25 Iustin Pop
- rapi-daemon.log for the RAPI daemon logs
153 bb083b25 Iustin Pop
- rapi-access.log, an additional log file for the RAPI that will be
154 bb083b25 Iustin Pop
  in the standard http log format for possible parsing by other tools
155 bb083b25 Iustin Pop
156 bb083b25 Iustin Pop
Since the watcher will only submit jobs to the master for startup of
157 bb083b25 Iustin Pop
the instances, its log file will contain less information than before,
158 bb083b25 Iustin Pop
mainly that it will start the instance, but not the results.
159 bb083b25 Iustin Pop
160 bb083b25 Iustin Pop
Caveats
161 bb083b25 Iustin Pop
-------
162 bb083b25 Iustin Pop
163 bb083b25 Iustin Pop
A discussed alternative is to keep the current individual processes
164 bb083b25 Iustin Pop
touching the cluster configuration model. The reasons we have not
165 bb083b25 Iustin Pop
chosen this approach is:
166 bb083b25 Iustin Pop
167 bb083b25 Iustin Pop
- the speed of reading and unserializing the cluster state
168 bb083b25 Iustin Pop
  today is not small enough that we can ignore it; the addition of
169 bb083b25 Iustin Pop
  the job queue will make the startup cost even higher. While this
170 bb083b25 Iustin Pop
  runtime cost is low, it can be on the order of a few seconds on
171 bb083b25 Iustin Pop
  bigger clusters, which for very quick commands is comparable to
172 bb083b25 Iustin Pop
  the actual duration of the computation itself
173 bb083b25 Iustin Pop
174 bb083b25 Iustin Pop
- individual commands would make it harder to implement a
175 bb083b25 Iustin Pop
  fire-and-forget job request, along the lines "start this
176 bb083b25 Iustin Pop
  instance but do not wait for it to finish"; it would require a
177 bb083b25 Iustin Pop
  model of backgrounding the operation and other things that are
178 bb083b25 Iustin Pop
  much better served by a daemon-based model
179 bb083b25 Iustin Pop
180 bb083b25 Iustin Pop
Another area of discussion is moving away from Twisted in this new
181 bb083b25 Iustin Pop
implementation. While Twisted hase its advantages, there are also many
182 bb083b25 Iustin Pop
disatvantanges to using it:
183 bb083b25 Iustin Pop
184 bb083b25 Iustin Pop
- first and foremost, it's not a library, but a framework; thus, if
185 bb083b25 Iustin Pop
  you use twisted, all the code needs to be 'twiste-ized'; we were able
186 bb083b25 Iustin Pop
  to keep the 1.x code clean by hacking around twisted in an
187 bb083b25 Iustin Pop
  unsupported, unrecommended way, and the only alternative would have
188 bb083b25 Iustin Pop
  been to make all the code be written for twisted
189 bb083b25 Iustin Pop
- it has some weaknesses in working with multiple threads, since its base
190 efd0d44f Michael Hanselmann
  model is designed to replace thread usage by using deferred calls, so while
191 efd0d44f Michael Hanselmann
  it can use threads, it's not less flexible in doing so
192 bb083b25 Iustin Pop
193 efd0d44f Michael Hanselmann
And, since we already have an HTTP server library for the RAPI, we
194 bb083b25 Iustin Pop
can just reuse that for inter-node communication.