|
1 |
==========================
|
|
2 |
Ganeti daemons refactoring
|
|
3 |
==========================
|
|
4 |
|
|
5 |
.. contents:: :depth: 2
|
|
6 |
|
|
7 |
This is a design document detailing the plan for refactoring the internal
|
|
8 |
structure of Ganeti, and particularly the set of daemons it is divided into.
|
|
9 |
|
|
10 |
|
|
11 |
Current state and shortcomings
|
|
12 |
==============================
|
|
13 |
|
|
14 |
Ganeti is comprised of a growing number of daemons, each dealing with part of
|
|
15 |
the tasks the cluster has to face, and communicating with the other daemons
|
|
16 |
using a variety of protocols.
|
|
17 |
|
|
18 |
Specifically, as of Ganeti 2.8, the situation is as follows:
|
|
19 |
|
|
20 |
``Master daemon (MasterD)``
|
|
21 |
It is responsible for managing the entire cluster, and it's written in Python.
|
|
22 |
It is executed on a single node (the master node). It receives the commands
|
|
23 |
given by the cluster administrator (through the remote API daemon or the
|
|
24 |
command line tools) over the LUXI protocol. The master daemon is responsible
|
|
25 |
for creating and managing the jobs that will execute such commands, and for
|
|
26 |
managing the locks that ensure the cluster will not incur in race conditions.
|
|
27 |
|
|
28 |
Each job is managed by a separate Python thread, that interacts with the node
|
|
29 |
daemons via RPC calls.
|
|
30 |
|
|
31 |
The master daemon is also responsible for managing the configuration of the
|
|
32 |
cluster, changing it when required by some job. It is also responsible for
|
|
33 |
copying the configuration to the other master candidates after updating it.
|
|
34 |
|
|
35 |
``RAPI daemon (RapiD)``
|
|
36 |
It is written in Python and runs on the master node only. It waits for
|
|
37 |
requests issued remotely through the remote API protocol. Then, it forwards
|
|
38 |
them, using the LUXI protocol, to the master daemon (if they are commands) or
|
|
39 |
to the query daemon if they are queries about the configuration (including
|
|
40 |
live status) of the cluster.
|
|
41 |
|
|
42 |
``Node daemon (NodeD)``
|
|
43 |
It is written in Python. It runs on all the nodes. It is responsible for
|
|
44 |
receiving the master requests over RPC and execute them, using the appropriate
|
|
45 |
backend (hypervisors, DRBD, LVM, etc.). It also receives requests over RPC for
|
|
46 |
the execution of queries gathering live data on behalf of the query daemon.
|
|
47 |
|
|
48 |
``Configuration daemon (ConfD)``
|
|
49 |
It is written in Haskell. It runs on all the master candidates. Since the
|
|
50 |
configuration is replicated only on the master node, this daemon exists in
|
|
51 |
order to provide information about the configuration to nodes needing them.
|
|
52 |
The requests are done through ConfD's own protocol, HMAC signed,
|
|
53 |
implemented over UDP, and meant to be used by parallely querying all the
|
|
54 |
master candidates (or a subset thereof) and getting the most up to date
|
|
55 |
answer. This is meant as a way to provide a robust service even in case master
|
|
56 |
is temporarily unavailable.
|
|
57 |
|
|
58 |
``Query daemon (QueryD)``
|
|
59 |
It is written in Haskell. It runs on all the master candidates. It replies
|
|
60 |
to Luxi queries about the current status of the system, including live data it
|
|
61 |
obtains by querying the node daemons through RPCs.
|
|
62 |
|
|
63 |
``Monitoring daemon (MonD)``
|
|
64 |
It is written in Haskell. It runs on all nodes, including the ones that are
|
|
65 |
not vm-capable. It is meant to provide information on the status of the
|
|
66 |
system. Such information is related only to the specific node the daemon is
|
|
67 |
running on, and it is provided as JSON encoded data over HTTP, to be easily
|
|
68 |
readable by external tools.
|
|
69 |
The monitoring daemon communicates with ConfD to get information about the
|
|
70 |
configuration of the cluster. The choice of communicating with ConfD instead
|
|
71 |
of MasterD allows it to obtain configuration information even when the cluster
|
|
72 |
is heavily degraded (e.g.: when master and some, but not all, of the master
|
|
73 |
candidates are unreachable).
|
|
74 |
|
|
75 |
The current structure of the Ganeti daemons is inefficient because there are
|
|
76 |
many different protocols involved, and each daemon needs to be able to use
|
|
77 |
multiple ones, and has to deal with doing different things, thus making
|
|
78 |
sometimes unclear which daemon is responsible for performing a specific task.
|
|
79 |
|
|
80 |
Also, with the current configuration, jobs are managed by the master daemon
|
|
81 |
using python threads. This makes terminating a job after it has started a
|
|
82 |
difficult operation, and it is the main reason why this is not possible yet.
|
|
83 |
|
|
84 |
The master daemon currently has too many different tasks, that could be handled
|
|
85 |
better if split among different daemons.
|
|
86 |
|
|
87 |
|
|
88 |
Proposed changes
|
|
89 |
================
|
|
90 |
|
|
91 |
In order to improve on the current situation, a new daemon subdivision is
|
|
92 |
proposed, and presented hereafter.
|
|
93 |
|
|
94 |
.. digraph:: "new-daemons-structure"
|
|
95 |
|
|
96 |
{rank=same; RConfD LuxiD;}
|
|
97 |
{rank=same; Jobs rconfigdata;}
|
|
98 |
node [shape=box]
|
|
99 |
RapiD [label="RapiD [M]"]
|
|
100 |
LuxiD [label="LuxiD [M]"]
|
|
101 |
WConfD [label="WConfD [M]"]
|
|
102 |
Jobs [label="Jobs [M]"]
|
|
103 |
RConfD [label="RConfD [MC]"]
|
|
104 |
MonD [label="MonD [All]"]
|
|
105 |
NodeD [label="NodeD [All]"]
|
|
106 |
Clients [label="gnt-*\nclients [M]"]
|
|
107 |
p1 [shape=none, label=""]
|
|
108 |
p2 [shape=none, label=""]
|
|
109 |
p3 [shape=none, label=""]
|
|
110 |
p4 [shape=none, label=""]
|
|
111 |
configdata [shape=none, label="config.data"]
|
|
112 |
rconfigdata [shape=none, label="config.data\n[MC copy]"]
|
|
113 |
locksdata [shape=none, label="locks.data"]
|
|
114 |
|
|
115 |
RapiD -> LuxiD [label="LUXI"]
|
|
116 |
LuxiD -> WConfD [label="WConfD\nproto"]
|
|
117 |
LuxiD -> Jobs [label="fork/exec"]
|
|
118 |
Jobs -> WConfD [label="WConfD\nproto"]
|
|
119 |
Jobs -> NodeD [label="RPC"]
|
|
120 |
LuxiD -> NodeD [label="RPC"]
|
|
121 |
rconfigdata -> RConfD
|
|
122 |
configdata -> rconfigdata [label="sync via\nNodeD RPC"]
|
|
123 |
WConfD -> NodeD [label="RPC"]
|
|
124 |
WConfD -> configdata
|
|
125 |
WConfD -> locksdata
|
|
126 |
MonD -> RConfD [label="RConfD\nproto"]
|
|
127 |
Clients -> LuxiD [label="LUXI"]
|
|
128 |
p1 -> MonD [label="MonD proto"]
|
|
129 |
p2 -> RapiD [label="RAPI"]
|
|
130 |
p3 -> RConfD [label="RConfD\nproto"]
|
|
131 |
p4 -> Clients [label="CLI"]
|
|
132 |
|
|
133 |
``LUXI daemon (LuxiD)``
|
|
134 |
It will be written in Haskell. It will run on the master node and it will be
|
|
135 |
the only LUXI server, replying to all the LUXI queries. These includes both
|
|
136 |
the queries about the live configuration of the cluster, previously served by
|
|
137 |
QueryD, and the commands actually changing the status of the cluster by
|
|
138 |
submitting jobs. Therefore, this daemon will also be the one responsible with
|
|
139 |
managing the job queue. When a job needs to be executed, the LuxiD will spawn
|
|
140 |
a separate process tasked with the execution of that specific job, thus making
|
|
141 |
it easier to terminate the job itself, if needeed. When a job requires locks,
|
|
142 |
LuxiD will request them from WConfD.
|
|
143 |
In order to keep availability of the cluster in case of failure of the master
|
|
144 |
node, LuxiD will replicate the job queue to the other master candidates, by
|
|
145 |
RPCs to the NodeD running there (the choice of RPCs for this task might be
|
|
146 |
reviewed at a second time, after implementing this design).
|
|
147 |
|
|
148 |
``Configuration management daemon (WConfD)``
|
|
149 |
It will run on the master node and it will be responsible for the management
|
|
150 |
of the authoritative copy of the cluster configuration (that is, it will be
|
|
151 |
the daemon actually modifying the ``config.data`` file). All the requests of
|
|
152 |
configuration changes will have to pass through this daemon, and will be
|
|
153 |
performed using a LUXI-like protocol ("WConfD proto" in the graph. The exact
|
|
154 |
protocol will be defined in the separate design document that will detail the
|
|
155 |
WConfD separation). Having a single point of configuration management will
|
|
156 |
also allow Ganeti to get rid of possible race conditions due to concurrent
|
|
157 |
modifications of the configuration. When the configuration is updated, it
|
|
158 |
will have to push the received changes to the other master candidates, via
|
|
159 |
RPCs, so that RConfD daemons and (in case of a failure on the master node)
|
|
160 |
the WConfD daemon on the new master can access an up-to-date version of it
|
|
161 |
(the choice of RPCs for this task might be reviewed at a second time). This
|
|
162 |
daemon will also be the one responsible for managing the locks, granting them
|
|
163 |
to the jobs requesting them, and taking care of freeing them up if the jobs
|
|
164 |
holding them crash or are terminated before releasing them. In order to do
|
|
165 |
this, each job, after being spawned by LuxiD, will open a local unix socket
|
|
166 |
that will be used to communicate with it, and will be destroyed when the job
|
|
167 |
terminates. LuxiD will be able to check, after a timeout, whether the job is
|
|
168 |
still running by connecting here, and to ask WConfD to forcefully remove the
|
|
169 |
locks if the socket is closed.
|
|
170 |
Also, WConfD should hold a serialized list of the locks and their owners in a
|
|
171 |
file (``locks.data``), so that it can keep track of their status in case it
|
|
172 |
crashes and needs to be restarted (by asking LuxiD which of them are still
|
|
173 |
running).
|
|
174 |
Interaction with this daemon will be performed using Unix sockets.
|
|
175 |
|
|
176 |
``Configuration query daemon (RConfD)``
|
|
177 |
It is written in Haskell, and it corresponds to the old ConfD. It will run on
|
|
178 |
all the master candidates and it will serve information about the the static
|
|
179 |
configuration of the cluster (the one contained in ``config.data``). The
|
|
180 |
provided information will be highly available (as in: a response will be
|
|
181 |
available as long as a stable-enough connection between the client and at
|
|
182 |
least one working master candidate is available) and its freshness will be
|
|
183 |
best effort (the most recent reply from any of the master candidates will be
|
|
184 |
returned, but it might still be older than the one available through WConfD).
|
|
185 |
The information will be served through the ConfD protocol.
|
|
186 |
|
|
187 |
``Rapi daemon (RapiD)``
|
|
188 |
It remains basically unchanged, with the only difference that all of its LUXI
|
|
189 |
query are directed towards LuxiD instead of being split between MasterD and
|
|
190 |
QueryD.
|
|
191 |
|
|
192 |
``Monitoring daemon (MonD)``
|
|
193 |
It remains unaffected by the changes in this design document. It will just get
|
|
194 |
some of the data it needs from RConfD instead of the old ConfD, but the
|
|
195 |
interfaces of the two are identical.
|
|
196 |
|
|
197 |
``Node daemon (NodeD)``
|
|
198 |
It remains unaffected by the changes proposed in the design document. The only
|
|
199 |
difference being that it will receive its RPCs from LuxiD (for job queue
|
|
200 |
replication), from WConfD (for configuration replication) and for the
|
|
201 |
processes executing single jobs (for all the operations to be performed by
|
|
202 |
nodes) instead of receiving them just from MasterD.
|
|
203 |
|
|
204 |
This restructuring will allow us to reorganize and improve the codebase,
|
|
205 |
introducing cleaner interfaces and giving well defined and more restricted tasks
|
|
206 |
to each daemon.
|
|
207 |
|
|
208 |
Furthermore, having more well-defined interfaces will allow us to have easier
|
|
209 |
upgrade procedures, and to work towards the possibility of upgrading single
|
|
210 |
components of a cluster one at a time, without the need for immediately
|
|
211 |
upgrading the entire cluster in a single step.
|
|
212 |
|
|
213 |
|
|
214 |
Implementation
|
|
215 |
==============
|
|
216 |
|
|
217 |
While performing this refactoring, we aim to increase the amount of
|
|
218 |
Haskell code, thus benefiting from the additional type safety provided by its
|
|
219 |
wide compile-time checks. In particular, all the job queue management and the
|
|
220 |
configuration management daemon will be written in Haskell, taking over the role
|
|
221 |
currently fulfilled by Python code executed as part of MasterD.
|
|
222 |
|
|
223 |
The changes describe by this design document are quite extensive, therefore they
|
|
224 |
will not be implemented all at the same time, but through a sequence of steps,
|
|
225 |
leaving the codebase in a consistent and usable state.
|
|
226 |
|
|
227 |
#. Rename QueryD to LuxiD.
|
|
228 |
A part of LuxiD, the one replying to configuration
|
|
229 |
queries including live information about the system, already exists in the
|
|
230 |
form of QueryD. This is being renamed to LuxiD, and will form the first part
|
|
231 |
of the new daemon. NB: this is happening starting from Ganeti 2.8. At the
|
|
232 |
beginning, only the already existing queries will be replied to by LuxiD.
|
|
233 |
More queries will be implemented in the next versions.
|
|
234 |
|
|
235 |
#. Let LuxiD be the interface for the queries and MasterD be their executor.
|
|
236 |
Currently, MasterD is the only responsible for receiving and executing LUXI
|
|
237 |
queries, and for managing the jobs they create.
|
|
238 |
Receiving the queries and managing the job queue will be extracted from
|
|
239 |
MasterD into LuxiD.
|
|
240 |
Actually executing jobs will still be done by MasterD, that contains all the
|
|
241 |
logic for doing that and for properly managing locks and the configuration.
|
|
242 |
A separate design document will detail how the system will decide which jobs
|
|
243 |
to send over for execution, and how to rate-limit them.
|
|
244 |
|
|
245 |
#. Extract WConfD from MasterD.
|
|
246 |
The logic for managing the configuration file is factored out to the
|
|
247 |
dedicated WConfD daemon. All configuration changes, currently executed
|
|
248 |
directly by MasterD, will be changed to be IPC requests sent to the new
|
|
249 |
daemon.
|
|
250 |
|
|
251 |
#. Extract locking management from MasterD.
|
|
252 |
The logic for managing and granting locks is extracted to WConfD as well.
|
|
253 |
Locks will not be taken directly anymore, but asked via IPC to WConfD.
|
|
254 |
This step can be executed on its own or at the same time as the previous one.
|
|
255 |
|
|
256 |
#. Jobs are executed as processes.
|
|
257 |
The logic for running jobs is rewritten so that each job can be managed by an
|
|
258 |
independent process. LuxiD will spawn a new (Python) process for every single
|
|
259 |
job. The RPCs will remain unchanged, and the LU code will stay as is as much
|
|
260 |
as possible.
|
|
261 |
MasterD will cease to exist as a deamon on its own at this point, but not
|
|
262 |
before.
|
|
263 |
|
|
264 |
Further considerations
|
|
265 |
======================
|
|
266 |
|
|
267 |
There is a possibility that a job will finish performing its task while LuxiD
|
|
268 |
and/or WConfD will not be available.
|
|
269 |
In order to deal with this situation, each job will write the results of its
|
|
270 |
execution on a file. The name of this file will be known to LuxiD before
|
|
271 |
starting the job, and will be stored together with the job ID, and the
|
|
272 |
name of the job-unique socket.
|
|
273 |
|
|
274 |
The job, upon ending its execution, will signal LuxiD (through the socket), so
|
|
275 |
that it can read the result of the execution and release the locks as needed.
|
|
276 |
|
|
277 |
In case LuxiD is not available at that time, the job will just terminate without
|
|
278 |
signalling it, and writing the results on file as usual. When a new LuxiD
|
|
279 |
becomes available, it will have the most up-to-date list of running jobs
|
|
280 |
(received via replication from the former LuxiD), and go through it, cleaning up
|
|
281 |
all the terminated jobs.
|
|
282 |
|
|
283 |
|
|
284 |
.. vim: set textwidth=72 :
|
|
285 |
.. Local Variables:
|
|
286 |
.. mode: rst
|
|
287 |
.. fill-column: 72
|
|
288 |
.. End:
|