Statistics
| Branch: | Tag: | Revision:

root / doc / design-daemons.rst @ 56c934da

History | View | Annotate | Download (20.2 kB)

1 ffedf64d Michele Tartara
==========================
2 ffedf64d Michele Tartara
Ganeti daemons refactoring
3 ffedf64d Michele Tartara
==========================
4 ffedf64d Michele Tartara
5 ffedf64d Michele Tartara
.. contents:: :depth: 2
6 ffedf64d Michele Tartara
7 ffedf64d Michele Tartara
This is a design document detailing the plan for refactoring the internal
8 ffedf64d Michele Tartara
structure of Ganeti, and particularly the set of daemons it is divided into.
9 ffedf64d Michele Tartara
10 ffedf64d Michele Tartara
11 ffedf64d Michele Tartara
Current state and shortcomings
12 ffedf64d Michele Tartara
==============================
13 ffedf64d Michele Tartara
14 ffedf64d Michele Tartara
Ganeti is comprised of a growing number of daemons, each dealing with part of
15 ffedf64d Michele Tartara
the tasks the cluster has to face, and communicating with the other daemons
16 ffedf64d Michele Tartara
using a variety of protocols.
17 ffedf64d Michele Tartara
18 ffedf64d Michele Tartara
Specifically, as of Ganeti 2.8, the situation is as follows:
19 ffedf64d Michele Tartara
20 ffedf64d Michele Tartara
``Master daemon (MasterD)``
21 ffedf64d Michele Tartara
  It is responsible for managing the entire cluster, and it's written in Python.
22 ffedf64d Michele Tartara
  It is executed on a single node (the master node). It receives the commands
23 ffedf64d Michele Tartara
  given by the cluster administrator (through the remote API daemon or the
24 ffedf64d Michele Tartara
  command line tools) over the LUXI protocol.  The master daemon is responsible
25 ffedf64d Michele Tartara
  for creating and managing the jobs that will execute such commands, and for
26 ffedf64d Michele Tartara
  managing the locks that ensure the cluster will not incur in race conditions.
27 ffedf64d Michele Tartara
28 ffedf64d Michele Tartara
  Each job is managed by a separate Python thread, that interacts with the node
29 ffedf64d Michele Tartara
  daemons via RPC calls.
30 ffedf64d Michele Tartara
31 ffedf64d Michele Tartara
  The master daemon is also responsible for managing the configuration of the
32 ffedf64d Michele Tartara
  cluster, changing it when required by some job. It is also responsible for
33 ffedf64d Michele Tartara
  copying the configuration to the other master candidates after updating it.
34 ffedf64d Michele Tartara
35 ffedf64d Michele Tartara
``RAPI daemon (RapiD)``
36 ffedf64d Michele Tartara
  It is written in Python and runs on the master node only. It waits for
37 ffedf64d Michele Tartara
  requests issued remotely through the remote API protocol. Then, it forwards
38 ffedf64d Michele Tartara
  them, using the LUXI protocol, to the master daemon (if they are commands) or
39 ffedf64d Michele Tartara
  to the query daemon if they are queries about the configuration (including
40 ffedf64d Michele Tartara
  live status) of the cluster.
41 ffedf64d Michele Tartara
42 ffedf64d Michele Tartara
``Node daemon (NodeD)``
43 ffedf64d Michele Tartara
  It is written in Python. It runs on all the nodes. It is responsible for
44 ffedf64d Michele Tartara
  receiving the master requests over RPC and execute them, using the appropriate
45 ffedf64d Michele Tartara
  backend (hypervisors, DRBD, LVM, etc.). It also receives requests over RPC for
46 ffedf64d Michele Tartara
  the execution of queries gathering live data on behalf of the query daemon.
47 ffedf64d Michele Tartara
48 ffedf64d Michele Tartara
``Configuration daemon (ConfD)``
49 ffedf64d Michele Tartara
  It is written in Haskell. It runs on all the master candidates. Since the
50 ffedf64d Michele Tartara
  configuration is replicated only on the master node, this daemon exists in
51 ffedf64d Michele Tartara
  order to provide information about the configuration to nodes needing them.
52 ffedf64d Michele Tartara
  The requests are done through ConfD's own protocol, HMAC signed,
53 ffedf64d Michele Tartara
  implemented over UDP, and meant to be used by parallely querying all the
54 ffedf64d Michele Tartara
  master candidates (or a subset thereof) and getting the most up to date
55 ffedf64d Michele Tartara
  answer. This is meant as a way to provide a robust service even in case master
56 ffedf64d Michele Tartara
  is temporarily unavailable.
57 ffedf64d Michele Tartara
58 ffedf64d Michele Tartara
``Query daemon (QueryD)``
59 ffedf64d Michele Tartara
  It is written in Haskell. It runs on all the master candidates. It replies
60 ffedf64d Michele Tartara
  to Luxi queries about the current status of the system, including live data it
61 ffedf64d Michele Tartara
  obtains by querying the node daemons through RPCs.
62 ffedf64d Michele Tartara
63 ffedf64d Michele Tartara
``Monitoring daemon (MonD)``
64 ffedf64d Michele Tartara
  It is written in Haskell. It runs on all nodes, including the ones that are
65 ffedf64d Michele Tartara
  not vm-capable. It is meant to provide information on the status of the
66 ffedf64d Michele Tartara
  system. Such information is related only to the specific node the daemon is
67 ffedf64d Michele Tartara
  running on, and it is provided as JSON encoded data over HTTP, to be easily
68 ffedf64d Michele Tartara
  readable by external tools.
69 ffedf64d Michele Tartara
  The monitoring daemon communicates with ConfD to get information about the
70 ffedf64d Michele Tartara
  configuration of the cluster. The choice of communicating with ConfD instead
71 ffedf64d Michele Tartara
  of MasterD allows it to obtain configuration information even when the cluster
72 ffedf64d Michele Tartara
  is heavily degraded (e.g.: when master and some, but not all, of the master
73 ffedf64d Michele Tartara
  candidates are unreachable).
74 ffedf64d Michele Tartara
75 ffedf64d Michele Tartara
The current structure of the Ganeti daemons is inefficient because there are
76 ffedf64d Michele Tartara
many different protocols involved, and each daemon needs to be able to use
77 ffedf64d Michele Tartara
multiple ones, and has to deal with doing different things, thus making
78 ffedf64d Michele Tartara
sometimes unclear which daemon is responsible for performing a specific task.
79 ffedf64d Michele Tartara
80 ffedf64d Michele Tartara
Also, with the current configuration, jobs are managed by the master daemon
81 ffedf64d Michele Tartara
using python threads. This makes terminating a job after it has started a
82 ffedf64d Michele Tartara
difficult operation, and it is the main reason why this is not possible yet.
83 ffedf64d Michele Tartara
84 ffedf64d Michele Tartara
The master daemon currently has too many different tasks, that could be handled
85 ffedf64d Michele Tartara
better if split among different daemons.
86 ffedf64d Michele Tartara
87 ffedf64d Michele Tartara
88 ffedf64d Michele Tartara
Proposed changes
89 ffedf64d Michele Tartara
================
90 ffedf64d Michele Tartara
91 ffedf64d Michele Tartara
In order to improve on the current situation, a new daemon subdivision is
92 ffedf64d Michele Tartara
proposed, and presented hereafter.
93 ffedf64d Michele Tartara
94 ffedf64d Michele Tartara
.. digraph:: "new-daemons-structure"
95 ffedf64d Michele Tartara
96 ffedf64d Michele Tartara
  {rank=same; RConfD LuxiD;}
97 ffedf64d Michele Tartara
  {rank=same; Jobs rconfigdata;}
98 ffedf64d Michele Tartara
  node [shape=box]
99 ffedf64d Michele Tartara
  RapiD [label="RapiD [M]"]
100 ffedf64d Michele Tartara
  LuxiD [label="LuxiD [M]"]
101 ffedf64d Michele Tartara
  WConfD [label="WConfD [M]"]
102 ffedf64d Michele Tartara
  Jobs [label="Jobs [M]"]
103 ffedf64d Michele Tartara
  RConfD [label="RConfD [MC]"]
104 ffedf64d Michele Tartara
  MonD [label="MonD [All]"]
105 ffedf64d Michele Tartara
  NodeD [label="NodeD [All]"]
106 ffedf64d Michele Tartara
  Clients [label="gnt-*\nclients [M]"]
107 ffedf64d Michele Tartara
  p1 [shape=none, label=""]
108 ffedf64d Michele Tartara
  p2 [shape=none, label=""]
109 ffedf64d Michele Tartara
  p3 [shape=none, label=""]
110 ffedf64d Michele Tartara
  p4 [shape=none, label=""]
111 ffedf64d Michele Tartara
  configdata [shape=none, label="config.data"]
112 ffedf64d Michele Tartara
  rconfigdata [shape=none, label="config.data\n[MC copy]"]
113 ffedf64d Michele Tartara
  locksdata [shape=none, label="locks.data"]
114 ffedf64d Michele Tartara
115 ffedf64d Michele Tartara
  RapiD -> LuxiD [label="LUXI"]
116 ffedf64d Michele Tartara
  LuxiD -> WConfD [label="WConfD\nproto"]
117 ffedf64d Michele Tartara
  LuxiD -> Jobs [label="fork/exec"]
118 ffedf64d Michele Tartara
  Jobs -> WConfD [label="WConfD\nproto"]
119 ffedf64d Michele Tartara
  Jobs -> NodeD [label="RPC"]
120 ffedf64d Michele Tartara
  LuxiD -> NodeD [label="RPC"]
121 ffedf64d Michele Tartara
  rconfigdata -> RConfD
122 ffedf64d Michele Tartara
  configdata -> rconfigdata [label="sync via\nNodeD RPC"]
123 ffedf64d Michele Tartara
  WConfD -> NodeD [label="RPC"]
124 ffedf64d Michele Tartara
  WConfD -> configdata
125 ffedf64d Michele Tartara
  WConfD -> locksdata
126 ffedf64d Michele Tartara
  MonD -> RConfD [label="RConfD\nproto"]
127 ffedf64d Michele Tartara
  Clients -> LuxiD [label="LUXI"]
128 ffedf64d Michele Tartara
  p1 -> MonD [label="MonD proto"]
129 ffedf64d Michele Tartara
  p2 -> RapiD [label="RAPI"]
130 ffedf64d Michele Tartara
  p3 -> RConfD [label="RConfD\nproto"]
131 ffedf64d Michele Tartara
  p4 -> Clients [label="CLI"]
132 ffedf64d Michele Tartara
133 ffedf64d Michele Tartara
``LUXI daemon (LuxiD)``
134 ffedf64d Michele Tartara
  It will be written in Haskell. It will run on the master node and it will be
135 ffedf64d Michele Tartara
  the only LUXI server, replying to all the LUXI queries. These includes both
136 ffedf64d Michele Tartara
  the queries about the live configuration of the cluster, previously served by
137 ffedf64d Michele Tartara
  QueryD, and the commands actually changing the status of the cluster by
138 ffedf64d Michele Tartara
  submitting jobs. Therefore, this daemon will also be the one responsible with
139 ffedf64d Michele Tartara
  managing the job queue. When a job needs to be executed, the LuxiD will spawn
140 ffedf64d Michele Tartara
  a separate process tasked with the execution of that specific job, thus making
141 ffedf64d Michele Tartara
  it easier to terminate the job itself, if needeed.  When a job requires locks,
142 ffedf64d Michele Tartara
  LuxiD will request them from WConfD.
143 ffedf64d Michele Tartara
  In order to keep availability of the cluster in case of failure of the master
144 ffedf64d Michele Tartara
  node, LuxiD will replicate the job queue to the other master candidates, by
145 ffedf64d Michele Tartara
  RPCs to the NodeD running there (the choice of RPCs for this task might be
146 ffedf64d Michele Tartara
  reviewed at a second time, after implementing this design).
147 ffedf64d Michele Tartara
148 ffedf64d Michele Tartara
``Configuration management daemon (WConfD)``
149 ffedf64d Michele Tartara
  It will run on the master node and it will be responsible for the management
150 ffedf64d Michele Tartara
  of the authoritative copy of the cluster configuration (that is, it will be
151 ffedf64d Michele Tartara
  the daemon actually modifying the ``config.data`` file). All the requests of
152 ffedf64d Michele Tartara
  configuration changes will have to pass through this daemon, and will be
153 ffedf64d Michele Tartara
  performed using a LUXI-like protocol ("WConfD proto" in the graph. The exact
154 ffedf64d Michele Tartara
  protocol will be defined in the separate design document that will detail the
155 ffedf64d Michele Tartara
  WConfD separation).  Having a single point of configuration management will
156 ffedf64d Michele Tartara
  also allow Ganeti to get rid of possible race conditions due to concurrent
157 ffedf64d Michele Tartara
  modifications of the configuration.  When the configuration is updated, it
158 ffedf64d Michele Tartara
  will have to push the received changes to the other master candidates, via
159 ffedf64d Michele Tartara
  RPCs, so that RConfD daemons and (in case of a failure on the master node)
160 ffedf64d Michele Tartara
  the WConfD daemon on the new master can access an up-to-date version of it
161 ffedf64d Michele Tartara
  (the choice of RPCs for this task might be reviewed at a second time). This
162 ffedf64d Michele Tartara
  daemon will also be the one responsible for managing the locks, granting them
163 ffedf64d Michele Tartara
  to the jobs requesting them, and taking care of freeing them up if the jobs
164 ffedf64d Michele Tartara
  holding them crash or are terminated before releasing them.  In order to do
165 ffedf64d Michele Tartara
  this, each job, after being spawned by LuxiD, will open a local unix socket
166 ffedf64d Michele Tartara
  that will be used to communicate with it, and will be destroyed when the job
167 ffedf64d Michele Tartara
  terminates.  LuxiD will be able to check, after a timeout, whether the job is
168 ffedf64d Michele Tartara
  still running by connecting here, and to ask WConfD to forcefully remove the
169 ffedf64d Michele Tartara
  locks if the socket is closed.
170 ffedf64d Michele Tartara
  Also, WConfD should hold a serialized list of the locks and their owners in a
171 ffedf64d Michele Tartara
  file (``locks.data``), so that it can keep track of their status in case it
172 ffedf64d Michele Tartara
  crashes and needs to be restarted (by asking LuxiD which of them are still
173 ffedf64d Michele Tartara
  running).
174 ffedf64d Michele Tartara
  Interaction with this daemon will be performed using Unix sockets.
175 ffedf64d Michele Tartara
176 ffedf64d Michele Tartara
``Configuration query daemon (RConfD)``
177 ffedf64d Michele Tartara
  It is written in Haskell, and it corresponds to the old ConfD. It will run on
178 ffedf64d Michele Tartara
  all the master candidates and it will serve information about the the static
179 ffedf64d Michele Tartara
  configuration of the cluster (the one contained in ``config.data``). The
180 ffedf64d Michele Tartara
  provided information will be highly available (as in: a response will be
181 ffedf64d Michele Tartara
  available as long as a stable-enough connection between the client and at
182 ffedf64d Michele Tartara
  least one working master candidate is available) and its freshness will be
183 ffedf64d Michele Tartara
  best effort (the most recent reply from any of the master candidates will be
184 ffedf64d Michele Tartara
  returned, but it might still be older than the one available through WConfD).
185 ffedf64d Michele Tartara
  The information will be served through the ConfD protocol.
186 ffedf64d Michele Tartara
187 ffedf64d Michele Tartara
``Rapi daemon (RapiD)``
188 ffedf64d Michele Tartara
  It remains basically unchanged, with the only difference that all of its LUXI
189 ffedf64d Michele Tartara
  query are directed towards LuxiD instead of being split between MasterD and
190 ffedf64d Michele Tartara
  QueryD.
191 ffedf64d Michele Tartara
192 ffedf64d Michele Tartara
``Monitoring daemon (MonD)``
193 ffedf64d Michele Tartara
  It remains unaffected by the changes in this design document. It will just get
194 ffedf64d Michele Tartara
  some of the data it needs from RConfD instead of the old ConfD, but the
195 ffedf64d Michele Tartara
  interfaces of the two are identical.
196 ffedf64d Michele Tartara
197 ffedf64d Michele Tartara
``Node daemon (NodeD)``
198 ffedf64d Michele Tartara
  It remains unaffected by the changes proposed in the design document. The only
199 ffedf64d Michele Tartara
  difference being that it will receive its RPCs from LuxiD (for job queue
200 ffedf64d Michele Tartara
  replication), from WConfD (for configuration replication) and for the
201 ffedf64d Michele Tartara
  processes executing single jobs (for all the operations to be performed by
202 ffedf64d Michele Tartara
  nodes) instead of receiving them just from MasterD.
203 ffedf64d Michele Tartara
204 ffedf64d Michele Tartara
This restructuring will allow us to reorganize and improve the codebase,
205 ffedf64d Michele Tartara
introducing cleaner interfaces and giving well defined and more restricted tasks
206 ffedf64d Michele Tartara
to each daemon.
207 ffedf64d Michele Tartara
208 ffedf64d Michele Tartara
Furthermore, having more well-defined interfaces will allow us to have easier
209 ffedf64d Michele Tartara
upgrade procedures, and to work towards the possibility of upgrading single
210 ffedf64d Michele Tartara
components of a cluster one at a time, without the need for immediately
211 ffedf64d Michele Tartara
upgrading the entire cluster in a single step.
212 ffedf64d Michele Tartara
213 ffedf64d Michele Tartara
214 ffedf64d Michele Tartara
Implementation
215 ffedf64d Michele Tartara
==============
216 ffedf64d Michele Tartara
217 ffedf64d Michele Tartara
While performing this refactoring, we aim to increase the amount of
218 ffedf64d Michele Tartara
Haskell code, thus benefiting from the additional type safety provided by its
219 ffedf64d Michele Tartara
wide compile-time checks. In particular, all the job queue management and the
220 ffedf64d Michele Tartara
configuration management daemon will be written in Haskell, taking over the role
221 ffedf64d Michele Tartara
currently fulfilled by Python code executed as part of MasterD.
222 ffedf64d Michele Tartara
223 ffedf64d Michele Tartara
The changes describe by this design document are quite extensive, therefore they
224 ffedf64d Michele Tartara
will not be implemented all at the same time, but through a sequence of steps,
225 ffedf64d Michele Tartara
leaving the codebase in a consistent and usable state.
226 ffedf64d Michele Tartara
227 ffedf64d Michele Tartara
#. Rename QueryD to LuxiD.
228 ffedf64d Michele Tartara
   A part of LuxiD, the one replying to configuration
229 ffedf64d Michele Tartara
   queries including live information about the system, already exists in the
230 ffedf64d Michele Tartara
   form of QueryD. This is being renamed to LuxiD, and will form the first part
231 ffedf64d Michele Tartara
   of the new daemon. NB: this is happening starting from Ganeti 2.8. At the
232 ffedf64d Michele Tartara
   beginning, only the already existing queries will be replied to by LuxiD.
233 ffedf64d Michele Tartara
   More queries will be implemented in the next versions.
234 ffedf64d Michele Tartara
235 ffedf64d Michele Tartara
#. Let LuxiD be the interface for the queries and MasterD be their executor.
236 ffedf64d Michele Tartara
   Currently, MasterD is the only responsible for receiving and executing LUXI
237 ffedf64d Michele Tartara
   queries, and for managing the jobs they create.
238 ffedf64d Michele Tartara
   Receiving the queries and managing the job queue will be extracted from
239 ffedf64d Michele Tartara
   MasterD into LuxiD.
240 ffedf64d Michele Tartara
   Actually executing jobs will still be done by MasterD, that contains all the
241 ffedf64d Michele Tartara
   logic for doing that and for properly managing locks and the configuration.
242 ce10eb31 Klaus Aehlig
   At this stage, scheduling will simply consist in starting jobs until a fixed
243 ce10eb31 Klaus Aehlig
   maximum number of simultaneously running jobs is reached.
244 ffedf64d Michele Tartara
245 ffedf64d Michele Tartara
#. Extract WConfD from MasterD.
246 ffedf64d Michele Tartara
   The logic for managing the configuration file is factored out to the
247 ffedf64d Michele Tartara
   dedicated WConfD daemon. All configuration changes, currently executed
248 ffedf64d Michele Tartara
   directly by MasterD, will be changed to be IPC requests sent to the new
249 ffedf64d Michele Tartara
   daemon.
250 ffedf64d Michele Tartara
251 ffedf64d Michele Tartara
#. Extract locking management from MasterD.
252 ffedf64d Michele Tartara
   The logic for managing and granting locks is extracted to WConfD as well.
253 ffedf64d Michele Tartara
   Locks will not be taken directly anymore, but asked via IPC to WConfD.
254 ffedf64d Michele Tartara
   This step can be executed on its own or at the same time as the previous one.
255 ffedf64d Michele Tartara
256 ffedf64d Michele Tartara
#. Jobs are executed as processes.
257 ffedf64d Michele Tartara
   The logic for running jobs is rewritten so that each job can be managed by an
258 ffedf64d Michele Tartara
   independent process. LuxiD will spawn a new (Python) process for every single
259 ffedf64d Michele Tartara
   job. The RPCs will remain unchanged, and the LU code will stay as is as much
260 ffedf64d Michele Tartara
   as possible.
261 ffedf64d Michele Tartara
   MasterD will cease to exist as a deamon on its own at this point, but not
262 ffedf64d Michele Tartara
   before.
263 ffedf64d Michele Tartara
264 ce10eb31 Klaus Aehlig
#. Improve job scheduling algorithm.
265 ce10eb31 Klaus Aehlig
   The simple algorithm for scheduling jobs will be replaced by a more
266 ce10eb31 Klaus Aehlig
   intelligent one. Also, the implementation of :doc:`design-optables` can be
267 ce10eb31 Klaus Aehlig
   started.
268 ce10eb31 Klaus Aehlig
269 5eeb7168 Petr Pudlak
WConfD details
270 5eeb7168 Petr Pudlak
--------------
271 5eeb7168 Petr Pudlak
272 5eeb7168 Petr Pudlak
WConfD will communicate with its clients through a Unix domain socket for both
273 5eeb7168 Petr Pudlak
configuration management and locking. Clients can issue multiple RPC calls
274 5eeb7168 Petr Pudlak
through one socket. For each such a call the client sends a JSON request
275 5eeb7168 Petr Pudlak
document with a remote function name and data for its arguments. The server
276 5eeb7168 Petr Pudlak
replies with a JSON response document containing either the result of
277 5eeb7168 Petr Pudlak
signalling a failure.
278 5eeb7168 Petr Pudlak
279 5eeb7168 Petr Pudlak
There will be a special RPC call for identifying a client when connecting to
280 5eeb7168 Petr Pudlak
WConfD. The client will tell WConfD it's job number and process ID. WConfD will
281 5eeb7168 Petr Pudlak
fail any other RPC calls before a client identifies this way.
282 5eeb7168 Petr Pudlak
283 5eeb7168 Petr Pudlak
Any state associated with client processes will be mirrored on persistent
284 5eeb7168 Petr Pudlak
storage and linked to the identity of processes so that the WConfD daemon will
285 5eeb7168 Petr Pudlak
be able to resume its operation at any point after a restart or a crash. WConfD
286 5eeb7168 Petr Pudlak
will track each client's process start time along with its process ID to be
287 5eeb7168 Petr Pudlak
able detect if a process dies and it's process ID is reused.  WConfD will clear
288 5eeb7168 Petr Pudlak
all locks and other state associated with a client if it detects it's process
289 5eeb7168 Petr Pudlak
no longer exists.
290 5eeb7168 Petr Pudlak
291 5eeb7168 Petr Pudlak
Configuration management
292 5eeb7168 Petr Pudlak
++++++++++++++++++++++++
293 5eeb7168 Petr Pudlak
294 5eeb7168 Petr Pudlak
The new configuration management protocol will be implemented in the following
295 5eeb7168 Petr Pudlak
steps:
296 5eeb7168 Petr Pudlak
297 5eeb7168 Petr Pudlak
#. Reimplement all current methods of ``ConfigWriter`` for reading and writing
298 5eeb7168 Petr Pudlak
   the configuration of a cluster in WConfD.
299 5eeb7168 Petr Pudlak
#. Expose each of those functions in WConfD as a RPC function. This will allow
300 5eeb7168 Petr Pudlak
   easy future extensions or modifications.
301 5eeb7168 Petr Pudlak
#. Replace ``ConfigWriter`` with a stub (preferably automatically generated
302 5eeb7168 Petr Pudlak
   from the Haskell code) that will contain the same methods as the current
303 5eeb7168 Petr Pudlak
   ``ConfigWriter`` and delegate all calls to its methods to WConfD.
304 5eeb7168 Petr Pudlak
305 5eeb7168 Petr Pudlak
After this step it'll be possible access the configuration from separate
306 5eeb7168 Petr Pudlak
processes.
307 5eeb7168 Petr Pudlak
308 5eeb7168 Petr Pudlak
Future aims:
309 5eeb7168 Petr Pudlak
310 5eeb7168 Petr Pudlak
-  Optionally refactor the RPC calls to reduce their number or improve their
311 5eeb7168 Petr Pudlak
   efficiency (for example by obtaining a larger set of data instead of
312 5eeb7168 Petr Pudlak
   querying items one by one).
313 5eeb7168 Petr Pudlak
314 5eeb7168 Petr Pudlak
Locking
315 5eeb7168 Petr Pudlak
+++++++
316 5eeb7168 Petr Pudlak
317 5eeb7168 Petr Pudlak
The new locking protocol will be implemented as follows:
318 5eeb7168 Petr Pudlak
319 5eeb7168 Petr Pudlak
Re-implement the current locking mechanism in WConfD and expose it for RPC
320 5eeb7168 Petr Pudlak
calls. All current locks will be mapped into a data structure that will
321 5eeb7168 Petr Pudlak
uniquely identify them (storing lock's level together with it's name).
322 5eeb7168 Petr Pudlak
323 5eeb7168 Petr Pudlak
WConfD will impose a linear order on locks. The order will be compatible
324 5eeb7168 Petr Pudlak
with the current ordering of lock levels so that existing code will work
325 5eeb7168 Petr Pudlak
without changes.
326 5eeb7168 Petr Pudlak
327 5eeb7168 Petr Pudlak
WConfD will keep the set of currently held locks for each client. The
328 5eeb7168 Petr Pudlak
protocol will allow the following operations on the set:
329 5eeb7168 Petr Pudlak
330 5eeb7168 Petr Pudlak
*Update:*
331 5eeb7168 Petr Pudlak
  Update the current set of locks according to a given list. The list contains
332 5eeb7168 Petr Pudlak
  locks and their desired level (release / shared / exclusive). To prevent
333 5eeb7168 Petr Pudlak
  deadlocks, WConfD will check that all newly requested locks (or already held
334 5eeb7168 Petr Pudlak
  locks requested to be upgraded to *exclusive*) are greater in the sense of
335 5eeb7168 Petr Pudlak
  the linear order than all currently held locks, and fail the operation if
336 5eeb7168 Petr Pudlak
  not. Only the locks in the list will be updated, other locks already held
337 5eeb7168 Petr Pudlak
  will be left intact. If the operation fails, the client's lock set will be
338 5eeb7168 Petr Pudlak
  left intact.
339 5eeb7168 Petr Pudlak
*Opportunistic union:*
340 5eeb7168 Petr Pudlak
  Add as much as possible locks from a given set to the current set within a
341 5eeb7168 Petr Pudlak
  given timeout. WConfD will again check the proper order of locks and
342 5eeb7168 Petr Pudlak
  acquire only the ones that are allowed wrt. the current set.  Returns the
343 5eeb7168 Petr Pudlak
  set of acquired locks, possibly empty. Immediate. Never fails. (It would also
344 5eeb7168 Petr Pudlak
  be possible to extend the operation to try to wait until a given number of
345 5eeb7168 Petr Pudlak
  locks is available, or a given timeout elapses.)
346 5eeb7168 Petr Pudlak
*List:*
347 5eeb7168 Petr Pudlak
  List the current set of held locks. Immediate, never fails.
348 5eeb7168 Petr Pudlak
*Intersection:*
349 5eeb7168 Petr Pudlak
  Retain only a given set of locks in the current one. This function is
350 5eeb7168 Petr Pudlak
  provided for convenience, it's redundant wrt. *list* and *update*. Immediate,
351 5eeb7168 Petr Pudlak
  never fails.
352 5eeb7168 Petr Pudlak
353 5eeb7168 Petr Pudlak
After this step it'll be possible to use locks from jobs as separate processes.
354 5eeb7168 Petr Pudlak
355 5eeb7168 Petr Pudlak
The above set of operations allows the clients to use various work-flows. In particular:
356 5eeb7168 Petr Pudlak
357 5eeb7168 Petr Pudlak
Pessimistic strategy:
358 5eeb7168 Petr Pudlak
  Lock all potentially relevant resources (for example all nodes), determine
359 5eeb7168 Petr Pudlak
  which will be needed, and release all the others.
360 5eeb7168 Petr Pudlak
Optimistic strategy:
361 5eeb7168 Petr Pudlak
  Determine what locks need to be acquired without holding any. Lock the
362 5eeb7168 Petr Pudlak
  required set of locks. Determine the set of required locks again and check if
363 5eeb7168 Petr Pudlak
  they are all held. If not, release everything and restart.
364 5eeb7168 Petr Pudlak
365 5eeb7168 Petr Pudlak
.. COMMENTED OUT:
366 5eeb7168 Petr Pudlak
  Start with the smallest set of locks and when determining what more
367 5eeb7168 Petr Pudlak
  relevant resources will be needed, expand the set. If an *union* operation
368 5eeb7168 Petr Pudlak
  fails, release all locks, acquire the desired union and restart the
369 5eeb7168 Petr Pudlak
  operation so that all preconditions and possible concurrent changes are
370 5eeb7168 Petr Pudlak
  checked again.
371 5eeb7168 Petr Pudlak
372 5eeb7168 Petr Pudlak
Future aims:
373 5eeb7168 Petr Pudlak
374 5eeb7168 Petr Pudlak
-  Add more fine-grained locks to prevent unnecessary blocking of jobs. This
375 5eeb7168 Petr Pudlak
   could include locks on parameters of entities or locks on their states (so that
376 5eeb7168 Petr Pudlak
   a node remains online, but otherwise can change, etc.). In particular,
377 5eeb7168 Petr Pudlak
   adding, moving and removing instances currently blocks the whole node.
378 5eeb7168 Petr Pudlak
-  Add checks that all modified configuration parameters belong to entities
379 5eeb7168 Petr Pudlak
   the client has locked and log violations.
380 5eeb7168 Petr Pudlak
-  Make the above checks mandatory.
381 5eeb7168 Petr Pudlak
-  Automate optimistic locking and checking the locks in logical units.
382 5eeb7168 Petr Pudlak
   For example, this could be accomplished by allowing some of the initial
383 5eeb7168 Petr Pudlak
   phases of `LogicalUnit` (such as `ExpandNames` and `DeclareLocks`) to be run
384 5eeb7168 Petr Pudlak
   repeatedly, checking if the set of locks requested the second time is
385 5eeb7168 Petr Pudlak
   contained in the set acquired after the first pass.
386 5eeb7168 Petr Pudlak
-  Add the possibility for a job to reserve hardware resources such as disk
387 5eeb7168 Petr Pudlak
   space or memory on nodes. Most likely as a new, special kind of instances
388 5eeb7168 Petr Pudlak
   that would only block its resources and allow to be converted to a regular
389 5eeb7168 Petr Pudlak
   instance. This would allow long-running jobs such as instance creation or
390 5eeb7168 Petr Pudlak
   move to lock the corresponding nodes, acquire the resources and turn the
391 5eeb7168 Petr Pudlak
   locks into shared ones, keeping an exclusive lock only on the instance.
392 5eeb7168 Petr Pudlak
-  Use more sophisticated algorithm for preventing deadlocks such as a
393 5eeb7168 Petr Pudlak
   `wait-for graph`_. This would allow less *union* failures and allow more
394 5eeb7168 Petr Pudlak
   optimistic, scalable acquisition of locks.
395 5eeb7168 Petr Pudlak
396 5eeb7168 Petr Pudlak
.. _`wait-for graph`: http://en.wikipedia.org/wiki/Wait-for_graph
397 5eeb7168 Petr Pudlak
398 5eeb7168 Petr Pudlak
399 ffedf64d Michele Tartara
Further considerations
400 ffedf64d Michele Tartara
======================
401 ffedf64d Michele Tartara
402 ffedf64d Michele Tartara
There is a possibility that a job will finish performing its task while LuxiD
403 ffedf64d Michele Tartara
and/or WConfD will not be available.
404 ffedf64d Michele Tartara
In order to deal with this situation, each job will write the results of its
405 ffedf64d Michele Tartara
execution on a file. The name of this file will be known to LuxiD before
406 ffedf64d Michele Tartara
starting the job, and will be stored together with the job ID, and the
407 ffedf64d Michele Tartara
name of the job-unique socket.
408 ffedf64d Michele Tartara
409 ffedf64d Michele Tartara
The job, upon ending its execution, will signal LuxiD (through the socket), so
410 ffedf64d Michele Tartara
that it can read the result of the execution and release the locks as needed.
411 ffedf64d Michele Tartara
412 ffedf64d Michele Tartara
In case LuxiD is not available at that time, the job will just terminate without
413 ffedf64d Michele Tartara
signalling it, and writing the results on file as usual. When a new LuxiD
414 ffedf64d Michele Tartara
becomes available, it will have the most up-to-date list of running jobs
415 ffedf64d Michele Tartara
(received via replication from the former LuxiD), and go through it, cleaning up
416 ffedf64d Michele Tartara
all the terminated jobs.
417 ffedf64d Michele Tartara
418 ffedf64d Michele Tartara
419 ffedf64d Michele Tartara
.. vim: set textwidth=72 :
420 ffedf64d Michele Tartara
.. Local Variables:
421 ffedf64d Michele Tartara
.. mode: rst
422 ffedf64d Michele Tartara
.. fill-column: 72
423 ffedf64d Michele Tartara
.. End: