Statistics
| Branch: | Tag: | Revision:

root / doc / design-daemons.rst @ fe78783d

History | View | Annotate | Download (24.3 kB)

1
==========================
2
Ganeti daemons refactoring
3
==========================
4

    
5
.. contents:: :depth: 2
6

    
7
This is a design document detailing the plan for refactoring the internal
8
structure of Ganeti, and particularly the set of daemons it is divided into.
9

    
10

    
11
Current state and shortcomings
12
==============================
13

    
14
Ganeti is comprised of a growing number of daemons, each dealing with part of
15
the tasks the cluster has to face, and communicating with the other daemons
16
using a variety of protocols.
17

    
18
Specifically, as of Ganeti 2.8, the situation is as follows:
19

    
20
``Master daemon (MasterD)``
21
  It is responsible for managing the entire cluster, and it's written in Python.
22
  It is executed on a single node (the master node). It receives the commands
23
  given by the cluster administrator (through the remote API daemon or the
24
  command line tools) over the LUXI protocol.  The master daemon is responsible
25
  for creating and managing the jobs that will execute such commands, and for
26
  managing the locks that ensure the cluster will not incur in race conditions.
27

    
28
  Each job is managed by a separate Python thread, that interacts with the node
29
  daemons via RPC calls.
30

    
31
  The master daemon is also responsible for managing the configuration of the
32
  cluster, changing it when required by some job. It is also responsible for
33
  copying the configuration to the other master candidates after updating it.
34

    
35
``RAPI daemon (RapiD)``
36
  It is written in Python and runs on the master node only. It waits for
37
  requests issued remotely through the remote API protocol. Then, it forwards
38
  them, using the LUXI protocol, to the master daemon (if they are commands) or
39
  to the query daemon if they are queries about the configuration (including
40
  live status) of the cluster.
41

    
42
``Node daemon (NodeD)``
43
  It is written in Python. It runs on all the nodes. It is responsible for
44
  receiving the master requests over RPC and execute them, using the appropriate
45
  backend (hypervisors, DRBD, LVM, etc.). It also receives requests over RPC for
46
  the execution of queries gathering live data on behalf of the query daemon.
47

    
48
``Configuration daemon (ConfD)``
49
  It is written in Haskell. It runs on all the master candidates. Since the
50
  configuration is replicated only on the master node, this daemon exists in
51
  order to provide information about the configuration to nodes needing them.
52
  The requests are done through ConfD's own protocol, HMAC signed,
53
  implemented over UDP, and meant to be used by parallely querying all the
54
  master candidates (or a subset thereof) and getting the most up to date
55
  answer. This is meant as a way to provide a robust service even in case master
56
  is temporarily unavailable.
57

    
58
``Query daemon (QueryD)``
59
  It is written in Haskell. It runs on all the master candidates. It replies
60
  to Luxi queries about the current status of the system, including live data it
61
  obtains by querying the node daemons through RPCs.
62

    
63
``Monitoring daemon (MonD)``
64
  It is written in Haskell. It runs on all nodes, including the ones that are
65
  not vm-capable. It is meant to provide information on the status of the
66
  system. Such information is related only to the specific node the daemon is
67
  running on, and it is provided as JSON encoded data over HTTP, to be easily
68
  readable by external tools.
69
  The monitoring daemon communicates with ConfD to get information about the
70
  configuration of the cluster. The choice of communicating with ConfD instead
71
  of MasterD allows it to obtain configuration information even when the cluster
72
  is heavily degraded (e.g.: when master and some, but not all, of the master
73
  candidates are unreachable).
74

    
75
The current structure of the Ganeti daemons is inefficient because there are
76
many different protocols involved, and each daemon needs to be able to use
77
multiple ones, and has to deal with doing different things, thus making
78
sometimes unclear which daemon is responsible for performing a specific task.
79

    
80
Also, with the current configuration, jobs are managed by the master daemon
81
using python threads. This makes terminating a job after it has started a
82
difficult operation, and it is the main reason why this is not possible yet.
83

    
84
The master daemon currently has too many different tasks, that could be handled
85
better if split among different daemons.
86

    
87

    
88
Proposed changes
89
================
90

    
91
In order to improve on the current situation, a new daemon subdivision is
92
proposed, and presented hereafter.
93

    
94
.. digraph:: "new-daemons-structure"
95

    
96
  {rank=same; RConfD LuxiD;}
97
  {rank=same; Jobs rconfigdata;}
98
  node [shape=box]
99
  RapiD [label="RapiD [M]"]
100
  LuxiD [label="LuxiD [M]"]
101
  WConfD [label="WConfD [M]"]
102
  Jobs [label="Jobs [M]"]
103
  RConfD [label="RConfD [MC]"]
104
  MonD [label="MonD [All]"]
105
  NodeD [label="NodeD [All]"]
106
  Clients [label="gnt-*\nclients [M]"]
107
  p1 [shape=none, label=""]
108
  p2 [shape=none, label=""]
109
  p3 [shape=none, label=""]
110
  p4 [shape=none, label=""]
111
  configdata [shape=none, label="config.data"]
112
  rconfigdata [shape=none, label="config.data\n[MC copy]"]
113
  locksdata [shape=none, label="locks.data"]
114

    
115
  RapiD -> LuxiD [label="LUXI"]
116
  LuxiD -> WConfD [label="WConfD\nproto"]
117
  LuxiD -> Jobs [label="fork/exec"]
118
  Jobs -> WConfD [label="WConfD\nproto"]
119
  Jobs -> NodeD [label="RPC"]
120
  LuxiD -> NodeD [label="RPC"]
121
  rconfigdata -> RConfD
122
  configdata -> rconfigdata [label="sync via\nNodeD RPC"]
123
  WConfD -> NodeD [label="RPC"]
124
  WConfD -> configdata
125
  WConfD -> locksdata
126
  MonD -> RConfD [label="RConfD\nproto"]
127
  Clients -> LuxiD [label="LUXI"]
128
  p1 -> MonD [label="MonD proto"]
129
  p2 -> RapiD [label="RAPI"]
130
  p3 -> RConfD [label="RConfD\nproto"]
131
  p4 -> Clients [label="CLI"]
132

    
133
``LUXI daemon (LuxiD)``
134
  It will be written in Haskell. It will run on the master node and it will be
135
  the only LUXI server, replying to all the LUXI queries. These includes both
136
  the queries about the live configuration of the cluster, previously served by
137
  QueryD, and the commands actually changing the status of the cluster by
138
  submitting jobs. Therefore, this daemon will also be the one responsible with
139
  managing the job queue. When a job needs to be executed, the LuxiD will spawn
140
  a separate process tasked with the execution of that specific job, thus making
141
  it easier to terminate the job itself, if needeed.  When a job requires locks,
142
  LuxiD will request them from WConfD.
143
  In order to keep availability of the cluster in case of failure of the master
144
  node, LuxiD will replicate the job queue to the other master candidates, by
145
  RPCs to the NodeD running there (the choice of RPCs for this task might be
146
  reviewed at a second time, after implementing this design).
147

    
148
``Configuration management daemon (WConfD)``
149
  It will run on the master node and it will be responsible for the management
150
  of the authoritative copy of the cluster configuration (that is, it will be
151
  the daemon actually modifying the ``config.data`` file). All the requests of
152
  configuration changes will have to pass through this daemon, and will be
153
  performed using a LUXI-like protocol ("WConfD proto" in the graph. The exact
154
  protocol will be defined in the separate design document that will detail the
155
  WConfD separation).  Having a single point of configuration management will
156
  also allow Ganeti to get rid of possible race conditions due to concurrent
157
  modifications of the configuration.  When the configuration is updated, it
158
  will have to push the received changes to the other master candidates, via
159
  RPCs, so that RConfD daemons and (in case of a failure on the master node)
160
  the WConfD daemon on the new master can access an up-to-date version of it
161
  (the choice of RPCs for this task might be reviewed at a second time). This
162
  daemon will also be the one responsible for managing the locks, granting them
163
  to the jobs requesting them, and taking care of freeing them up if the jobs
164
  holding them crash or are terminated before releasing them.  In order to do
165
  this, each job, after being spawned by LuxiD, will open a local unix socket
166
  that will be used to communicate with it, and will be destroyed when the job
167
  terminates.  LuxiD will be able to check, after a timeout, whether the job is
168
  still running by connecting here, and to ask WConfD to forcefully remove the
169
  locks if the socket is closed.
170
  Also, WConfD should hold a serialized list of the locks and their owners in a
171
  file (``locks.data``), so that it can keep track of their status in case it
172
  crashes and needs to be restarted (by asking LuxiD which of them are still
173
  running).
174
  Interaction with this daemon will be performed using Unix sockets.
175

    
176
``Configuration query daemon (RConfD)``
177
  It is written in Haskell, and it corresponds to the old ConfD. It will run on
178
  all the master candidates and it will serve information about the the static
179
  configuration of the cluster (the one contained in ``config.data``). The
180
  provided information will be highly available (as in: a response will be
181
  available as long as a stable-enough connection between the client and at
182
  least one working master candidate is available) and its freshness will be
183
  best effort (the most recent reply from any of the master candidates will be
184
  returned, but it might still be older than the one available through WConfD).
185
  The information will be served through the ConfD protocol.
186

    
187
``Rapi daemon (RapiD)``
188
  It remains basically unchanged, with the only difference that all of its LUXI
189
  query are directed towards LuxiD instead of being split between MasterD and
190
  QueryD.
191

    
192
``Monitoring daemon (MonD)``
193
  It remains unaffected by the changes in this design document. It will just get
194
  some of the data it needs from RConfD instead of the old ConfD, but the
195
  interfaces of the two are identical.
196

    
197
``Node daemon (NodeD)``
198
  It remains unaffected by the changes proposed in the design document. The only
199
  difference being that it will receive its RPCs from LuxiD (for job queue
200
  replication), from WConfD (for configuration replication) and for the
201
  processes executing single jobs (for all the operations to be performed by
202
  nodes) instead of receiving them just from MasterD.
203

    
204
This restructuring will allow us to reorganize and improve the codebase,
205
introducing cleaner interfaces and giving well defined and more restricted tasks
206
to each daemon.
207

    
208
Furthermore, having more well-defined interfaces will allow us to have easier
209
upgrade procedures, and to work towards the possibility of upgrading single
210
components of a cluster one at a time, without the need for immediately
211
upgrading the entire cluster in a single step.
212

    
213

    
214
Implementation
215
==============
216

    
217
While performing this refactoring, we aim to increase the amount of
218
Haskell code, thus benefiting from the additional type safety provided by its
219
wide compile-time checks. In particular, all the job queue management and the
220
configuration management daemon will be written in Haskell, taking over the role
221
currently fulfilled by Python code executed as part of MasterD.
222

    
223
The changes describe by this design document are quite extensive, therefore they
224
will not be implemented all at the same time, but through a sequence of steps,
225
leaving the codebase in a consistent and usable state.
226

    
227
#. Rename QueryD to LuxiD.
228
   A part of LuxiD, the one replying to configuration
229
   queries including live information about the system, already exists in the
230
   form of QueryD. This is being renamed to LuxiD, and will form the first part
231
   of the new daemon. NB: this is happening starting from Ganeti 2.8. At the
232
   beginning, only the already existing queries will be replied to by LuxiD.
233
   More queries will be implemented in the next versions.
234

    
235
#. Let LuxiD be the interface for the queries and MasterD be their executor.
236
   Currently, MasterD is the only responsible for receiving and executing LUXI
237
   queries, and for managing the jobs they create.
238
   Receiving the queries and managing the job queue will be extracted from
239
   MasterD into LuxiD.
240
   Actually executing jobs will still be done by MasterD, that contains all the
241
   logic for doing that and for properly managing locks and the configuration.
242
   At this stage, scheduling will simply consist in starting jobs until a fixed
243
   maximum number of simultaneously running jobs is reached.
244

    
245
#. Extract WConfD from MasterD.
246
   The logic for managing the configuration file is factored out to the
247
   dedicated WConfD daemon. All configuration changes, currently executed
248
   directly by MasterD, will be changed to be IPC requests sent to the new
249
   daemon.
250

    
251
#. Extract locking management from MasterD.
252
   The logic for managing and granting locks is extracted to WConfD as well.
253
   Locks will not be taken directly anymore, but asked via IPC to WConfD.
254
   This step can be executed on its own or at the same time as the previous one.
255

    
256
#. Jobs are executed as processes.
257
   The logic for running jobs is rewritten so that each job can be managed by an
258
   independent process. LuxiD will spawn a new (Python) process for every single
259
   job. The RPCs will remain unchanged, and the LU code will stay as is as much
260
   as possible.
261
   MasterD will cease to exist as a deamon on its own at this point, but not
262
   before.
263

    
264
#. Improve job scheduling algorithm.
265
   The simple algorithm for scheduling jobs will be replaced by a more
266
   intelligent one. Also, the implementation of :doc:`design-optables` can be
267
   started.
268

    
269
Job death detection
270
-------------------
271

    
272
**Requirements:**
273

    
274
- It must be possible to reliably detect a death of a process even under
275
  uncommon conditions such as very heavy system load.
276
- A daemon must be able to detect a death of a process even if the
277
  daemon is restarted while the process is running.
278
- The solution must not rely on being able to communicate with
279
  a process.
280
- The solution must work for the current situation where multiple jobs
281
  run in a single process.
282
- It must be POSIX compliant.
283

    
284
These conditions rule out simple solutions like checking a process ID
285
(because the process might be eventually replaced by another process
286
with the same ID) or keeping an open connection to a process.
287

    
288
**Solution:** As a job process is spawned, before attempting to
289
communicate with any other process, it will create a designated empty
290
lock file, open it, acquire an *exclusive* lock on it, and keep it open.
291
When connecting to a daemon, the job process will provide it with the
292
path of the file. If the process dies unexpectedly, the operating system
293
kernel automatically cleans up the lock.
294

    
295
Therefore, daemons can check if a process is dead by trying to acquire
296
a *shared* lock on the lock file in a non-blocking mode:
297

    
298
- If the locking operation succeeds, it means that the exclusive lock is
299
  missing, therefore the process has died, but the lock
300
  file hasn't been cleaned up yet. The daemon should release the lock
301
  immediately. Optionally, the daemon may delete the lock file.
302
- If the file is missing, the process has died and the lock file has
303
  been cleaned up.
304
- If the locking operation fails due to a lock conflict, it means
305
  the process is alive.
306

    
307
Using shared locks for querying lock files ensures that the detection
308
works correctly even if multiple daemons query a file at the same time.
309

    
310
A job should close and remove its lock file when completely finishes.
311
The WConfD daemon will be responsible for removing stale lock files of
312
jobs that didn't remove its lock files themselves.
313

    
314
**Considered alternatives:** An alternative to creating a separate lock
315
file would be to lock the job status file. However, file locks are kept
316
only as long as the file is open. Therefore any operation followed by
317
closing the file would cause the process to release the lock. In
318
particular, with jobs as threads, the master daemon wouldn't be able to
319
keep locks and operate on job files at the same time.
320

    
321
WConfD details
322
--------------
323

    
324
WConfD will communicate with its clients through a Unix domain socket for both
325
configuration management and locking. Clients can issue multiple RPC calls
326
through one socket. For each such a call the client sends a JSON request
327
document with a remote function name and data for its arguments. The server
328
replies with a JSON response document containing either the result of
329
signalling a failure.
330

    
331
There will be a special RPC call for identifying a client when connecting to
332
WConfD. The client will tell WConfD it's job number and process ID. WConfD will
333
fail any other RPC calls before a client identifies this way.
334

    
335
Any state associated with client processes will be mirrored on persistent
336
storage and linked to the identity of processes so that the WConfD daemon will
337
be able to resume its operation at any point after a restart or a crash. WConfD
338
will track each client's process start time along with its process ID to be
339
able detect if a process dies and it's process ID is reused.  WConfD will clear
340
all locks and other state associated with a client if it detects it's process
341
no longer exists.
342

    
343
Configuration management
344
++++++++++++++++++++++++
345

    
346
The new configuration management protocol will be implemented in the following
347
steps:
348

    
349
Step 1:
350
  #. Implement the following functions in WConfD and export them through
351
     RPC:
352

    
353
     - Obtain a single internal lock, either in shared or
354
       exclusive mode. This lock will substitute the current lock
355
       ``_config_lock`` in config.py.
356
     - Release the lock.
357
     - Return the whole configuration data to a client.
358
     - Receive the whole configuration data from a client and replace the
359
       current configuration with it. Distribute it to master candidates
360
       and distribute the corresponding *ssconf*.
361

    
362
     WConfD must detect deaths of its clients (see `Job death
363
     detection`_) and release locks automatically.
364

    
365
  #. In config.py modify public methods that access configuration:
366

    
367
     - Instead of acquiring a local lock, obtain a lock from WConfD
368
       using the above functions
369
     - Fetch the current configuration from WConfD.
370
     - Use it to perform the method's task.
371
     - If the configuration was modified, send it to WConfD at the end.
372
     - Release the lock to WConfD.
373

    
374
  This will decouple the configuration management from the master daemon,
375
  even though the specific configuration tasks will still performed by
376
  individual jobs.
377

    
378
  After this step it'll be possible access the configuration from separate
379
  processes.
380

    
381
Step 2:
382
  #. Reimplement all current methods of ``ConfigWriter`` for reading and
383
     writing the configuration of a cluster in WConfD.
384
  #. Expose each of those functions in WConfD as a separate RPC function.
385
     This will allow easy future extensions or modifications.
386
  #. Replace ``ConfigWriter`` with a stub (preferably automatically
387
     generated from the Haskell code) that will contain the same methods
388
     as the current ``ConfigWriter`` and delegate all calls to its
389
     methods to WConfD.
390

    
391
Step 3:
392
  #. Remove WConfD's RPC functions for obtaining/releasing the single
393
     internal lock from Step 1.
394
  #. Remove WConfD's RPC functions for sending/receiving the whole
395
     configuration from Step 1.
396

    
397
Future aims:
398

    
399
-  Optionally refactor the RPC calls to reduce their number or improve their
400
   efficiency (for example by obtaining a larger set of data instead of
401
   querying items one by one).
402

    
403
Locking
404
+++++++
405

    
406
The new locking protocol will be implemented as follows:
407

    
408
Re-implement the current locking mechanism in WConfD and expose it for RPC
409
calls. All current locks will be mapped into a data structure that will
410
uniquely identify them (storing lock's level together with it's name).
411

    
412
WConfD will impose a linear order on locks. The order will be compatible
413
with the current ordering of lock levels so that existing code will work
414
without changes.
415

    
416
WConfD will keep the set of currently held locks for each client. The
417
protocol will allow the following operations on the set:
418

    
419
*Update:*
420
  Update the current set of locks according to a given list. The list contains
421
  locks and their desired level (release / shared / exclusive). To prevent
422
  deadlocks, WConfD will check that all newly requested locks (or already held
423
  locks requested to be upgraded to *exclusive*) are greater in the sense of
424
  the linear order than all currently held locks, and fail the operation if
425
  not. Only the locks in the list will be updated, other locks already held
426
  will be left intact. If the operation fails, the client's lock set will be
427
  left intact.
428
*Opportunistic union:*
429
  Add as much as possible locks from a given set to the current set within a
430
  given timeout. WConfD will again check the proper order of locks and
431
  acquire only the ones that are allowed wrt. the current set.  Returns the
432
  set of acquired locks, possibly empty. Immediate. Never fails. (It would also
433
  be possible to extend the operation to try to wait until a given number of
434
  locks is available, or a given timeout elapses.)
435
*List:*
436
  List the current set of held locks. Immediate, never fails.
437
*Intersection:*
438
  Retain only a given set of locks in the current one. This function is
439
  provided for convenience, it's redundant wrt. *list* and *update*. Immediate,
440
  never fails.
441

    
442
Addidional restrictions due to lock implications:
443
  Ganeti supports locks that act as if a lock on a whole group (like all nodes)
444
  were held. To avoid dead locks caused by the additional blockage of those
445
  group locks, we impose certain restrictions. Whenever `A` is a group lock and
446
  `B` belongs to `A`, then the following holds.
447

    
448
  - `A` is in lock order before `B`.
449
  - All locks that are in the lock order between `A` and `B` also belong to `A`.
450
  - It is considered a lock-order violation to ask for an exclusive lock on `B`
451
    while holding a shared lock on `A`.
452

    
453
After this step it'll be possible to use locks from jobs as separate processes.
454

    
455
The above set of operations allows the clients to use various work-flows. In particular:
456

    
457
Pessimistic strategy:
458
  Lock all potentially relevant resources (for example all nodes), determine
459
  which will be needed, and release all the others.
460
Optimistic strategy:
461
  Determine what locks need to be acquired without holding any. Lock the
462
  required set of locks. Determine the set of required locks again and check if
463
  they are all held. If not, release everything and restart.
464

    
465
.. COMMENTED OUT:
466
  Start with the smallest set of locks and when determining what more
467
  relevant resources will be needed, expand the set. If an *union* operation
468
  fails, release all locks, acquire the desired union and restart the
469
  operation so that all preconditions and possible concurrent changes are
470
  checked again.
471

    
472
Future aims:
473

    
474
-  Add more fine-grained locks to prevent unnecessary blocking of jobs. This
475
   could include locks on parameters of entities or locks on their states (so that
476
   a node remains online, but otherwise can change, etc.). In particular,
477
   adding, moving and removing instances currently blocks the whole node.
478
-  Add checks that all modified configuration parameters belong to entities
479
   the client has locked and log violations.
480
-  Make the above checks mandatory.
481
-  Automate optimistic locking and checking the locks in logical units.
482
   For example, this could be accomplished by allowing some of the initial
483
   phases of `LogicalUnit` (such as `ExpandNames` and `DeclareLocks`) to be run
484
   repeatedly, checking if the set of locks requested the second time is
485
   contained in the set acquired after the first pass.
486
-  Add the possibility for a job to reserve hardware resources such as disk
487
   space or memory on nodes. Most likely as a new, special kind of instances
488
   that would only block its resources and allow to be converted to a regular
489
   instance. This would allow long-running jobs such as instance creation or
490
   move to lock the corresponding nodes, acquire the resources and turn the
491
   locks into shared ones, keeping an exclusive lock only on the instance.
492
-  Use more sophisticated algorithm for preventing deadlocks such as a
493
   `wait-for graph`_. This would allow less *union* failures and allow more
494
   optimistic, scalable acquisition of locks.
495

    
496
.. _`wait-for graph`: http://en.wikipedia.org/wiki/Wait-for_graph
497

    
498

    
499
Further considerations
500
======================
501

    
502
There is a possibility that a job will finish performing its task while LuxiD
503
and/or WConfD will not be available.
504
In order to deal with this situation, each job will update its job file
505
in the queue. This is race free, as LuxiD will no longer touch the job file,
506
once the job is started; a corollary of this is that the job also has to
507
take care of replicating updates to the job file. LuxiD will watch job files for
508
changes to determine when a job as cleanly finished. To determine jobs
509
that died without having the chance of updating the job file, the `Job death
510
detection`_ mechanism will be used.
511

    
512
.. vim: set textwidth=72 :
513
.. Local Variables:
514
.. mode: rst
515
.. fill-column: 72
516
.. End: