Revision bb083b25

b/doc/design-2.0-master-daemon.rst
1
Ganeti 2.0 Master daemon
2
========================
3

  
4
Objective
5
---------
6

  
7
Many of the important features of Ganeti 2.0 — job queue, granular
8
locking, external API, etc. — will be integrated via a master
9
daemon. While not absolutely necessary, it is the best way to
10
integrate all these components.
11

  
12
Background
13
----------
14

  
15
Currently there is no "master" daemon in Ganeti (1.2). Each command
16
tries to acquire the so called *cmd* lock and when it succeeds, it
17
takes complete ownership of the cluster configuration and state. The
18
scheduled improvements to Ganeti require or can use a daemon that
19
coordinates the activities/jobs scheduled/etc.
20

  
21
Overview
22
--------
23

  
24
The master daemon will be the central point of the cluster; command
25
line tools and the external API will interact with the cluster via
26
this daemon; it will be one coordinating the node daemons.
27

  
28
This design doc is best read in the context of the accompanying design
29
docs for Ganeti 2.0: Granular locking design and Job queue design.
30

  
31

  
32
Detailed Design
33
---------------
34

  
35
In Ganeti 2.0, we will have the following *entities*:
36

  
37
- the master daemon (on master node)
38
- the node daemon (all nodes)
39
- the command line tools (master node)
40
- the RAPI daemon (master node)
41

  
42
Interaction paths are between:
43

  
44
- (CLI tools/RAPI daemon) and the master daemon, via the so called *luxi* API
45
- the master daemon and the node daemons, via the node RPC
46

  
47
The protocol between the master daemon and the node daemons will be
48
changed to HTTP(S), using a simple PUT/GET of JSON-encoded
49
messages. This is done due to difficulties in working with the twisted
50
protocols in a multithreaded environment, which we can overcome by
51
using a simpler stack (see the caveats section). The protocol between
52
the CLI/RAPI and the master daemon will be a custom one: on a UNIX
53
socket on the master node, with rights restricted by filesystem
54
permissions, the CLI/API will speak HTTP to the master daemon.
55

  
56
The operations supported over this internal protocol will be encoded
57
via a python library that will expose a simple API for its
58
users. Internally, the protocol will simply encode all objects in JSON
59
format and decode them on the receiver side.
60

  
61
The LUXI protocol
62
~~~~~~~~~~~~~~~~~
63

  
64
We will have two main classes of operations over the master daemon API:
65

  
66
- cluster query functions
67
- job related functions
68

  
69
The cluster query functions are usually short-duration, and are the
70
equivalent of the OP_QUERY_* opcodes in ganeti 1.2 (and they are
71
internally implemented still with these opcodes). The clients are
72
guaranteed to receive the response in a reasonable time via a timeout.
73

  
74
The job-related functions will be:
75

  
76
- submit job
77
- query job (which could also be categorized in the query-functions)
78
- archive job (see the job queue design doc)
79
- wait for job change, which allows a client to wait without polling
80

  
81
Daemon implementation
82
~~~~~~~~~~~~~~~~~~~~~
83

  
84
The daemon will be based around a main I/O thread that will wait for
85
new requests from the clients, and that does the setup/shutdown of the
86
other thread (pools).
87

  
88

  
89
There will two other classes of threads in the daemon:
90

  
91
- job processing threads, part of a thread pool, and which are
92
  long-lived, started at daemon startup and terminated only at shutdown
93
  time
94
- client I/O threads, which are the ones that talk the local protocol
95
  to the clients
96

  
97
Master startup/failover
98
~~~~~~~~~~~~~~~~~~~~~~~
99

  
100
In Ganeti 1.x there is no protection against failing over the master
101
to a node with stale configuration. In effect, the responsibility of
102
correct failovers falls on the admin. This is true both for the new
103
master and for when an old, offline master startup.
104

  
105
Since in 2.x we are extending the cluster state to cover the job queue
106
and have a daemon that will execute by itself the job queue, we want
107
to have more resilience for the master role.
108

  
109
The following algorithm will happen whenever a node is ready to
110
transition to the master role, either at startup time or at node
111
failover:
112

  
113
#. read the configuration file and parse the node list
114
   contained within
115

  
116
#. query all the nodes and make sure we obtain an agreement via
117
   a quorum of at least half plus one nodes for the following:
118

  
119
    - we have the latest configuration and job list (as
120
      determined by the serial number on the configuration and
121
      highest job ID on the job queue)
122

  
123
    - there is not even a single node having a newer
124
      configuration file
125

  
126
    - if we are not failing over (but just starting), the
127
      quorum agrees that we are the designated master
128

  
129
#. at this point, the node transitions to the master role
130

  
131
#. for all the in-progress jobs, mark them as failed, with
132
   reason unknown or something similar (master failed, etc.)
133

  
134

  
135
Logging
136
~~~~~~~
137

  
138
The logging system will be switched completely to the logging module;
139
currently it's logging-based, but exposes a different API, which is
140
just overhead. As such, the code will be switched over to standard
141
logging calls, and only the setup will be custom.
142

  
143
With this change, we will remove the separate debug/info/error logs,
144
and instead have always one logfile per daemon model:
145

  
146
- master-daemon.log for the master daemon
147
- node-daemon.log for the node daemon (this is the same as in 1.2)
148
- rapi-daemon.log for the RAPI daemon logs
149
- rapi-access.log, an additional log file for the RAPI that will be
150
  in the standard http log format for possible parsing by other tools
151

  
152
Since the watcher will only submit jobs to the master for startup of
153
the instances, its log file will contain less information than before,
154
mainly that it will start the instance, but not the results.
155

  
156
Caveats
157
-------
158

  
159
A discussed alternative is to keep the current individual processes
160
touching the cluster configuration model. The reasons we have not
161
chosen this approach is:
162

  
163
- the speed of reading and unserializing the cluster state
164
  today is not small enough that we can ignore it; the addition of
165
  the job queue will make the startup cost even higher. While this
166
  runtime cost is low, it can be on the order of a few seconds on
167
  bigger clusters, which for very quick commands is comparable to
168
  the actual duration of the computation itself
169

  
170
- individual commands would make it harder to implement a
171
  fire-and-forget job request, along the lines "start this
172
  instance but do not wait for it to finish"; it would require a
173
  model of backgrounding the operation and other things that are
174
  much better served by a daemon-based model
175

  
176
Another area of discussion is moving away from Twisted in this new
177
implementation. While Twisted hase its advantages, there are also many
178
disatvantanges to using it:
179

  
180
- first and foremost, it's not a library, but a framework; thus, if
181
  you use twisted, all the code needs to be 'twiste-ized'; we were able
182
  to keep the 1.x code clean by hacking around twisted in an
183
  unsupported, unrecommended way, and the only alternative would have
184
  been to make all the code be written for twisted
185
- it has some weaknesses in working with multiple threads, since its base
186
  model is designed to replace thread usage by the deffered, so while it can
187
  use threads, it's not less flexible in doing so
188

  
189
And, since we already have an http server library (for the RAPI), we
190
can just reuse that for inter-node communication.

Also available in: Unified diff