Statistics
| Branch: | Tag: | Revision:

root / doc / design-2.0-master-daemon.rst @ 47eb4b45

History | View | Annotate | Download (7.2 kB)

1
Ganeti 2.0 Master daemon
2
========================
3

    
4
.. contents::
5

    
6
Objective
7
---------
8

    
9
Many of the important features of Ganeti 2.0 — job queue, granular
10
locking, external API, etc. — will be integrated via a master
11
daemon. While not absolutely necessary, it is the best way to
12
integrate all these components.
13

    
14
Background
15
----------
16

    
17
Currently there is no "master" daemon in Ganeti (1.2). Each command
18
tries to acquire the so called *cmd* lock and when it succeeds, it
19
takes complete ownership of the cluster configuration and state. The
20
scheduled improvements to Ganeti require or can use a daemon that
21
coordinates the activities/jobs scheduled/etc.
22

    
23
Overview
24
--------
25

    
26
The master daemon will be the central point of the cluster; command
27
line tools and the external API will interact with the cluster via
28
this daemon; it will be one coordinating the node daemons.
29

    
30
This design doc is best read in the context of the accompanying design
31
docs for Ganeti 2.0: Granular locking design and Job queue design.
32

    
33

    
34
Detailed Design
35
---------------
36

    
37
In Ganeti 2.0, we will have the following *entities*:
38

    
39
- the master daemon (on master node)
40
- the node daemon (all nodes)
41
- the command line tools (master node)
42
- the RAPI daemon (master node)
43

    
44
Interaction paths are between:
45

    
46
- (CLI tools/RAPI daemon) and the master daemon, via the so called *luxi* API
47
- the master daemon and the node daemons, via the node RPC
48

    
49
The protocol between the master daemon and the node daemons will be
50
changed to HTTP(S), using a simple PUT/GET of JSON-encoded
51
messages. This is done due to difficulties in working with the Twisted
52
framework and its protocols in a multithreaded environment, which we can
53
overcome by using a simpler stack (see the caveats section). The protocol
54
between the CLI/RAPI and the master daemon will be a custom one: on a UNIX
55
socket on the master node, with rights restricted by filesystem
56
permissions, the CLI/RAPI will talk to the master daemon using JSON-encoded
57
messages.
58

    
59
The operations supported over this internal protocol will be encoded
60
via a python library that will expose a simple API for its
61
users. Internally, the protocol will simply encode all objects in JSON
62
format and decode them on the receiver side.
63

    
64
The LUXI protocol
65
~~~~~~~~~~~~~~~~~
66

    
67
We will have two main classes of operations over the master daemon API:
68

    
69
- cluster query functions
70
- job related functions
71

    
72
The cluster query functions are usually short-duration, and are the
73
equivalent of the OP_QUERY_* opcodes in ganeti 1.2 (and they are
74
internally implemented still with these opcodes). The clients are
75
guaranteed to receive the response in a reasonable time via a timeout.
76

    
77
The job-related functions will be:
78

    
79
- submit job
80
- query job (which could also be categorized in the query-functions)
81
- archive job (see the job queue design doc)
82
- wait for job change, which allows a client to wait without polling
83

    
84
For more details, see the job queue design document.
85

    
86
Daemon implementation
87
~~~~~~~~~~~~~~~~~~~~~
88

    
89
The daemon will be based around a main I/O thread that will wait for
90
new requests from the clients, and that does the setup/shutdown of the
91
other thread (pools).
92

    
93
There will two other classes of threads in the daemon:
94

    
95
- job processing threads, part of a thread pool, and which are
96
  long-lived, started at daemon startup and terminated only at shutdown
97
  time
98
- client I/O threads, which are the ones that talk the local protocol
99
  to the clients
100

    
101
Master startup/failover
102
~~~~~~~~~~~~~~~~~~~~~~~
103

    
104
In Ganeti 1.x there is no protection against failing over the master
105
to a node with stale configuration. In effect, the responsibility of
106
correct failovers falls on the admin. This is true both for the new
107
master and for when an old, offline master startup.
108

    
109
Since in 2.x we are extending the cluster state to cover the job queue
110
and have a daemon that will execute by itself the job queue, we want
111
to have more resilience for the master role.
112

    
113
The following algorithm will happen whenever a node is ready to
114
transition to the master role, either at startup time or at node
115
failover:
116

    
117
#. read the configuration file and parse the node list
118
   contained within
119

    
120
#. query all the nodes and make sure we obtain an agreement via
121
   a quorum of at least half plus one nodes for the following:
122

    
123
    - we have the latest configuration and job list (as
124
      determined by the serial number on the configuration and
125
      highest job ID on the job queue)
126

    
127
    - there is not even a single node having a newer
128
      configuration file
129

    
130
    - if we are not failing over (but just starting), the
131
      quorum agrees that we are the designated master
132

    
133
#. at this point, the node transitions to the master role
134

    
135
#. for all the in-progress jobs, mark them as failed, with
136
   reason unknown or something similar (master failed, etc.)
137

    
138

    
139
Logging
140
~~~~~~~
141

    
142
The logging system will be switched completely to the logging module;
143
currently it's logging-based, but exposes a different API, which is
144
just overhead. As such, the code will be switched over to standard
145
logging calls, and only the setup will be custom.
146

    
147
With this change, we will remove the separate debug/info/error logs,
148
and instead have always one logfile per daemon model:
149

    
150
- master-daemon.log for the master daemon
151
- node-daemon.log for the node daemon (this is the same as in 1.2)
152
- rapi-daemon.log for the RAPI daemon logs
153
- rapi-access.log, an additional log file for the RAPI that will be
154
  in the standard http log format for possible parsing by other tools
155

    
156
Since the watcher will only submit jobs to the master for startup of
157
the instances, its log file will contain less information than before,
158
mainly that it will start the instance, but not the results.
159

    
160
Caveats
161
-------
162

    
163
A discussed alternative is to keep the current individual processes
164
touching the cluster configuration model. The reasons we have not
165
chosen this approach is:
166

    
167
- the speed of reading and unserializing the cluster state
168
  today is not small enough that we can ignore it; the addition of
169
  the job queue will make the startup cost even higher. While this
170
  runtime cost is low, it can be on the order of a few seconds on
171
  bigger clusters, which for very quick commands is comparable to
172
  the actual duration of the computation itself
173

    
174
- individual commands would make it harder to implement a
175
  fire-and-forget job request, along the lines "start this
176
  instance but do not wait for it to finish"; it would require a
177
  model of backgrounding the operation and other things that are
178
  much better served by a daemon-based model
179

    
180
Another area of discussion is moving away from Twisted in this new
181
implementation. While Twisted hase its advantages, there are also many
182
disatvantanges to using it:
183

    
184
- first and foremost, it's not a library, but a framework; thus, if
185
  you use twisted, all the code needs to be 'twiste-ized'; we were able
186
  to keep the 1.x code clean by hacking around twisted in an
187
  unsupported, unrecommended way, and the only alternative would have
188
  been to make all the code be written for twisted
189
- it has some weaknesses in working with multiple threads, since its base
190
  model is designed to replace thread usage by using deferred calls, so while
191
  it can use threads, it's not less flexible in doing so
192

    
193
And, since we already have an HTTP server library for the RAPI, we
194
can just reuse that for inter-node communication.