root / doc / design-2.0-master-daemon.rst @ cd55576a
History | View | Annotate | Download (7.2 kB)
1 | bb083b25 | Iustin Pop | Ganeti 2.0 Master daemon |
---|---|---|---|
2 | bb083b25 | Iustin Pop | ======================== |
3 | bb083b25 | Iustin Pop | |
4 | 47eb4b45 | Guido Trotter | .. contents:: |
5 | 47eb4b45 | Guido Trotter | |
6 | bb083b25 | Iustin Pop | Objective |
7 | bb083b25 | Iustin Pop | --------- |
8 | bb083b25 | Iustin Pop | |
9 | bb083b25 | Iustin Pop | Many of the important features of Ganeti 2.0 — job queue, granular |
10 | bb083b25 | Iustin Pop | locking, external API, etc. — will be integrated via a master |
11 | bb083b25 | Iustin Pop | daemon. While not absolutely necessary, it is the best way to |
12 | bb083b25 | Iustin Pop | integrate all these components. |
13 | bb083b25 | Iustin Pop | |
14 | bb083b25 | Iustin Pop | Background |
15 | bb083b25 | Iustin Pop | ---------- |
16 | bb083b25 | Iustin Pop | |
17 | bb083b25 | Iustin Pop | Currently there is no "master" daemon in Ganeti (1.2). Each command |
18 | bb083b25 | Iustin Pop | tries to acquire the so called *cmd* lock and when it succeeds, it |
19 | bb083b25 | Iustin Pop | takes complete ownership of the cluster configuration and state. The |
20 | bb083b25 | Iustin Pop | scheduled improvements to Ganeti require or can use a daemon that |
21 | bb083b25 | Iustin Pop | coordinates the activities/jobs scheduled/etc. |
22 | bb083b25 | Iustin Pop | |
23 | bb083b25 | Iustin Pop | Overview |
24 | bb083b25 | Iustin Pop | -------- |
25 | bb083b25 | Iustin Pop | |
26 | bb083b25 | Iustin Pop | The master daemon will be the central point of the cluster; command |
27 | bb083b25 | Iustin Pop | line tools and the external API will interact with the cluster via |
28 | bb083b25 | Iustin Pop | this daemon; it will be one coordinating the node daemons. |
29 | bb083b25 | Iustin Pop | |
30 | bb083b25 | Iustin Pop | This design doc is best read in the context of the accompanying design |
31 | bb083b25 | Iustin Pop | docs for Ganeti 2.0: Granular locking design and Job queue design. |
32 | bb083b25 | Iustin Pop | |
33 | bb083b25 | Iustin Pop | |
34 | bb083b25 | Iustin Pop | Detailed Design |
35 | bb083b25 | Iustin Pop | --------------- |
36 | bb083b25 | Iustin Pop | |
37 | bb083b25 | Iustin Pop | In Ganeti 2.0, we will have the following *entities*: |
38 | bb083b25 | Iustin Pop | |
39 | bb083b25 | Iustin Pop | - the master daemon (on master node) |
40 | bb083b25 | Iustin Pop | - the node daemon (all nodes) |
41 | bb083b25 | Iustin Pop | - the command line tools (master node) |
42 | bb083b25 | Iustin Pop | - the RAPI daemon (master node) |
43 | bb083b25 | Iustin Pop | |
44 | bb083b25 | Iustin Pop | Interaction paths are between: |
45 | bb083b25 | Iustin Pop | |
46 | bb083b25 | Iustin Pop | - (CLI tools/RAPI daemon) and the master daemon, via the so called *luxi* API |
47 | bb083b25 | Iustin Pop | - the master daemon and the node daemons, via the node RPC |
48 | bb083b25 | Iustin Pop | |
49 | bb083b25 | Iustin Pop | The protocol between the master daemon and the node daemons will be |
50 | bb083b25 | Iustin Pop | changed to HTTP(S), using a simple PUT/GET of JSON-encoded |
51 | efd0d44f | Michael Hanselmann | messages. This is done due to difficulties in working with the Twisted |
52 | efd0d44f | Michael Hanselmann | framework and its protocols in a multithreaded environment, which we can |
53 | efd0d44f | Michael Hanselmann | overcome by using a simpler stack (see the caveats section). The protocol |
54 | efd0d44f | Michael Hanselmann | between the CLI/RAPI and the master daemon will be a custom one: on a UNIX |
55 | bb083b25 | Iustin Pop | socket on the master node, with rights restricted by filesystem |
56 | efd0d44f | Michael Hanselmann | permissions, the CLI/RAPI will talk to the master daemon using JSON-encoded |
57 | efd0d44f | Michael Hanselmann | messages. |
58 | bb083b25 | Iustin Pop | |
59 | bb083b25 | Iustin Pop | The operations supported over this internal protocol will be encoded |
60 | bb083b25 | Iustin Pop | via a python library that will expose a simple API for its |
61 | bb083b25 | Iustin Pop | users. Internally, the protocol will simply encode all objects in JSON |
62 | bb083b25 | Iustin Pop | format and decode them on the receiver side. |
63 | bb083b25 | Iustin Pop | |
64 | bb083b25 | Iustin Pop | The LUXI protocol |
65 | bb083b25 | Iustin Pop | ~~~~~~~~~~~~~~~~~ |
66 | bb083b25 | Iustin Pop | |
67 | bb083b25 | Iustin Pop | We will have two main classes of operations over the master daemon API: |
68 | bb083b25 | Iustin Pop | |
69 | bb083b25 | Iustin Pop | - cluster query functions |
70 | bb083b25 | Iustin Pop | - job related functions |
71 | bb083b25 | Iustin Pop | |
72 | bb083b25 | Iustin Pop | The cluster query functions are usually short-duration, and are the |
73 | bb083b25 | Iustin Pop | equivalent of the OP_QUERY_* opcodes in ganeti 1.2 (and they are |
74 | bb083b25 | Iustin Pop | internally implemented still with these opcodes). The clients are |
75 | bb083b25 | Iustin Pop | guaranteed to receive the response in a reasonable time via a timeout. |
76 | bb083b25 | Iustin Pop | |
77 | bb083b25 | Iustin Pop | The job-related functions will be: |
78 | bb083b25 | Iustin Pop | |
79 | bb083b25 | Iustin Pop | - submit job |
80 | bb083b25 | Iustin Pop | - query job (which could also be categorized in the query-functions) |
81 | bb083b25 | Iustin Pop | - archive job (see the job queue design doc) |
82 | bb083b25 | Iustin Pop | - wait for job change, which allows a client to wait without polling |
83 | bb083b25 | Iustin Pop | |
84 | efd0d44f | Michael Hanselmann | For more details, see the job queue design document. |
85 | efd0d44f | Michael Hanselmann | |
86 | bb083b25 | Iustin Pop | Daemon implementation |
87 | bb083b25 | Iustin Pop | ~~~~~~~~~~~~~~~~~~~~~ |
88 | bb083b25 | Iustin Pop | |
89 | bb083b25 | Iustin Pop | The daemon will be based around a main I/O thread that will wait for |
90 | bb083b25 | Iustin Pop | new requests from the clients, and that does the setup/shutdown of the |
91 | bb083b25 | Iustin Pop | other thread (pools). |
92 | bb083b25 | Iustin Pop | |
93 | bb083b25 | Iustin Pop | There will two other classes of threads in the daemon: |
94 | bb083b25 | Iustin Pop | |
95 | bb083b25 | Iustin Pop | - job processing threads, part of a thread pool, and which are |
96 | bb083b25 | Iustin Pop | long-lived, started at daemon startup and terminated only at shutdown |
97 | bb083b25 | Iustin Pop | time |
98 | bb083b25 | Iustin Pop | - client I/O threads, which are the ones that talk the local protocol |
99 | bb083b25 | Iustin Pop | to the clients |
100 | bb083b25 | Iustin Pop | |
101 | bb083b25 | Iustin Pop | Master startup/failover |
102 | bb083b25 | Iustin Pop | ~~~~~~~~~~~~~~~~~~~~~~~ |
103 | bb083b25 | Iustin Pop | |
104 | bb083b25 | Iustin Pop | In Ganeti 1.x there is no protection against failing over the master |
105 | bb083b25 | Iustin Pop | to a node with stale configuration. In effect, the responsibility of |
106 | bb083b25 | Iustin Pop | correct failovers falls on the admin. This is true both for the new |
107 | bb083b25 | Iustin Pop | master and for when an old, offline master startup. |
108 | bb083b25 | Iustin Pop | |
109 | bb083b25 | Iustin Pop | Since in 2.x we are extending the cluster state to cover the job queue |
110 | bb083b25 | Iustin Pop | and have a daemon that will execute by itself the job queue, we want |
111 | bb083b25 | Iustin Pop | to have more resilience for the master role. |
112 | bb083b25 | Iustin Pop | |
113 | bb083b25 | Iustin Pop | The following algorithm will happen whenever a node is ready to |
114 | bb083b25 | Iustin Pop | transition to the master role, either at startup time or at node |
115 | bb083b25 | Iustin Pop | failover: |
116 | bb083b25 | Iustin Pop | |
117 | bb083b25 | Iustin Pop | #. read the configuration file and parse the node list |
118 | bb083b25 | Iustin Pop | contained within |
119 | bb083b25 | Iustin Pop | |
120 | bb083b25 | Iustin Pop | #. query all the nodes and make sure we obtain an agreement via |
121 | bb083b25 | Iustin Pop | a quorum of at least half plus one nodes for the following: |
122 | bb083b25 | Iustin Pop | |
123 | bb083b25 | Iustin Pop | - we have the latest configuration and job list (as |
124 | bb083b25 | Iustin Pop | determined by the serial number on the configuration and |
125 | bb083b25 | Iustin Pop | highest job ID on the job queue) |
126 | bb083b25 | Iustin Pop | |
127 | bb083b25 | Iustin Pop | - there is not even a single node having a newer |
128 | bb083b25 | Iustin Pop | configuration file |
129 | bb083b25 | Iustin Pop | |
130 | bb083b25 | Iustin Pop | - if we are not failing over (but just starting), the |
131 | bb083b25 | Iustin Pop | quorum agrees that we are the designated master |
132 | bb083b25 | Iustin Pop | |
133 | bb083b25 | Iustin Pop | #. at this point, the node transitions to the master role |
134 | bb083b25 | Iustin Pop | |
135 | bb083b25 | Iustin Pop | #. for all the in-progress jobs, mark them as failed, with |
136 | bb083b25 | Iustin Pop | reason unknown or something similar (master failed, etc.) |
137 | bb083b25 | Iustin Pop | |
138 | bb083b25 | Iustin Pop | |
139 | bb083b25 | Iustin Pop | Logging |
140 | bb083b25 | Iustin Pop | ~~~~~~~ |
141 | bb083b25 | Iustin Pop | |
142 | bb083b25 | Iustin Pop | The logging system will be switched completely to the logging module; |
143 | bb083b25 | Iustin Pop | currently it's logging-based, but exposes a different API, which is |
144 | bb083b25 | Iustin Pop | just overhead. As such, the code will be switched over to standard |
145 | bb083b25 | Iustin Pop | logging calls, and only the setup will be custom. |
146 | bb083b25 | Iustin Pop | |
147 | bb083b25 | Iustin Pop | With this change, we will remove the separate debug/info/error logs, |
148 | bb083b25 | Iustin Pop | and instead have always one logfile per daemon model: |
149 | bb083b25 | Iustin Pop | |
150 | bb083b25 | Iustin Pop | - master-daemon.log for the master daemon |
151 | bb083b25 | Iustin Pop | - node-daemon.log for the node daemon (this is the same as in 1.2) |
152 | bb083b25 | Iustin Pop | - rapi-daemon.log for the RAPI daemon logs |
153 | bb083b25 | Iustin Pop | - rapi-access.log, an additional log file for the RAPI that will be |
154 | bb083b25 | Iustin Pop | in the standard http log format for possible parsing by other tools |
155 | bb083b25 | Iustin Pop | |
156 | bb083b25 | Iustin Pop | Since the watcher will only submit jobs to the master for startup of |
157 | bb083b25 | Iustin Pop | the instances, its log file will contain less information than before, |
158 | bb083b25 | Iustin Pop | mainly that it will start the instance, but not the results. |
159 | bb083b25 | Iustin Pop | |
160 | bb083b25 | Iustin Pop | Caveats |
161 | bb083b25 | Iustin Pop | ------- |
162 | bb083b25 | Iustin Pop | |
163 | bb083b25 | Iustin Pop | A discussed alternative is to keep the current individual processes |
164 | bb083b25 | Iustin Pop | touching the cluster configuration model. The reasons we have not |
165 | bb083b25 | Iustin Pop | chosen this approach is: |
166 | bb083b25 | Iustin Pop | |
167 | bb083b25 | Iustin Pop | - the speed of reading and unserializing the cluster state |
168 | bb083b25 | Iustin Pop | today is not small enough that we can ignore it; the addition of |
169 | bb083b25 | Iustin Pop | the job queue will make the startup cost even higher. While this |
170 | bb083b25 | Iustin Pop | runtime cost is low, it can be on the order of a few seconds on |
171 | bb083b25 | Iustin Pop | bigger clusters, which for very quick commands is comparable to |
172 | bb083b25 | Iustin Pop | the actual duration of the computation itself |
173 | bb083b25 | Iustin Pop | |
174 | bb083b25 | Iustin Pop | - individual commands would make it harder to implement a |
175 | bb083b25 | Iustin Pop | fire-and-forget job request, along the lines "start this |
176 | bb083b25 | Iustin Pop | instance but do not wait for it to finish"; it would require a |
177 | bb083b25 | Iustin Pop | model of backgrounding the operation and other things that are |
178 | bb083b25 | Iustin Pop | much better served by a daemon-based model |
179 | bb083b25 | Iustin Pop | |
180 | bb083b25 | Iustin Pop | Another area of discussion is moving away from Twisted in this new |
181 | bb083b25 | Iustin Pop | implementation. While Twisted hase its advantages, there are also many |
182 | bb083b25 | Iustin Pop | disatvantanges to using it: |
183 | bb083b25 | Iustin Pop | |
184 | bb083b25 | Iustin Pop | - first and foremost, it's not a library, but a framework; thus, if |
185 | bb083b25 | Iustin Pop | you use twisted, all the code needs to be 'twiste-ized'; we were able |
186 | bb083b25 | Iustin Pop | to keep the 1.x code clean by hacking around twisted in an |
187 | bb083b25 | Iustin Pop | unsupported, unrecommended way, and the only alternative would have |
188 | bb083b25 | Iustin Pop | been to make all the code be written for twisted |
189 | bb083b25 | Iustin Pop | - it has some weaknesses in working with multiple threads, since its base |
190 | efd0d44f | Michael Hanselmann | model is designed to replace thread usage by using deferred calls, so while |
191 | efd0d44f | Michael Hanselmann | it can use threads, it's not less flexible in doing so |
192 | bb083b25 | Iustin Pop | |
193 | efd0d44f | Michael Hanselmann | And, since we already have an HTTP server library for the RAPI, we |
194 | bb083b25 | Iustin Pop | can just reuse that for inter-node communication. |