/doc/design-daemons.rst - Annotate - snf-ganeti - Greek Research and Technology Network's projects

| Branch: | Tag: | Revision:

root / doc / design-daemons.rst @ bb3011ad

History | View | Annotate | Download (23.7 kB)

1	ffedf64d	Michele Tartara	==========================
2	ffedf64d	Michele Tartara	Ganeti daemons refactoring
3	ffedf64d	Michele Tartara	==========================
4	ffedf64d	Michele Tartara
5	ffedf64d	Michele Tartara	.. contents:: :depth: 2
6	ffedf64d	Michele Tartara
7	ffedf64d	Michele Tartara	This is a design document detailing the plan for refactoring the internal
8	ffedf64d	Michele Tartara	structure of Ganeti, and particularly the set of daemons it is divided into.
9	ffedf64d	Michele Tartara
10	ffedf64d	Michele Tartara
11	ffedf64d	Michele Tartara	Current state and shortcomings
12	ffedf64d	Michele Tartara	==============================
13	ffedf64d	Michele Tartara
14	ffedf64d	Michele Tartara	Ganeti is comprised of a growing number of daemons, each dealing with part of
15	ffedf64d	Michele Tartara	the tasks the cluster has to face, and communicating with the other daemons
16	ffedf64d	Michele Tartara	using a variety of protocols.
17	ffedf64d	Michele Tartara
18	ffedf64d	Michele Tartara	Specifically, as of Ganeti 2.8, the situation is as follows:
19	ffedf64d	Michele Tartara
20	ffedf64d	Michele Tartara	``Master daemon (MasterD)``
21	ffedf64d	Michele Tartara	It is responsible for managing the entire cluster, and it's written in Python.
22	ffedf64d	Michele Tartara	It is executed on a single node (the master node). It receives the commands
23	ffedf64d	Michele Tartara	given by the cluster administrator (through the remote API daemon or the
24	ffedf64d	Michele Tartara	command line tools) over the LUXI protocol. The master daemon is responsible
25	ffedf64d	Michele Tartara	for creating and managing the jobs that will execute such commands, and for
26	ffedf64d	Michele Tartara	managing the locks that ensure the cluster will not incur in race conditions.
27	ffedf64d	Michele Tartara
28	ffedf64d	Michele Tartara	Each job is managed by a separate Python thread, that interacts with the node
29	ffedf64d	Michele Tartara	daemons via RPC calls.
30	ffedf64d	Michele Tartara
31	ffedf64d	Michele Tartara	The master daemon is also responsible for managing the configuration of the
32	ffedf64d	Michele Tartara	cluster, changing it when required by some job. It is also responsible for
33	ffedf64d	Michele Tartara	copying the configuration to the other master candidates after updating it.
34	ffedf64d	Michele Tartara
35	ffedf64d	Michele Tartara	``RAPI daemon (RapiD)``
36	ffedf64d	Michele Tartara	It is written in Python and runs on the master node only. It waits for
37	ffedf64d	Michele Tartara	requests issued remotely through the remote API protocol. Then, it forwards
38	ffedf64d	Michele Tartara	them, using the LUXI protocol, to the master daemon (if they are commands) or
39	ffedf64d	Michele Tartara	to the query daemon if they are queries about the configuration (including
40	ffedf64d	Michele Tartara	live status) of the cluster.
41	ffedf64d	Michele Tartara
42	ffedf64d	Michele Tartara	``Node daemon (NodeD)``
43	ffedf64d	Michele Tartara	It is written in Python. It runs on all the nodes. It is responsible for
44	ffedf64d	Michele Tartara	receiving the master requests over RPC and execute them, using the appropriate
45	ffedf64d	Michele Tartara	backend (hypervisors, DRBD, LVM, etc.). It also receives requests over RPC for
46	ffedf64d	Michele Tartara	the execution of queries gathering live data on behalf of the query daemon.
47	ffedf64d	Michele Tartara
48	ffedf64d	Michele Tartara	``Configuration daemon (ConfD)``
49	ffedf64d	Michele Tartara	It is written in Haskell. It runs on all the master candidates. Since the
50	ffedf64d	Michele Tartara	configuration is replicated only on the master node, this daemon exists in
51	ffedf64d	Michele Tartara	order to provide information about the configuration to nodes needing them.
52	ffedf64d	Michele Tartara	The requests are done through ConfD's own protocol, HMAC signed,
53	ffedf64d	Michele Tartara	implemented over UDP, and meant to be used by parallely querying all the
54	ffedf64d	Michele Tartara	master candidates (or a subset thereof) and getting the most up to date
55	ffedf64d	Michele Tartara	answer. This is meant as a way to provide a robust service even in case master
56	ffedf64d	Michele Tartara	is temporarily unavailable.
57	ffedf64d	Michele Tartara
58	ffedf64d	Michele Tartara	``Query daemon (QueryD)``
59	ffedf64d	Michele Tartara	It is written in Haskell. It runs on all the master candidates. It replies
60	ffedf64d	Michele Tartara	to Luxi queries about the current status of the system, including live data it
61	ffedf64d	Michele Tartara	obtains by querying the node daemons through RPCs.
62	ffedf64d	Michele Tartara
63	ffedf64d	Michele Tartara	``Monitoring daemon (MonD)``
64	ffedf64d	Michele Tartara	It is written in Haskell. It runs on all nodes, including the ones that are
65	ffedf64d	Michele Tartara	not vm-capable. It is meant to provide information on the status of the
66	ffedf64d	Michele Tartara	system. Such information is related only to the specific node the daemon is
67	ffedf64d	Michele Tartara	running on, and it is provided as JSON encoded data over HTTP, to be easily
68	ffedf64d	Michele Tartara	readable by external tools.
69	ffedf64d	Michele Tartara	The monitoring daemon communicates with ConfD to get information about the
70	ffedf64d	Michele Tartara	configuration of the cluster. The choice of communicating with ConfD instead
71	ffedf64d	Michele Tartara	of MasterD allows it to obtain configuration information even when the cluster
72	ffedf64d	Michele Tartara	is heavily degraded (e.g.: when master and some, but not all, of the master
73	ffedf64d	Michele Tartara	candidates are unreachable).
74	ffedf64d	Michele Tartara
75	ffedf64d	Michele Tartara	The current structure of the Ganeti daemons is inefficient because there are
76	ffedf64d	Michele Tartara	many different protocols involved, and each daemon needs to be able to use
77	ffedf64d	Michele Tartara	multiple ones, and has to deal with doing different things, thus making
78	ffedf64d	Michele Tartara	sometimes unclear which daemon is responsible for performing a specific task.
79	ffedf64d	Michele Tartara
80	ffedf64d	Michele Tartara	Also, with the current configuration, jobs are managed by the master daemon
81	ffedf64d	Michele Tartara	using python threads. This makes terminating a job after it has started a
82	ffedf64d	Michele Tartara	difficult operation, and it is the main reason why this is not possible yet.
83	ffedf64d	Michele Tartara
84	ffedf64d	Michele Tartara	The master daemon currently has too many different tasks, that could be handled
85	ffedf64d	Michele Tartara	better if split among different daemons.
86	ffedf64d	Michele Tartara
87	ffedf64d	Michele Tartara
88	ffedf64d	Michele Tartara	Proposed changes
89	ffedf64d	Michele Tartara	================
90	ffedf64d	Michele Tartara
91	ffedf64d	Michele Tartara	In order to improve on the current situation, a new daemon subdivision is
92	ffedf64d	Michele Tartara	proposed, and presented hereafter.
93	ffedf64d	Michele Tartara
94	ffedf64d	Michele Tartara	.. digraph:: "new-daemons-structure"
95	ffedf64d	Michele Tartara
96	ffedf64d	Michele Tartara	{rank=same; RConfD LuxiD;}
97	ffedf64d	Michele Tartara	{rank=same; Jobs rconfigdata;}
98	ffedf64d	Michele Tartara	node [shape=box]
99	ffedf64d	Michele Tartara	RapiD [label="RapiD [M]"]
100	ffedf64d	Michele Tartara	LuxiD [label="LuxiD [M]"]
101	ffedf64d	Michele Tartara	WConfD [label="WConfD [M]"]
102	ffedf64d	Michele Tartara	Jobs [label="Jobs [M]"]
103	ffedf64d	Michele Tartara	RConfD [label="RConfD [MC]"]
104	ffedf64d	Michele Tartara	MonD [label="MonD [All]"]
105	ffedf64d	Michele Tartara	NodeD [label="NodeD [All]"]
106	ffedf64d	Michele Tartara	Clients [label="gnt-*\nclients [M]"]
107	ffedf64d	Michele Tartara	p1 [shape=none, label=""]
108	ffedf64d	Michele Tartara	p2 [shape=none, label=""]
109	ffedf64d	Michele Tartara	p3 [shape=none, label=""]
110	ffedf64d	Michele Tartara	p4 [shape=none, label=""]
111	ffedf64d	Michele Tartara	configdata [shape=none, label="config.data"]
112	ffedf64d	Michele Tartara	rconfigdata [shape=none, label="config.data\n[MC copy]"]
113	ffedf64d	Michele Tartara	locksdata [shape=none, label="locks.data"]
114	ffedf64d	Michele Tartara
115	ffedf64d	Michele Tartara	RapiD -> LuxiD [label="LUXI"]
116	ffedf64d	Michele Tartara	LuxiD -> WConfD [label="WConfD\nproto"]
117	ffedf64d	Michele Tartara	LuxiD -> Jobs [label="fork/exec"]
118	ffedf64d	Michele Tartara	Jobs -> WConfD [label="WConfD\nproto"]
119	ffedf64d	Michele Tartara	Jobs -> NodeD [label="RPC"]
120	ffedf64d	Michele Tartara	LuxiD -> NodeD [label="RPC"]
121	ffedf64d	Michele Tartara	rconfigdata -> RConfD
122	ffedf64d	Michele Tartara	configdata -> rconfigdata [label="sync via\nNodeD RPC"]
123	ffedf64d	Michele Tartara	WConfD -> NodeD [label="RPC"]
124	ffedf64d	Michele Tartara	WConfD -> configdata
125	ffedf64d	Michele Tartara	WConfD -> locksdata
126	ffedf64d	Michele Tartara	MonD -> RConfD [label="RConfD\nproto"]
127	ffedf64d	Michele Tartara	Clients -> LuxiD [label="LUXI"]
128	ffedf64d	Michele Tartara	p1 -> MonD [label="MonD proto"]
129	ffedf64d	Michele Tartara	p2 -> RapiD [label="RAPI"]
130	ffedf64d	Michele Tartara	p3 -> RConfD [label="RConfD\nproto"]
131	ffedf64d	Michele Tartara	p4 -> Clients [label="CLI"]
132	ffedf64d	Michele Tartara
133	ffedf64d	Michele Tartara	``LUXI daemon (LuxiD)``
134	ffedf64d	Michele Tartara	It will be written in Haskell. It will run on the master node and it will be
135	ffedf64d	Michele Tartara	the only LUXI server, replying to all the LUXI queries. These includes both
136	ffedf64d	Michele Tartara	the queries about the live configuration of the cluster, previously served by
137	ffedf64d	Michele Tartara	QueryD, and the commands actually changing the status of the cluster by
138	ffedf64d	Michele Tartara	submitting jobs. Therefore, this daemon will also be the one responsible with
139	ffedf64d	Michele Tartara	managing the job queue. When a job needs to be executed, the LuxiD will spawn
140	ffedf64d	Michele Tartara	a separate process tasked with the execution of that specific job, thus making
141	ffedf64d	Michele Tartara	it easier to terminate the job itself, if needeed. When a job requires locks,
142	ffedf64d	Michele Tartara	LuxiD will request them from WConfD.
143	ffedf64d	Michele Tartara	In order to keep availability of the cluster in case of failure of the master
144	ffedf64d	Michele Tartara	node, LuxiD will replicate the job queue to the other master candidates, by
145	ffedf64d	Michele Tartara	RPCs to the NodeD running there (the choice of RPCs for this task might be
146	ffedf64d	Michele Tartara	reviewed at a second time, after implementing this design).
147	ffedf64d	Michele Tartara
148	ffedf64d	Michele Tartara	``Configuration management daemon (WConfD)``
149	ffedf64d	Michele Tartara	It will run on the master node and it will be responsible for the management
150	ffedf64d	Michele Tartara	of the authoritative copy of the cluster configuration (that is, it will be
151	ffedf64d	Michele Tartara	the daemon actually modifying the ``config.data`` file). All the requests of
152	ffedf64d	Michele Tartara	configuration changes will have to pass through this daemon, and will be
153	ffedf64d	Michele Tartara	performed using a LUXI-like protocol ("WConfD proto" in the graph. The exact
154	ffedf64d	Michele Tartara	protocol will be defined in the separate design document that will detail the
155	ffedf64d	Michele Tartara	WConfD separation). Having a single point of configuration management will
156	ffedf64d	Michele Tartara	also allow Ganeti to get rid of possible race conditions due to concurrent
157	ffedf64d	Michele Tartara	modifications of the configuration. When the configuration is updated, it
158	ffedf64d	Michele Tartara	will have to push the received changes to the other master candidates, via
159	ffedf64d	Michele Tartara	RPCs, so that RConfD daemons and (in case of a failure on the master node)
160	ffedf64d	Michele Tartara	the WConfD daemon on the new master can access an up-to-date version of it
161	ffedf64d	Michele Tartara	(the choice of RPCs for this task might be reviewed at a second time). This
162	ffedf64d	Michele Tartara	daemon will also be the one responsible for managing the locks, granting them
163	ffedf64d	Michele Tartara	to the jobs requesting them, and taking care of freeing them up if the jobs
164	ffedf64d	Michele Tartara	holding them crash or are terminated before releasing them. In order to do
165	ffedf64d	Michele Tartara	this, each job, after being spawned by LuxiD, will open a local unix socket
166	ffedf64d	Michele Tartara	that will be used to communicate with it, and will be destroyed when the job
167	ffedf64d	Michele Tartara	terminates. LuxiD will be able to check, after a timeout, whether the job is
168	ffedf64d	Michele Tartara	still running by connecting here, and to ask WConfD to forcefully remove the
169	ffedf64d	Michele Tartara	locks if the socket is closed.
170	ffedf64d	Michele Tartara	Also, WConfD should hold a serialized list of the locks and their owners in a
171	ffedf64d	Michele Tartara	file (``locks.data``), so that it can keep track of their status in case it
172	ffedf64d	Michele Tartara	crashes and needs to be restarted (by asking LuxiD which of them are still
173	ffedf64d	Michele Tartara	running).
174	ffedf64d	Michele Tartara	Interaction with this daemon will be performed using Unix sockets.
175	ffedf64d	Michele Tartara
176	ffedf64d	Michele Tartara	``Configuration query daemon (RConfD)``
177	ffedf64d	Michele Tartara	It is written in Haskell, and it corresponds to the old ConfD. It will run on
178	ffedf64d	Michele Tartara	all the master candidates and it will serve information about the the static
179	ffedf64d	Michele Tartara	configuration of the cluster (the one contained in ``config.data``). The
180	ffedf64d	Michele Tartara	provided information will be highly available (as in: a response will be
181	ffedf64d	Michele Tartara	available as long as a stable-enough connection between the client and at
182	ffedf64d	Michele Tartara	least one working master candidate is available) and its freshness will be
183	ffedf64d	Michele Tartara	best effort (the most recent reply from any of the master candidates will be
184	ffedf64d	Michele Tartara	returned, but it might still be older than the one available through WConfD).
185	ffedf64d	Michele Tartara	The information will be served through the ConfD protocol.
186	ffedf64d	Michele Tartara
187	ffedf64d	Michele Tartara	``Rapi daemon (RapiD)``
188	ffedf64d	Michele Tartara	It remains basically unchanged, with the only difference that all of its LUXI
189	ffedf64d	Michele Tartara	query are directed towards LuxiD instead of being split between MasterD and
190	ffedf64d	Michele Tartara	QueryD.
191	ffedf64d	Michele Tartara
192	ffedf64d	Michele Tartara	``Monitoring daemon (MonD)``
193	ffedf64d	Michele Tartara	It remains unaffected by the changes in this design document. It will just get
194	ffedf64d	Michele Tartara	some of the data it needs from RConfD instead of the old ConfD, but the
195	ffedf64d	Michele Tartara	interfaces of the two are identical.
196	ffedf64d	Michele Tartara
197	ffedf64d	Michele Tartara	``Node daemon (NodeD)``
198	ffedf64d	Michele Tartara	It remains unaffected by the changes proposed in the design document. The only
199	ffedf64d	Michele Tartara	difference being that it will receive its RPCs from LuxiD (for job queue
200	ffedf64d	Michele Tartara	replication), from WConfD (for configuration replication) and for the
201	ffedf64d	Michele Tartara	processes executing single jobs (for all the operations to be performed by
202	ffedf64d	Michele Tartara	nodes) instead of receiving them just from MasterD.
203	ffedf64d	Michele Tartara
204	ffedf64d	Michele Tartara	This restructuring will allow us to reorganize and improve the codebase,
205	ffedf64d	Michele Tartara	introducing cleaner interfaces and giving well defined and more restricted tasks
206	ffedf64d	Michele Tartara	to each daemon.
207	ffedf64d	Michele Tartara
208	ffedf64d	Michele Tartara	Furthermore, having more well-defined interfaces will allow us to have easier
209	ffedf64d	Michele Tartara	upgrade procedures, and to work towards the possibility of upgrading single
210	ffedf64d	Michele Tartara	components of a cluster one at a time, without the need for immediately
211	ffedf64d	Michele Tartara	upgrading the entire cluster in a single step.
212	ffedf64d	Michele Tartara
213	ffedf64d	Michele Tartara
214	ffedf64d	Michele Tartara	Implementation
215	ffedf64d	Michele Tartara	==============
216	ffedf64d	Michele Tartara
217	ffedf64d	Michele Tartara	While performing this refactoring, we aim to increase the amount of
218	ffedf64d	Michele Tartara	Haskell code, thus benefiting from the additional type safety provided by its
219	ffedf64d	Michele Tartara	wide compile-time checks. In particular, all the job queue management and the
220	ffedf64d	Michele Tartara	configuration management daemon will be written in Haskell, taking over the role
221	ffedf64d	Michele Tartara	currently fulfilled by Python code executed as part of MasterD.
222	ffedf64d	Michele Tartara
223	ffedf64d	Michele Tartara	The changes describe by this design document are quite extensive, therefore they
224	ffedf64d	Michele Tartara	will not be implemented all at the same time, but through a sequence of steps,
225	ffedf64d	Michele Tartara	leaving the codebase in a consistent and usable state.
226	ffedf64d	Michele Tartara
227	ffedf64d	Michele Tartara	#. Rename QueryD to LuxiD.
228	ffedf64d	Michele Tartara	A part of LuxiD, the one replying to configuration
229	ffedf64d	Michele Tartara	queries including live information about the system, already exists in the
230	ffedf64d	Michele Tartara	form of QueryD. This is being renamed to LuxiD, and will form the first part
231	ffedf64d	Michele Tartara	of the new daemon. NB: this is happening starting from Ganeti 2.8. At the
232	ffedf64d	Michele Tartara	beginning, only the already existing queries will be replied to by LuxiD.
233	ffedf64d	Michele Tartara	More queries will be implemented in the next versions.
234	ffedf64d	Michele Tartara
235	ffedf64d	Michele Tartara	#. Let LuxiD be the interface for the queries and MasterD be their executor.
236	ffedf64d	Michele Tartara	Currently, MasterD is the only responsible for receiving and executing LUXI
237	ffedf64d	Michele Tartara	queries, and for managing the jobs they create.
238	ffedf64d	Michele Tartara	Receiving the queries and managing the job queue will be extracted from
239	ffedf64d	Michele Tartara	MasterD into LuxiD.
240	ffedf64d	Michele Tartara	Actually executing jobs will still be done by MasterD, that contains all the
241	ffedf64d	Michele Tartara	logic for doing that and for properly managing locks and the configuration.
242	ce10eb31	Klaus Aehlig	At this stage, scheduling will simply consist in starting jobs until a fixed
243	ce10eb31	Klaus Aehlig	maximum number of simultaneously running jobs is reached.
244	ffedf64d	Michele Tartara
245	ffedf64d	Michele Tartara	#. Extract WConfD from MasterD.
246	ffedf64d	Michele Tartara	The logic for managing the configuration file is factored out to the
247	ffedf64d	Michele Tartara	dedicated WConfD daemon. All configuration changes, currently executed
248	ffedf64d	Michele Tartara	directly by MasterD, will be changed to be IPC requests sent to the new
249	ffedf64d	Michele Tartara	daemon.
250	ffedf64d	Michele Tartara
251	ffedf64d	Michele Tartara	#. Extract locking management from MasterD.
252	ffedf64d	Michele Tartara	The logic for managing and granting locks is extracted to WConfD as well.
253	ffedf64d	Michele Tartara	Locks will not be taken directly anymore, but asked via IPC to WConfD.
254	ffedf64d	Michele Tartara	This step can be executed on its own or at the same time as the previous one.
255	ffedf64d	Michele Tartara
256	ffedf64d	Michele Tartara	#. Jobs are executed as processes.
257	ffedf64d	Michele Tartara	The logic for running jobs is rewritten so that each job can be managed by an
258	ffedf64d	Michele Tartara	independent process. LuxiD will spawn a new (Python) process for every single
259	ffedf64d	Michele Tartara	job. The RPCs will remain unchanged, and the LU code will stay as is as much
260	ffedf64d	Michele Tartara	as possible.
261	ffedf64d	Michele Tartara	MasterD will cease to exist as a deamon on its own at this point, but not
262	ffedf64d	Michele Tartara	before.
263	ffedf64d	Michele Tartara
264	ce10eb31	Klaus Aehlig	#. Improve job scheduling algorithm.
265	ce10eb31	Klaus Aehlig	The simple algorithm for scheduling jobs will be replaced by a more
266	ce10eb31	Klaus Aehlig	intelligent one. Also, the implementation of :doc:`design-optables` can be
267	ce10eb31	Klaus Aehlig	started.
268	ce10eb31	Klaus Aehlig
269	2de55c83	Petr Pudlak	Job death detection
270	2de55c83	Petr Pudlak	-------------------
271	2de55c83	Petr Pudlak
272	2de55c83	Petr Pudlak	Requirements:
273	2de55c83	Petr Pudlak
274	2de55c83	Petr Pudlak	- It must be possible to reliably detect a death of a process even under
275	2de55c83	Petr Pudlak	uncommon conditions such as very heavy system load.
276	2de55c83	Petr Pudlak	- A daemon must be able to detect a death of a process even if the
277	2de55c83	Petr Pudlak	daemon is restarted while the process is running.
278	2de55c83	Petr Pudlak	- The solution must not rely on being able to communicate with
279	2de55c83	Petr Pudlak	a process.
280	2de55c83	Petr Pudlak	- The solution must work for the current situation where multiple jobs
281	2de55c83	Petr Pudlak	run in a single process.
282	2de55c83	Petr Pudlak	- It must be POSIX compliant.
283	2de55c83	Petr Pudlak
284	2de55c83	Petr Pudlak	These conditions rule out simple solutions like checking a process ID
285	2de55c83	Petr Pudlak	(because the process might be eventually replaced by another process
286	2de55c83	Petr Pudlak	with the same ID) or keeping an open connection to a process.
287	2de55c83	Petr Pudlak
288	2de55c83	Petr Pudlak	Solution: As a job process is spawned, before attempting to
289	2de55c83	Petr Pudlak	communicate with any other process, it will create a designated empty
290	2de55c83	Petr Pudlak	lock file, open it, acquire an exclusive lock on it, and keep it open.
291	2de55c83	Petr Pudlak	When connecting to a daemon, the job process will provide it with the
292	2de55c83	Petr Pudlak	path of the file. If the process dies unexpectedly, the operating system
293	2de55c83	Petr Pudlak	kernel automatically cleans up the lock.
294	2de55c83	Petr Pudlak
295	2de55c83	Petr Pudlak	Therefore, daemons can check if a process is dead by trying to acquire
296	2de55c83	Petr Pudlak	a shared lock on the lock file in a non-blocking mode:
297	2de55c83	Petr Pudlak
298	2de55c83	Petr Pudlak	- If the locking operation succeeds, it means that the exclusive lock is
299	2de55c83	Petr Pudlak	missing, therefore the process has died, but the lock
300	2de55c83	Petr Pudlak	file hasn't been cleaned up yet. The daemon should release the lock
301	2de55c83	Petr Pudlak	immediately. Optionally, the daemon may delete the lock file.
302	2de55c83	Petr Pudlak	- If the file is missing, the process has died and the lock file has
303	2de55c83	Petr Pudlak	been cleaned up.
304	2de55c83	Petr Pudlak	- If the locking operation fails due to a lock conflict, it means
305	2de55c83	Petr Pudlak	the process is alive.
306	2de55c83	Petr Pudlak
307	2de55c83	Petr Pudlak	Using shared locks for querying lock files ensures that the detection
308	2de55c83	Petr Pudlak	works correctly even if multiple daemons query a file at the same time.
309	2de55c83	Petr Pudlak
310	2de55c83	Petr Pudlak	A job should close and remove its lock file when completely finishes.
311	2de55c83	Petr Pudlak	The WConfD daemon will be responsible for removing stale lock files of
312	2de55c83	Petr Pudlak	jobs that didn't remove its lock files themselves.
313	2de55c83	Petr Pudlak
314	2de55c83	Petr Pudlak	Considered alternatives: An alternative to creating a separate lock
315	2de55c83	Petr Pudlak	file would be to lock the job status file. However, file locks are kept
316	2de55c83	Petr Pudlak	only as long as the file is open. Therefore any operation followed by
317	2de55c83	Petr Pudlak	closing the file would cause the process to release the lock. In
318	2de55c83	Petr Pudlak	particular, with jobs as threads, the master daemon wouldn't be able to
319	2de55c83	Petr Pudlak	keep locks and operate on job files at the same time.
320	2de55c83	Petr Pudlak
321	5eeb7168	Petr Pudlak	WConfD details
322	5eeb7168	Petr Pudlak	--------------
323	5eeb7168	Petr Pudlak
324	5eeb7168	Petr Pudlak	WConfD will communicate with its clients through a Unix domain socket for both
325	5eeb7168	Petr Pudlak	configuration management and locking. Clients can issue multiple RPC calls
326	5eeb7168	Petr Pudlak	through one socket. For each such a call the client sends a JSON request
327	5eeb7168	Petr Pudlak	document with a remote function name and data for its arguments. The server
328	5eeb7168	Petr Pudlak	replies with a JSON response document containing either the result of
329	5eeb7168	Petr Pudlak	signalling a failure.
330	5eeb7168	Petr Pudlak
331	5eeb7168	Petr Pudlak	There will be a special RPC call for identifying a client when connecting to
332	5eeb7168	Petr Pudlak	WConfD. The client will tell WConfD it's job number and process ID. WConfD will
333	5eeb7168	Petr Pudlak	fail any other RPC calls before a client identifies this way.
334	5eeb7168	Petr Pudlak
335	5eeb7168	Petr Pudlak	Any state associated with client processes will be mirrored on persistent
336	5eeb7168	Petr Pudlak	storage and linked to the identity of processes so that the WConfD daemon will
337	5eeb7168	Petr Pudlak	be able to resume its operation at any point after a restart or a crash. WConfD
338	5eeb7168	Petr Pudlak	will track each client's process start time along with its process ID to be
339	5eeb7168	Petr Pudlak	able detect if a process dies and it's process ID is reused. WConfD will clear
340	5eeb7168	Petr Pudlak	all locks and other state associated with a client if it detects it's process
341	5eeb7168	Petr Pudlak	no longer exists.
342	5eeb7168	Petr Pudlak
343	5eeb7168	Petr Pudlak	Configuration management
344	5eeb7168	Petr Pudlak	++++++++++++++++++++++++
345	5eeb7168	Petr Pudlak
346	5eeb7168	Petr Pudlak	The new configuration management protocol will be implemented in the following
347	5eeb7168	Petr Pudlak	steps:
348	5eeb7168	Petr Pudlak
349	b0159850	Petr Pudlak	Step 1:
350	b0159850	Petr Pudlak	#. Implement the following functions in WConfD and export them through
351	b0159850	Petr Pudlak	RPC:
352	b0159850	Petr Pudlak
353	b0159850	Petr Pudlak	- Obtain a single internal lock, either in shared or
354	b0159850	Petr Pudlak	exclusive mode. This lock will substitute the current lock
355	b0159850	Petr Pudlak	``_config_lock`` in config.py.
356	b0159850	Petr Pudlak	- Release the lock.
357	b0159850	Petr Pudlak	- Return the whole configuration data to a client.
358	b0159850	Petr Pudlak	- Receive the whole configuration data from a client and replace the
359	b0159850	Petr Pudlak	current configuration with it. Distribute it to master candidates
360	b0159850	Petr Pudlak	and distribute the corresponding ssconf.
361	b0159850	Petr Pudlak
362	b0159850	Petr Pudlak	WConfD must detect deaths of its clients (see `Job death
363	b0159850	Petr Pudlak	detection`_) and release locks automatically.
364	b0159850	Petr Pudlak
365	b0159850	Petr Pudlak	#. In config.py modify public methods that access configuration:
366	b0159850	Petr Pudlak
367	b0159850	Petr Pudlak	- Instead of acquiring a local lock, obtain a lock from WConfD
368	b0159850	Petr Pudlak	using the above functions
369	b0159850	Petr Pudlak	- Fetch the current configuration from WConfD.
370	b0159850	Petr Pudlak	- Use it to perform the method's task.
371	b0159850	Petr Pudlak	- If the configuration was modified, send it to WConfD at the end.
372	b0159850	Petr Pudlak	- Release the lock to WConfD.
373	b0159850	Petr Pudlak
374	b0159850	Petr Pudlak	This will decouple the configuration management from the master daemon,
375	b0159850	Petr Pudlak	even though the specific configuration tasks will still performed by
376	b0159850	Petr Pudlak	individual jobs.
377	b0159850	Petr Pudlak
378	b0159850	Petr Pudlak	After this step it'll be possible access the configuration from separate
379	b0159850	Petr Pudlak	processes.
380	b0159850	Petr Pudlak
381	b0159850	Petr Pudlak	Step 2:
382	b0159850	Petr Pudlak	#. Reimplement all current methods of ``ConfigWriter`` for reading and
383	b0159850	Petr Pudlak	writing the configuration of a cluster in WConfD.
384	b0159850	Petr Pudlak	#. Expose each of those functions in WConfD as a separate RPC function.
385	b0159850	Petr Pudlak	This will allow easy future extensions or modifications.
386	b0159850	Petr Pudlak	#. Replace ``ConfigWriter`` with a stub (preferably automatically
387	b0159850	Petr Pudlak	generated from the Haskell code) that will contain the same methods
388	b0159850	Petr Pudlak	as the current ``ConfigWriter`` and delegate all calls to its
389	b0159850	Petr Pudlak	methods to WConfD.
390	b0159850	Petr Pudlak
391	b0159850	Petr Pudlak	Step 3:
392	b0159850	Petr Pudlak	#. Remove WConfD's RPC functions for obtaining/releasing the single
393	b0159850	Petr Pudlak	internal lock from Step 1.
394	b0159850	Petr Pudlak	#. Remove WConfD's RPC functions for sending/receiving the whole
395	b0159850	Petr Pudlak	configuration from Step 1.
396	5eeb7168	Petr Pudlak
397	5eeb7168	Petr Pudlak	Future aims:
398	5eeb7168	Petr Pudlak
399	5eeb7168	Petr Pudlak	- Optionally refactor the RPC calls to reduce their number or improve their
400	5eeb7168	Petr Pudlak	efficiency (for example by obtaining a larger set of data instead of
401	5eeb7168	Petr Pudlak	querying items one by one).
402	5eeb7168	Petr Pudlak
403	5eeb7168	Petr Pudlak	Locking
404	5eeb7168	Petr Pudlak	+++++++
405	5eeb7168	Petr Pudlak
406	5eeb7168	Petr Pudlak	The new locking protocol will be implemented as follows:
407	5eeb7168	Petr Pudlak
408	5eeb7168	Petr Pudlak	Re-implement the current locking mechanism in WConfD and expose it for RPC
409	5eeb7168	Petr Pudlak	calls. All current locks will be mapped into a data structure that will
410	5eeb7168	Petr Pudlak	uniquely identify them (storing lock's level together with it's name).
411	5eeb7168	Petr Pudlak
412	5eeb7168	Petr Pudlak	WConfD will impose a linear order on locks. The order will be compatible
413	5eeb7168	Petr Pudlak	with the current ordering of lock levels so that existing code will work
414	5eeb7168	Petr Pudlak	without changes.
415	5eeb7168	Petr Pudlak
416	5eeb7168	Petr Pudlak	WConfD will keep the set of currently held locks for each client. The
417	5eeb7168	Petr Pudlak	protocol will allow the following operations on the set:
418	5eeb7168	Petr Pudlak
419	5eeb7168	Petr Pudlak	Update:
420	5eeb7168	Petr Pudlak	Update the current set of locks according to a given list. The list contains
421	5eeb7168	Petr Pudlak	locks and their desired level (release / shared / exclusive). To prevent
422	5eeb7168	Petr Pudlak	deadlocks, WConfD will check that all newly requested locks (or already held
423	5eeb7168	Petr Pudlak	locks requested to be upgraded to exclusive) are greater in the sense of
424	5eeb7168	Petr Pudlak	the linear order than all currently held locks, and fail the operation if
425	5eeb7168	Petr Pudlak	not. Only the locks in the list will be updated, other locks already held
426	5eeb7168	Petr Pudlak	will be left intact. If the operation fails, the client's lock set will be
427	5eeb7168	Petr Pudlak	left intact.
428	5eeb7168	Petr Pudlak	Opportunistic union:
429	5eeb7168	Petr Pudlak	Add as much as possible locks from a given set to the current set within a
430	5eeb7168	Petr Pudlak	given timeout. WConfD will again check the proper order of locks and
431	5eeb7168	Petr Pudlak	acquire only the ones that are allowed wrt. the current set. Returns the
432	5eeb7168	Petr Pudlak	set of acquired locks, possibly empty. Immediate. Never fails. (It would also
433	5eeb7168	Petr Pudlak	be possible to extend the operation to try to wait until a given number of
434	5eeb7168	Petr Pudlak	locks is available, or a given timeout elapses.)
435	5eeb7168	Petr Pudlak	List:
436	5eeb7168	Petr Pudlak	List the current set of held locks. Immediate, never fails.
437	5eeb7168	Petr Pudlak	Intersection:
438	5eeb7168	Petr Pudlak	Retain only a given set of locks in the current one. This function is
439	5eeb7168	Petr Pudlak	provided for convenience, it's redundant wrt. list and update. Immediate,
440	5eeb7168	Petr Pudlak	never fails.
441	5eeb7168	Petr Pudlak
442	5eeb7168	Petr Pudlak	After this step it'll be possible to use locks from jobs as separate processes.
443	5eeb7168	Petr Pudlak
444	5eeb7168	Petr Pudlak	The above set of operations allows the clients to use various work-flows. In particular:
445	5eeb7168	Petr Pudlak
446	5eeb7168	Petr Pudlak	Pessimistic strategy:
447	5eeb7168	Petr Pudlak	Lock all potentially relevant resources (for example all nodes), determine
448	5eeb7168	Petr Pudlak	which will be needed, and release all the others.
449	5eeb7168	Petr Pudlak	Optimistic strategy:
450	5eeb7168	Petr Pudlak	Determine what locks need to be acquired without holding any. Lock the
451	5eeb7168	Petr Pudlak	required set of locks. Determine the set of required locks again and check if
452	5eeb7168	Petr Pudlak	they are all held. If not, release everything and restart.
453	5eeb7168	Petr Pudlak
454	5eeb7168	Petr Pudlak	.. COMMENTED OUT:
455	5eeb7168	Petr Pudlak	Start with the smallest set of locks and when determining what more
456	5eeb7168	Petr Pudlak	relevant resources will be needed, expand the set. If an union operation
457	5eeb7168	Petr Pudlak	fails, release all locks, acquire the desired union and restart the
458	5eeb7168	Petr Pudlak	operation so that all preconditions and possible concurrent changes are
459	5eeb7168	Petr Pudlak	checked again.
460	5eeb7168	Petr Pudlak
461	5eeb7168	Petr Pudlak	Future aims:
462	5eeb7168	Petr Pudlak
463	5eeb7168	Petr Pudlak	- Add more fine-grained locks to prevent unnecessary blocking of jobs. This
464	5eeb7168	Petr Pudlak	could include locks on parameters of entities or locks on their states (so that
465	5eeb7168	Petr Pudlak	a node remains online, but otherwise can change, etc.). In particular,
466	5eeb7168	Petr Pudlak	adding, moving and removing instances currently blocks the whole node.
467	5eeb7168	Petr Pudlak	- Add checks that all modified configuration parameters belong to entities
468	5eeb7168	Petr Pudlak	the client has locked and log violations.
469	5eeb7168	Petr Pudlak	- Make the above checks mandatory.
470	5eeb7168	Petr Pudlak	- Automate optimistic locking and checking the locks in logical units.
471	5eeb7168	Petr Pudlak	For example, this could be accomplished by allowing some of the initial
472	5eeb7168	Petr Pudlak	phases of `LogicalUnit` (such as `ExpandNames` and `DeclareLocks`) to be run
473	5eeb7168	Petr Pudlak	repeatedly, checking if the set of locks requested the second time is
474	5eeb7168	Petr Pudlak	contained in the set acquired after the first pass.
475	5eeb7168	Petr Pudlak	- Add the possibility for a job to reserve hardware resources such as disk
476	5eeb7168	Petr Pudlak	space or memory on nodes. Most likely as a new, special kind of instances
477	5eeb7168	Petr Pudlak	that would only block its resources and allow to be converted to a regular
478	5eeb7168	Petr Pudlak	instance. This would allow long-running jobs such as instance creation or
479	5eeb7168	Petr Pudlak	move to lock the corresponding nodes, acquire the resources and turn the
480	5eeb7168	Petr Pudlak	locks into shared ones, keeping an exclusive lock only on the instance.
481	5eeb7168	Petr Pudlak	- Use more sophisticated algorithm for preventing deadlocks such as a
482	5eeb7168	Petr Pudlak	`wait-for graph`_. This would allow less union failures and allow more
483	5eeb7168	Petr Pudlak	optimistic, scalable acquisition of locks.
484	5eeb7168	Petr Pudlak
485	5eeb7168	Petr Pudlak	.. _`wait-for graph`: http://en.wikipedia.org/wiki/Wait-for_graph
486	5eeb7168	Petr Pudlak
487	5eeb7168	Petr Pudlak
488	ffedf64d	Michele Tartara	Further considerations
489	ffedf64d	Michele Tartara	======================
490	ffedf64d	Michele Tartara
491	ffedf64d	Michele Tartara	There is a possibility that a job will finish performing its task while LuxiD
492	ffedf64d	Michele Tartara	and/or WConfD will not be available.
493	9269d118	Klaus Aehlig	In order to deal with this situation, each job will update its job file
494	9269d118	Klaus Aehlig	in the queue. This is race free, as LuxiD will no longer touch the job file,
495	9269d118	Klaus Aehlig	once the job is started; a corollary of this is that the job also has to
496	9269d118	Klaus Aehlig	take care of replicating updates to the job file. LuxiD will watch job files for
497	9269d118	Klaus Aehlig	changes to determine when a job as cleanly finished. To determine jobs
498	9269d118	Klaus Aehlig	that died without having the chance of updating the job file, the `Job death
499	9269d118	Klaus Aehlig	detection`_ mechanism will be used.
500	ffedf64d	Michele Tartara
501	ffedf64d	Michele Tartara	.. vim: set textwidth=72 :
502	ffedf64d	Michele Tartara	.. Local Variables:
503	ffedf64d	Michele Tartara	.. mode: rst
504	ffedf64d	Michele Tartara	.. fill-column: 72
505	ffedf64d	Michele Tartara	.. End:

Synnefo » snf-ganeti

root / doc / design-daemons.rst @ bb3011ad