root / doc / design-kvmd.rst @ 333bd799
History | View | Annotate | Download (10.8 kB)
1 |
========== |
---|---|
2 |
KVM daemon |
3 |
========== |
4 |
|
5 |
.. toctree:: |
6 |
:maxdepth: 2 |
7 |
|
8 |
This design document describes the KVM daemon, which is responsible for |
9 |
determining whether a given KVM instance was shutdown by an |
10 |
administrator or a user. |
11 |
|
12 |
|
13 |
Current state and shortcomings |
14 |
============================== |
15 |
|
16 |
This design document describes the KVM daemon which addresses the KVM |
17 |
side of the user-initiated shutdown problem introduced in |
18 |
:doc:`design-internal-shutdown`. We are also interested in keeping this |
19 |
functionality optional. That is, an administrator does not necessarily |
20 |
have to run the KVM daemon if either he is running Xen or even, if he |
21 |
is running KVM, he is not interested in instance shutdown detection. |
22 |
This requirement is important because it means the KVM daemon should |
23 |
be a modular component in the overall Ganeti design, i.e., it should |
24 |
be easy to enable and disable it. |
25 |
|
26 |
Proposed changes |
27 |
================ |
28 |
|
29 |
The instance shutdown feature for KVM requires listening on events from |
30 |
the Qemu Machine Protocol (QMP) Unix socket, which is created together |
31 |
with a KVM instance. A QMP socket typically looks like |
32 |
``/var/run/ganeti/kvm-hypervisor/ctrl/<instance>.qmp`` and implements |
33 |
the QMP protocol. This is a bidirectional protocol that allows Ganeti |
34 |
to send commands, such as, system powerdown, as well as, receive events, |
35 |
such as, the powerdown and shutdown events. |
36 |
|
37 |
Listening in on these events allows Ganeti to determine whether a given |
38 |
KVM instance was shutdown by an administrator, either through |
39 |
``gnt-instance stop|remove <instance>`` or ``kill -KILL |
40 |
<instance-pid>``, or by a user, through ``poweroff`` from inside the |
41 |
instance. Upon an administrator powerdown, the QMP protocol sends two |
42 |
events, namely, a powerdown event and a shutdown event, whereas upon a |
43 |
user shutdown only the shutdown event is sent. This is enough to |
44 |
distinguish between an administrator and a user shutdown. However, |
45 |
there is one limitation, which is, ``kill -TERM <instance-pid>``. Even |
46 |
though this is an action performed by the administrator, it will be |
47 |
considered a user shutdown by the approach described in this document. |
48 |
|
49 |
Several design strategies were considered. Most of these strategies |
50 |
consisted of spawning some process listening on the QMP socket when a |
51 |
KVM instance is created. However, having a listener process per KVM |
52 |
instance is not scalable. Therefore, a different strategy is proposed, |
53 |
namely, having a single process, called the KVM daemon, listening on the |
54 |
QMP sockets of all KVM instances within a node. That also means there |
55 |
is an instance of the KVM daemon on each node. |
56 |
|
57 |
In order to implement the KVM daemon, two problems need to be addressed, |
58 |
namely, how the KVM daemon knows when to open a connection to a given |
59 |
QMP socket and how the KVM daemon communicates with Ganeti whether a |
60 |
given instance was shutdown by an administrator or a user. |
61 |
|
62 |
QMP connections management |
63 |
-------------------------- |
64 |
|
65 |
As mentioned before, the QMP sockets reside in the KVM control |
66 |
directory, which is usually located under |
67 |
``/var/run/ganeti/kvm-hypervisor/ctrl/``. When a KVM instance is |
68 |
created, a new QMP socket for this instance is also created in this |
69 |
directory. |
70 |
|
71 |
In order to simplify the design of the KVM daemon, instead of having |
72 |
Ganeti communicate to this daemon through a pipe or socket the creation |
73 |
of a new KVM instance, and thus a new QMP socket, this daemon will |
74 |
monitor the KVM control directory using ``inotify``. As a result, the |
75 |
daemon is not only able to deal with KVM instances being created and |
76 |
removed, but also capable of overcoming other problematic situations |
77 |
concerning the filesystem, such as, the case when the KVM control |
78 |
directory does not exist because, for example, Ganeti was not yet |
79 |
started, or the KVM control directory was removed, for example, as a |
80 |
result of a Ganeti reinstallation. |
81 |
|
82 |
Shutdown detection |
83 |
------------------ |
84 |
|
85 |
As mentioned before, the KVM daemon is responsible for opening a |
86 |
connection to the QMP socket of a given instance and listening in on the |
87 |
shutdown and powerdown events, which allow the KVM daemon to determine |
88 |
whether the instance stopped because of an administrator or user |
89 |
shutdown. Once the instance is stopped, the KVM daemon needs to |
90 |
communicate to Ganeti whether the user was responsible for shutting down |
91 |
the instance. |
92 |
|
93 |
In order to achieve this, the KVM daemon writes an empty file, called |
94 |
the shutdown file, in the KVM control directory with a name similar to |
95 |
the QMP socket file but with the extension ``.qmp`` replaced with |
96 |
``.shutdown``. The presence of this file indicates that the shutdown |
97 |
was initiated by a user, whereas the absence of this file indicates that |
98 |
the shutdown was caused by an administrator. This strategy also handles |
99 |
crashes and signals, such as, ``SIGKILL``, to be handled correctly, |
100 |
given that in these cases the KVM daemon never receives the powerdown |
101 |
and shutdown events and, therefore, never creates the shutdown file. |
102 |
|
103 |
KVM daemon launch |
104 |
----------------- |
105 |
|
106 |
With the above issues addressed, a question remains as to when the KVM |
107 |
daemon should be started. The KVM daemon is different from other Ganeti |
108 |
daemons, which start together with the Ganeti service, because the KVM |
109 |
daemon is optional, given that it is specific to KVM and should not be |
110 |
run on installations containing only Xen, and, even in a KVM |
111 |
installation, the user might still choose not to enable it. And finally |
112 |
because the KVM daemon is not really necessary until the first KVM |
113 |
instance is started. For these reasons, the KVM daemon is started from |
114 |
within Ganeti when a KVM instance is started. And the job process |
115 |
spawned by the node daemon is responsible for starting the KVM daemon. |
116 |
|
117 |
Given the current design of Ganeti, in which the node daemon spawns a |
118 |
job process to handle the creation of the instance, when launching the |
119 |
KVM daemon it is necessary to first check whether an instance of this |
120 |
daemon is already running and, if this is not the case, then the KVM |
121 |
daemon can be safely started. |
122 |
|
123 |
Design alternatives |
124 |
=================== |
125 |
|
126 |
At first, it might seem natural to include the instance shutdown |
127 |
detection for KVM in the node daemon. After all, the node daemon is |
128 |
already responsible for managing instances, for example, starting and |
129 |
stopping an instance. Nevertheless, the node daemon is more complicated |
130 |
than it might seem at first. |
131 |
|
132 |
The node daemon is composed of the main loop, which runs in the main |
133 |
thread and is responsible for receiving requests and spawning jobs for |
134 |
handling these requests, and the jobs, which are independent processes |
135 |
spawned for executing the actual tasks, such as, creating an instance. |
136 |
|
137 |
Including instance shutdown detection in the node daemon is not viable |
138 |
because adding it to the main loop would cause KVM specific code to |
139 |
taint the generality of the node daemon. In order to add it to the job |
140 |
processes, it would be possible to spawn either a foreground or a |
141 |
background process. However, these options are also not viable because |
142 |
they would lead to the situation described before where there would be a |
143 |
monitoring process per instance, which is not scalable. Moreover, the |
144 |
foreground process has an additional disadvantage: it would require |
145 |
modifications the node daemon in order not to expect a terminating job, |
146 |
which is the current node daemon design. |
147 |
|
148 |
There is another design issue to have in mind. We could reconsider the |
149 |
place where to write the data that tell Ganeti whether an instance was |
150 |
shutdown by an administrator or the user. Instead of using the KVM |
151 |
shutdown files presented above, in which the presence of the file |
152 |
indicates a user shutdown and its absence an administrator shutdown, we |
153 |
could store a value in the KVM runtime state file, which is where the |
154 |
relevant KVM state information is. The advantage of this approach is |
155 |
that it would keep the KVM related information in one place, thus making |
156 |
it easier to manage. However, it would lead to a more complex |
157 |
implementation and, in the context of the general transition in Ganeti |
158 |
from Python to Haskell, a simpler implementation is preferred. |
159 |
|
160 |
Finally, it should be noted that the KVM runtime state file benefits |
161 |
from automatic migration. That is, when an instance is migrated so is |
162 |
the KVM state file. However, the instance shutdown detection for KVM |
163 |
does not require this feature and, in fact, migrating the instance |
164 |
shutdown state would be incorrect. |
165 |
|
166 |
Further considerations |
167 |
====================== |
168 |
|
169 |
There are potential race conditions between Ganeti and the KVM daemon, |
170 |
however, in practice they seem unlikely. For example, the KVM daemon |
171 |
needs to add and remove watches to the parent directories of the KVM |
172 |
control directory until this directory is finally created. It is |
173 |
possible that Ganeti creates this directory and a KVM instance before |
174 |
the KVM daemon has a chance to add a watch to the KVM control directory, |
175 |
thus causing this daemon to miss the ``inotify`` creation event for the |
176 |
QMP socket. |
177 |
|
178 |
There are other problems which arise from the limitations of |
179 |
``inotify``. For example, if the KVM daemon is started after the first |
180 |
Ganeti instance has been created, then the ``inotify`` will not produce |
181 |
any event for the creation of the QMP socket. This can happen, for |
182 |
example, if the KVM daemon needs to be restarted or upgraded. As a |
183 |
result, it might be necessary to have an additional mechanism that runs |
184 |
at KVM daemon startup or at regular intervals to ensure that the current |
185 |
KVM internal state is consistent with the actual contents of the KVM |
186 |
control directory. |
187 |
|
188 |
Another race condition occurs when Ganeti shuts down a KVM instance |
189 |
using force. Ganeti uses ``TERM`` signals to stop KVM instances when |
190 |
force is specified or ACPI is not enabled. However, as mentioned |
191 |
before, ``TERM`` signals are interpreted by the KVM daemon as a user |
192 |
shutdown. As a result, the KVM daemon creates a shutdown file which |
193 |
then must be removed by Ganeti. The race condition occurs because the |
194 |
KVM daemon might create the shutdown file after the hypervisor code that |
195 |
tries to remove this file has already run. In practice, the race |
196 |
condition seems unlikely because Ganeti stops the KVM instance in a |
197 |
retry loop, which allows Ganeti to stop the instance and cleanup its |
198 |
runtime information. |
199 |
|
200 |
It is possible to determine if a process, in this particular case the |
201 |
KVM process, was terminated by a ``TERM`` signal, using the `proc |
202 |
connector and socket filters |
203 |
<https://web.archive.org/web/20121025062848/http://netsplit.com/2011/02/09/the-proc-connector-and-socket-filters/>`_. |
204 |
The proc connector is a socket connected between a userspace process and |
205 |
the kernel through the netlink protocol and can be used to receive |
206 |
notifications of process events, and the socket filters is a mechanism |
207 |
for subscribing only to events that are relevant. There are several |
208 |
`process events <http://lwn.net/Articles/157150/>`_ which can be |
209 |
subscribed to, however, in this case, we are interested only in the exit |
210 |
event, which carries information about the exit signal. |
211 |
|
212 |
.. vim: set textwidth=72 : |
213 |
.. Local Variables: |
214 |
.. mode: rst |
215 |
.. fill-column: 72 |
216 |
.. End: |