|
1 |
==========
|
|
2 |
KVM daemon
|
|
3 |
==========
|
|
4 |
|
|
5 |
.. toctree::
|
|
6 |
:maxdepth: 2
|
|
7 |
|
|
8 |
This design document describes the KVM daemon, which is responsible for
|
|
9 |
determining whether a given KVM instance was shutdown by an
|
|
10 |
administrator or a user.
|
|
11 |
|
|
12 |
|
|
13 |
Current state and shortcomings
|
|
14 |
==============================
|
|
15 |
|
|
16 |
This design document describes the KVM daemon which addresses the KVM
|
|
17 |
side of the user-initiated shutdown problem introduced in
|
|
18 |
:doc:`design-internal-shutdown`. We are also interested in keeping this
|
|
19 |
functionality optional. That is, an administrator does not necessarily
|
|
20 |
have to run the KVM daemon if either he is running Xen or even, if he
|
|
21 |
is running KVM, he is not interested in instance shutdown detection.
|
|
22 |
This requirement is important because it means the KVM daemon should
|
|
23 |
be a modular component in the overall Ganeti design, i.e., it should
|
|
24 |
be easy to enable and disable it.
|
|
25 |
|
|
26 |
Proposed changes
|
|
27 |
================
|
|
28 |
|
|
29 |
The instance shutdown feature for KVM requires listening on events from
|
|
30 |
the Qemu Machine Protocol (QMP) Unix socket, which is created together
|
|
31 |
with a KVM instance. A QMP socket typically looks like
|
|
32 |
``/var/run/ganeti/kvm-hypervisor/ctrl/<instance>.qmp`` and implements
|
|
33 |
the QMP protocol. This is a bidirectional protocol that allows Ganeti
|
|
34 |
to send commands, such as, system powerdown, as well as, receive events,
|
|
35 |
such as, the powerdown and shutdown events.
|
|
36 |
|
|
37 |
Listening in on these events allows Ganeti to determine whether a given
|
|
38 |
KVM instance was shutdown by an administrator, either through
|
|
39 |
``gnt-instance stop|remove <instance>`` or ``kill -KILL
|
|
40 |
<instance-pid>``, or by a user, through ``poweroff`` from inside the
|
|
41 |
instance. Upon an administrator powerdown, the QMP protocol sends two
|
|
42 |
events, namely, a powerdown event and a shutdown event, whereas upon a
|
|
43 |
user shutdown only the shutdown event is sent. This is enough to
|
|
44 |
distinguish between an administrator and a user shutdown. However,
|
|
45 |
there is one limitation, which is, ``kill -TERM <instance-pid>``. Even
|
|
46 |
though this is an action performed by the administrator, it will be
|
|
47 |
considered a user shutdown by the approach described in this document.
|
|
48 |
|
|
49 |
Several design strategies were considered. Most of these strategies
|
|
50 |
consisted of spawning some process listening on the QMP socket when a
|
|
51 |
KVM instance is created. However, having a listener process per KVM
|
|
52 |
instance is not scalable. Therefore, a different strategy is proposed,
|
|
53 |
namely, having a single process, called the KVM daemon, listening on the
|
|
54 |
QMP sockets of all KVM instances within a node. That also means there
|
|
55 |
is an instance of the KVM daemon on each node.
|
|
56 |
|
|
57 |
In order to implement the KVM daemon, two problems need to be addressed,
|
|
58 |
namely, how the KVM daemon knows when to open a connection to a given
|
|
59 |
QMP socket and how the KVM daemon communicates with Ganeti whether a
|
|
60 |
given instance was shutdown by an administrator or a user.
|
|
61 |
|
|
62 |
QMP connections management
|
|
63 |
--------------------------
|
|
64 |
|
|
65 |
As mentioned before, the QMP sockets reside in the KVM control
|
|
66 |
directory, which is usually located under
|
|
67 |
``/var/run/ganeti/kvm-hypervisor/ctrl/``. When a KVM instance is
|
|
68 |
created, a new QMP socket for this instance is also created in this
|
|
69 |
directory.
|
|
70 |
|
|
71 |
In order to simplify the design of the KVM daemon, instead of having
|
|
72 |
Ganeti communicate to this daemon through a pipe or socket the creation
|
|
73 |
of a new KVM instance, and thus a new QMP socket, this daemon will
|
|
74 |
monitor the KVM control directory using ``inotify``. As a result, the
|
|
75 |
daemon is not only able to deal with KVM instances being created and
|
|
76 |
removed, but also capable of overcoming other problematic situations
|
|
77 |
concerning the filesystem, such as, the case when the KVM control
|
|
78 |
directory does not exist because, for example, Ganeti was not yet
|
|
79 |
started, or the KVM control directory was removed, for example, as a
|
|
80 |
result of a Ganeti reinstallation.
|
|
81 |
|
|
82 |
Shutdown detection
|
|
83 |
------------------
|
|
84 |
|
|
85 |
As mentioned before, the KVM daemon is responsible for opening a
|
|
86 |
connection to the QMP socket of a given instance and listening in on the
|
|
87 |
shutdown and powerdown events, which allow the KVM daemon to determine
|
|
88 |
whether the instance stopped because of an administrator or user
|
|
89 |
shutdown. Once the instance is stopped, the KVM daemon needs to
|
|
90 |
communicate to Ganeti whether the user was responsible for shutting down
|
|
91 |
the instance.
|
|
92 |
|
|
93 |
In order to achieve this, the KVM daemon writes an empty file, called
|
|
94 |
the shutdown file, in the KVM control directory with a name similar to
|
|
95 |
the QMP socket file but with the extension ``.qmp`` replaced with
|
|
96 |
``.shutdown``. The presence of this file indicates that the shutdown
|
|
97 |
was initiated by a user, whereas the absence of this file indicates that
|
|
98 |
the shutdown was caused by an administrator. This strategy also handles
|
|
99 |
crashes and signals, such as, ``SIGKILL``, to be handled correctly,
|
|
100 |
given that in these cases the KVM daemon never receives the powerdown
|
|
101 |
and shutdown events and, therefore, never creates the shutdown file.
|
|
102 |
|
|
103 |
KVM daemon launch
|
|
104 |
-----------------
|
|
105 |
|
|
106 |
With the above issues addressed, a question remains as to when the KVM
|
|
107 |
daemon should be started. The KVM daemon is different from other Ganeti
|
|
108 |
daemons, which start together with the Ganeti service, because the KVM
|
|
109 |
daemon is optional, given that it is specific to KVM and should not be
|
|
110 |
run on installations containing only Xen, and, even in a KVM
|
|
111 |
installation, the user might still choose not to enable it. And finally
|
|
112 |
because the KVM daemon is not really necessary until the first KVM
|
|
113 |
instance is started. For these reasons, the KVM daemon is started from
|
|
114 |
within Ganeti when a KVM instance is started. And the job process
|
|
115 |
spawned by the node daemon is responsible for starting the KVM daemon.
|
|
116 |
|
|
117 |
Given the current design of Ganeti, in which the node daemon spawns a
|
|
118 |
job process to handle the creation of the instance, when launching the
|
|
119 |
KVM daemon it is necessary to first check whether an instance of this
|
|
120 |
daemon is already running and, if this is not the case, then the KVM
|
|
121 |
daemon can be safely started.
|
|
122 |
|
|
123 |
Design alternatives
|
|
124 |
===================
|
|
125 |
|
|
126 |
At first, it might seem natural to include the instance shutdown
|
|
127 |
detection for KVM in the node daemon. After all, the node daemon is
|
|
128 |
already responsible for managing instances, for example, starting and
|
|
129 |
stopping an instance. Nevertheless, the node daemon is more complicated
|
|
130 |
than it might seem at first.
|
|
131 |
|
|
132 |
The node daemon is composed of the main loop, which runs in the main
|
|
133 |
thread and is responsible for receiving requests and spawning jobs for
|
|
134 |
handling these requests, and the jobs, which are independent processes
|
|
135 |
spawned for executing the actual tasks, such as, creating an instance.
|
|
136 |
|
|
137 |
Including instance shutdown detection in the node daemon is not viable
|
|
138 |
because adding it to the main loop would cause KVM specific code to
|
|
139 |
taint the generality of the node daemon. In order to add it to the job
|
|
140 |
processes, it would be possible to spawn either a foreground or a
|
|
141 |
background process. However, these options are also not viable because
|
|
142 |
they would lead to the situation described before where there would be a
|
|
143 |
monitoring process per instance, which is not scalable. Moreover, the
|
|
144 |
foreground process has an additional disadvantage: it would require
|
|
145 |
modifications the node daemon in order not to expect a terminating job,
|
|
146 |
which is the current node daemon design.
|
|
147 |
|
|
148 |
There is another design issue to have in mind. We could reconsider the
|
|
149 |
place where to write the data that tell Ganeti whether an instance was
|
|
150 |
shutdown by an administrator or the user. Instead of using the KVM
|
|
151 |
shutdown files presented above, in which the presence of the file
|
|
152 |
indicates a user shutdown and its absence an administrator shutdown, we
|
|
153 |
could store a value in the KVM runtime state file, which is where the
|
|
154 |
relevant KVM state information is. The advantage of this approach is
|
|
155 |
that it would keep the KVM related information in one place, thus making
|
|
156 |
it easier to manage. However, it would lead to a more complex
|
|
157 |
implementation and, in the context of the general transition in Ganeti
|
|
158 |
from Python to Haskell, a simpler implementation is preferred.
|
|
159 |
|
|
160 |
Finally, it should be noted that the KVM runtime state file benefits
|
|
161 |
from automatic migration. That is, when an instance is migrated so is
|
|
162 |
the KVM state file. However, the instance shutdown detection for KVM
|
|
163 |
does not require this feature and, in fact, migrating the instance
|
|
164 |
shutdown state would be incorrect.
|
|
165 |
|
|
166 |
Further considerations
|
|
167 |
======================
|
|
168 |
|
|
169 |
There are potential race conditions between Ganeti and the KVM daemon,
|
|
170 |
however, in practice they seem unlikely. For example, the KVM daemon
|
|
171 |
needs to add and remove watches to the parent directories of the KVM
|
|
172 |
control directory until this directory is finally created. It is
|
|
173 |
possible that Ganeti creates this directory and a KVM instance before
|
|
174 |
the KVM daemon has a chance to add a watch to the KVM control directory,
|
|
175 |
thus causing this daemon to miss the ``inotify`` creation event for the
|
|
176 |
QMP socket.
|
|
177 |
|
|
178 |
There are other problems which arise from the limitations of
|
|
179 |
``inotify``. For example, if the KVM daemon is started after the first
|
|
180 |
Ganeti instance has been created, then the ``inotify`` will not produce
|
|
181 |
any event for the creation of the QMP socket. This can happen, for
|
|
182 |
example, if the KVM daemon needs to be restarted or upgraded. As a
|
|
183 |
result, it might be necessary to have an additional mechanism that runs
|
|
184 |
at KVM daemon startup or at regular intervals to ensure that the current
|
|
185 |
KVM internal state is consistent with the actual contents of the KVM
|
|
186 |
control directory.
|
|
187 |
|
|
188 |
Another race condition occurs when Ganeti shuts down a KVM instance
|
|
189 |
using force. Ganeti uses ``TERM`` signals to stop KVM instances when
|
|
190 |
force is specified or ACPI is not enabled. However, as mentioned
|
|
191 |
before, ``TERM`` signals are interpreted by the KVM daemon as a user
|
|
192 |
shutdown. As a result, the KVM daemon creates a shutdown file which
|
|
193 |
then must be removed by Ganeti. The race condition occurs because the
|
|
194 |
KVM daemon might create the shutdown file after the hypervisor code that
|
|
195 |
tries to remove this file has already run. In practice, the race
|
|
196 |
condition seems unlikely because Ganeti stops the KVM instance in a
|
|
197 |
retry loop, which allows Ganeti to stop the instance and cleanup its
|
|
198 |
runtime information.
|
|
199 |
|
|
200 |
It is possible to determine if a process, in this particular case the
|
|
201 |
KVM process, was terminated by a ``TERM`` signal, using the `proc
|
|
202 |
connector and socket filters
|
|
203 |
<https://web.archive.org/web/20121025062848/http://netsplit.com/2011/02/09/the-proc-connector-and-socket-filters/>`_.
|
|
204 |
The proc connector is a socket connected between a userspace process and
|
|
205 |
the kernel through the netlink protocol and can be used to receive
|
|
206 |
notifications of process events, and the socket filters is a mechanism
|
|
207 |
for subscribing only to events that are relevant. There are several
|
|
208 |
`process events <http://lwn.net/Articles/157150/>`_ which can be
|
|
209 |
subscribed to, however, in this case, we are interested only in the exit
|
|
210 |
event, which carries information about the exit signal.
|
|
211 |
|
|
212 |
.. vim: set textwidth=72 :
|
|
213 |
.. Local Variables:
|
|
214 |
.. mode: rst
|
|
215 |
.. fill-column: 72
|
|
216 |
.. End:
|