Statistics
| Branch: | Tag: | Revision:

root / doc / design-kvmd.rst @ 333bd799

History | View | Annotate | Download (10.8 kB)

1 333bd799 Jose A. Lopes
==========
2 333bd799 Jose A. Lopes
KVM daemon
3 333bd799 Jose A. Lopes
==========
4 333bd799 Jose A. Lopes
5 333bd799 Jose A. Lopes
.. toctree::
6 333bd799 Jose A. Lopes
   :maxdepth: 2
7 333bd799 Jose A. Lopes
8 333bd799 Jose A. Lopes
This design document describes the KVM daemon, which is responsible for
9 333bd799 Jose A. Lopes
determining whether a given KVM instance was shutdown by an
10 333bd799 Jose A. Lopes
administrator or a user.
11 333bd799 Jose A. Lopes
12 333bd799 Jose A. Lopes
13 333bd799 Jose A. Lopes
Current state and shortcomings
14 333bd799 Jose A. Lopes
==============================
15 333bd799 Jose A. Lopes
16 333bd799 Jose A. Lopes
This design document describes the KVM daemon which addresses the KVM
17 333bd799 Jose A. Lopes
side of the user-initiated shutdown problem introduced in
18 333bd799 Jose A. Lopes
:doc:`design-internal-shutdown`.  We are also interested in keeping this
19 333bd799 Jose A. Lopes
functionality optional.  That is, an administrator does not necessarily
20 333bd799 Jose A. Lopes
have to run the KVM daemon if either he is running Xen or even, if he
21 333bd799 Jose A. Lopes
is running KVM, he is not interested in instance shutdown detection.
22 333bd799 Jose A. Lopes
This requirement is important because it means the KVM daemon should
23 333bd799 Jose A. Lopes
be a modular component in the overall Ganeti design, i.e., it should
24 333bd799 Jose A. Lopes
be easy to enable and disable it.
25 333bd799 Jose A. Lopes
26 333bd799 Jose A. Lopes
Proposed changes
27 333bd799 Jose A. Lopes
================
28 333bd799 Jose A. Lopes
29 333bd799 Jose A. Lopes
The instance shutdown feature for KVM requires listening on events from
30 333bd799 Jose A. Lopes
the Qemu Machine Protocol (QMP) Unix socket, which is created together
31 333bd799 Jose A. Lopes
with a KVM instance.  A QMP socket typically looks like
32 333bd799 Jose A. Lopes
``/var/run/ganeti/kvm-hypervisor/ctrl/<instance>.qmp`` and implements
33 333bd799 Jose A. Lopes
the QMP protocol.  This is a bidirectional protocol that allows Ganeti
34 333bd799 Jose A. Lopes
to send commands, such as, system powerdown, as well as, receive events,
35 333bd799 Jose A. Lopes
such as, the powerdown and shutdown events.
36 333bd799 Jose A. Lopes
37 333bd799 Jose A. Lopes
Listening in on these events allows Ganeti to determine whether a given
38 333bd799 Jose A. Lopes
KVM instance was shutdown by an administrator, either through
39 333bd799 Jose A. Lopes
``gnt-instance stop|remove <instance>`` or ``kill -KILL
40 333bd799 Jose A. Lopes
<instance-pid>``, or by a user, through ``poweroff`` from inside the
41 333bd799 Jose A. Lopes
instance.  Upon an administrator powerdown, the QMP protocol sends two
42 333bd799 Jose A. Lopes
events, namely, a powerdown event and a shutdown event, whereas upon a
43 333bd799 Jose A. Lopes
user shutdown only the shutdown event is sent.  This is enough to
44 333bd799 Jose A. Lopes
distinguish between an administrator and a user shutdown.  However,
45 333bd799 Jose A. Lopes
there is one limitation, which is, ``kill -TERM <instance-pid>``.  Even
46 333bd799 Jose A. Lopes
though this is an action performed by the administrator, it will be
47 333bd799 Jose A. Lopes
considered a user shutdown by the approach described in this document.
48 333bd799 Jose A. Lopes
49 333bd799 Jose A. Lopes
Several design strategies were considered.  Most of these strategies
50 333bd799 Jose A. Lopes
consisted of spawning some process listening on the QMP socket when a
51 333bd799 Jose A. Lopes
KVM instance is created.  However, having a listener process per KVM
52 333bd799 Jose A. Lopes
instance is not scalable.  Therefore, a different strategy is proposed,
53 333bd799 Jose A. Lopes
namely, having a single process, called the KVM daemon, listening on the
54 333bd799 Jose A. Lopes
QMP sockets of all KVM instances within a node.  That also means there
55 333bd799 Jose A. Lopes
is an instance of the KVM daemon on each node.
56 333bd799 Jose A. Lopes
57 333bd799 Jose A. Lopes
In order to implement the KVM daemon, two problems need to be addressed,
58 333bd799 Jose A. Lopes
namely, how the KVM daemon knows when to open a connection to a given
59 333bd799 Jose A. Lopes
QMP socket and how the KVM daemon communicates with Ganeti whether a
60 333bd799 Jose A. Lopes
given instance was shutdown by an administrator or a user.
61 333bd799 Jose A. Lopes
62 333bd799 Jose A. Lopes
QMP connections management
63 333bd799 Jose A. Lopes
--------------------------
64 333bd799 Jose A. Lopes
65 333bd799 Jose A. Lopes
As mentioned before, the QMP sockets reside in the KVM control
66 333bd799 Jose A. Lopes
directory, which is usually located under
67 333bd799 Jose A. Lopes
``/var/run/ganeti/kvm-hypervisor/ctrl/``.  When a KVM instance is
68 333bd799 Jose A. Lopes
created, a new QMP socket for this instance is also created in this
69 333bd799 Jose A. Lopes
directory.
70 333bd799 Jose A. Lopes
71 333bd799 Jose A. Lopes
In order to simplify the design of the KVM daemon, instead of having
72 333bd799 Jose A. Lopes
Ganeti communicate to this daemon through a pipe or socket the creation
73 333bd799 Jose A. Lopes
of a new KVM instance, and thus a new QMP socket, this daemon will
74 333bd799 Jose A. Lopes
monitor the KVM control directory using ``inotify``.  As a result, the
75 333bd799 Jose A. Lopes
daemon is not only able to deal with KVM instances being created and
76 333bd799 Jose A. Lopes
removed, but also capable of overcoming other problematic situations
77 333bd799 Jose A. Lopes
concerning the filesystem, such as, the case when the KVM control
78 333bd799 Jose A. Lopes
directory does not exist because, for example, Ganeti was not yet
79 333bd799 Jose A. Lopes
started, or the KVM control directory was removed, for example, as a
80 333bd799 Jose A. Lopes
result of a Ganeti reinstallation.
81 333bd799 Jose A. Lopes
82 333bd799 Jose A. Lopes
Shutdown detection
83 333bd799 Jose A. Lopes
------------------
84 333bd799 Jose A. Lopes
85 333bd799 Jose A. Lopes
As mentioned before, the KVM daemon is responsible for opening a
86 333bd799 Jose A. Lopes
connection to the QMP socket of a given instance and listening in on the
87 333bd799 Jose A. Lopes
shutdown and powerdown events, which allow the KVM daemon to determine
88 333bd799 Jose A. Lopes
whether the instance stopped because of an administrator or user
89 333bd799 Jose A. Lopes
shutdown.  Once the instance is stopped, the KVM daemon needs to
90 333bd799 Jose A. Lopes
communicate to Ganeti whether the user was responsible for shutting down
91 333bd799 Jose A. Lopes
the instance.
92 333bd799 Jose A. Lopes
93 333bd799 Jose A. Lopes
In order to achieve this, the KVM daemon writes an empty file, called
94 333bd799 Jose A. Lopes
the shutdown file, in the KVM control directory with a name similar to
95 333bd799 Jose A. Lopes
the QMP socket file but with the extension ``.qmp`` replaced with
96 333bd799 Jose A. Lopes
``.shutdown``.  The presence of this file indicates that the shutdown
97 333bd799 Jose A. Lopes
was initiated by a user, whereas the absence of this file indicates that
98 333bd799 Jose A. Lopes
the shutdown was caused by an administrator.  This strategy also handles
99 333bd799 Jose A. Lopes
crashes and signals, such as, ``SIGKILL``, to be handled correctly,
100 333bd799 Jose A. Lopes
given that in these cases the KVM daemon never receives the powerdown
101 333bd799 Jose A. Lopes
and shutdown events and, therefore, never creates the shutdown file.
102 333bd799 Jose A. Lopes
103 333bd799 Jose A. Lopes
KVM daemon launch
104 333bd799 Jose A. Lopes
-----------------
105 333bd799 Jose A. Lopes
106 333bd799 Jose A. Lopes
With the above issues addressed, a question remains as to when the KVM
107 333bd799 Jose A. Lopes
daemon should be started.  The KVM daemon is different from other Ganeti
108 333bd799 Jose A. Lopes
daemons, which start together with the Ganeti service, because the KVM
109 333bd799 Jose A. Lopes
daemon is optional, given that it is specific to KVM and should not be
110 333bd799 Jose A. Lopes
run on installations containing only Xen, and, even in a KVM
111 333bd799 Jose A. Lopes
installation, the user might still choose not to enable it.  And finally
112 333bd799 Jose A. Lopes
because the KVM daemon is not really necessary until the first KVM
113 333bd799 Jose A. Lopes
instance is started.  For these reasons, the KVM daemon is started from
114 333bd799 Jose A. Lopes
within Ganeti when a KVM instance is started.  And the job process
115 333bd799 Jose A. Lopes
spawned by the node daemon is responsible for starting the KVM daemon.
116 333bd799 Jose A. Lopes
117 333bd799 Jose A. Lopes
Given the current design of Ganeti, in which the node daemon spawns a
118 333bd799 Jose A. Lopes
job process to handle the creation of the instance, when launching the
119 333bd799 Jose A. Lopes
KVM daemon it is necessary to first check whether an instance of this
120 333bd799 Jose A. Lopes
daemon is already running and, if this is not the case, then the KVM
121 333bd799 Jose A. Lopes
daemon can be safely started.
122 333bd799 Jose A. Lopes
123 333bd799 Jose A. Lopes
Design alternatives
124 333bd799 Jose A. Lopes
===================
125 333bd799 Jose A. Lopes
126 333bd799 Jose A. Lopes
At first, it might seem natural to include the instance shutdown
127 333bd799 Jose A. Lopes
detection for KVM in the node daemon.  After all, the node daemon is
128 333bd799 Jose A. Lopes
already responsible for managing instances, for example, starting and
129 333bd799 Jose A. Lopes
stopping an instance.  Nevertheless, the node daemon is more complicated
130 333bd799 Jose A. Lopes
than it might seem at first.
131 333bd799 Jose A. Lopes
132 333bd799 Jose A. Lopes
The node daemon is composed of the main loop, which runs in the main
133 333bd799 Jose A. Lopes
thread and is responsible for receiving requests and spawning jobs for
134 333bd799 Jose A. Lopes
handling these requests, and the jobs, which are independent processes
135 333bd799 Jose A. Lopes
spawned for executing the actual tasks, such as, creating an instance.
136 333bd799 Jose A. Lopes
137 333bd799 Jose A. Lopes
Including instance shutdown detection in the node daemon is not viable
138 333bd799 Jose A. Lopes
because adding it to the main loop would cause KVM specific code to
139 333bd799 Jose A. Lopes
taint the generality of the node daemon.  In order to add it to the job
140 333bd799 Jose A. Lopes
processes, it would be possible to spawn either a foreground or a
141 333bd799 Jose A. Lopes
background process.  However, these options are also not viable because
142 333bd799 Jose A. Lopes
they would lead to the situation described before where there would be a
143 333bd799 Jose A. Lopes
monitoring process per instance, which is not scalable.  Moreover, the
144 333bd799 Jose A. Lopes
foreground process has an additional disadvantage: it would require
145 333bd799 Jose A. Lopes
modifications the node daemon in order not to expect a terminating job,
146 333bd799 Jose A. Lopes
which is the current node daemon design.
147 333bd799 Jose A. Lopes
148 333bd799 Jose A. Lopes
There is another design issue to have in mind.  We could reconsider the
149 333bd799 Jose A. Lopes
place where to write the data that tell Ganeti whether an instance was
150 333bd799 Jose A. Lopes
shutdown by an administrator or the user.  Instead of using the KVM
151 333bd799 Jose A. Lopes
shutdown files presented above, in which the presence of the file
152 333bd799 Jose A. Lopes
indicates a user shutdown and its absence an administrator shutdown, we
153 333bd799 Jose A. Lopes
could store a value in the KVM runtime state file, which is where the
154 333bd799 Jose A. Lopes
relevant KVM state information is.  The advantage of this approach is
155 333bd799 Jose A. Lopes
that it would keep the KVM related information in one place, thus making
156 333bd799 Jose A. Lopes
it easier to manage.  However, it would lead to a more complex
157 333bd799 Jose A. Lopes
implementation and, in the context of the general transition in Ganeti
158 333bd799 Jose A. Lopes
from Python to Haskell, a simpler implementation is preferred.
159 333bd799 Jose A. Lopes
160 333bd799 Jose A. Lopes
Finally, it should be noted that the KVM runtime state file benefits
161 333bd799 Jose A. Lopes
from automatic migration.  That is, when an instance is migrated so is
162 333bd799 Jose A. Lopes
the KVM state file.  However, the instance shutdown detection for KVM
163 333bd799 Jose A. Lopes
does not require this feature and, in fact, migrating the instance
164 333bd799 Jose A. Lopes
shutdown state would be incorrect.
165 333bd799 Jose A. Lopes
166 333bd799 Jose A. Lopes
Further considerations
167 333bd799 Jose A. Lopes
======================
168 333bd799 Jose A. Lopes
169 333bd799 Jose A. Lopes
There are potential race conditions between Ganeti and the KVM daemon,
170 333bd799 Jose A. Lopes
however, in practice they seem unlikely.  For example, the KVM daemon
171 333bd799 Jose A. Lopes
needs to add and remove watches to the parent directories of the KVM
172 333bd799 Jose A. Lopes
control directory until this directory is finally created.  It is
173 333bd799 Jose A. Lopes
possible that Ganeti creates this directory and a KVM instance before
174 333bd799 Jose A. Lopes
the KVM daemon has a chance to add a watch to the KVM control directory,
175 333bd799 Jose A. Lopes
thus causing this daemon to miss the ``inotify`` creation event for the
176 333bd799 Jose A. Lopes
QMP socket.
177 333bd799 Jose A. Lopes
178 333bd799 Jose A. Lopes
There are other problems which arise from the limitations of
179 333bd799 Jose A. Lopes
``inotify``.  For example, if the KVM daemon is started after the first
180 333bd799 Jose A. Lopes
Ganeti instance has been created, then the ``inotify`` will not produce
181 333bd799 Jose A. Lopes
any event for the creation of the QMP socket.  This can happen, for
182 333bd799 Jose A. Lopes
example, if the KVM daemon needs to be restarted or upgraded.  As a
183 333bd799 Jose A. Lopes
result, it might be necessary to have an additional mechanism that runs
184 333bd799 Jose A. Lopes
at KVM daemon startup or at regular intervals to ensure that the current
185 333bd799 Jose A. Lopes
KVM internal state is consistent with the actual contents of the KVM
186 333bd799 Jose A. Lopes
control directory.
187 333bd799 Jose A. Lopes
188 333bd799 Jose A. Lopes
Another race condition occurs when Ganeti shuts down a KVM instance
189 333bd799 Jose A. Lopes
using force.  Ganeti uses ``TERM`` signals to stop KVM instances when
190 333bd799 Jose A. Lopes
force is specified or ACPI is not enabled.  However, as mentioned
191 333bd799 Jose A. Lopes
before, ``TERM`` signals are interpreted by the KVM daemon as a user
192 333bd799 Jose A. Lopes
shutdown.  As a result, the KVM daemon creates a shutdown file which
193 333bd799 Jose A. Lopes
then must be removed by Ganeti.  The race condition occurs because the
194 333bd799 Jose A. Lopes
KVM daemon might create the shutdown file after the hypervisor code that
195 333bd799 Jose A. Lopes
tries to remove this file has already run.  In practice, the race
196 333bd799 Jose A. Lopes
condition seems unlikely because Ganeti stops the KVM instance in a
197 333bd799 Jose A. Lopes
retry loop, which allows Ganeti to stop the instance and cleanup its
198 333bd799 Jose A. Lopes
runtime information.
199 333bd799 Jose A. Lopes
200 333bd799 Jose A. Lopes
It is possible to determine if a process, in this particular case the
201 333bd799 Jose A. Lopes
KVM process, was terminated by a ``TERM`` signal, using the `proc
202 333bd799 Jose A. Lopes
connector and socket filters
203 333bd799 Jose A. Lopes
<https://web.archive.org/web/20121025062848/http://netsplit.com/2011/02/09/the-proc-connector-and-socket-filters/>`_.
204 333bd799 Jose A. Lopes
The proc connector is a socket connected between a userspace process and
205 333bd799 Jose A. Lopes
the kernel through the netlink protocol and can be used to receive
206 333bd799 Jose A. Lopes
notifications of process events, and the socket filters is a mechanism
207 333bd799 Jose A. Lopes
for subscribing only to events that are relevant.  There are several
208 333bd799 Jose A. Lopes
`process events <http://lwn.net/Articles/157150/>`_ which can be
209 333bd799 Jose A. Lopes
subscribed to, however, in this case, we are interested only in the exit
210 333bd799 Jose A. Lopes
event, which carries information about the exit signal.
211 333bd799 Jose A. Lopes
212 333bd799 Jose A. Lopes
.. vim: set textwidth=72 :
213 333bd799 Jose A. Lopes
.. Local Variables:
214 333bd799 Jose A. Lopes
.. mode: rst
215 333bd799 Jose A. Lopes
.. fill-column: 72
216 333bd799 Jose A. Lopes
.. End: