Revision 333bd799

b/Makefile.am
531 531
	doc/design-hugepages-support.rst \
532 532
	doc/design-impexp2.rst \
533 533
	doc/design-internal-shutdown.rst \
534
	doc/design-kvmd.rst \
534 535
	doc/design-linuxha.rst \
535 536
	doc/design-lu-generated-jobs.rst \
536 537
	doc/design-monitoring-agent.rst \
b/doc/design-kvmd.rst
1
==========
2
KVM daemon
3
==========
4

  
5
.. toctree::
6
   :maxdepth: 2
7

  
8
This design document describes the KVM daemon, which is responsible for
9
determining whether a given KVM instance was shutdown by an
10
administrator or a user.
11

  
12

  
13
Current state and shortcomings
14
==============================
15

  
16
This design document describes the KVM daemon which addresses the KVM
17
side of the user-initiated shutdown problem introduced in
18
:doc:`design-internal-shutdown`.  We are also interested in keeping this
19
functionality optional.  That is, an administrator does not necessarily
20
have to run the KVM daemon if either he is running Xen or even, if he
21
is running KVM, he is not interested in instance shutdown detection.
22
This requirement is important because it means the KVM daemon should
23
be a modular component in the overall Ganeti design, i.e., it should
24
be easy to enable and disable it.
25

  
26
Proposed changes
27
================
28

  
29
The instance shutdown feature for KVM requires listening on events from
30
the Qemu Machine Protocol (QMP) Unix socket, which is created together
31
with a KVM instance.  A QMP socket typically looks like
32
``/var/run/ganeti/kvm-hypervisor/ctrl/<instance>.qmp`` and implements
33
the QMP protocol.  This is a bidirectional protocol that allows Ganeti
34
to send commands, such as, system powerdown, as well as, receive events,
35
such as, the powerdown and shutdown events.
36

  
37
Listening in on these events allows Ganeti to determine whether a given
38
KVM instance was shutdown by an administrator, either through
39
``gnt-instance stop|remove <instance>`` or ``kill -KILL
40
<instance-pid>``, or by a user, through ``poweroff`` from inside the
41
instance.  Upon an administrator powerdown, the QMP protocol sends two
42
events, namely, a powerdown event and a shutdown event, whereas upon a
43
user shutdown only the shutdown event is sent.  This is enough to
44
distinguish between an administrator and a user shutdown.  However,
45
there is one limitation, which is, ``kill -TERM <instance-pid>``.  Even
46
though this is an action performed by the administrator, it will be
47
considered a user shutdown by the approach described in this document.
48

  
49
Several design strategies were considered.  Most of these strategies
50
consisted of spawning some process listening on the QMP socket when a
51
KVM instance is created.  However, having a listener process per KVM
52
instance is not scalable.  Therefore, a different strategy is proposed,
53
namely, having a single process, called the KVM daemon, listening on the
54
QMP sockets of all KVM instances within a node.  That also means there
55
is an instance of the KVM daemon on each node.
56

  
57
In order to implement the KVM daemon, two problems need to be addressed,
58
namely, how the KVM daemon knows when to open a connection to a given
59
QMP socket and how the KVM daemon communicates with Ganeti whether a
60
given instance was shutdown by an administrator or a user.
61

  
62
QMP connections management
63
--------------------------
64

  
65
As mentioned before, the QMP sockets reside in the KVM control
66
directory, which is usually located under
67
``/var/run/ganeti/kvm-hypervisor/ctrl/``.  When a KVM instance is
68
created, a new QMP socket for this instance is also created in this
69
directory.
70

  
71
In order to simplify the design of the KVM daemon, instead of having
72
Ganeti communicate to this daemon through a pipe or socket the creation
73
of a new KVM instance, and thus a new QMP socket, this daemon will
74
monitor the KVM control directory using ``inotify``.  As a result, the
75
daemon is not only able to deal with KVM instances being created and
76
removed, but also capable of overcoming other problematic situations
77
concerning the filesystem, such as, the case when the KVM control
78
directory does not exist because, for example, Ganeti was not yet
79
started, or the KVM control directory was removed, for example, as a
80
result of a Ganeti reinstallation.
81

  
82
Shutdown detection
83
------------------
84

  
85
As mentioned before, the KVM daemon is responsible for opening a
86
connection to the QMP socket of a given instance and listening in on the
87
shutdown and powerdown events, which allow the KVM daemon to determine
88
whether the instance stopped because of an administrator or user
89
shutdown.  Once the instance is stopped, the KVM daemon needs to
90
communicate to Ganeti whether the user was responsible for shutting down
91
the instance.
92

  
93
In order to achieve this, the KVM daemon writes an empty file, called
94
the shutdown file, in the KVM control directory with a name similar to
95
the QMP socket file but with the extension ``.qmp`` replaced with
96
``.shutdown``.  The presence of this file indicates that the shutdown
97
was initiated by a user, whereas the absence of this file indicates that
98
the shutdown was caused by an administrator.  This strategy also handles
99
crashes and signals, such as, ``SIGKILL``, to be handled correctly,
100
given that in these cases the KVM daemon never receives the powerdown
101
and shutdown events and, therefore, never creates the shutdown file.
102

  
103
KVM daemon launch
104
-----------------
105

  
106
With the above issues addressed, a question remains as to when the KVM
107
daemon should be started.  The KVM daemon is different from other Ganeti
108
daemons, which start together with the Ganeti service, because the KVM
109
daemon is optional, given that it is specific to KVM and should not be
110
run on installations containing only Xen, and, even in a KVM
111
installation, the user might still choose not to enable it.  And finally
112
because the KVM daemon is not really necessary until the first KVM
113
instance is started.  For these reasons, the KVM daemon is started from
114
within Ganeti when a KVM instance is started.  And the job process
115
spawned by the node daemon is responsible for starting the KVM daemon.
116

  
117
Given the current design of Ganeti, in which the node daemon spawns a
118
job process to handle the creation of the instance, when launching the
119
KVM daemon it is necessary to first check whether an instance of this
120
daemon is already running and, if this is not the case, then the KVM
121
daemon can be safely started.
122

  
123
Design alternatives
124
===================
125

  
126
At first, it might seem natural to include the instance shutdown
127
detection for KVM in the node daemon.  After all, the node daemon is
128
already responsible for managing instances, for example, starting and
129
stopping an instance.  Nevertheless, the node daemon is more complicated
130
than it might seem at first.
131

  
132
The node daemon is composed of the main loop, which runs in the main
133
thread and is responsible for receiving requests and spawning jobs for
134
handling these requests, and the jobs, which are independent processes
135
spawned for executing the actual tasks, such as, creating an instance.
136

  
137
Including instance shutdown detection in the node daemon is not viable
138
because adding it to the main loop would cause KVM specific code to
139
taint the generality of the node daemon.  In order to add it to the job
140
processes, it would be possible to spawn either a foreground or a
141
background process.  However, these options are also not viable because
142
they would lead to the situation described before where there would be a
143
monitoring process per instance, which is not scalable.  Moreover, the
144
foreground process has an additional disadvantage: it would require
145
modifications the node daemon in order not to expect a terminating job,
146
which is the current node daemon design.
147

  
148
There is another design issue to have in mind.  We could reconsider the
149
place where to write the data that tell Ganeti whether an instance was
150
shutdown by an administrator or the user.  Instead of using the KVM
151
shutdown files presented above, in which the presence of the file
152
indicates a user shutdown and its absence an administrator shutdown, we
153
could store a value in the KVM runtime state file, which is where the
154
relevant KVM state information is.  The advantage of this approach is
155
that it would keep the KVM related information in one place, thus making
156
it easier to manage.  However, it would lead to a more complex
157
implementation and, in the context of the general transition in Ganeti
158
from Python to Haskell, a simpler implementation is preferred.
159

  
160
Finally, it should be noted that the KVM runtime state file benefits
161
from automatic migration.  That is, when an instance is migrated so is
162
the KVM state file.  However, the instance shutdown detection for KVM
163
does not require this feature and, in fact, migrating the instance
164
shutdown state would be incorrect.
165

  
166
Further considerations
167
======================
168

  
169
There are potential race conditions between Ganeti and the KVM daemon,
170
however, in practice they seem unlikely.  For example, the KVM daemon
171
needs to add and remove watches to the parent directories of the KVM
172
control directory until this directory is finally created.  It is
173
possible that Ganeti creates this directory and a KVM instance before
174
the KVM daemon has a chance to add a watch to the KVM control directory,
175
thus causing this daemon to miss the ``inotify`` creation event for the
176
QMP socket.
177

  
178
There are other problems which arise from the limitations of
179
``inotify``.  For example, if the KVM daemon is started after the first
180
Ganeti instance has been created, then the ``inotify`` will not produce
181
any event for the creation of the QMP socket.  This can happen, for
182
example, if the KVM daemon needs to be restarted or upgraded.  As a
183
result, it might be necessary to have an additional mechanism that runs
184
at KVM daemon startup or at regular intervals to ensure that the current
185
KVM internal state is consistent with the actual contents of the KVM
186
control directory.
187

  
188
Another race condition occurs when Ganeti shuts down a KVM instance
189
using force.  Ganeti uses ``TERM`` signals to stop KVM instances when
190
force is specified or ACPI is not enabled.  However, as mentioned
191
before, ``TERM`` signals are interpreted by the KVM daemon as a user
192
shutdown.  As a result, the KVM daemon creates a shutdown file which
193
then must be removed by Ganeti.  The race condition occurs because the
194
KVM daemon might create the shutdown file after the hypervisor code that
195
tries to remove this file has already run.  In practice, the race
196
condition seems unlikely because Ganeti stops the KVM instance in a
197
retry loop, which allows Ganeti to stop the instance and cleanup its
198
runtime information.
199

  
200
It is possible to determine if a process, in this particular case the
201
KVM process, was terminated by a ``TERM`` signal, using the `proc
202
connector and socket filters
203
<https://web.archive.org/web/20121025062848/http://netsplit.com/2011/02/09/the-proc-connector-and-socket-filters/>`_.
204
The proc connector is a socket connected between a userspace process and
205
the kernel through the netlink protocol and can be used to receive
206
notifications of process events, and the socket filters is a mechanism
207
for subscribing only to events that are relevant.  There are several
208
`process events <http://lwn.net/Articles/157150/>`_ which can be
209
subscribed to, however, in this case, we are interested only in the exit
210
event, which carries information about the exit signal.
211

  
212
.. vim: set textwidth=72 :
213
.. Local Variables:
214
.. mode: rst
215
.. fill-column: 72
216
.. End:
b/doc/index.rst
116 116
   design-device-uuid-name.rst
117 117
   design-hroller.rst
118 118
   design-hotplug.rst
119
   design-kvmd.rst
119 120
   design-linuxha.rst
120 121
   design-lu-generated-jobs.rst
121 122
   design-monitoring-agent.rst

Also available in: Unified diff