root / doc / design-kvmd.rst @ bb3011ad
History | View | Annotate | Download (10.8 kB)
1 | 333bd799 | Jose A. Lopes | ========== |
---|---|---|---|
2 | 333bd799 | Jose A. Lopes | KVM daemon |
3 | 333bd799 | Jose A. Lopes | ========== |
4 | 333bd799 | Jose A. Lopes | |
5 | 333bd799 | Jose A. Lopes | .. toctree:: |
6 | 333bd799 | Jose A. Lopes | :maxdepth: 2 |
7 | 333bd799 | Jose A. Lopes | |
8 | 333bd799 | Jose A. Lopes | This design document describes the KVM daemon, which is responsible for |
9 | 333bd799 | Jose A. Lopes | determining whether a given KVM instance was shutdown by an |
10 | 333bd799 | Jose A. Lopes | administrator or a user. |
11 | 333bd799 | Jose A. Lopes | |
12 | 333bd799 | Jose A. Lopes | |
13 | 333bd799 | Jose A. Lopes | Current state and shortcomings |
14 | 333bd799 | Jose A. Lopes | ============================== |
15 | 333bd799 | Jose A. Lopes | |
16 | 333bd799 | Jose A. Lopes | This design document describes the KVM daemon which addresses the KVM |
17 | 333bd799 | Jose A. Lopes | side of the user-initiated shutdown problem introduced in |
18 | 333bd799 | Jose A. Lopes | :doc:`design-internal-shutdown`. We are also interested in keeping this |
19 | 333bd799 | Jose A. Lopes | functionality optional. That is, an administrator does not necessarily |
20 | 333bd799 | Jose A. Lopes | have to run the KVM daemon if either he is running Xen or even, if he |
21 | 333bd799 | Jose A. Lopes | is running KVM, he is not interested in instance shutdown detection. |
22 | 333bd799 | Jose A. Lopes | This requirement is important because it means the KVM daemon should |
23 | 333bd799 | Jose A. Lopes | be a modular component in the overall Ganeti design, i.e., it should |
24 | 333bd799 | Jose A. Lopes | be easy to enable and disable it. |
25 | 333bd799 | Jose A. Lopes | |
26 | 333bd799 | Jose A. Lopes | Proposed changes |
27 | 333bd799 | Jose A. Lopes | ================ |
28 | 333bd799 | Jose A. Lopes | |
29 | 333bd799 | Jose A. Lopes | The instance shutdown feature for KVM requires listening on events from |
30 | 333bd799 | Jose A. Lopes | the Qemu Machine Protocol (QMP) Unix socket, which is created together |
31 | 333bd799 | Jose A. Lopes | with a KVM instance. A QMP socket typically looks like |
32 | 333bd799 | Jose A. Lopes | ``/var/run/ganeti/kvm-hypervisor/ctrl/<instance>.qmp`` and implements |
33 | 333bd799 | Jose A. Lopes | the QMP protocol. This is a bidirectional protocol that allows Ganeti |
34 | 333bd799 | Jose A. Lopes | to send commands, such as, system powerdown, as well as, receive events, |
35 | 333bd799 | Jose A. Lopes | such as, the powerdown and shutdown events. |
36 | 333bd799 | Jose A. Lopes | |
37 | 333bd799 | Jose A. Lopes | Listening in on these events allows Ganeti to determine whether a given |
38 | 333bd799 | Jose A. Lopes | KVM instance was shutdown by an administrator, either through |
39 | 333bd799 | Jose A. Lopes | ``gnt-instance stop|remove <instance>`` or ``kill -KILL |
40 | 333bd799 | Jose A. Lopes | <instance-pid>``, or by a user, through ``poweroff`` from inside the |
41 | 333bd799 | Jose A. Lopes | instance. Upon an administrator powerdown, the QMP protocol sends two |
42 | 333bd799 | Jose A. Lopes | events, namely, a powerdown event and a shutdown event, whereas upon a |
43 | 333bd799 | Jose A. Lopes | user shutdown only the shutdown event is sent. This is enough to |
44 | 333bd799 | Jose A. Lopes | distinguish between an administrator and a user shutdown. However, |
45 | 333bd799 | Jose A. Lopes | there is one limitation, which is, ``kill -TERM <instance-pid>``. Even |
46 | 333bd799 | Jose A. Lopes | though this is an action performed by the administrator, it will be |
47 | 333bd799 | Jose A. Lopes | considered a user shutdown by the approach described in this document. |
48 | 333bd799 | Jose A. Lopes | |
49 | 333bd799 | Jose A. Lopes | Several design strategies were considered. Most of these strategies |
50 | 333bd799 | Jose A. Lopes | consisted of spawning some process listening on the QMP socket when a |
51 | 333bd799 | Jose A. Lopes | KVM instance is created. However, having a listener process per KVM |
52 | 333bd799 | Jose A. Lopes | instance is not scalable. Therefore, a different strategy is proposed, |
53 | 333bd799 | Jose A. Lopes | namely, having a single process, called the KVM daemon, listening on the |
54 | 333bd799 | Jose A. Lopes | QMP sockets of all KVM instances within a node. That also means there |
55 | 333bd799 | Jose A. Lopes | is an instance of the KVM daemon on each node. |
56 | 333bd799 | Jose A. Lopes | |
57 | 333bd799 | Jose A. Lopes | In order to implement the KVM daemon, two problems need to be addressed, |
58 | 333bd799 | Jose A. Lopes | namely, how the KVM daemon knows when to open a connection to a given |
59 | 333bd799 | Jose A. Lopes | QMP socket and how the KVM daemon communicates with Ganeti whether a |
60 | 333bd799 | Jose A. Lopes | given instance was shutdown by an administrator or a user. |
61 | 333bd799 | Jose A. Lopes | |
62 | 333bd799 | Jose A. Lopes | QMP connections management |
63 | 333bd799 | Jose A. Lopes | -------------------------- |
64 | 333bd799 | Jose A. Lopes | |
65 | 333bd799 | Jose A. Lopes | As mentioned before, the QMP sockets reside in the KVM control |
66 | 333bd799 | Jose A. Lopes | directory, which is usually located under |
67 | 333bd799 | Jose A. Lopes | ``/var/run/ganeti/kvm-hypervisor/ctrl/``. When a KVM instance is |
68 | 333bd799 | Jose A. Lopes | created, a new QMP socket for this instance is also created in this |
69 | 333bd799 | Jose A. Lopes | directory. |
70 | 333bd799 | Jose A. Lopes | |
71 | 333bd799 | Jose A. Lopes | In order to simplify the design of the KVM daemon, instead of having |
72 | 333bd799 | Jose A. Lopes | Ganeti communicate to this daemon through a pipe or socket the creation |
73 | 333bd799 | Jose A. Lopes | of a new KVM instance, and thus a new QMP socket, this daemon will |
74 | 333bd799 | Jose A. Lopes | monitor the KVM control directory using ``inotify``. As a result, the |
75 | 333bd799 | Jose A. Lopes | daemon is not only able to deal with KVM instances being created and |
76 | 333bd799 | Jose A. Lopes | removed, but also capable of overcoming other problematic situations |
77 | 333bd799 | Jose A. Lopes | concerning the filesystem, such as, the case when the KVM control |
78 | 333bd799 | Jose A. Lopes | directory does not exist because, for example, Ganeti was not yet |
79 | 333bd799 | Jose A. Lopes | started, or the KVM control directory was removed, for example, as a |
80 | 333bd799 | Jose A. Lopes | result of a Ganeti reinstallation. |
81 | 333bd799 | Jose A. Lopes | |
82 | 333bd799 | Jose A. Lopes | Shutdown detection |
83 | 333bd799 | Jose A. Lopes | ------------------ |
84 | 333bd799 | Jose A. Lopes | |
85 | 333bd799 | Jose A. Lopes | As mentioned before, the KVM daemon is responsible for opening a |
86 | 333bd799 | Jose A. Lopes | connection to the QMP socket of a given instance and listening in on the |
87 | 333bd799 | Jose A. Lopes | shutdown and powerdown events, which allow the KVM daemon to determine |
88 | 333bd799 | Jose A. Lopes | whether the instance stopped because of an administrator or user |
89 | 333bd799 | Jose A. Lopes | shutdown. Once the instance is stopped, the KVM daemon needs to |
90 | 333bd799 | Jose A. Lopes | communicate to Ganeti whether the user was responsible for shutting down |
91 | 333bd799 | Jose A. Lopes | the instance. |
92 | 333bd799 | Jose A. Lopes | |
93 | 333bd799 | Jose A. Lopes | In order to achieve this, the KVM daemon writes an empty file, called |
94 | 333bd799 | Jose A. Lopes | the shutdown file, in the KVM control directory with a name similar to |
95 | 333bd799 | Jose A. Lopes | the QMP socket file but with the extension ``.qmp`` replaced with |
96 | 333bd799 | Jose A. Lopes | ``.shutdown``. The presence of this file indicates that the shutdown |
97 | 333bd799 | Jose A. Lopes | was initiated by a user, whereas the absence of this file indicates that |
98 | 333bd799 | Jose A. Lopes | the shutdown was caused by an administrator. This strategy also handles |
99 | 333bd799 | Jose A. Lopes | crashes and signals, such as, ``SIGKILL``, to be handled correctly, |
100 | 333bd799 | Jose A. Lopes | given that in these cases the KVM daemon never receives the powerdown |
101 | 333bd799 | Jose A. Lopes | and shutdown events and, therefore, never creates the shutdown file. |
102 | 333bd799 | Jose A. Lopes | |
103 | 333bd799 | Jose A. Lopes | KVM daemon launch |
104 | 333bd799 | Jose A. Lopes | ----------------- |
105 | 333bd799 | Jose A. Lopes | |
106 | 333bd799 | Jose A. Lopes | With the above issues addressed, a question remains as to when the KVM |
107 | 333bd799 | Jose A. Lopes | daemon should be started. The KVM daemon is different from other Ganeti |
108 | 333bd799 | Jose A. Lopes | daemons, which start together with the Ganeti service, because the KVM |
109 | 333bd799 | Jose A. Lopes | daemon is optional, given that it is specific to KVM and should not be |
110 | 333bd799 | Jose A. Lopes | run on installations containing only Xen, and, even in a KVM |
111 | 333bd799 | Jose A. Lopes | installation, the user might still choose not to enable it. And finally |
112 | 333bd799 | Jose A. Lopes | because the KVM daemon is not really necessary until the first KVM |
113 | 333bd799 | Jose A. Lopes | instance is started. For these reasons, the KVM daemon is started from |
114 | 333bd799 | Jose A. Lopes | within Ganeti when a KVM instance is started. And the job process |
115 | 333bd799 | Jose A. Lopes | spawned by the node daemon is responsible for starting the KVM daemon. |
116 | 333bd799 | Jose A. Lopes | |
117 | 333bd799 | Jose A. Lopes | Given the current design of Ganeti, in which the node daemon spawns a |
118 | 333bd799 | Jose A. Lopes | job process to handle the creation of the instance, when launching the |
119 | 333bd799 | Jose A. Lopes | KVM daemon it is necessary to first check whether an instance of this |
120 | 333bd799 | Jose A. Lopes | daemon is already running and, if this is not the case, then the KVM |
121 | 333bd799 | Jose A. Lopes | daemon can be safely started. |
122 | 333bd799 | Jose A. Lopes | |
123 | 333bd799 | Jose A. Lopes | Design alternatives |
124 | 333bd799 | Jose A. Lopes | =================== |
125 | 333bd799 | Jose A. Lopes | |
126 | 333bd799 | Jose A. Lopes | At first, it might seem natural to include the instance shutdown |
127 | 333bd799 | Jose A. Lopes | detection for KVM in the node daemon. After all, the node daemon is |
128 | 333bd799 | Jose A. Lopes | already responsible for managing instances, for example, starting and |
129 | 333bd799 | Jose A. Lopes | stopping an instance. Nevertheless, the node daemon is more complicated |
130 | 333bd799 | Jose A. Lopes | than it might seem at first. |
131 | 333bd799 | Jose A. Lopes | |
132 | 333bd799 | Jose A. Lopes | The node daemon is composed of the main loop, which runs in the main |
133 | 333bd799 | Jose A. Lopes | thread and is responsible for receiving requests and spawning jobs for |
134 | 333bd799 | Jose A. Lopes | handling these requests, and the jobs, which are independent processes |
135 | 333bd799 | Jose A. Lopes | spawned for executing the actual tasks, such as, creating an instance. |
136 | 333bd799 | Jose A. Lopes | |
137 | 333bd799 | Jose A. Lopes | Including instance shutdown detection in the node daemon is not viable |
138 | 333bd799 | Jose A. Lopes | because adding it to the main loop would cause KVM specific code to |
139 | 333bd799 | Jose A. Lopes | taint the generality of the node daemon. In order to add it to the job |
140 | 333bd799 | Jose A. Lopes | processes, it would be possible to spawn either a foreground or a |
141 | 333bd799 | Jose A. Lopes | background process. However, these options are also not viable because |
142 | 333bd799 | Jose A. Lopes | they would lead to the situation described before where there would be a |
143 | 333bd799 | Jose A. Lopes | monitoring process per instance, which is not scalable. Moreover, the |
144 | 333bd799 | Jose A. Lopes | foreground process has an additional disadvantage: it would require |
145 | 333bd799 | Jose A. Lopes | modifications the node daemon in order not to expect a terminating job, |
146 | 333bd799 | Jose A. Lopes | which is the current node daemon design. |
147 | 333bd799 | Jose A. Lopes | |
148 | 333bd799 | Jose A. Lopes | There is another design issue to have in mind. We could reconsider the |
149 | 333bd799 | Jose A. Lopes | place where to write the data that tell Ganeti whether an instance was |
150 | 333bd799 | Jose A. Lopes | shutdown by an administrator or the user. Instead of using the KVM |
151 | 333bd799 | Jose A. Lopes | shutdown files presented above, in which the presence of the file |
152 | 333bd799 | Jose A. Lopes | indicates a user shutdown and its absence an administrator shutdown, we |
153 | 333bd799 | Jose A. Lopes | could store a value in the KVM runtime state file, which is where the |
154 | 333bd799 | Jose A. Lopes | relevant KVM state information is. The advantage of this approach is |
155 | 333bd799 | Jose A. Lopes | that it would keep the KVM related information in one place, thus making |
156 | 333bd799 | Jose A. Lopes | it easier to manage. However, it would lead to a more complex |
157 | 333bd799 | Jose A. Lopes | implementation and, in the context of the general transition in Ganeti |
158 | 333bd799 | Jose A. Lopes | from Python to Haskell, a simpler implementation is preferred. |
159 | 333bd799 | Jose A. Lopes | |
160 | 333bd799 | Jose A. Lopes | Finally, it should be noted that the KVM runtime state file benefits |
161 | 333bd799 | Jose A. Lopes | from automatic migration. That is, when an instance is migrated so is |
162 | 333bd799 | Jose A. Lopes | the KVM state file. However, the instance shutdown detection for KVM |
163 | 333bd799 | Jose A. Lopes | does not require this feature and, in fact, migrating the instance |
164 | 333bd799 | Jose A. Lopes | shutdown state would be incorrect. |
165 | 333bd799 | Jose A. Lopes | |
166 | 333bd799 | Jose A. Lopes | Further considerations |
167 | 333bd799 | Jose A. Lopes | ====================== |
168 | 333bd799 | Jose A. Lopes | |
169 | 333bd799 | Jose A. Lopes | There are potential race conditions between Ganeti and the KVM daemon, |
170 | 333bd799 | Jose A. Lopes | however, in practice they seem unlikely. For example, the KVM daemon |
171 | 333bd799 | Jose A. Lopes | needs to add and remove watches to the parent directories of the KVM |
172 | 333bd799 | Jose A. Lopes | control directory until this directory is finally created. It is |
173 | 333bd799 | Jose A. Lopes | possible that Ganeti creates this directory and a KVM instance before |
174 | 333bd799 | Jose A. Lopes | the KVM daemon has a chance to add a watch to the KVM control directory, |
175 | 333bd799 | Jose A. Lopes | thus causing this daemon to miss the ``inotify`` creation event for the |
176 | 333bd799 | Jose A. Lopes | QMP socket. |
177 | 333bd799 | Jose A. Lopes | |
178 | 333bd799 | Jose A. Lopes | There are other problems which arise from the limitations of |
179 | 333bd799 | Jose A. Lopes | ``inotify``. For example, if the KVM daemon is started after the first |
180 | 333bd799 | Jose A. Lopes | Ganeti instance has been created, then the ``inotify`` will not produce |
181 | 333bd799 | Jose A. Lopes | any event for the creation of the QMP socket. This can happen, for |
182 | 333bd799 | Jose A. Lopes | example, if the KVM daemon needs to be restarted or upgraded. As a |
183 | 333bd799 | Jose A. Lopes | result, it might be necessary to have an additional mechanism that runs |
184 | 333bd799 | Jose A. Lopes | at KVM daemon startup or at regular intervals to ensure that the current |
185 | 333bd799 | Jose A. Lopes | KVM internal state is consistent with the actual contents of the KVM |
186 | 333bd799 | Jose A. Lopes | control directory. |
187 | 333bd799 | Jose A. Lopes | |
188 | 333bd799 | Jose A. Lopes | Another race condition occurs when Ganeti shuts down a KVM instance |
189 | 333bd799 | Jose A. Lopes | using force. Ganeti uses ``TERM`` signals to stop KVM instances when |
190 | 333bd799 | Jose A. Lopes | force is specified or ACPI is not enabled. However, as mentioned |
191 | 333bd799 | Jose A. Lopes | before, ``TERM`` signals are interpreted by the KVM daemon as a user |
192 | 333bd799 | Jose A. Lopes | shutdown. As a result, the KVM daemon creates a shutdown file which |
193 | 333bd799 | Jose A. Lopes | then must be removed by Ganeti. The race condition occurs because the |
194 | 333bd799 | Jose A. Lopes | KVM daemon might create the shutdown file after the hypervisor code that |
195 | 333bd799 | Jose A. Lopes | tries to remove this file has already run. In practice, the race |
196 | 333bd799 | Jose A. Lopes | condition seems unlikely because Ganeti stops the KVM instance in a |
197 | 333bd799 | Jose A. Lopes | retry loop, which allows Ganeti to stop the instance and cleanup its |
198 | 333bd799 | Jose A. Lopes | runtime information. |
199 | 333bd799 | Jose A. Lopes | |
200 | 333bd799 | Jose A. Lopes | It is possible to determine if a process, in this particular case the |
201 | 333bd799 | Jose A. Lopes | KVM process, was terminated by a ``TERM`` signal, using the `proc |
202 | 333bd799 | Jose A. Lopes | connector and socket filters |
203 | 333bd799 | Jose A. Lopes | <https://web.archive.org/web/20121025062848/http://netsplit.com/2011/02/09/the-proc-connector-and-socket-filters/>`_. |
204 | 333bd799 | Jose A. Lopes | The proc connector is a socket connected between a userspace process and |
205 | 333bd799 | Jose A. Lopes | the kernel through the netlink protocol and can be used to receive |
206 | 333bd799 | Jose A. Lopes | notifications of process events, and the socket filters is a mechanism |
207 | 333bd799 | Jose A. Lopes | for subscribing only to events that are relevant. There are several |
208 | 333bd799 | Jose A. Lopes | `process events <http://lwn.net/Articles/157150/>`_ which can be |
209 | 333bd799 | Jose A. Lopes | subscribed to, however, in this case, we are interested only in the exit |
210 | 333bd799 | Jose A. Lopes | event, which carries information about the exit signal. |
211 | 333bd799 | Jose A. Lopes | |
212 | 333bd799 | Jose A. Lopes | .. vim: set textwidth=72 : |
213 | 333bd799 | Jose A. Lopes | .. Local Variables: |
214 | 333bd799 | Jose A. Lopes | .. mode: rst |
215 | 333bd799 | Jose A. Lopes | .. fill-column: 72 |
216 | 333bd799 | Jose A. Lopes | .. End: |