Statistics
| Branch: | Tag: | Revision:

root / doc / design-2.2.rst @ ebeb600f

History | View | Annotate | Download (25.8 kB)

1 e56bb0e8 Guido Trotter
=================
2 e56bb0e8 Guido Trotter
Ganeti 2.2 design
3 e56bb0e8 Guido Trotter
=================
4 e56bb0e8 Guido Trotter
5 e56bb0e8 Guido Trotter
This document describes the major changes in Ganeti 2.2 compared to
6 e56bb0e8 Guido Trotter
the 2.1 version.
7 e56bb0e8 Guido Trotter
8 e56bb0e8 Guido Trotter
The 2.2 version will be a relatively small release. Its main aim is to
9 e56bb0e8 Guido Trotter
avoid changing too much of the core code, while addressing issues and
10 e56bb0e8 Guido Trotter
adding new features and improvements over 2.1, in a timely fashion.
11 e56bb0e8 Guido Trotter
12 e56bb0e8 Guido Trotter
.. contents:: :depth: 4
13 e56bb0e8 Guido Trotter
14 e56bb0e8 Guido Trotter
Objective
15 e56bb0e8 Guido Trotter
=========
16 e56bb0e8 Guido Trotter
17 e56bb0e8 Guido Trotter
Background
18 e56bb0e8 Guido Trotter
==========
19 e56bb0e8 Guido Trotter
20 e56bb0e8 Guido Trotter
Overview
21 e56bb0e8 Guido Trotter
========
22 e56bb0e8 Guido Trotter
23 e56bb0e8 Guido Trotter
Detailed design
24 e56bb0e8 Guido Trotter
===============
25 e56bb0e8 Guido Trotter
26 e56bb0e8 Guido Trotter
As for 2.1 we divide the 2.2 design into three areas:
27 e56bb0e8 Guido Trotter
28 e56bb0e8 Guido Trotter
- core changes, which affect the master daemon/job queue/locking or
29 e56bb0e8 Guido Trotter
  all/most logical units
30 e56bb0e8 Guido Trotter
- logical unit/feature changes
31 e56bb0e8 Guido Trotter
- external interface changes (eg. command line, os api, hooks, ...)
32 e56bb0e8 Guido Trotter
33 e56bb0e8 Guido Trotter
Core changes
34 e56bb0e8 Guido Trotter
------------
35 e56bb0e8 Guido Trotter
36 6e56e84a Michael Hanselmann
Remote procedure call timeouts
37 6e56e84a Michael Hanselmann
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
38 6e56e84a Michael Hanselmann
39 6e56e84a Michael Hanselmann
Current state and shortcomings
40 6e56e84a Michael Hanselmann
++++++++++++++++++++++++++++++
41 6e56e84a Michael Hanselmann
42 6e56e84a Michael Hanselmann
The current RPC protocol used by Ganeti is based on HTTP. Every request
43 6e56e84a Michael Hanselmann
consists of an HTTP PUT request (e.g. ``PUT /hooks_runner HTTP/1.0``)
44 6e56e84a Michael Hanselmann
and doesn't return until the function called has returned. Parameters
45 6e56e84a Michael Hanselmann
and return values are encoded using JSON.
46 6e56e84a Michael Hanselmann
47 6e56e84a Michael Hanselmann
On the server side, ``ganeti-noded`` handles every incoming connection
48 6e56e84a Michael Hanselmann
in a separate process by forking just after accepting the connection.
49 6e56e84a Michael Hanselmann
This process exits after sending the response.
50 6e56e84a Michael Hanselmann
51 6e56e84a Michael Hanselmann
There is one major problem with this design: Timeouts can not be used on
52 6e56e84a Michael Hanselmann
a per-request basis. Neither client or server know how long it will
53 6e56e84a Michael Hanselmann
take. Even if we might be able to group requests into different
54 6e56e84a Michael Hanselmann
categories (e.g. fast and slow), this is not reliable.
55 6e56e84a Michael Hanselmann
56 6e56e84a Michael Hanselmann
If a node has an issue or the network connection fails while a request
57 6e56e84a Michael Hanselmann
is being handled, the master daemon can wait for a long time for the
58 6e56e84a Michael Hanselmann
connection to time out (e.g. due to the operating system's underlying
59 6e56e84a Michael Hanselmann
TCP keep-alive packets or timeouts). While the settings for keep-alive
60 6e56e84a Michael Hanselmann
packets can be changed using Linux-specific socket options, we prefer to
61 6e56e84a Michael Hanselmann
use application-level timeouts because these cover both machine down and
62 6e56e84a Michael Hanselmann
unresponsive node daemon cases.
63 6e56e84a Michael Hanselmann
64 6e56e84a Michael Hanselmann
Proposed changes
65 6e56e84a Michael Hanselmann
++++++++++++++++
66 6e56e84a Michael Hanselmann
67 6e56e84a Michael Hanselmann
RPC glossary
68 6e56e84a Michael Hanselmann
^^^^^^^^^^^^
69 6e56e84a Michael Hanselmann
70 6e56e84a Michael Hanselmann
Function call ID
71 6e56e84a Michael Hanselmann
  Unique identifier returned by ``ganeti-noded`` after invoking a
72 6e56e84a Michael Hanselmann
  function.
73 6e56e84a Michael Hanselmann
Function process
74 6e56e84a Michael Hanselmann
  Process started by ``ganeti-noded`` to call actual (backend) function.
75 6e56e84a Michael Hanselmann
76 6e56e84a Michael Hanselmann
Protocol
77 6e56e84a Michael Hanselmann
^^^^^^^^
78 6e56e84a Michael Hanselmann
79 6e56e84a Michael Hanselmann
Initially we chose HTTP as our RPC protocol because there were existing
80 6e56e84a Michael Hanselmann
libraries, which, unfortunately, turned out to miss important features
81 6e56e84a Michael Hanselmann
(such as SSL certificate authentication) and we had to write our own.
82 6e56e84a Michael Hanselmann
83 6e56e84a Michael Hanselmann
This proposal can easily be implemented using HTTP, though it would
84 6e56e84a Michael Hanselmann
likely be more efficient and less complicated to use the LUXI protocol
85 6e56e84a Michael Hanselmann
already used to communicate between client tools and the Ganeti master
86 6e56e84a Michael Hanselmann
daemon. Switching to another protocol can occur at a later point. This
87 6e56e84a Michael Hanselmann
proposal should be implemented using HTTP as its underlying protocol.
88 6e56e84a Michael Hanselmann
89 6e56e84a Michael Hanselmann
The LUXI protocol currently contains two functions, ``WaitForJobChange``
90 6e56e84a Michael Hanselmann
and ``AutoArchiveJobs``, which can take a longer time. They both support
91 6e56e84a Michael Hanselmann
a parameter to specify the timeout. This timeout is usually chosen as
92 6e56e84a Michael Hanselmann
roughly half of the socket timeout, guaranteeing a response before the
93 6e56e84a Michael Hanselmann
socket times out. After the specified amount of time,
94 6e56e84a Michael Hanselmann
``AutoArchiveJobs`` returns and reports the number of archived jobs.
95 6e56e84a Michael Hanselmann
``WaitForJobChange`` returns and reports a timeout. In both cases, the
96 6e56e84a Michael Hanselmann
functions can be called again.
97 6e56e84a Michael Hanselmann
98 6e56e84a Michael Hanselmann
A similar model can be used for the inter-node RPC protocol. In some
99 6e56e84a Michael Hanselmann
sense, the node daemon will implement a light variant of *"node daemon
100 6e56e84a Michael Hanselmann
jobs"*. When the function call is sent, it specifies an initial timeout.
101 6e56e84a Michael Hanselmann
If the function didn't finish within this timeout, a response is sent
102 6e56e84a Michael Hanselmann
with a unique identifier, the function call ID. The client can then
103 6e56e84a Michael Hanselmann
choose to wait for the function to finish again with a timeout.
104 6e56e84a Michael Hanselmann
Inter-node RPC calls would no longer be blocking indefinitely and there
105 6e56e84a Michael Hanselmann
would be an implicit ping-mechanism.
106 6e56e84a Michael Hanselmann
107 6e56e84a Michael Hanselmann
Request handling
108 6e56e84a Michael Hanselmann
^^^^^^^^^^^^^^^^
109 6e56e84a Michael Hanselmann
110 6e56e84a Michael Hanselmann
To support the protocol changes described above, the way the node daemon
111 6e56e84a Michael Hanselmann
handles request will have to change. Instead of forking and handling
112 6e56e84a Michael Hanselmann
every connection in a separate process, there should be one child
113 6e56e84a Michael Hanselmann
process per function call and the master process will handle the
114 6e56e84a Michael Hanselmann
communication with clients and the function processes using asynchronous
115 6e56e84a Michael Hanselmann
I/O.
116 6e56e84a Michael Hanselmann
117 6e56e84a Michael Hanselmann
Function processes communicate with the parent process via stdio and
118 6e56e84a Michael Hanselmann
possibly their exit status. Every function process has a unique
119 6e56e84a Michael Hanselmann
identifier, though it shouldn't be the process ID only (PIDs can be
120 6e56e84a Michael Hanselmann
recycled and are prone to race conditions for this use case). The
121 6e56e84a Michael Hanselmann
proposed format is ``${ppid}:${cpid}:${time}:${random}``, where ``ppid``
122 6e56e84a Michael Hanselmann
is the ``ganeti-noded`` PID, ``cpid`` the child's PID, ``time`` the
123 6e56e84a Michael Hanselmann
current Unix timestamp with decimal places and ``random`` at least 16
124 6e56e84a Michael Hanselmann
random bits.
125 6e56e84a Michael Hanselmann
126 6e56e84a Michael Hanselmann
The following operations will be supported:
127 6e56e84a Michael Hanselmann
128 6e56e84a Michael Hanselmann
``StartFunction(fn_name, fn_args, timeout)``
129 6e56e84a Michael Hanselmann
  Starts a function specified by ``fn_name`` with arguments in
130 6e56e84a Michael Hanselmann
  ``fn_args`` and waits up to ``timeout`` seconds for the function
131 6e56e84a Michael Hanselmann
  to finish. Fire-and-forget calls can be made by specifying a timeout
132 6e56e84a Michael Hanselmann
  of 0 seconds (e.g. for powercycling the node). Returns three values:
133 6e56e84a Michael Hanselmann
  function call ID (if not finished), whether function finished (or
134 6e56e84a Michael Hanselmann
  timeout) and the function's return value.
135 6e56e84a Michael Hanselmann
``WaitForFunction(fnc_id, timeout)``
136 6e56e84a Michael Hanselmann
  Waits up to ``timeout`` seconds for function call to finish. Return
137 6e56e84a Michael Hanselmann
  value same as ``StartFunction``.
138 6e56e84a Michael Hanselmann
139 6e56e84a Michael Hanselmann
In the future, ``StartFunction`` could support an additional parameter
140 6e56e84a Michael Hanselmann
to specify after how long the function process should be aborted.
141 6e56e84a Michael Hanselmann
142 6e56e84a Michael Hanselmann
Simplified timing diagram::
143 6e56e84a Michael Hanselmann
144 6e56e84a Michael Hanselmann
  Master daemon        Node daemon                      Function process
145 6e56e84a Michael Hanselmann
   |
146 6e56e84a Michael Hanselmann
  Call function
147 6e56e84a Michael Hanselmann
  (timeout 10s) -----> Parse request and fork for ----> Start function
148 6e56e84a Michael Hanselmann
                       calling actual function, then     |
149 6e56e84a Michael Hanselmann
                       wait up to 10s for function to    |
150 6e56e84a Michael Hanselmann
                       finish                            |
151 6e56e84a Michael Hanselmann
                        |                                |
152 6e56e84a Michael Hanselmann
                       ...                              ...
153 6e56e84a Michael Hanselmann
                        |                                |
154 6e56e84a Michael Hanselmann
  Examine return <----  |                                |
155 6e56e84a Michael Hanselmann
  value and wait                                         |
156 6e56e84a Michael Hanselmann
  again -------------> Wait another 10s for function     |
157 6e56e84a Michael Hanselmann
                        |                                |
158 6e56e84a Michael Hanselmann
                       ...                              ...
159 6e56e84a Michael Hanselmann
                        |                                |
160 6e56e84a Michael Hanselmann
  Examine return <----  |                                |
161 6e56e84a Michael Hanselmann
  value and wait                                         |
162 6e56e84a Michael Hanselmann
  again -------------> Wait another 10s for function     |
163 6e56e84a Michael Hanselmann
                        |                                |
164 6e56e84a Michael Hanselmann
                       ...                              ...
165 6e56e84a Michael Hanselmann
                        |                                |
166 6e56e84a Michael Hanselmann
                        |                               Function ends,
167 6e56e84a Michael Hanselmann
                       Get return value and forward <-- process exits
168 6e56e84a Michael Hanselmann
  Process return <---- it to caller
169 6e56e84a Michael Hanselmann
  value and continue
170 6e56e84a Michael Hanselmann
   |
171 6e56e84a Michael Hanselmann
172 6e56e84a Michael Hanselmann
.. TODO: Convert diagram above to graphviz/dot graphic
173 6e56e84a Michael Hanselmann
174 6e56e84a Michael Hanselmann
On process termination (e.g. after having been sent a ``SIGTERM`` or
175 6e56e84a Michael Hanselmann
``SIGINT`` signal), ``ganeti-noded`` should send ``SIGTERM`` to all
176 6e56e84a Michael Hanselmann
function processes and wait for all of them to terminate.
177 6e56e84a Michael Hanselmann
178 6e56e84a Michael Hanselmann
179 5b2069a9 Michael Hanselmann
Inter-cluster instance moves
180 5b2069a9 Michael Hanselmann
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
181 5b2069a9 Michael Hanselmann
182 5b2069a9 Michael Hanselmann
Current state and shortcomings
183 5b2069a9 Michael Hanselmann
++++++++++++++++++++++++++++++
184 5b2069a9 Michael Hanselmann
185 5b2069a9 Michael Hanselmann
With the current design of Ganeti, moving whole instances between
186 5b2069a9 Michael Hanselmann
different clusters involves a lot of manual work. There are several ways
187 5b2069a9 Michael Hanselmann
to move instances, one of them being to export the instance, manually
188 5b2069a9 Michael Hanselmann
copying all data to the new cluster before importing it again. Manual
189 5b2069a9 Michael Hanselmann
changes to the instances configuration, such as the IP address, may be
190 5b2069a9 Michael Hanselmann
necessary in the new environment. The goal is to improve and automate
191 5b2069a9 Michael Hanselmann
this process in Ganeti 2.2.
192 5b2069a9 Michael Hanselmann
193 5b2069a9 Michael Hanselmann
Proposed changes
194 5b2069a9 Michael Hanselmann
++++++++++++++++
195 5b2069a9 Michael Hanselmann
196 5b2069a9 Michael Hanselmann
Authorization, Authentication and Security
197 5b2069a9 Michael Hanselmann
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
198 5b2069a9 Michael Hanselmann
199 5b2069a9 Michael Hanselmann
Until now, each Ganeti cluster was a self-contained entity and wouldn't
200 5b2069a9 Michael Hanselmann
talk to other Ganeti clusters. Nodes within clusters only had to trust
201 5b2069a9 Michael Hanselmann
the other nodes in the same cluster and the network used for replication
202 5b2069a9 Michael Hanselmann
was trusted, too (hence the ability the use a separate, local network
203 5b2069a9 Michael Hanselmann
for replication).
204 5b2069a9 Michael Hanselmann
205 5b2069a9 Michael Hanselmann
For inter-cluster instance transfers this model must be weakened. Nodes
206 5b2069a9 Michael Hanselmann
in one cluster will have to talk to nodes in other clusters, sometimes
207 5b2069a9 Michael Hanselmann
in other locations and, very important, via untrusted network
208 5b2069a9 Michael Hanselmann
connections.
209 5b2069a9 Michael Hanselmann
210 5b2069a9 Michael Hanselmann
Various option have been considered for securing and authenticating the
211 5b2069a9 Michael Hanselmann
data transfer from one machine to another. To reduce the risk of
212 5b2069a9 Michael Hanselmann
accidentally overwriting data due to software bugs, authenticating the
213 5b2069a9 Michael Hanselmann
arriving data was considered critical. Eventually we decided to use
214 5b2069a9 Michael Hanselmann
socat's OpenSSL options (``OPENSSL:``, ``OPENSSL-LISTEN:`` et al), which
215 5b2069a9 Michael Hanselmann
provide us with encryption, authentication and authorization when used
216 5b2069a9 Michael Hanselmann
with separate keys and certificates.
217 5b2069a9 Michael Hanselmann
218 5b2069a9 Michael Hanselmann
Combinations of OpenSSH, GnuPG and Netcat were deemed too complex to set
219 5b2069a9 Michael Hanselmann
up from within Ganeti. Any solution involving OpenSSH would require a
220 5b2069a9 Michael Hanselmann
dedicated user with a home directory and likely automated modifications
221 5b2069a9 Michael Hanselmann
to the user's ``$HOME/.ssh/authorized_keys`` file. When using Netcat,
222 5b2069a9 Michael Hanselmann
GnuPG or another encryption method would be necessary to transfer the
223 5b2069a9 Michael Hanselmann
data over an untrusted network. socat combines both in one program and
224 5b2069a9 Michael Hanselmann
is already a dependency.
225 5b2069a9 Michael Hanselmann
226 5b2069a9 Michael Hanselmann
Each of the two clusters will have to generate an RSA key. The public
227 5b2069a9 Michael Hanselmann
parts are exchanged between the clusters by a third party, such as an
228 5b2069a9 Michael Hanselmann
administrator or a system interacting with Ganeti via the remote API
229 5b2069a9 Michael Hanselmann
("third party" from here on). After receiving each other's public key,
230 5b2069a9 Michael Hanselmann
the clusters can start talking to each other.
231 5b2069a9 Michael Hanselmann
232 5b2069a9 Michael Hanselmann
All encrypted connections must be verified on both sides. Neither side
233 5b2069a9 Michael Hanselmann
may accept unverified certificates. The generated certificate should
234 5b2069a9 Michael Hanselmann
only be valid for the time necessary to move the instance.
235 5b2069a9 Michael Hanselmann
236 a7c6552d Michael Hanselmann
For additional protection of the instance data, the two clusters can
237 f0476905 Michael Hanselmann
verify the certificates and destination information exchanged via the
238 f0476905 Michael Hanselmann
third party by checking an HMAC signature using a key shared among the
239 f0476905 Michael Hanselmann
involved clusters. By default this secret key will be a random string
240 f0476905 Michael Hanselmann
unique to the cluster, generated by running SHA1 over 20 bytes read from
241 f0476905 Michael Hanselmann
``/dev/urandom`` and the administrator must synchronize the secrets
242 f0476905 Michael Hanselmann
between clusters before instances can be moved. If the third party does
243 f0476905 Michael Hanselmann
not know the secret, it can't forge the certificates or redirect the
244 f0476905 Michael Hanselmann
data. Unless disabled by a new cluster parameter, verifying the HMAC
245 f0476905 Michael Hanselmann
signatures must be mandatory. The HMAC signature for X509 certificates
246 f0476905 Michael Hanselmann
will be prepended to the certificate similar to an RFC822 header and
247 f0476905 Michael Hanselmann
only covers the certificate (from ``-----BEGIN CERTIFICATE-----`` to
248 f0476905 Michael Hanselmann
``-----END CERTIFICATE-----``). The header name will be
249 68857643 Michael Hanselmann
``X-Ganeti-Signature`` and its value will have the format
250 68857643 Michael Hanselmann
``$salt/$hash`` (salt and hash separated by slash). The salt may only
251 68857643 Michael Hanselmann
contain characters in the range ``[a-zA-Z0-9]``.
252 a7c6552d Michael Hanselmann
253 5b2069a9 Michael Hanselmann
On the web, the destination cluster would be equivalent to an HTTPS
254 5b2069a9 Michael Hanselmann
server requiring verifiable client certificates. The browser would be
255 5b2069a9 Michael Hanselmann
equivalent to the source cluster and must verify the server's
256 5b2069a9 Michael Hanselmann
certificate while providing a client certificate to the server.
257 5b2069a9 Michael Hanselmann
258 5b2069a9 Michael Hanselmann
Copying data
259 5b2069a9 Michael Hanselmann
^^^^^^^^^^^^
260 5b2069a9 Michael Hanselmann
261 5b2069a9 Michael Hanselmann
To simplify the implementation, we decided to operate at a block-device
262 5b2069a9 Michael Hanselmann
level only, allowing us to easily support non-DRBD instance moves.
263 5b2069a9 Michael Hanselmann
264 5b2069a9 Michael Hanselmann
Intra-cluster instance moves will re-use the existing export and import
265 5b2069a9 Michael Hanselmann
scripts supplied by instance OS definitions. Unlike simply copying the
266 5b2069a9 Michael Hanselmann
raw data, this allows to use filesystem-specific utilities to dump only
267 5b2069a9 Michael Hanselmann
used parts of the disk and to exclude certain disks from the move.
268 5b2069a9 Michael Hanselmann
Compression should be used to further reduce the amount of data
269 5b2069a9 Michael Hanselmann
transferred.
270 5b2069a9 Michael Hanselmann
271 5b2069a9 Michael Hanselmann
The export scripts writes all data to stdout and the import script reads
272 5b2069a9 Michael Hanselmann
it from stdin again. To avoid copying data and reduce disk space
273 5b2069a9 Michael Hanselmann
consumption, everything is read from the disk and sent over the network
274 5b2069a9 Michael Hanselmann
directly, where it'll be written to the new block device directly again.
275 5b2069a9 Michael Hanselmann
276 5b2069a9 Michael Hanselmann
Workflow
277 5b2069a9 Michael Hanselmann
^^^^^^^^
278 5b2069a9 Michael Hanselmann
279 5b2069a9 Michael Hanselmann
#. Third party tells source cluster to shut down instance, asks for the
280 5b2069a9 Michael Hanselmann
   instance specification and for the public part of an encryption key
281 f0476905 Michael Hanselmann
282 f0476905 Michael Hanselmann
   - Instance information can already be retrieved using an existing API
283 f0476905 Michael Hanselmann
     (``OpQueryInstanceData``).
284 f0476905 Michael Hanselmann
   - An RSA encryption key and a corresponding self-signed X509
285 f0476905 Michael Hanselmann
     certificate is generated using the "openssl" command. This key will
286 f0476905 Michael Hanselmann
     be used to encrypt the data sent to the destination cluster.
287 f0476905 Michael Hanselmann
288 f0476905 Michael Hanselmann
     - Private keys never leave the cluster.
289 f0476905 Michael Hanselmann
     - The public part (the X509 certificate) is signed using HMAC with
290 f0476905 Michael Hanselmann
       salting and a secret shared between Ganeti clusters.
291 f0476905 Michael Hanselmann
292 5b2069a9 Michael Hanselmann
#. Third party tells destination cluster to create an instance with the
293 5b2069a9 Michael Hanselmann
   same specifications as on source cluster and to prepare for an
294 5b2069a9 Michael Hanselmann
   instance move with the key received from the source cluster and
295 5b2069a9 Michael Hanselmann
   receives the public part of the destination's encryption key
296 f0476905 Michael Hanselmann
297 f0476905 Michael Hanselmann
   - The current API to create instances (``OpCreateInstance``) will be
298 f0476905 Michael Hanselmann
     extended to support an import from a remote cluster.
299 f0476905 Michael Hanselmann
   - A valid, unexpired X509 certificate signed with the destination
300 f0476905 Michael Hanselmann
     cluster's secret will be required. By verifying the signature, we
301 f0476905 Michael Hanselmann
     know the third party didn't modify the certificate.
302 f0476905 Michael Hanselmann
303 f0476905 Michael Hanselmann
     - The private keys never leave their cluster, hence the third party
304 f0476905 Michael Hanselmann
       can not decrypt or intercept the instance's data by modifying the
305 f0476905 Michael Hanselmann
       IP address or port sent by the destination cluster.
306 f0476905 Michael Hanselmann
307 f0476905 Michael Hanselmann
   - The destination cluster generates another key and certificate,
308 f0476905 Michael Hanselmann
     signs and sends it to the third party, who will have to pass it to
309 f0476905 Michael Hanselmann
     the API for exporting an instance (``OpExportInstance``). This
310 f0476905 Michael Hanselmann
     certificate is used to ensure we're sending the disk data to the
311 f0476905 Michael Hanselmann
     correct destination cluster.
312 f0476905 Michael Hanselmann
   - Once a disk can be imported, the API sends the destination
313 f0476905 Michael Hanselmann
     information (IP address and TCP port) together with an HMAC
314 f0476905 Michael Hanselmann
     signature to the third party.
315 f0476905 Michael Hanselmann
316 5b2069a9 Michael Hanselmann
#. Third party hands public part of the destination's encryption key
317 5b2069a9 Michael Hanselmann
   together with all necessary information to source cluster and tells
318 5b2069a9 Michael Hanselmann
   it to start the move
319 f0476905 Michael Hanselmann
320 f0476905 Michael Hanselmann
   - The existing API for exporting instances (``OpExportInstance``)
321 f0476905 Michael Hanselmann
     will be extended to export instances to remote clusters.
322 f0476905 Michael Hanselmann
323 5b2069a9 Michael Hanselmann
#. Source cluster connects to destination cluster for each disk and
324 5b2069a9 Michael Hanselmann
   transfers its data using the instance OS definition's export and
325 5b2069a9 Michael Hanselmann
   import scripts
326 f0476905 Michael Hanselmann
327 f0476905 Michael Hanselmann
   - Before starting, the source cluster must verify the HMAC signature
328 f0476905 Michael Hanselmann
     of the certificate and destination information (IP address and TCP
329 f0476905 Michael Hanselmann
     port).
330 f0476905 Michael Hanselmann
   - When connecting to the remote machine, strong certificate checks
331 f0476905 Michael Hanselmann
     must be employed.
332 f0476905 Michael Hanselmann
333 5b2069a9 Michael Hanselmann
#. Due to the asynchronous nature of the whole process, the destination
334 5b2069a9 Michael Hanselmann
   cluster checks whether all disks have been transferred every time
335 f0476905 Michael Hanselmann
   after transferring a single disk; if so, it destroys the encryption
336 5b2069a9 Michael Hanselmann
   key
337 5b2069a9 Michael Hanselmann
#. After sending all disks, the source cluster destroys its key
338 5b2069a9 Michael Hanselmann
#. Destination cluster runs OS definition's rename script to adjust
339 5b2069a9 Michael Hanselmann
   instance settings if needed (e.g. IP address)
340 5b2069a9 Michael Hanselmann
#. Destination cluster starts the instance if requested at the beginning
341 5b2069a9 Michael Hanselmann
   by the third party
342 5b2069a9 Michael Hanselmann
#. Source cluster removes the instance if requested
343 5b2069a9 Michael Hanselmann
344 f0476905 Michael Hanselmann
Instance move in pseudo code
345 f0476905 Michael Hanselmann
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
346 f0476905 Michael Hanselmann
347 f0476905 Michael Hanselmann
.. highlight:: python
348 f0476905 Michael Hanselmann
349 f0476905 Michael Hanselmann
The following pseudo code describes a script moving instances between
350 f0476905 Michael Hanselmann
clusters and what happens on both clusters.
351 f0476905 Michael Hanselmann
352 f0476905 Michael Hanselmann
#. Script is started, gets the instance name and destination cluster::
353 f0476905 Michael Hanselmann
354 f0476905 Michael Hanselmann
    (instance_name, dest_cluster_name) = sys.argv[1:]
355 f0476905 Michael Hanselmann
356 f0476905 Michael Hanselmann
    # Get destination cluster object
357 f0476905 Michael Hanselmann
    dest_cluster = db.FindCluster(dest_cluster_name)
358 f0476905 Michael Hanselmann
359 f0476905 Michael Hanselmann
    # Use database to find source cluster
360 f0476905 Michael Hanselmann
    src_cluster = db.FindClusterByInstance(instance_name)
361 f0476905 Michael Hanselmann
362 f0476905 Michael Hanselmann
#. Script tells source cluster to stop instance::
363 f0476905 Michael Hanselmann
364 f0476905 Michael Hanselmann
    # Stop instance
365 f0476905 Michael Hanselmann
    src_cluster.StopInstance(instance_name)
366 f0476905 Michael Hanselmann
367 f0476905 Michael Hanselmann
    # Get instance specification (memory, disk, etc.)
368 f0476905 Michael Hanselmann
    inst_spec = src_cluster.GetInstanceInfo(instance_name)
369 f0476905 Michael Hanselmann
370 f0476905 Michael Hanselmann
    (src_key_name, src_cert) = src_cluster.CreateX509Certificate()
371 f0476905 Michael Hanselmann
372 f0476905 Michael Hanselmann
#. ``CreateX509Certificate`` on source cluster::
373 f0476905 Michael Hanselmann
374 f0476905 Michael Hanselmann
    key_file = mkstemp()
375 f0476905 Michael Hanselmann
    cert_file = "%s.cert" % key_file
376 f0476905 Michael Hanselmann
    RunCmd(["/usr/bin/openssl", "req", "-new",
377 f0476905 Michael Hanselmann
             "-newkey", "rsa:1024", "-days", "1",
378 f0476905 Michael Hanselmann
             "-nodes", "-x509", "-batch",
379 f0476905 Michael Hanselmann
             "-keyout", key_file, "-out", cert_file])
380 f0476905 Michael Hanselmann
381 f0476905 Michael Hanselmann
    plain_cert = utils.ReadFile(cert_file)
382 f0476905 Michael Hanselmann
383 f0476905 Michael Hanselmann
    # HMAC sign using secret key, this adds a "X-Ganeti-Signature"
384 f0476905 Michael Hanselmann
    # header to the beginning of the certificate
385 f0476905 Michael Hanselmann
    signed_cert = utils.SignX509Certificate(plain_cert,
386 f0476905 Michael Hanselmann
      utils.ReadFile(constants.X509_SIGNKEY_FILE))
387 f0476905 Michael Hanselmann
388 f0476905 Michael Hanselmann
    # The certificate now looks like the following:
389 f0476905 Michael Hanselmann
    #
390 f0476905 Michael Hanselmann
    #   X-Ganeti-Signature: $1234$28676f0516c6ab68062b[…]
391 f0476905 Michael Hanselmann
    #   -----BEGIN CERTIFICATE-----
392 f0476905 Michael Hanselmann
    #   MIICsDCCAhmgAwIBAgI[…]
393 f0476905 Michael Hanselmann
    #   -----END CERTIFICATE-----
394 f0476905 Michael Hanselmann
395 f0476905 Michael Hanselmann
    # Return name of key file and signed certificate in PEM format
396 f0476905 Michael Hanselmann
    return (os.path.basename(key_file), signed_cert)
397 f0476905 Michael Hanselmann
398 f0476905 Michael Hanselmann
#. Script creates instance on destination cluster and waits for move to
399 f0476905 Michael Hanselmann
   finish::
400 f0476905 Michael Hanselmann
401 f0476905 Michael Hanselmann
    dest_cluster.CreateInstance(mode=constants.REMOTE_IMPORT,
402 f0476905 Michael Hanselmann
                                spec=inst_spec,
403 f0476905 Michael Hanselmann
                                source_cert=src_cert)
404 f0476905 Michael Hanselmann
405 f0476905 Michael Hanselmann
    # Wait until destination cluster gives us its certificate
406 f0476905 Michael Hanselmann
    dest_cert = None
407 f0476905 Michael Hanselmann
    disk_info = []
408 f0476905 Michael Hanselmann
    while not (dest_cert and len(disk_info) < len(inst_spec.disks)):
409 f0476905 Michael Hanselmann
      tmp = dest_cluster.WaitOutput()
410 f0476905 Michael Hanselmann
      if tmp is Certificate:
411 f0476905 Michael Hanselmann
        dest_cert = tmp
412 f0476905 Michael Hanselmann
      elif tmp is DiskInfo:
413 f0476905 Michael Hanselmann
        # DiskInfo contains destination address and port
414 f0476905 Michael Hanselmann
        disk_info[tmp.index] = tmp
415 f0476905 Michael Hanselmann
416 f0476905 Michael Hanselmann
    # Tell source cluster to export disks
417 f0476905 Michael Hanselmann
    for disk in disk_info:
418 f0476905 Michael Hanselmann
      src_cluster.ExportDisk(instance_name, disk=disk,
419 f0476905 Michael Hanselmann
                             key_name=src_key_name,
420 f0476905 Michael Hanselmann
                             dest_cert=dest_cert)
421 f0476905 Michael Hanselmann
422 f0476905 Michael Hanselmann
    print ("Instance %s sucessfully moved to %s" %
423 f0476905 Michael Hanselmann
           (instance_name, dest_cluster.name))
424 f0476905 Michael Hanselmann
425 f0476905 Michael Hanselmann
#. ``CreateInstance`` on destination cluster::
426 f0476905 Michael Hanselmann
427 f0476905 Michael Hanselmann
    # …
428 f0476905 Michael Hanselmann
429 f0476905 Michael Hanselmann
    if mode == constants.REMOTE_IMPORT:
430 f0476905 Michael Hanselmann
      # Make sure certificate was not modified since it was generated by
431 f0476905 Michael Hanselmann
      # source cluster (which must use the same secret)
432 f0476905 Michael Hanselmann
      if (not utils.VerifySignedX509Cert(source_cert,
433 f0476905 Michael Hanselmann
            utils.ReadFile(constants.X509_SIGNKEY_FILE))):
434 f0476905 Michael Hanselmann
        raise Error("Certificate not signed with this cluster's secret")
435 f0476905 Michael Hanselmann
436 f0476905 Michael Hanselmann
      if utils.CheckExpiredX509Cert(source_cert):
437 f0476905 Michael Hanselmann
        raise Error("X509 certificate is expired")
438 f0476905 Michael Hanselmann
439 f0476905 Michael Hanselmann
      source_cert_file = utils.WriteTempFile(source_cert)
440 f0476905 Michael Hanselmann
441 f0476905 Michael Hanselmann
      # See above for X509 certificate generation and signing
442 f0476905 Michael Hanselmann
      (key_name, signed_cert) = CreateSignedX509Certificate()
443 f0476905 Michael Hanselmann
444 f0476905 Michael Hanselmann
      SendToClient("x509-cert", signed_cert)
445 f0476905 Michael Hanselmann
446 f0476905 Michael Hanselmann
      for disk in instance.disks:
447 f0476905 Michael Hanselmann
        # Start socat
448 f0476905 Michael Hanselmann
        RunCmd(("socat"
449 f0476905 Michael Hanselmann
                " OPENSSL-LISTEN:%s,…,key=%s,cert=%s,cafile=%s,verify=1"
450 f0476905 Michael Hanselmann
                " stdout > /dev/disk…") %
451 f0476905 Michael Hanselmann
               port, GetRsaKeyPath(key_name, private=True),
452 f0476905 Michael Hanselmann
               GetRsaKeyPath(key_name, private=False), src_cert_file)
453 f0476905 Michael Hanselmann
        SendToClient("send-disk-to", disk, ip_address, port)
454 f0476905 Michael Hanselmann
455 f0476905 Michael Hanselmann
      DestroyX509Cert(key_name)
456 f0476905 Michael Hanselmann
457 f0476905 Michael Hanselmann
      RunRenameScript(instance_name)
458 f0476905 Michael Hanselmann
459 f0476905 Michael Hanselmann
#. ``ExportDisk`` on source cluster::
460 f0476905 Michael Hanselmann
461 f0476905 Michael Hanselmann
    # Make sure certificate was not modified since it was generated by
462 f0476905 Michael Hanselmann
    # destination cluster (which must use the same secret)
463 f0476905 Michael Hanselmann
    if (not utils.VerifySignedX509Cert(cert_pem,
464 f0476905 Michael Hanselmann
          utils.ReadFile(constants.X509_SIGNKEY_FILE))):
465 f0476905 Michael Hanselmann
      raise Error("Certificate not signed with this cluster's secret")
466 f0476905 Michael Hanselmann
467 f0476905 Michael Hanselmann
    if utils.CheckExpiredX509Cert(cert_pem):
468 f0476905 Michael Hanselmann
      raise Error("X509 certificate is expired")
469 f0476905 Michael Hanselmann
470 f0476905 Michael Hanselmann
    dest_cert_file = utils.WriteTempFile(cert_pem)
471 f0476905 Michael Hanselmann
472 f0476905 Michael Hanselmann
    # Start socat
473 f0476905 Michael Hanselmann
    RunCmd(("socat stdin"
474 f0476905 Michael Hanselmann
            " OPENSSL:%s:%s,…,key=%s,cert=%s,cafile=%s,verify=1"
475 f0476905 Michael Hanselmann
            " < /dev/disk…") %
476 f0476905 Michael Hanselmann
           disk.host, disk.port,
477 f0476905 Michael Hanselmann
           GetRsaKeyPath(key_name, private=True),
478 f0476905 Michael Hanselmann
           GetRsaKeyPath(key_name, private=False), dest_cert_file)
479 f0476905 Michael Hanselmann
480 f0476905 Michael Hanselmann
    if instance.all_disks_done:
481 f0476905 Michael Hanselmann
      DestroyX509Cert(key_name)
482 f0476905 Michael Hanselmann
483 f0476905 Michael Hanselmann
.. highlight:: text
484 f0476905 Michael Hanselmann
485 5b2069a9 Michael Hanselmann
Miscellaneous notes
486 5b2069a9 Michael Hanselmann
^^^^^^^^^^^^^^^^^^^
487 5b2069a9 Michael Hanselmann
488 5b2069a9 Michael Hanselmann
- A very similar system could also be used for instance exports within
489 5b2069a9 Michael Hanselmann
  the same cluster. Currently OpenSSH is being used, but could be
490 5b2069a9 Michael Hanselmann
  replaced by socat and SSL/TLS.
491 5b2069a9 Michael Hanselmann
- During the design of intra-cluster instance moves we also discussed
492 5b2069a9 Michael Hanselmann
  encrypting instance exports using GnuPG.
493 5b2069a9 Michael Hanselmann
- While most instances should have exactly the same configuration as
494 5b2069a9 Michael Hanselmann
  on the source cluster, setting them up with a different disk layout
495 5b2069a9 Michael Hanselmann
  might be helpful in some use-cases.
496 5b2069a9 Michael Hanselmann
- A cleanup operation, similar to the one available for failed instance
497 5b2069a9 Michael Hanselmann
  migrations, should be provided.
498 5b2069a9 Michael Hanselmann
- ``ganeti-watcher`` should remove instances pending a move from another
499 5b2069a9 Michael Hanselmann
  cluster after a certain amount of time. This takes care of failures
500 5b2069a9 Michael Hanselmann
  somewhere in the process.
501 5b2069a9 Michael Hanselmann
- RSA keys can be generated using the existing
502 5b2069a9 Michael Hanselmann
  ``bootstrap.GenerateSelfSignedSslCert`` function, though it might be
503 5b2069a9 Michael Hanselmann
  useful to not write both parts into a single file, requiring small
504 5b2069a9 Michael Hanselmann
  changes to the function. The public part always starts with
505 5b2069a9 Michael Hanselmann
  ``-----BEGIN CERTIFICATE-----`` and ends with ``-----END
506 5b2069a9 Michael Hanselmann
  CERTIFICATE-----``.
507 5b2069a9 Michael Hanselmann
- The source and destination cluster might be different when it comes
508 5b2069a9 Michael Hanselmann
  to available hypervisors, kernels, etc. The destination cluster should
509 5b2069a9 Michael Hanselmann
  refuse to accept an instance move if it can't fulfill an instance's
510 5b2069a9 Michael Hanselmann
  requirements.
511 5b2069a9 Michael Hanselmann
512 5b2069a9 Michael Hanselmann
513 e56bb0e8 Guido Trotter
Feature changes
514 e56bb0e8 Guido Trotter
---------------
515 e56bb0e8 Guido Trotter
516 8388e9ff Guido Trotter
KVM Security
517 8388e9ff Guido Trotter
~~~~~~~~~~~~
518 8388e9ff Guido Trotter
519 8388e9ff Guido Trotter
Current state and shortcomings
520 8388e9ff Guido Trotter
++++++++++++++++++++++++++++++
521 8388e9ff Guido Trotter
522 8388e9ff Guido Trotter
Currently all kvm processes run as root. Taking ownership of the
523 8388e9ff Guido Trotter
hypervisor process, from inside a virtual machine, would mean a full
524 8388e9ff Guido Trotter
compromise of the whole Ganeti cluster, knowledge of all Ganeti
525 8388e9ff Guido Trotter
authentication secrets, full access to all running instances, and the
526 8388e9ff Guido Trotter
option of subverting other basic services on the cluster (eg: ssh).
527 8388e9ff Guido Trotter
528 8388e9ff Guido Trotter
Proposed changes
529 8388e9ff Guido Trotter
++++++++++++++++
530 8388e9ff Guido Trotter
531 8388e9ff Guido Trotter
We would like to decrease the surface of attack available if an
532 8388e9ff Guido Trotter
hypervisor is compromised. We can do so adding different features to
533 8388e9ff Guido Trotter
Ganeti, which will allow restricting the broken hypervisor
534 8388e9ff Guido Trotter
possibilities, in the absence of a local privilege escalation attack, to
535 8388e9ff Guido Trotter
subvert the node.
536 8388e9ff Guido Trotter
537 8388e9ff Guido Trotter
Dropping privileges in kvm to a single user (easy)
538 8388e9ff Guido Trotter
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
539 8388e9ff Guido Trotter
540 8388e9ff Guido Trotter
By passing the ``-runas`` option to kvm, we can make it drop privileges.
541 8388e9ff Guido Trotter
The user can be chosen by an hypervisor parameter, so that each instance
542 8388e9ff Guido Trotter
can have its own user, but by default they will all run under the same
543 8388e9ff Guido Trotter
one. It should be very easy to implement, and can easily be backported
544 8388e9ff Guido Trotter
to 2.1.X.
545 8388e9ff Guido Trotter
546 8388e9ff Guido Trotter
This mode protects the Ganeti cluster from a subverted hypervisor, but
547 8388e9ff Guido Trotter
doesn't protect the instances between each other, unless care is taken
548 8388e9ff Guido Trotter
to specify a different user for each. This would prevent the worst
549 8388e9ff Guido Trotter
attacks, including:
550 8388e9ff Guido Trotter
551 8388e9ff Guido Trotter
- logging in to other nodes
552 8388e9ff Guido Trotter
- administering the Ganeti cluster
553 8388e9ff Guido Trotter
- subverting other services
554 8388e9ff Guido Trotter
555 8388e9ff Guido Trotter
But the following would remain an option:
556 8388e9ff Guido Trotter
557 8388e9ff Guido Trotter
- terminate other VMs (but not start them again, as that requires root
558 8388e9ff Guido Trotter
  privileges to set up networking) (unless different users are used)
559 8388e9ff Guido Trotter
- trace other VMs, and probably subvert them and access their data
560 8388e9ff Guido Trotter
  (unless different users are used)
561 8388e9ff Guido Trotter
- send network traffic from the node
562 8388e9ff Guido Trotter
- read unprotected data on the node filesystem
563 8388e9ff Guido Trotter
564 8388e9ff Guido Trotter
Running kvm in a chroot (slightly harder)
565 8388e9ff Guido Trotter
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
566 8388e9ff Guido Trotter
567 8388e9ff Guido Trotter
By passing the ``-chroot`` option to kvm, we can restrict the kvm
568 8388e9ff Guido Trotter
process in its own (possibly empty) root directory. We need to set this
569 8388e9ff Guido Trotter
area up so that the instance disks and control sockets are accessible,
570 8388e9ff Guido Trotter
so it would require slightly more work at the Ganeti level.
571 8388e9ff Guido Trotter
572 8388e9ff Guido Trotter
Breaking out in a chroot would mean:
573 8388e9ff Guido Trotter
574 8388e9ff Guido Trotter
- a lot less options to find a local privilege escalation vector
575 8388e9ff Guido Trotter
- the impossibility to write local data, if the chroot is set up
576 8388e9ff Guido Trotter
  correctly
577 8388e9ff Guido Trotter
- the impossibility to read filesystem data on the host
578 8388e9ff Guido Trotter
579 8388e9ff Guido Trotter
It would still be possible though to:
580 8388e9ff Guido Trotter
581 8388e9ff Guido Trotter
- terminate other VMs
582 8388e9ff Guido Trotter
- trace other VMs, and possibly subvert them (if a tracer can be
583 8388e9ff Guido Trotter
  installed in the chroot)
584 8388e9ff Guido Trotter
- send network traffic from the node
585 8388e9ff Guido Trotter
586 8388e9ff Guido Trotter
587 8388e9ff Guido Trotter
Running kvm with a pool of users (slightly harder)
588 8388e9ff Guido Trotter
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
589 8388e9ff Guido Trotter
590 8388e9ff Guido Trotter
If rather than passing a single user as an hypervisor parameter, we have
591 8388e9ff Guido Trotter
a pool of useable ones, we can dynamically choose a free one to use and
592 8388e9ff Guido Trotter
thus guarantee that each machine will be separate from the others,
593 8388e9ff Guido Trotter
without putting the burden of this on the cluster administrator.
594 8388e9ff Guido Trotter
595 8388e9ff Guido Trotter
This would mean interfering between machines would be impossible, and
596 8388e9ff Guido Trotter
can still be combined with the chroot benefits.
597 8388e9ff Guido Trotter
598 8388e9ff Guido Trotter
Running iptables rules to limit network interaction (easy)
599 8388e9ff Guido Trotter
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
600 8388e9ff Guido Trotter
601 8388e9ff Guido Trotter
These don't need to be handled by Ganeti, but we can ship examples. If
602 8388e9ff Guido Trotter
the users used to run VMs would be blocked from sending some or all
603 8388e9ff Guido Trotter
network traffic, it would become impossible for a broken into hypervisor
604 8388e9ff Guido Trotter
to send arbitrary data on the node network, which is especially useful
605 8388e9ff Guido Trotter
when the instance and the node network are separated (using ganeti-nbma
606 8388e9ff Guido Trotter
or a separate set of network interfaces), or when a separate replication
607 8388e9ff Guido Trotter
network is maintained. We need to experiment to see how much restriction
608 8388e9ff Guido Trotter
we can properly apply, without limiting the instance legitimate traffic.
609 8388e9ff Guido Trotter
610 8388e9ff Guido Trotter
611 8388e9ff Guido Trotter
Running kvm inside a container (even harder)
612 8388e9ff Guido Trotter
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
613 8388e9ff Guido Trotter
614 8388e9ff Guido Trotter
Recent linux kernels support different process namespaces through
615 8388e9ff Guido Trotter
control groups. PIDs, users, filesystems and even network interfaces can
616 8388e9ff Guido Trotter
be separated. If we can set up ganeti to run kvm in a separate container
617 8388e9ff Guido Trotter
we could insulate all the host process from being even visible if the
618 8388e9ff Guido Trotter
hypervisor gets broken into. Most probably separating the network
619 8388e9ff Guido Trotter
namespace would require one extra hop in the host, through a veth
620 8388e9ff Guido Trotter
interface, thus reducing performance, so we may want to avoid that, and
621 8388e9ff Guido Trotter
just rely on iptables.
622 8388e9ff Guido Trotter
623 8388e9ff Guido Trotter
Implementation plan
624 8388e9ff Guido Trotter
+++++++++++++++++++
625 8388e9ff Guido Trotter
626 8388e9ff Guido Trotter
We will first implement dropping privileges for kvm processes as a
627 8388e9ff Guido Trotter
single user, and most probably backport it to 2.1. Then we'll ship
628 8388e9ff Guido Trotter
example iptables rules to show how the user can be limited in its
629 8388e9ff Guido Trotter
network activities.  After that we'll implement chroot restriction for
630 8388e9ff Guido Trotter
kvm processes, and extend the user limitation to use a user pool.
631 8388e9ff Guido Trotter
632 8388e9ff Guido Trotter
Finally we'll look into namespaces and containers, although that might
633 8388e9ff Guido Trotter
slip after the 2.2 release.
634 8388e9ff Guido Trotter
635 e56bb0e8 Guido Trotter
External interface changes
636 e56bb0e8 Guido Trotter
--------------------------
637 e56bb0e8 Guido Trotter
638 e56bb0e8 Guido Trotter
.. vim: set textwidth=72 :