root / doc / design-2.2.rst @ 68857643
History | View | Annotate | Download (25.8 kB)
1 |
================= |
---|---|
2 |
Ganeti 2.2 design |
3 |
================= |
4 |
|
5 |
This document describes the major changes in Ganeti 2.2 compared to |
6 |
the 2.1 version. |
7 |
|
8 |
The 2.2 version will be a relatively small release. Its main aim is to |
9 |
avoid changing too much of the core code, while addressing issues and |
10 |
adding new features and improvements over 2.1, in a timely fashion. |
11 |
|
12 |
.. contents:: :depth: 4 |
13 |
|
14 |
Objective |
15 |
========= |
16 |
|
17 |
Background |
18 |
========== |
19 |
|
20 |
Overview |
21 |
======== |
22 |
|
23 |
Detailed design |
24 |
=============== |
25 |
|
26 |
As for 2.1 we divide the 2.2 design into three areas: |
27 |
|
28 |
- core changes, which affect the master daemon/job queue/locking or |
29 |
all/most logical units |
30 |
- logical unit/feature changes |
31 |
- external interface changes (eg. command line, os api, hooks, ...) |
32 |
|
33 |
Core changes |
34 |
------------ |
35 |
|
36 |
Remote procedure call timeouts |
37 |
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
38 |
|
39 |
Current state and shortcomings |
40 |
++++++++++++++++++++++++++++++ |
41 |
|
42 |
The current RPC protocol used by Ganeti is based on HTTP. Every request |
43 |
consists of an HTTP PUT request (e.g. ``PUT /hooks_runner HTTP/1.0``) |
44 |
and doesn't return until the function called has returned. Parameters |
45 |
and return values are encoded using JSON. |
46 |
|
47 |
On the server side, ``ganeti-noded`` handles every incoming connection |
48 |
in a separate process by forking just after accepting the connection. |
49 |
This process exits after sending the response. |
50 |
|
51 |
There is one major problem with this design: Timeouts can not be used on |
52 |
a per-request basis. Neither client or server know how long it will |
53 |
take. Even if we might be able to group requests into different |
54 |
categories (e.g. fast and slow), this is not reliable. |
55 |
|
56 |
If a node has an issue or the network connection fails while a request |
57 |
is being handled, the master daemon can wait for a long time for the |
58 |
connection to time out (e.g. due to the operating system's underlying |
59 |
TCP keep-alive packets or timeouts). While the settings for keep-alive |
60 |
packets can be changed using Linux-specific socket options, we prefer to |
61 |
use application-level timeouts because these cover both machine down and |
62 |
unresponsive node daemon cases. |
63 |
|
64 |
Proposed changes |
65 |
++++++++++++++++ |
66 |
|
67 |
RPC glossary |
68 |
^^^^^^^^^^^^ |
69 |
|
70 |
Function call ID |
71 |
Unique identifier returned by ``ganeti-noded`` after invoking a |
72 |
function. |
73 |
Function process |
74 |
Process started by ``ganeti-noded`` to call actual (backend) function. |
75 |
|
76 |
Protocol |
77 |
^^^^^^^^ |
78 |
|
79 |
Initially we chose HTTP as our RPC protocol because there were existing |
80 |
libraries, which, unfortunately, turned out to miss important features |
81 |
(such as SSL certificate authentication) and we had to write our own. |
82 |
|
83 |
This proposal can easily be implemented using HTTP, though it would |
84 |
likely be more efficient and less complicated to use the LUXI protocol |
85 |
already used to communicate between client tools and the Ganeti master |
86 |
daemon. Switching to another protocol can occur at a later point. This |
87 |
proposal should be implemented using HTTP as its underlying protocol. |
88 |
|
89 |
The LUXI protocol currently contains two functions, ``WaitForJobChange`` |
90 |
and ``AutoArchiveJobs``, which can take a longer time. They both support |
91 |
a parameter to specify the timeout. This timeout is usually chosen as |
92 |
roughly half of the socket timeout, guaranteeing a response before the |
93 |
socket times out. After the specified amount of time, |
94 |
``AutoArchiveJobs`` returns and reports the number of archived jobs. |
95 |
``WaitForJobChange`` returns and reports a timeout. In both cases, the |
96 |
functions can be called again. |
97 |
|
98 |
A similar model can be used for the inter-node RPC protocol. In some |
99 |
sense, the node daemon will implement a light variant of *"node daemon |
100 |
jobs"*. When the function call is sent, it specifies an initial timeout. |
101 |
If the function didn't finish within this timeout, a response is sent |
102 |
with a unique identifier, the function call ID. The client can then |
103 |
choose to wait for the function to finish again with a timeout. |
104 |
Inter-node RPC calls would no longer be blocking indefinitely and there |
105 |
would be an implicit ping-mechanism. |
106 |
|
107 |
Request handling |
108 |
^^^^^^^^^^^^^^^^ |
109 |
|
110 |
To support the protocol changes described above, the way the node daemon |
111 |
handles request will have to change. Instead of forking and handling |
112 |
every connection in a separate process, there should be one child |
113 |
process per function call and the master process will handle the |
114 |
communication with clients and the function processes using asynchronous |
115 |
I/O. |
116 |
|
117 |
Function processes communicate with the parent process via stdio and |
118 |
possibly their exit status. Every function process has a unique |
119 |
identifier, though it shouldn't be the process ID only (PIDs can be |
120 |
recycled and are prone to race conditions for this use case). The |
121 |
proposed format is ``${ppid}:${cpid}:${time}:${random}``, where ``ppid`` |
122 |
is the ``ganeti-noded`` PID, ``cpid`` the child's PID, ``time`` the |
123 |
current Unix timestamp with decimal places and ``random`` at least 16 |
124 |
random bits. |
125 |
|
126 |
The following operations will be supported: |
127 |
|
128 |
``StartFunction(fn_name, fn_args, timeout)`` |
129 |
Starts a function specified by ``fn_name`` with arguments in |
130 |
``fn_args`` and waits up to ``timeout`` seconds for the function |
131 |
to finish. Fire-and-forget calls can be made by specifying a timeout |
132 |
of 0 seconds (e.g. for powercycling the node). Returns three values: |
133 |
function call ID (if not finished), whether function finished (or |
134 |
timeout) and the function's return value. |
135 |
``WaitForFunction(fnc_id, timeout)`` |
136 |
Waits up to ``timeout`` seconds for function call to finish. Return |
137 |
value same as ``StartFunction``. |
138 |
|
139 |
In the future, ``StartFunction`` could support an additional parameter |
140 |
to specify after how long the function process should be aborted. |
141 |
|
142 |
Simplified timing diagram:: |
143 |
|
144 |
Master daemon Node daemon Function process |
145 |
| |
146 |
Call function |
147 |
(timeout 10s) -----> Parse request and fork for ----> Start function |
148 |
calling actual function, then | |
149 |
wait up to 10s for function to | |
150 |
finish | |
151 |
| | |
152 |
... ... |
153 |
| | |
154 |
Examine return <---- | | |
155 |
value and wait | |
156 |
again -------------> Wait another 10s for function | |
157 |
| | |
158 |
... ... |
159 |
| | |
160 |
Examine return <---- | | |
161 |
value and wait | |
162 |
again -------------> Wait another 10s for function | |
163 |
| | |
164 |
... ... |
165 |
| | |
166 |
| Function ends, |
167 |
Get return value and forward <-- process exits |
168 |
Process return <---- it to caller |
169 |
value and continue |
170 |
| |
171 |
|
172 |
.. TODO: Convert diagram above to graphviz/dot graphic |
173 |
|
174 |
On process termination (e.g. after having been sent a ``SIGTERM`` or |
175 |
``SIGINT`` signal), ``ganeti-noded`` should send ``SIGTERM`` to all |
176 |
function processes and wait for all of them to terminate. |
177 |
|
178 |
|
179 |
Inter-cluster instance moves |
180 |
~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
181 |
|
182 |
Current state and shortcomings |
183 |
++++++++++++++++++++++++++++++ |
184 |
|
185 |
With the current design of Ganeti, moving whole instances between |
186 |
different clusters involves a lot of manual work. There are several ways |
187 |
to move instances, one of them being to export the instance, manually |
188 |
copying all data to the new cluster before importing it again. Manual |
189 |
changes to the instances configuration, such as the IP address, may be |
190 |
necessary in the new environment. The goal is to improve and automate |
191 |
this process in Ganeti 2.2. |
192 |
|
193 |
Proposed changes |
194 |
++++++++++++++++ |
195 |
|
196 |
Authorization, Authentication and Security |
197 |
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
198 |
|
199 |
Until now, each Ganeti cluster was a self-contained entity and wouldn't |
200 |
talk to other Ganeti clusters. Nodes within clusters only had to trust |
201 |
the other nodes in the same cluster and the network used for replication |
202 |
was trusted, too (hence the ability the use a separate, local network |
203 |
for replication). |
204 |
|
205 |
For inter-cluster instance transfers this model must be weakened. Nodes |
206 |
in one cluster will have to talk to nodes in other clusters, sometimes |
207 |
in other locations and, very important, via untrusted network |
208 |
connections. |
209 |
|
210 |
Various option have been considered for securing and authenticating the |
211 |
data transfer from one machine to another. To reduce the risk of |
212 |
accidentally overwriting data due to software bugs, authenticating the |
213 |
arriving data was considered critical. Eventually we decided to use |
214 |
socat's OpenSSL options (``OPENSSL:``, ``OPENSSL-LISTEN:`` et al), which |
215 |
provide us with encryption, authentication and authorization when used |
216 |
with separate keys and certificates. |
217 |
|
218 |
Combinations of OpenSSH, GnuPG and Netcat were deemed too complex to set |
219 |
up from within Ganeti. Any solution involving OpenSSH would require a |
220 |
dedicated user with a home directory and likely automated modifications |
221 |
to the user's ``$HOME/.ssh/authorized_keys`` file. When using Netcat, |
222 |
GnuPG or another encryption method would be necessary to transfer the |
223 |
data over an untrusted network. socat combines both in one program and |
224 |
is already a dependency. |
225 |
|
226 |
Each of the two clusters will have to generate an RSA key. The public |
227 |
parts are exchanged between the clusters by a third party, such as an |
228 |
administrator or a system interacting with Ganeti via the remote API |
229 |
("third party" from here on). After receiving each other's public key, |
230 |
the clusters can start talking to each other. |
231 |
|
232 |
All encrypted connections must be verified on both sides. Neither side |
233 |
may accept unverified certificates. The generated certificate should |
234 |
only be valid for the time necessary to move the instance. |
235 |
|
236 |
For additional protection of the instance data, the two clusters can |
237 |
verify the certificates and destination information exchanged via the |
238 |
third party by checking an HMAC signature using a key shared among the |
239 |
involved clusters. By default this secret key will be a random string |
240 |
unique to the cluster, generated by running SHA1 over 20 bytes read from |
241 |
``/dev/urandom`` and the administrator must synchronize the secrets |
242 |
between clusters before instances can be moved. If the third party does |
243 |
not know the secret, it can't forge the certificates or redirect the |
244 |
data. Unless disabled by a new cluster parameter, verifying the HMAC |
245 |
signatures must be mandatory. The HMAC signature for X509 certificates |
246 |
will be prepended to the certificate similar to an RFC822 header and |
247 |
only covers the certificate (from ``-----BEGIN CERTIFICATE-----`` to |
248 |
``-----END CERTIFICATE-----``). The header name will be |
249 |
``X-Ganeti-Signature`` and its value will have the format |
250 |
``$salt/$hash`` (salt and hash separated by slash). The salt may only |
251 |
contain characters in the range ``[a-zA-Z0-9]``. |
252 |
|
253 |
On the web, the destination cluster would be equivalent to an HTTPS |
254 |
server requiring verifiable client certificates. The browser would be |
255 |
equivalent to the source cluster and must verify the server's |
256 |
certificate while providing a client certificate to the server. |
257 |
|
258 |
Copying data |
259 |
^^^^^^^^^^^^ |
260 |
|
261 |
To simplify the implementation, we decided to operate at a block-device |
262 |
level only, allowing us to easily support non-DRBD instance moves. |
263 |
|
264 |
Intra-cluster instance moves will re-use the existing export and import |
265 |
scripts supplied by instance OS definitions. Unlike simply copying the |
266 |
raw data, this allows to use filesystem-specific utilities to dump only |
267 |
used parts of the disk and to exclude certain disks from the move. |
268 |
Compression should be used to further reduce the amount of data |
269 |
transferred. |
270 |
|
271 |
The export scripts writes all data to stdout and the import script reads |
272 |
it from stdin again. To avoid copying data and reduce disk space |
273 |
consumption, everything is read from the disk and sent over the network |
274 |
directly, where it'll be written to the new block device directly again. |
275 |
|
276 |
Workflow |
277 |
^^^^^^^^ |
278 |
|
279 |
#. Third party tells source cluster to shut down instance, asks for the |
280 |
instance specification and for the public part of an encryption key |
281 |
|
282 |
- Instance information can already be retrieved using an existing API |
283 |
(``OpQueryInstanceData``). |
284 |
- An RSA encryption key and a corresponding self-signed X509 |
285 |
certificate is generated using the "openssl" command. This key will |
286 |
be used to encrypt the data sent to the destination cluster. |
287 |
|
288 |
- Private keys never leave the cluster. |
289 |
- The public part (the X509 certificate) is signed using HMAC with |
290 |
salting and a secret shared between Ganeti clusters. |
291 |
|
292 |
#. Third party tells destination cluster to create an instance with the |
293 |
same specifications as on source cluster and to prepare for an |
294 |
instance move with the key received from the source cluster and |
295 |
receives the public part of the destination's encryption key |
296 |
|
297 |
- The current API to create instances (``OpCreateInstance``) will be |
298 |
extended to support an import from a remote cluster. |
299 |
- A valid, unexpired X509 certificate signed with the destination |
300 |
cluster's secret will be required. By verifying the signature, we |
301 |
know the third party didn't modify the certificate. |
302 |
|
303 |
- The private keys never leave their cluster, hence the third party |
304 |
can not decrypt or intercept the instance's data by modifying the |
305 |
IP address or port sent by the destination cluster. |
306 |
|
307 |
- The destination cluster generates another key and certificate, |
308 |
signs and sends it to the third party, who will have to pass it to |
309 |
the API for exporting an instance (``OpExportInstance``). This |
310 |
certificate is used to ensure we're sending the disk data to the |
311 |
correct destination cluster. |
312 |
- Once a disk can be imported, the API sends the destination |
313 |
information (IP address and TCP port) together with an HMAC |
314 |
signature to the third party. |
315 |
|
316 |
#. Third party hands public part of the destination's encryption key |
317 |
together with all necessary information to source cluster and tells |
318 |
it to start the move |
319 |
|
320 |
- The existing API for exporting instances (``OpExportInstance``) |
321 |
will be extended to export instances to remote clusters. |
322 |
|
323 |
#. Source cluster connects to destination cluster for each disk and |
324 |
transfers its data using the instance OS definition's export and |
325 |
import scripts |
326 |
|
327 |
- Before starting, the source cluster must verify the HMAC signature |
328 |
of the certificate and destination information (IP address and TCP |
329 |
port). |
330 |
- When connecting to the remote machine, strong certificate checks |
331 |
must be employed. |
332 |
|
333 |
#. Due to the asynchronous nature of the whole process, the destination |
334 |
cluster checks whether all disks have been transferred every time |
335 |
after transferring a single disk; if so, it destroys the encryption |
336 |
key |
337 |
#. After sending all disks, the source cluster destroys its key |
338 |
#. Destination cluster runs OS definition's rename script to adjust |
339 |
instance settings if needed (e.g. IP address) |
340 |
#. Destination cluster starts the instance if requested at the beginning |
341 |
by the third party |
342 |
#. Source cluster removes the instance if requested |
343 |
|
344 |
Instance move in pseudo code |
345 |
^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
346 |
|
347 |
.. highlight:: python |
348 |
|
349 |
The following pseudo code describes a script moving instances between |
350 |
clusters and what happens on both clusters. |
351 |
|
352 |
#. Script is started, gets the instance name and destination cluster:: |
353 |
|
354 |
(instance_name, dest_cluster_name) = sys.argv[1:] |
355 |
|
356 |
# Get destination cluster object |
357 |
dest_cluster = db.FindCluster(dest_cluster_name) |
358 |
|
359 |
# Use database to find source cluster |
360 |
src_cluster = db.FindClusterByInstance(instance_name) |
361 |
|
362 |
#. Script tells source cluster to stop instance:: |
363 |
|
364 |
# Stop instance |
365 |
src_cluster.StopInstance(instance_name) |
366 |
|
367 |
# Get instance specification (memory, disk, etc.) |
368 |
inst_spec = src_cluster.GetInstanceInfo(instance_name) |
369 |
|
370 |
(src_key_name, src_cert) = src_cluster.CreateX509Certificate() |
371 |
|
372 |
#. ``CreateX509Certificate`` on source cluster:: |
373 |
|
374 |
key_file = mkstemp() |
375 |
cert_file = "%s.cert" % key_file |
376 |
RunCmd(["/usr/bin/openssl", "req", "-new", |
377 |
"-newkey", "rsa:1024", "-days", "1", |
378 |
"-nodes", "-x509", "-batch", |
379 |
"-keyout", key_file, "-out", cert_file]) |
380 |
|
381 |
plain_cert = utils.ReadFile(cert_file) |
382 |
|
383 |
# HMAC sign using secret key, this adds a "X-Ganeti-Signature" |
384 |
# header to the beginning of the certificate |
385 |
signed_cert = utils.SignX509Certificate(plain_cert, |
386 |
utils.ReadFile(constants.X509_SIGNKEY_FILE)) |
387 |
|
388 |
# The certificate now looks like the following: |
389 |
# |
390 |
# X-Ganeti-Signature: $1234$28676f0516c6ab68062b[…] |
391 |
# -----BEGIN CERTIFICATE----- |
392 |
# MIICsDCCAhmgAwIBAgI[…] |
393 |
# -----END CERTIFICATE----- |
394 |
|
395 |
# Return name of key file and signed certificate in PEM format |
396 |
return (os.path.basename(key_file), signed_cert) |
397 |
|
398 |
#. Script creates instance on destination cluster and waits for move to |
399 |
finish:: |
400 |
|
401 |
dest_cluster.CreateInstance(mode=constants.REMOTE_IMPORT, |
402 |
spec=inst_spec, |
403 |
source_cert=src_cert) |
404 |
|
405 |
# Wait until destination cluster gives us its certificate |
406 |
dest_cert = None |
407 |
disk_info = [] |
408 |
while not (dest_cert and len(disk_info) < len(inst_spec.disks)): |
409 |
tmp = dest_cluster.WaitOutput() |
410 |
if tmp is Certificate: |
411 |
dest_cert = tmp |
412 |
elif tmp is DiskInfo: |
413 |
# DiskInfo contains destination address and port |
414 |
disk_info[tmp.index] = tmp |
415 |
|
416 |
# Tell source cluster to export disks |
417 |
for disk in disk_info: |
418 |
src_cluster.ExportDisk(instance_name, disk=disk, |
419 |
key_name=src_key_name, |
420 |
dest_cert=dest_cert) |
421 |
|
422 |
print ("Instance %s sucessfully moved to %s" % |
423 |
(instance_name, dest_cluster.name)) |
424 |
|
425 |
#. ``CreateInstance`` on destination cluster:: |
426 |
|
427 |
# … |
428 |
|
429 |
if mode == constants.REMOTE_IMPORT: |
430 |
# Make sure certificate was not modified since it was generated by |
431 |
# source cluster (which must use the same secret) |
432 |
if (not utils.VerifySignedX509Cert(source_cert, |
433 |
utils.ReadFile(constants.X509_SIGNKEY_FILE))): |
434 |
raise Error("Certificate not signed with this cluster's secret") |
435 |
|
436 |
if utils.CheckExpiredX509Cert(source_cert): |
437 |
raise Error("X509 certificate is expired") |
438 |
|
439 |
source_cert_file = utils.WriteTempFile(source_cert) |
440 |
|
441 |
# See above for X509 certificate generation and signing |
442 |
(key_name, signed_cert) = CreateSignedX509Certificate() |
443 |
|
444 |
SendToClient("x509-cert", signed_cert) |
445 |
|
446 |
for disk in instance.disks: |
447 |
# Start socat |
448 |
RunCmd(("socat" |
449 |
" OPENSSL-LISTEN:%s,…,key=%s,cert=%s,cafile=%s,verify=1" |
450 |
" stdout > /dev/disk…") % |
451 |
port, GetRsaKeyPath(key_name, private=True), |
452 |
GetRsaKeyPath(key_name, private=False), src_cert_file) |
453 |
SendToClient("send-disk-to", disk, ip_address, port) |
454 |
|
455 |
DestroyX509Cert(key_name) |
456 |
|
457 |
RunRenameScript(instance_name) |
458 |
|
459 |
#. ``ExportDisk`` on source cluster:: |
460 |
|
461 |
# Make sure certificate was not modified since it was generated by |
462 |
# destination cluster (which must use the same secret) |
463 |
if (not utils.VerifySignedX509Cert(cert_pem, |
464 |
utils.ReadFile(constants.X509_SIGNKEY_FILE))): |
465 |
raise Error("Certificate not signed with this cluster's secret") |
466 |
|
467 |
if utils.CheckExpiredX509Cert(cert_pem): |
468 |
raise Error("X509 certificate is expired") |
469 |
|
470 |
dest_cert_file = utils.WriteTempFile(cert_pem) |
471 |
|
472 |
# Start socat |
473 |
RunCmd(("socat stdin" |
474 |
" OPENSSL:%s:%s,…,key=%s,cert=%s,cafile=%s,verify=1" |
475 |
" < /dev/disk…") % |
476 |
disk.host, disk.port, |
477 |
GetRsaKeyPath(key_name, private=True), |
478 |
GetRsaKeyPath(key_name, private=False), dest_cert_file) |
479 |
|
480 |
if instance.all_disks_done: |
481 |
DestroyX509Cert(key_name) |
482 |
|
483 |
.. highlight:: text |
484 |
|
485 |
Miscellaneous notes |
486 |
^^^^^^^^^^^^^^^^^^^ |
487 |
|
488 |
- A very similar system could also be used for instance exports within |
489 |
the same cluster. Currently OpenSSH is being used, but could be |
490 |
replaced by socat and SSL/TLS. |
491 |
- During the design of intra-cluster instance moves we also discussed |
492 |
encrypting instance exports using GnuPG. |
493 |
- While most instances should have exactly the same configuration as |
494 |
on the source cluster, setting them up with a different disk layout |
495 |
might be helpful in some use-cases. |
496 |
- A cleanup operation, similar to the one available for failed instance |
497 |
migrations, should be provided. |
498 |
- ``ganeti-watcher`` should remove instances pending a move from another |
499 |
cluster after a certain amount of time. This takes care of failures |
500 |
somewhere in the process. |
501 |
- RSA keys can be generated using the existing |
502 |
``bootstrap.GenerateSelfSignedSslCert`` function, though it might be |
503 |
useful to not write both parts into a single file, requiring small |
504 |
changes to the function. The public part always starts with |
505 |
``-----BEGIN CERTIFICATE-----`` and ends with ``-----END |
506 |
CERTIFICATE-----``. |
507 |
- The source and destination cluster might be different when it comes |
508 |
to available hypervisors, kernels, etc. The destination cluster should |
509 |
refuse to accept an instance move if it can't fulfill an instance's |
510 |
requirements. |
511 |
|
512 |
|
513 |
Feature changes |
514 |
--------------- |
515 |
|
516 |
KVM Security |
517 |
~~~~~~~~~~~~ |
518 |
|
519 |
Current state and shortcomings |
520 |
++++++++++++++++++++++++++++++ |
521 |
|
522 |
Currently all kvm processes run as root. Taking ownership of the |
523 |
hypervisor process, from inside a virtual machine, would mean a full |
524 |
compromise of the whole Ganeti cluster, knowledge of all Ganeti |
525 |
authentication secrets, full access to all running instances, and the |
526 |
option of subverting other basic services on the cluster (eg: ssh). |
527 |
|
528 |
Proposed changes |
529 |
++++++++++++++++ |
530 |
|
531 |
We would like to decrease the surface of attack available if an |
532 |
hypervisor is compromised. We can do so adding different features to |
533 |
Ganeti, which will allow restricting the broken hypervisor |
534 |
possibilities, in the absence of a local privilege escalation attack, to |
535 |
subvert the node. |
536 |
|
537 |
Dropping privileges in kvm to a single user (easy) |
538 |
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
539 |
|
540 |
By passing the ``-runas`` option to kvm, we can make it drop privileges. |
541 |
The user can be chosen by an hypervisor parameter, so that each instance |
542 |
can have its own user, but by default they will all run under the same |
543 |
one. It should be very easy to implement, and can easily be backported |
544 |
to 2.1.X. |
545 |
|
546 |
This mode protects the Ganeti cluster from a subverted hypervisor, but |
547 |
doesn't protect the instances between each other, unless care is taken |
548 |
to specify a different user for each. This would prevent the worst |
549 |
attacks, including: |
550 |
|
551 |
- logging in to other nodes |
552 |
- administering the Ganeti cluster |
553 |
- subverting other services |
554 |
|
555 |
But the following would remain an option: |
556 |
|
557 |
- terminate other VMs (but not start them again, as that requires root |
558 |
privileges to set up networking) (unless different users are used) |
559 |
- trace other VMs, and probably subvert them and access their data |
560 |
(unless different users are used) |
561 |
- send network traffic from the node |
562 |
- read unprotected data on the node filesystem |
563 |
|
564 |
Running kvm in a chroot (slightly harder) |
565 |
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
566 |
|
567 |
By passing the ``-chroot`` option to kvm, we can restrict the kvm |
568 |
process in its own (possibly empty) root directory. We need to set this |
569 |
area up so that the instance disks and control sockets are accessible, |
570 |
so it would require slightly more work at the Ganeti level. |
571 |
|
572 |
Breaking out in a chroot would mean: |
573 |
|
574 |
- a lot less options to find a local privilege escalation vector |
575 |
- the impossibility to write local data, if the chroot is set up |
576 |
correctly |
577 |
- the impossibility to read filesystem data on the host |
578 |
|
579 |
It would still be possible though to: |
580 |
|
581 |
- terminate other VMs |
582 |
- trace other VMs, and possibly subvert them (if a tracer can be |
583 |
installed in the chroot) |
584 |
- send network traffic from the node |
585 |
|
586 |
|
587 |
Running kvm with a pool of users (slightly harder) |
588 |
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
589 |
|
590 |
If rather than passing a single user as an hypervisor parameter, we have |
591 |
a pool of useable ones, we can dynamically choose a free one to use and |
592 |
thus guarantee that each machine will be separate from the others, |
593 |
without putting the burden of this on the cluster administrator. |
594 |
|
595 |
This would mean interfering between machines would be impossible, and |
596 |
can still be combined with the chroot benefits. |
597 |
|
598 |
Running iptables rules to limit network interaction (easy) |
599 |
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
600 |
|
601 |
These don't need to be handled by Ganeti, but we can ship examples. If |
602 |
the users used to run VMs would be blocked from sending some or all |
603 |
network traffic, it would become impossible for a broken into hypervisor |
604 |
to send arbitrary data on the node network, which is especially useful |
605 |
when the instance and the node network are separated (using ganeti-nbma |
606 |
or a separate set of network interfaces), or when a separate replication |
607 |
network is maintained. We need to experiment to see how much restriction |
608 |
we can properly apply, without limiting the instance legitimate traffic. |
609 |
|
610 |
|
611 |
Running kvm inside a container (even harder) |
612 |
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
613 |
|
614 |
Recent linux kernels support different process namespaces through |
615 |
control groups. PIDs, users, filesystems and even network interfaces can |
616 |
be separated. If we can set up ganeti to run kvm in a separate container |
617 |
we could insulate all the host process from being even visible if the |
618 |
hypervisor gets broken into. Most probably separating the network |
619 |
namespace would require one extra hop in the host, through a veth |
620 |
interface, thus reducing performance, so we may want to avoid that, and |
621 |
just rely on iptables. |
622 |
|
623 |
Implementation plan |
624 |
+++++++++++++++++++ |
625 |
|
626 |
We will first implement dropping privileges for kvm processes as a |
627 |
single user, and most probably backport it to 2.1. Then we'll ship |
628 |
example iptables rules to show how the user can be limited in its |
629 |
network activities. After that we'll implement chroot restriction for |
630 |
kvm processes, and extend the user limitation to use a user pool. |
631 |
|
632 |
Finally we'll look into namespaces and containers, although that might |
633 |
slip after the 2.2 release. |
634 |
|
635 |
External interface changes |
636 |
-------------------------- |
637 |
|
638 |
.. vim: set textwidth=72 : |