root / doc / design-node-security.rst @ 575b31bf
History | View | Annotate | Download (17.6 kB)
1 |
============================= |
---|---|
2 |
Improvements of Node Security |
3 |
============================= |
4 |
|
5 |
This document describes an enhancement of Ganeti's security by restricting |
6 |
the distribution of security-sensitive data to the master and master |
7 |
candidates only. |
8 |
|
9 |
Note: In this document, we will use the term 'normal node' for a node that |
10 |
is neither master nor master-candidate. |
11 |
|
12 |
.. contents:: :depth: 4 |
13 |
|
14 |
Objective |
15 |
========= |
16 |
|
17 |
Up till 2.10, Ganeti distributes security-relevant keys to all nodes, |
18 |
including nodes that are neither master nor master-candidates. Those |
19 |
keys are the private and public SSH keys for node communication and the |
20 |
SSL certficate and private key for RPC communication. Objective of this |
21 |
design is to limit the set of nodes that can establish ssh and RPC |
22 |
connections to the master and master candidates. |
23 |
|
24 |
As pointed out in |
25 |
`issue 377 <https://code.google.com/p/ganeti/issues/detail?id=377>`_, this |
26 |
is a security risk. Since all nodes have these keys, compromising |
27 |
any of those nodes would possibly give an attacker access to all other |
28 |
machines in the cluster. Reducing the set of nodes that are able to |
29 |
make ssh and RPC connections to the master and master candidates would |
30 |
significantly reduce the risk simply because fewer machines would be a |
31 |
valuable target for attackers. |
32 |
|
33 |
Note: For bigger installations of Ganeti, it is advisable to run master |
34 |
candidate nodes as non-vm-capable nodes. This would reduce the attack |
35 |
surface for the hypervisor exploitation. |
36 |
|
37 |
|
38 |
Detailed design |
39 |
=============== |
40 |
|
41 |
|
42 |
Current state and shortcomings |
43 |
------------------------------ |
44 |
|
45 |
Currently (as of 2.10), all nodes hold the following information: |
46 |
|
47 |
- the ssh host keys (public and private) |
48 |
- the ssh root keys (public and private) |
49 |
- node daemon certificate (the SSL client certificate and its |
50 |
corresponding private key) |
51 |
|
52 |
Concerning ssh, this setup contains the following security issue. Since |
53 |
all nodes of a cluster can ssh as root into any other cluster node, one |
54 |
compromised node can harm all other nodes of a cluster. |
55 |
|
56 |
Regarding the SSL encryption of the RPC communication with the node |
57 |
daemon, we currently have the following setup. There is only one |
58 |
certificate which is used as both, client and server certificate. Besides |
59 |
the SSL client verification, we check if the used client certificate is |
60 |
the same as the certificate stored on the server. |
61 |
|
62 |
This means that any node running a node daemon can also act as an RPC |
63 |
client and use it to issue RPC calls to other cluster nodes. This in |
64 |
turn means that any compromised node could be used to make RPC calls to |
65 |
any node (including itself) to gain full control over VMs. This could |
66 |
be used by an attacker to for example bring down the VMs or exploit bugs |
67 |
in the virtualization stacks to gain access to the host machines as well. |
68 |
|
69 |
|
70 |
Proposal concerning SSH key distribution |
71 |
---------------------------------------- |
72 |
|
73 |
We propose two improvements regarding the ssh keys: |
74 |
|
75 |
#. Limit the distribution of the private ssh key to the master candidates. |
76 |
|
77 |
#. Use different ssh key pairs for each master candidate. |
78 |
|
79 |
We propose to limit the set of nodes holding the private root user SSH key |
80 |
to the master and the master candidates. This way, the security risk would |
81 |
be limited to a rather small set of nodes even though the cluster could |
82 |
consists of a lot more nodes. The set of master candidates could be protected |
83 |
better than the normal nodes (for example residing in a DMZ) to enhance |
84 |
security even more if the administrator wishes so. The following |
85 |
sections describe in detail which Ganeti commands are affected by this |
86 |
change and in what way. |
87 |
|
88 |
Security will be even more increased if each master candidate gets |
89 |
its own ssh private/public key pair. This way, one can remove a |
90 |
compromised master candidate from a cluster (including removing it's |
91 |
public key from all nodes' ``authorized_keys`` file) without having to |
92 |
regenerate and distribute new ssh keys for all master candidates. (Even |
93 |
though it is be good practice to do that anyway, since the compromising |
94 |
of the other master candidates might have taken place already.) However, |
95 |
this improvement was not part of the original feature request and |
96 |
increases the complexity of node management even more. We therefore |
97 |
consider it as second step in this design and will address |
98 |
this after the other parts of this design are implemented. |
99 |
|
100 |
The following sections describe in detail which Ganeti commands are affected |
101 |
by the first part of ssh-related improvements, limiting the key |
102 |
distribution to master candidates only. |
103 |
|
104 |
|
105 |
(Re-)Adding nodes to a cluster |
106 |
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
107 |
|
108 |
According to ``design-node-add.rst``, Ganeti transfers the ssh keys to every |
109 |
node that gets added to the cluster. |
110 |
|
111 |
We propose to change this procedure to treat master candidates and normal |
112 |
nodes differently. For master candidates, the procedure would stay as is. |
113 |
For normal nodes, Ganeti would transfer the public and private ssh host |
114 |
keys (as before) and only the public root key. |
115 |
|
116 |
A normal node would not be able to connect via ssh to other nodes, but |
117 |
the master (and potentially master candidates) can connect to this node. |
118 |
|
119 |
In case of readding a node that used to be in the cluster before, |
120 |
handling of the ssh keys would basically be the same with the following |
121 |
additional modifications: if the node used to be a master or |
122 |
master-candidate node, but will be a normal node after readding, Ganeti |
123 |
should make sure that the private root key is deleted if it is still |
124 |
present on the node. |
125 |
|
126 |
|
127 |
Pro- and demoting a node to/from master candidate |
128 |
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
129 |
|
130 |
If the role of a node is changed from 'normal' to 'master_candidate', the |
131 |
master node should at that point copy the private root ssh key. When demoting |
132 |
a node from master candidate to a normal node, the key that have been copied |
133 |
there on promotion or addition should be removed again. |
134 |
|
135 |
This affected the behavior of the following commands: |
136 |
|
137 |
:: |
138 |
gnt-node modify --master-candidate=yes |
139 |
gnt-node modify --master-candidate=no [--auto-promote] |
140 |
|
141 |
If the node has been master candidate already before the command to promote |
142 |
it was issued, Ganeti does not do anything. |
143 |
|
144 |
Note that when you demote a node from master candidate to normal node, another |
145 |
master-capable and normal node will be promoted to master candidate. For this |
146 |
newly promoted node, the same changes apply as if it was explicitely promoted. |
147 |
|
148 |
The same behavior should be ensured for the corresponding rapi command. |
149 |
|
150 |
|
151 |
Offlining and onlining a node |
152 |
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
153 |
|
154 |
When offlining a node, it immediately loses its role as master or master |
155 |
candidate as well. When it is onlined again, it will become master |
156 |
candidate again if it was so before. The handling of the keys should be done |
157 |
in the same way as when the node is explicitely promoted or demoted to or from |
158 |
master candidate. See the previous section for details. |
159 |
|
160 |
This affects the command: |
161 |
|
162 |
:: |
163 |
gnt-node modify --offline=yes |
164 |
gnt-node modify --offline=no [--auto-promote] |
165 |
|
166 |
For offlining, the removal of the keys is particularly important, as the |
167 |
detection of a compromised node might be the very reason for the offlining. |
168 |
Of course we cannot guarantee that removal of the key is always successful, |
169 |
because the node might not be reachable anymore. Even though it is a |
170 |
best-effort operation, it is still an improvement over the status quo, |
171 |
because currently Ganeti does not even try to remove any keys. |
172 |
|
173 |
The same behavior should be ensured for the corresponding rapi command. |
174 |
|
175 |
|
176 |
Cluster verify |
177 |
~~~~~~~~~~~~~~ |
178 |
|
179 |
To make sure the private root ssh key was not distributed to a normal |
180 |
node, 'gnt-cluster verify' will be extended by a check for the key |
181 |
on normal nodes. Additionally, it will check if the private key is |
182 |
indeed present on master candidates. |
183 |
|
184 |
|
185 |
|
186 |
Proposal regarding node daemon certificates |
187 |
------------------------------------------- |
188 |
|
189 |
Regarding the node daemon certificates, we propose the following changes |
190 |
in the design. |
191 |
|
192 |
- Instead of using the same certificate for all nodes as both, server |
193 |
and client certificate, we generate a common server certificate (and |
194 |
the corresponding private key) for all nodes and a different client |
195 |
certificate (and the corresponding private key) for each node. |
196 |
- In addition, we store a mapping of |
197 |
(node UUID, client certificate digest) in the cluster's configuration |
198 |
and ssconf for hosts that are master or master candidate. |
199 |
The client certificate digest is a hash of the client certificate. |
200 |
We suggest a 'sha1' hash here. We will call this mapping 'candidate map' |
201 |
from here on. |
202 |
- The node daemon will be modified in a way that on an incoming RPC |
203 |
request, it first performs a client verification (same as before) to |
204 |
ensure that the requesting host is indeed the holder of the |
205 |
corresponding private key. Additionally, it compares the digest of |
206 |
the certificate of the incoming request to the respective entry of |
207 |
the candidate map. If the digest does not match the entry of the host |
208 |
in the mapping or is not included in the mapping at all, the SSL |
209 |
connection is refused. |
210 |
|
211 |
This design has the following advantages: |
212 |
|
213 |
- A compromised normal node cannot issue RPC calls, because it will |
214 |
not be in the candidate map. (See the ``Drawbacks`` section regarding |
215 |
an indirect way of achieving this though.) |
216 |
- A compromised master candidate would be able to issue RPC requests, |
217 |
but on detection of its compromised state, it can be removed from the |
218 |
cluster (and thus from the candidate map) without the need for |
219 |
redistribution of any certificates, because the other master candidates |
220 |
can continue using their own certificates. However, it is best |
221 |
practice to issue a complete key renewal even in this case, unless one |
222 |
can ensure no actions compromising other nodes have not already been |
223 |
carried out. |
224 |
- A compromised node would not be able to use the other (possibly master |
225 |
candidate) nodes' information from the candidate map to issue RPCs, |
226 |
because the config just stores the digests and not the certificate |
227 |
itself. |
228 |
- A compromised node would be able to obtain another node's certificate |
229 |
by waiting for incoming RPCs from this other node. However, the node |
230 |
cannot use the certificate to issue RPC calls, because the SSL client |
231 |
verification would require the node to hold the corresponding private |
232 |
key as well. |
233 |
|
234 |
Drawbacks of this design: |
235 |
|
236 |
- Complexity of node and certificate management will be increased (see |
237 |
following sections for details). |
238 |
- If the candidate map is not distributed fast enough to all nodes after |
239 |
an update of the configuration, it might be possible to issue RPC calls |
240 |
from a compromised master candidate node that has been removed |
241 |
from the Ganeti cluster already. However, this is still a better |
242 |
situation than before and an inherent problem when one wants to |
243 |
distinguish between master candidates and normal nodes. |
244 |
- A compromised master candidate would still be able to issue RPC calls, |
245 |
if it uses ssh to retrieve another master candidate's client |
246 |
certificate and the corresponding private SSL key. This is an issue |
247 |
even with the first part of the improved handling of ssh keys in this |
248 |
design (limiting ssh keys to master candidates), but it will be |
249 |
eliminated with the second part of the design (separate ssh keys for |
250 |
each master candidate). |
251 |
|
252 |
Alternative proposals: |
253 |
|
254 |
- Instead of generating a client certificate per node, one could think |
255 |
of just generating two different client certificates, one for normal |
256 |
nodes and one for master candidates. Noded could then just check if |
257 |
the requesting node has the master candidate certificate. Drawback of |
258 |
this proposal is that once one master candidate gets compromised, all |
259 |
master candidates would need to get a new certificate even if the |
260 |
compromised master candidate had not yet fetched the certificates |
261 |
from the other master candidates via ssh. |
262 |
- In addition to our main proposal, one could think of including a |
263 |
piece of data (for example the node's host name or UUID) in the RPC |
264 |
call which is encrypted with the requesting node's private key. The |
265 |
node daemon could check if the datum can be decrypted using the node's |
266 |
certificate. However, this would ensure similar functionality as |
267 |
SSL's built-in client verification and add significant complexity |
268 |
to Ganeti's RPC protocol. |
269 |
|
270 |
In the following sections, we describe how our design affects various |
271 |
Ganeti operations. |
272 |
|
273 |
|
274 |
Cluster initialization |
275 |
~~~~~~~~~~~~~~~~~~~~~~ |
276 |
|
277 |
On cluster initialization, so far only the node daemon certificate was |
278 |
created. With our design, two certificates (and corresponding keys) |
279 |
need to be created, a server certificate to be distributed to all nodes |
280 |
and a client certificate only to be used by this particular node. In the |
281 |
following, we use the term node daemon certificate for the server |
282 |
certficate only. |
283 |
|
284 |
In the cluster configuration, the candidate map is created. It is |
285 |
populated with the respective entry for the master node. It is also |
286 |
written to ssconf. |
287 |
|
288 |
|
289 |
(Re-)Adding nodes |
290 |
~~~~~~~~~~~~~~~~~ |
291 |
|
292 |
When a node is added, the server certificate is copied to the node (as |
293 |
before). Additionally, a new client certificate (and the corresponding |
294 |
private key) is created on the new node to be used only by the new node |
295 |
as client certifcate. |
296 |
|
297 |
If the new node is a master candidate, the candidate map is extended by |
298 |
the new node's data. As before, the updated configuration is distributed |
299 |
to all nodes (as complete configuration on the master candidates and |
300 |
ssconf on all nodes). Note that distribution of the configuration after |
301 |
adding a node is already implemented, since all nodes hold the list of |
302 |
nodes in the cluster in ssconf anyway. |
303 |
|
304 |
If the configuration for whatever reason already holds an entry for this |
305 |
node, it will be overriden. |
306 |
|
307 |
When readding a node, the procedure is the same as for adding a node. |
308 |
|
309 |
|
310 |
Promotion and demotion of master candidates |
311 |
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
312 |
|
313 |
When a normal node gets promoted to be master candidate, an entry to the |
314 |
candidate map has to be added and the updated configuration has to be |
315 |
distributed to all nodes. If there was already an entry for the node, |
316 |
we override it. |
317 |
|
318 |
On demotion of a master candidate, the node's entry in the candidate map |
319 |
gets removed and the updated configuration gets redistibuted. |
320 |
|
321 |
The same procedure applied to onlining and offlining master candidates. |
322 |
|
323 |
|
324 |
Cluster verify |
325 |
~~~~~~~~~~~~~~ |
326 |
|
327 |
Cluster verify will be extended by the following checks: |
328 |
|
329 |
- Whether each entry in the candidate map indeed corresponds to a master |
330 |
candidate. |
331 |
- Whether the master candidate's certificate digest match their entry |
332 |
in the candidate map. |
333 |
- Whether no node tries to use the certificate of another node. In |
334 |
particular, it is important to check that no normal node tries to |
335 |
use the certificate of a master candidate. |
336 |
|
337 |
|
338 |
Crypto renewal |
339 |
~~~~~~~~~~~~~~ |
340 |
|
341 |
Currently, when the cluster's cryptographic tokens are renewed using the |
342 |
``gnt-cluster renew-crypto`` command the node daemon certificate is |
343 |
renewed (among others). Option ``--new-cluster-certificate`` renews the |
344 |
node daemon certificate only. |
345 |
|
346 |
By adding an option ``--new-node-certificates`` we offer to renew the |
347 |
client certificate. Whenever the client certificates are renewed, the |
348 |
candidate map has to be updated and redistributed. |
349 |
|
350 |
If for whatever reason, the candidate map becomes inconsistent, for example |
351 |
due inconsistent updating after a demotion or offlining), the user can use |
352 |
this option to renew the client certificates and update the candidate |
353 |
certificate map. |
354 |
|
355 |
|
356 |
Further considerations |
357 |
---------------------- |
358 |
|
359 |
Watcher |
360 |
~~~~~~~ |
361 |
|
362 |
The watcher is a script that is run on all nodes in regular intervals. The |
363 |
changes proposed in this design will not affect the watcher's implementation, |
364 |
because it behaves differently on the master than on non-master nodes. |
365 |
|
366 |
Only on the master, it issues query calls which would require a client |
367 |
certificate of a node in the candidate mapping. This is the case for the |
368 |
master node. On non-master node, it's only external communication is done via |
369 |
the ConfD protocol, which uses the hmac key, which is present on all nodes. |
370 |
Besides that, the watcher does not make any ssh connections, and thus is |
371 |
not affected by the changes in ssh key handling either. |
372 |
|
373 |
|
374 |
Other Keys and Daemons |
375 |
~~~~~~~~~~~~~~~~~~~~~~ |
376 |
|
377 |
Ganeti handles a couple of other keys/certificates that have not been mentioned |
378 |
in this design so far. Also, other daemons than the ones mentioned so far |
379 |
perform intra-cluster communication. Neither the keys nor the daemons will |
380 |
be affected by this design for several reasons: |
381 |
|
382 |
- The hmac key used by ConfD (see ``design-2.1.rst``): the hmac key is still |
383 |
distributed to all nodes, because it was designed to be used for |
384 |
communicating with ConfD, which should be possible from all nodes. |
385 |
For example, the monitoring daemon which runs on all nodes uses it to |
386 |
retrieve information from ConfD. However, since communication with ConfD |
387 |
is read-only, a compromised node holding the hmac key does not enable an |
388 |
attacker to change the cluster's state. |
389 |
|
390 |
- The WConfD daemon writes the configuration to all master candidates |
391 |
via RPC. Since it only runs on the master node, it's ability to run |
392 |
RPC requests is maintained with this design. |
393 |
|
394 |
- The rapi SSL key certificate and rapi user/password file 'rapi_users' is |
395 |
already only copied to the master candidates (see ``design-2.1.rst``, |
396 |
Section ``Redistribute Config``). |
397 |
|
398 |
- The spice certificates are still distributed to all nodes, since it should |
399 |
be possible to use spice to access VMs on any cluster node. |
400 |
|
401 |
- The cluster domain secret is used for inter-cluster instance moves. |
402 |
Since instances can be moved from any normal node of the source cluster to |
403 |
any normal node of the destination cluster, the presence of this |
404 |
secret on all nodes is necessary. |
405 |
|
406 |
|
407 |
Related and Future Work |
408 |
~~~~~~~~~~~~~~~~~~~~~~~ |
409 |
|
410 |
Ganeti RPC calls are currently done without server verification. |
411 |
Establishing server verification might be a desirable feature, but is |
412 |
not part of this design. |
413 |
|
414 |
.. vim: set textwidth=72 : |
415 |
.. Local Variables: |
416 |
.. mode: rst |
417 |
.. fill-column: 72 |
418 |
.. End: |