Statistics
| Branch: | Tag: | Revision:

root / doc / design-2.2.rst @ 8388e9ff

History | View | Annotate | Download (12 kB)

1
=================
2
Ganeti 2.2 design
3
=================
4

    
5
This document describes the major changes in Ganeti 2.2 compared to
6
the 2.1 version.
7

    
8
The 2.2 version will be a relatively small release. Its main aim is to
9
avoid changing too much of the core code, while addressing issues and
10
adding new features and improvements over 2.1, in a timely fashion.
11

    
12
.. contents:: :depth: 4
13

    
14
Objective
15
=========
16

    
17
Background
18
==========
19

    
20
Overview
21
========
22

    
23
Detailed design
24
===============
25

    
26
As for 2.1 we divide the 2.2 design into three areas:
27

    
28
- core changes, which affect the master daemon/job queue/locking or
29
  all/most logical units
30
- logical unit/feature changes
31
- external interface changes (eg. command line, os api, hooks, ...)
32

    
33
Core changes
34
------------
35

    
36
Remote procedure call timeouts
37
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
38

    
39
Current state and shortcomings
40
++++++++++++++++++++++++++++++
41

    
42
The current RPC protocol used by Ganeti is based on HTTP. Every request
43
consists of an HTTP PUT request (e.g. ``PUT /hooks_runner HTTP/1.0``)
44
and doesn't return until the function called has returned. Parameters
45
and return values are encoded using JSON.
46

    
47
On the server side, ``ganeti-noded`` handles every incoming connection
48
in a separate process by forking just after accepting the connection.
49
This process exits after sending the response.
50

    
51
There is one major problem with this design: Timeouts can not be used on
52
a per-request basis. Neither client or server know how long it will
53
take. Even if we might be able to group requests into different
54
categories (e.g. fast and slow), this is not reliable.
55

    
56
If a node has an issue or the network connection fails while a request
57
is being handled, the master daemon can wait for a long time for the
58
connection to time out (e.g. due to the operating system's underlying
59
TCP keep-alive packets or timeouts). While the settings for keep-alive
60
packets can be changed using Linux-specific socket options, we prefer to
61
use application-level timeouts because these cover both machine down and
62
unresponsive node daemon cases.
63

    
64
Proposed changes
65
++++++++++++++++
66

    
67
RPC glossary
68
^^^^^^^^^^^^
69

    
70
Function call ID
71
  Unique identifier returned by ``ganeti-noded`` after invoking a
72
  function.
73
Function process
74
  Process started by ``ganeti-noded`` to call actual (backend) function.
75

    
76
Protocol
77
^^^^^^^^
78

    
79
Initially we chose HTTP as our RPC protocol because there were existing
80
libraries, which, unfortunately, turned out to miss important features
81
(such as SSL certificate authentication) and we had to write our own.
82

    
83
This proposal can easily be implemented using HTTP, though it would
84
likely be more efficient and less complicated to use the LUXI protocol
85
already used to communicate between client tools and the Ganeti master
86
daemon. Switching to another protocol can occur at a later point. This
87
proposal should be implemented using HTTP as its underlying protocol.
88

    
89
The LUXI protocol currently contains two functions, ``WaitForJobChange``
90
and ``AutoArchiveJobs``, which can take a longer time. They both support
91
a parameter to specify the timeout. This timeout is usually chosen as
92
roughly half of the socket timeout, guaranteeing a response before the
93
socket times out. After the specified amount of time,
94
``AutoArchiveJobs`` returns and reports the number of archived jobs.
95
``WaitForJobChange`` returns and reports a timeout. In both cases, the
96
functions can be called again.
97

    
98
A similar model can be used for the inter-node RPC protocol. In some
99
sense, the node daemon will implement a light variant of *"node daemon
100
jobs"*. When the function call is sent, it specifies an initial timeout.
101
If the function didn't finish within this timeout, a response is sent
102
with a unique identifier, the function call ID. The client can then
103
choose to wait for the function to finish again with a timeout.
104
Inter-node RPC calls would no longer be blocking indefinitely and there
105
would be an implicit ping-mechanism.
106

    
107
Request handling
108
^^^^^^^^^^^^^^^^
109

    
110
To support the protocol changes described above, the way the node daemon
111
handles request will have to change. Instead of forking and handling
112
every connection in a separate process, there should be one child
113
process per function call and the master process will handle the
114
communication with clients and the function processes using asynchronous
115
I/O.
116

    
117
Function processes communicate with the parent process via stdio and
118
possibly their exit status. Every function process has a unique
119
identifier, though it shouldn't be the process ID only (PIDs can be
120
recycled and are prone to race conditions for this use case). The
121
proposed format is ``${ppid}:${cpid}:${time}:${random}``, where ``ppid``
122
is the ``ganeti-noded`` PID, ``cpid`` the child's PID, ``time`` the
123
current Unix timestamp with decimal places and ``random`` at least 16
124
random bits.
125

    
126
The following operations will be supported:
127

    
128
``StartFunction(fn_name, fn_args, timeout)``
129
  Starts a function specified by ``fn_name`` with arguments in
130
  ``fn_args`` and waits up to ``timeout`` seconds for the function
131
  to finish. Fire-and-forget calls can be made by specifying a timeout
132
  of 0 seconds (e.g. for powercycling the node). Returns three values:
133
  function call ID (if not finished), whether function finished (or
134
  timeout) and the function's return value.
135
``WaitForFunction(fnc_id, timeout)``
136
  Waits up to ``timeout`` seconds for function call to finish. Return
137
  value same as ``StartFunction``.
138

    
139
In the future, ``StartFunction`` could support an additional parameter
140
to specify after how long the function process should be aborted.
141

    
142
Simplified timing diagram::
143

    
144
  Master daemon        Node daemon                      Function process
145
   |
146
  Call function
147
  (timeout 10s) -----> Parse request and fork for ----> Start function
148
                       calling actual function, then     |
149
                       wait up to 10s for function to    |
150
                       finish                            |
151
                        |                                |
152
                       ...                              ...
153
                        |                                |
154
  Examine return <----  |                                |
155
  value and wait                                         |
156
  again -------------> Wait another 10s for function     |
157
                        |                                |
158
                       ...                              ...
159
                        |                                |
160
  Examine return <----  |                                |
161
  value and wait                                         |
162
  again -------------> Wait another 10s for function     |
163
                        |                                |
164
                       ...                              ...
165
                        |                                |
166
                        |                               Function ends,
167
                       Get return value and forward <-- process exits
168
  Process return <---- it to caller
169
  value and continue
170
   |
171

    
172
.. TODO: Convert diagram above to graphviz/dot graphic
173

    
174
On process termination (e.g. after having been sent a ``SIGTERM`` or
175
``SIGINT`` signal), ``ganeti-noded`` should send ``SIGTERM`` to all
176
function processes and wait for all of them to terminate.
177

    
178

    
179
Feature changes
180
---------------
181

    
182
KVM Security
183
~~~~~~~~~~~~
184

    
185
Current state and shortcomings
186
++++++++++++++++++++++++++++++
187

    
188
Currently all kvm processes run as root. Taking ownership of the
189
hypervisor process, from inside a virtual machine, would mean a full
190
compromise of the whole Ganeti cluster, knowledge of all Ganeti
191
authentication secrets, full access to all running instances, and the
192
option of subverting other basic services on the cluster (eg: ssh).
193

    
194
Proposed changes
195
++++++++++++++++
196

    
197
We would like to decrease the surface of attack available if an
198
hypervisor is compromised. We can do so adding different features to
199
Ganeti, which will allow restricting the broken hypervisor
200
possibilities, in the absence of a local privilege escalation attack, to
201
subvert the node.
202

    
203
Dropping privileges in kvm to a single user (easy)
204
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
205

    
206
By passing the ``-runas`` option to kvm, we can make it drop privileges.
207
The user can be chosen by an hypervisor parameter, so that each instance
208
can have its own user, but by default they will all run under the same
209
one. It should be very easy to implement, and can easily be backported
210
to 2.1.X.
211

    
212
This mode protects the Ganeti cluster from a subverted hypervisor, but
213
doesn't protect the instances between each other, unless care is taken
214
to specify a different user for each. This would prevent the worst
215
attacks, including:
216

    
217
- logging in to other nodes
218
- administering the Ganeti cluster
219
- subverting other services
220

    
221
But the following would remain an option:
222

    
223
- terminate other VMs (but not start them again, as that requires root
224
  privileges to set up networking) (unless different users are used)
225
- trace other VMs, and probably subvert them and access their data
226
  (unless different users are used)
227
- send network traffic from the node
228
- read unprotected data on the node filesystem
229

    
230
Running kvm in a chroot (slightly harder)
231
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
232

    
233
By passing the ``-chroot`` option to kvm, we can restrict the kvm
234
process in its own (possibly empty) root directory. We need to set this
235
area up so that the instance disks and control sockets are accessible,
236
so it would require slightly more work at the Ganeti level.
237

    
238
Breaking out in a chroot would mean:
239

    
240
- a lot less options to find a local privilege escalation vector
241
- the impossibility to write local data, if the chroot is set up
242
  correctly
243
- the impossibility to read filesystem data on the host
244

    
245
It would still be possible though to:
246

    
247
- terminate other VMs
248
- trace other VMs, and possibly subvert them (if a tracer can be
249
  installed in the chroot)
250
- send network traffic from the node
251

    
252

    
253
Running kvm with a pool of users (slightly harder)
254
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
255

    
256
If rather than passing a single user as an hypervisor parameter, we have
257
a pool of useable ones, we can dynamically choose a free one to use and
258
thus guarantee that each machine will be separate from the others,
259
without putting the burden of this on the cluster administrator.
260

    
261
This would mean interfering between machines would be impossible, and
262
can still be combined with the chroot benefits.
263

    
264
Running iptables rules to limit network interaction (easy)
265
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
266

    
267
These don't need to be handled by Ganeti, but we can ship examples. If
268
the users used to run VMs would be blocked from sending some or all
269
network traffic, it would become impossible for a broken into hypervisor
270
to send arbitrary data on the node network, which is especially useful
271
when the instance and the node network are separated (using ganeti-nbma
272
or a separate set of network interfaces), or when a separate replication
273
network is maintained. We need to experiment to see how much restriction
274
we can properly apply, without limiting the instance legitimate traffic.
275

    
276

    
277
Running kvm inside a container (even harder)
278
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
279

    
280
Recent linux kernels support different process namespaces through
281
control groups. PIDs, users, filesystems and even network interfaces can
282
be separated. If we can set up ganeti to run kvm in a separate container
283
we could insulate all the host process from being even visible if the
284
hypervisor gets broken into. Most probably separating the network
285
namespace would require one extra hop in the host, through a veth
286
interface, thus reducing performance, so we may want to avoid that, and
287
just rely on iptables.
288

    
289
Implementation plan
290
+++++++++++++++++++
291

    
292
We will first implement dropping privileges for kvm processes as a
293
single user, and most probably backport it to 2.1. Then we'll ship
294
example iptables rules to show how the user can be limited in its
295
network activities.  After that we'll implement chroot restriction for
296
kvm processes, and extend the user limitation to use a user pool.
297

    
298
Finally we'll look into namespaces and containers, although that might
299
slip after the 2.2 release.
300

    
301
External interface changes
302
--------------------------
303

    
304
.. vim: set textwidth=72 :