Statistics
| Branch: | Tag: | Revision:

root / doc / design-2.2.rst @ 37e1e262

History | View | Annotate | Download (33.2 kB)

1 e56bb0e8 Guido Trotter
=================
2 e56bb0e8 Guido Trotter
Ganeti 2.2 design
3 e56bb0e8 Guido Trotter
=================
4 e56bb0e8 Guido Trotter
5 e56bb0e8 Guido Trotter
This document describes the major changes in Ganeti 2.2 compared to
6 e56bb0e8 Guido Trotter
the 2.1 version.
7 e56bb0e8 Guido Trotter
8 e56bb0e8 Guido Trotter
The 2.2 version will be a relatively small release. Its main aim is to
9 e56bb0e8 Guido Trotter
avoid changing too much of the core code, while addressing issues and
10 e56bb0e8 Guido Trotter
adding new features and improvements over 2.1, in a timely fashion.
11 e56bb0e8 Guido Trotter
12 e56bb0e8 Guido Trotter
.. contents:: :depth: 4
13 e56bb0e8 Guido Trotter
14 e56bb0e8 Guido Trotter
Objective
15 e56bb0e8 Guido Trotter
=========
16 e56bb0e8 Guido Trotter
17 e56bb0e8 Guido Trotter
Background
18 e56bb0e8 Guido Trotter
==========
19 e56bb0e8 Guido Trotter
20 e56bb0e8 Guido Trotter
Overview
21 e56bb0e8 Guido Trotter
========
22 e56bb0e8 Guido Trotter
23 e56bb0e8 Guido Trotter
Detailed design
24 e56bb0e8 Guido Trotter
===============
25 e56bb0e8 Guido Trotter
26 e56bb0e8 Guido Trotter
As for 2.1 we divide the 2.2 design into three areas:
27 e56bb0e8 Guido Trotter
28 e56bb0e8 Guido Trotter
- core changes, which affect the master daemon/job queue/locking or
29 e56bb0e8 Guido Trotter
  all/most logical units
30 e56bb0e8 Guido Trotter
- logical unit/feature changes
31 e56bb0e8 Guido Trotter
- external interface changes (eg. command line, os api, hooks, ...)
32 e56bb0e8 Guido Trotter
33 e56bb0e8 Guido Trotter
Core changes
34 e56bb0e8 Guido Trotter
------------
35 e56bb0e8 Guido Trotter
36 c3c5dc77 Guido Trotter
Master Daemon Scaling improvements
37 c3c5dc77 Guido Trotter
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
38 c3c5dc77 Guido Trotter
39 c3c5dc77 Guido Trotter
Current state and shortcomings
40 c3c5dc77 Guido Trotter
++++++++++++++++++++++++++++++
41 c3c5dc77 Guido Trotter
42 c3c5dc77 Guido Trotter
Currently the Ganeti master daemon is based on four sets of threads:
43 c3c5dc77 Guido Trotter
44 c3c5dc77 Guido Trotter
- The main thread (1 thread) just accepts connections on the master
45 c3c5dc77 Guido Trotter
  socket
46 c3c5dc77 Guido Trotter
- The client worker pool (16 threads) handles those connections,
47 c3c5dc77 Guido Trotter
  one thread per connected socket, parses luxi requests, and sends data
48 c3c5dc77 Guido Trotter
  back to the clients
49 c3c5dc77 Guido Trotter
- The job queue worker pool (25 threads) executes the actual jobs
50 c3c5dc77 Guido Trotter
  submitted by the clients
51 c3c5dc77 Guido Trotter
- The rpc worker pool (10 threads) interacts with the nodes via
52 c3c5dc77 Guido Trotter
  http-based-rpc
53 c3c5dc77 Guido Trotter
54 c3c5dc77 Guido Trotter
This means that every masterd currently runs 52 threads to do its job.
55 c3c5dc77 Guido Trotter
Being able to reduce the number of thread sets would make the master's
56 c3c5dc77 Guido Trotter
architecture a lot simpler. Moreover having less threads can help
57 c3c5dc77 Guido Trotter
decrease lock contention, log pollution and memory usage.
58 c3c5dc77 Guido Trotter
Also, with the current architecture, masterd suffers from quite a few
59 c3c5dc77 Guido Trotter
scalability issues:
60 c3c5dc77 Guido Trotter
61 37e1e262 Guido Trotter
Core daemon connection handling
62 37e1e262 Guido Trotter
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
63 37e1e262 Guido Trotter
64 37e1e262 Guido Trotter
Since the 16 client worker threads handle one connection each, it's very
65 37e1e262 Guido Trotter
easy to exhaust them, by just connecting to masterd 16 times and not
66 37e1e262 Guido Trotter
sending any data. While we could perhaps make those pools resizable,
67 37e1e262 Guido Trotter
increasing the number of threads won't help with lock contention nor
68 37e1e262 Guido Trotter
with better handling long running operations making sure the client is
69 37e1e262 Guido Trotter
informed that everything is proceeding, and doesn't need to time out.
70 37e1e262 Guido Trotter
71 37e1e262 Guido Trotter
Wait for job change
72 37e1e262 Guido Trotter
^^^^^^^^^^^^^^^^^^^
73 37e1e262 Guido Trotter
74 37e1e262 Guido Trotter
The REQ_WAIT_FOR_JOB_CHANGE luxi operation makes the relevant client
75 37e1e262 Guido Trotter
thread block on its job for a relative long time. This is another easy
76 37e1e262 Guido Trotter
way to exhaust the 16 client threads, and a place where clients often
77 37e1e262 Guido Trotter
time out, moreover this operation is negative for the job queue lock
78 37e1e262 Guido Trotter
contention (see below).
79 37e1e262 Guido Trotter
80 37e1e262 Guido Trotter
Job Queue lock
81 37e1e262 Guido Trotter
^^^^^^^^^^^^^^
82 37e1e262 Guido Trotter
83 37e1e262 Guido Trotter
The job queue lock is quite heavily contended, and certain easily
84 37e1e262 Guido Trotter
reproducible workloads show that's it's very easy to put masterd in
85 37e1e262 Guido Trotter
trouble: for example running ~15 background instance reinstall jobs,
86 37e1e262 Guido Trotter
results in a master daemon that, even without having finished the
87 37e1e262 Guido Trotter
client worker threads, can't answer simple job list requests, or
88 37e1e262 Guido Trotter
submit more jobs.
89 37e1e262 Guido Trotter
90 37e1e262 Guido Trotter
Currently the job queue lock is an exclusive non-fair lock insulating
91 37e1e262 Guido Trotter
the following job queue methods (called by the client workers).
92 37e1e262 Guido Trotter
93 37e1e262 Guido Trotter
  - AddNode
94 37e1e262 Guido Trotter
  - RemoveNode
95 37e1e262 Guido Trotter
  - SubmitJob
96 37e1e262 Guido Trotter
  - SubmitManyJobs
97 37e1e262 Guido Trotter
  - WaitForJobChanges
98 37e1e262 Guido Trotter
  - CancelJob
99 37e1e262 Guido Trotter
  - ArchiveJob
100 37e1e262 Guido Trotter
  - AutoArchiveJobs
101 37e1e262 Guido Trotter
  - QueryJobs
102 37e1e262 Guido Trotter
  - Shutdown
103 37e1e262 Guido Trotter
104 37e1e262 Guido Trotter
Moreover the job queue lock is acquired outside of the job queue in two
105 37e1e262 Guido Trotter
other classes:
106 37e1e262 Guido Trotter
107 37e1e262 Guido Trotter
  - jqueue._JobQueueWorker (in RunTask) before executing the opcode, after
108 37e1e262 Guido Trotter
    finishing its executing and when handling an exception.
109 37e1e262 Guido Trotter
  - jqueue._OpExecCallbacks (in NotifyStart and Feedback) when the
110 37e1e262 Guido Trotter
    processor (mcpu.Processor) is about to start working on the opcode
111 37e1e262 Guido Trotter
    (after acquiring the necessary locks) and when any data is sent back
112 37e1e262 Guido Trotter
    via the feedback function.
113 37e1e262 Guido Trotter
114 37e1e262 Guido Trotter
Of those the major critical points are:
115 37e1e262 Guido Trotter
116 37e1e262 Guido Trotter
  - Submit[Many]Job, QueryJobs, WaitForJobChanges, which can easily slow
117 37e1e262 Guido Trotter
    down and block client threads up to making the respective clients
118 37e1e262 Guido Trotter
    time out.
119 37e1e262 Guido Trotter
  - The code paths in NotifyStart, Feedback, and RunTask, which slow
120 37e1e262 Guido Trotter
    down job processing between clients and otherwise non-related jobs.
121 37e1e262 Guido Trotter
122 37e1e262 Guido Trotter
To increase the pain:
123 37e1e262 Guido Trotter
124 37e1e262 Guido Trotter
  - WaitForJobChanges is a bad offender because it's implemented with a
125 37e1e262 Guido Trotter
    notified condition which awakes waiting threads, who then try to
126 37e1e262 Guido Trotter
    acquire the global lock again
127 37e1e262 Guido Trotter
  - Many should-be-fast code paths are slowed down by replicating the
128 37e1e262 Guido Trotter
    change to remote nodes, and thus waiting, with the lock held, on
129 37e1e262 Guido Trotter
    remote rpcs to complete (starting, finishing, and submitting jobs)
130 c3c5dc77 Guido Trotter
131 c3c5dc77 Guido Trotter
Proposed changes
132 c3c5dc77 Guido Trotter
++++++++++++++++
133 c3c5dc77 Guido Trotter
134 c3c5dc77 Guido Trotter
In order to be able to interact with the master daemon even when it's
135 c3c5dc77 Guido Trotter
under heavy load, and  to make it simpler to add core functionality
136 c3c5dc77 Guido Trotter
(such as an asynchronous rpc client) we propose three subsequent levels
137 c3c5dc77 Guido Trotter
of changes to the master core architecture.
138 c3c5dc77 Guido Trotter
139 c3c5dc77 Guido Trotter
After making this change we'll be able to re-evaluate the size of our
140 c3c5dc77 Guido Trotter
thread pool, if we see that we can make most threads in the client
141 c3c5dc77 Guido Trotter
worker pool always idle. In the future we should also investigate making
142 c3c5dc77 Guido Trotter
the rpc client asynchronous as well, so that we can make masterd a lot
143 c3c5dc77 Guido Trotter
smaller in number of threads, and memory size, and thus also easier to
144 c3c5dc77 Guido Trotter
understand, debug, and scale.
145 c3c5dc77 Guido Trotter
146 c3c5dc77 Guido Trotter
Connection handling
147 c3c5dc77 Guido Trotter
^^^^^^^^^^^^^^^^^^^
148 c3c5dc77 Guido Trotter
149 c3c5dc77 Guido Trotter
We'll move the main thread of ganeti-masterd to asyncore, so that it can
150 c3c5dc77 Guido Trotter
share the mainloop code with all other Ganeti daemons. Then all luxi
151 c3c5dc77 Guido Trotter
clients will be asyncore clients, and I/O to/from them will be handled
152 c3c5dc77 Guido Trotter
by the master thread asynchronously. Data will be read from the client
153 c3c5dc77 Guido Trotter
sockets as it becomes available, and kept in a buffer, then when a
154 c3c5dc77 Guido Trotter
complete message is found, it's passed to a client worker thread for
155 c3c5dc77 Guido Trotter
parsing and processing. The client worker thread is responsible for
156 c3c5dc77 Guido Trotter
serializing the reply, which can then be sent asynchronously by the main
157 c3c5dc77 Guido Trotter
thread on the socket.
158 c3c5dc77 Guido Trotter
159 c3c5dc77 Guido Trotter
Wait for job change
160 c3c5dc77 Guido Trotter
^^^^^^^^^^^^^^^^^^^
161 c3c5dc77 Guido Trotter
162 c3c5dc77 Guido Trotter
The REQ_WAIT_FOR_JOB_CHANGE luxi request is changed to be
163 c3c5dc77 Guido Trotter
subscription-based, so that the executing thread doesn't have to be
164 c3c5dc77 Guido Trotter
waiting for the changes to arrive. Threads producing messages (job queue
165 c3c5dc77 Guido Trotter
executors) will make sure that when there is a change another thread is
166 c3c5dc77 Guido Trotter
awaken and delivers it to the waiting clients. This can be either a
167 c3c5dc77 Guido Trotter
dedicated "wait for job changes" thread or pool, or one of the client
168 c3c5dc77 Guido Trotter
workers, depending on what's easier to implement. In either case the
169 c3c5dc77 Guido Trotter
main asyncore thread will only be involved in pushing of the actual
170 c3c5dc77 Guido Trotter
data, and not in fetching/serializing it.
171 c3c5dc77 Guido Trotter
172 c3c5dc77 Guido Trotter
Other features to look at, when implementing this code are:
173 c3c5dc77 Guido Trotter
174 37e1e262 Guido Trotter
  - Possibility not to need the job lock to know which updates to push:
175 37e1e262 Guido Trotter
    if the thread producing the data pushed a copy of the update for the
176 37e1e262 Guido Trotter
    waiting clients, the thread sending it won't need to acquire the
177 37e1e262 Guido Trotter
    lock again to fetch the actual data.
178 c3c5dc77 Guido Trotter
  - Possibility to signal clients about to time out, when no update has
179 c3c5dc77 Guido Trotter
    been received, not to despair and to keep waiting (luxi level
180 c3c5dc77 Guido Trotter
    keepalive).
181 c3c5dc77 Guido Trotter
  - Possibility to defer updates if they are too frequent, providing
182 c3c5dc77 Guido Trotter
    them at a maximum rate (lower priority).
183 c3c5dc77 Guido Trotter
184 c3c5dc77 Guido Trotter
Job Queue lock
185 c3c5dc77 Guido Trotter
^^^^^^^^^^^^^^
186 c3c5dc77 Guido Trotter
187 37e1e262 Guido Trotter
In order to decrease the job queue lock contention, we will change the
188 37e1e262 Guido Trotter
code paths in the following ways, initially:
189 37e1e262 Guido Trotter
190 37e1e262 Guido Trotter
  - A per-job lock will be introduced. All operations affecting only one
191 37e1e262 Guido Trotter
    job (for example feedback, starting/finishing notifications,
192 37e1e262 Guido Trotter
    subscribing to or watching a job) will only require the job lock.
193 37e1e262 Guido Trotter
    This should be a leaf lock, but if a situation arises in which it
194 37e1e262 Guido Trotter
    must be acquired together with the global job queue lock the global
195 37e1e262 Guido Trotter
    one must always be acquired last (for the global section).
196 37e1e262 Guido Trotter
  - The locks will be converted to a sharedlock. Any read-only operation
197 37e1e262 Guido Trotter
    will be able to proceed in parallel.
198 37e1e262 Guido Trotter
  - During remote update (which happens already per-job) we'll drop the
199 37e1e262 Guido Trotter
    job lock level to shared mode, so that activities reading the lock
200 37e1e262 Guido Trotter
    (for example job change notifications or QueryJobs calls) will be
201 37e1e262 Guido Trotter
    able to proceed in parallel.
202 37e1e262 Guido Trotter
  - The wait for job changes improvements proposed above will be
203 37e1e262 Guido Trotter
    implemented.
204 37e1e262 Guido Trotter
205 37e1e262 Guido Trotter
In the future other improvements may include splitting off some of the
206 37e1e262 Guido Trotter
work (eg replication of a job to remote nodes) to a separate thread pool
207 37e1e262 Guido Trotter
or asynchronous thread, not tied with the code path for answering client
208 37e1e262 Guido Trotter
requests or the one executing the "real" work. This can be discussed
209 37e1e262 Guido Trotter
again after we used the more granular job queue in production and tested
210 37e1e262 Guido Trotter
its benefits.
211 37e1e262 Guido Trotter
212 c3c5dc77 Guido Trotter
213 6e56e84a Michael Hanselmann
Remote procedure call timeouts
214 6e56e84a Michael Hanselmann
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
215 6e56e84a Michael Hanselmann
216 6e56e84a Michael Hanselmann
Current state and shortcomings
217 6e56e84a Michael Hanselmann
++++++++++++++++++++++++++++++
218 6e56e84a Michael Hanselmann
219 6e56e84a Michael Hanselmann
The current RPC protocol used by Ganeti is based on HTTP. Every request
220 6e56e84a Michael Hanselmann
consists of an HTTP PUT request (e.g. ``PUT /hooks_runner HTTP/1.0``)
221 6e56e84a Michael Hanselmann
and doesn't return until the function called has returned. Parameters
222 6e56e84a Michael Hanselmann
and return values are encoded using JSON.
223 6e56e84a Michael Hanselmann
224 6e56e84a Michael Hanselmann
On the server side, ``ganeti-noded`` handles every incoming connection
225 6e56e84a Michael Hanselmann
in a separate process by forking just after accepting the connection.
226 6e56e84a Michael Hanselmann
This process exits after sending the response.
227 6e56e84a Michael Hanselmann
228 6e56e84a Michael Hanselmann
There is one major problem with this design: Timeouts can not be used on
229 6e56e84a Michael Hanselmann
a per-request basis. Neither client or server know how long it will
230 6e56e84a Michael Hanselmann
take. Even if we might be able to group requests into different
231 6e56e84a Michael Hanselmann
categories (e.g. fast and slow), this is not reliable.
232 6e56e84a Michael Hanselmann
233 6e56e84a Michael Hanselmann
If a node has an issue or the network connection fails while a request
234 6e56e84a Michael Hanselmann
is being handled, the master daemon can wait for a long time for the
235 6e56e84a Michael Hanselmann
connection to time out (e.g. due to the operating system's underlying
236 6e56e84a Michael Hanselmann
TCP keep-alive packets or timeouts). While the settings for keep-alive
237 6e56e84a Michael Hanselmann
packets can be changed using Linux-specific socket options, we prefer to
238 6e56e84a Michael Hanselmann
use application-level timeouts because these cover both machine down and
239 6e56e84a Michael Hanselmann
unresponsive node daemon cases.
240 6e56e84a Michael Hanselmann
241 6e56e84a Michael Hanselmann
Proposed changes
242 6e56e84a Michael Hanselmann
++++++++++++++++
243 6e56e84a Michael Hanselmann
244 6e56e84a Michael Hanselmann
RPC glossary
245 6e56e84a Michael Hanselmann
^^^^^^^^^^^^
246 6e56e84a Michael Hanselmann
247 6e56e84a Michael Hanselmann
Function call ID
248 6e56e84a Michael Hanselmann
  Unique identifier returned by ``ganeti-noded`` after invoking a
249 6e56e84a Michael Hanselmann
  function.
250 6e56e84a Michael Hanselmann
Function process
251 6e56e84a Michael Hanselmann
  Process started by ``ganeti-noded`` to call actual (backend) function.
252 6e56e84a Michael Hanselmann
253 6e56e84a Michael Hanselmann
Protocol
254 6e56e84a Michael Hanselmann
^^^^^^^^
255 6e56e84a Michael Hanselmann
256 6e56e84a Michael Hanselmann
Initially we chose HTTP as our RPC protocol because there were existing
257 6e56e84a Michael Hanselmann
libraries, which, unfortunately, turned out to miss important features
258 6e56e84a Michael Hanselmann
(such as SSL certificate authentication) and we had to write our own.
259 6e56e84a Michael Hanselmann
260 6e56e84a Michael Hanselmann
This proposal can easily be implemented using HTTP, though it would
261 6e56e84a Michael Hanselmann
likely be more efficient and less complicated to use the LUXI protocol
262 6e56e84a Michael Hanselmann
already used to communicate between client tools and the Ganeti master
263 6e56e84a Michael Hanselmann
daemon. Switching to another protocol can occur at a later point. This
264 6e56e84a Michael Hanselmann
proposal should be implemented using HTTP as its underlying protocol.
265 6e56e84a Michael Hanselmann
266 6e56e84a Michael Hanselmann
The LUXI protocol currently contains two functions, ``WaitForJobChange``
267 6e56e84a Michael Hanselmann
and ``AutoArchiveJobs``, which can take a longer time. They both support
268 6e56e84a Michael Hanselmann
a parameter to specify the timeout. This timeout is usually chosen as
269 6e56e84a Michael Hanselmann
roughly half of the socket timeout, guaranteeing a response before the
270 6e56e84a Michael Hanselmann
socket times out. After the specified amount of time,
271 6e56e84a Michael Hanselmann
``AutoArchiveJobs`` returns and reports the number of archived jobs.
272 6e56e84a Michael Hanselmann
``WaitForJobChange`` returns and reports a timeout. In both cases, the
273 6e56e84a Michael Hanselmann
functions can be called again.
274 6e56e84a Michael Hanselmann
275 6e56e84a Michael Hanselmann
A similar model can be used for the inter-node RPC protocol. In some
276 6e56e84a Michael Hanselmann
sense, the node daemon will implement a light variant of *"node daemon
277 6e56e84a Michael Hanselmann
jobs"*. When the function call is sent, it specifies an initial timeout.
278 6e56e84a Michael Hanselmann
If the function didn't finish within this timeout, a response is sent
279 6e56e84a Michael Hanselmann
with a unique identifier, the function call ID. The client can then
280 6e56e84a Michael Hanselmann
choose to wait for the function to finish again with a timeout.
281 6e56e84a Michael Hanselmann
Inter-node RPC calls would no longer be blocking indefinitely and there
282 6e56e84a Michael Hanselmann
would be an implicit ping-mechanism.
283 6e56e84a Michael Hanselmann
284 6e56e84a Michael Hanselmann
Request handling
285 6e56e84a Michael Hanselmann
^^^^^^^^^^^^^^^^
286 6e56e84a Michael Hanselmann
287 6e56e84a Michael Hanselmann
To support the protocol changes described above, the way the node daemon
288 6e56e84a Michael Hanselmann
handles request will have to change. Instead of forking and handling
289 6e56e84a Michael Hanselmann
every connection in a separate process, there should be one child
290 6e56e84a Michael Hanselmann
process per function call and the master process will handle the
291 6e56e84a Michael Hanselmann
communication with clients and the function processes using asynchronous
292 6e56e84a Michael Hanselmann
I/O.
293 6e56e84a Michael Hanselmann
294 6e56e84a Michael Hanselmann
Function processes communicate with the parent process via stdio and
295 6e56e84a Michael Hanselmann
possibly their exit status. Every function process has a unique
296 6e56e84a Michael Hanselmann
identifier, though it shouldn't be the process ID only (PIDs can be
297 6e56e84a Michael Hanselmann
recycled and are prone to race conditions for this use case). The
298 6e56e84a Michael Hanselmann
proposed format is ``${ppid}:${cpid}:${time}:${random}``, where ``ppid``
299 6e56e84a Michael Hanselmann
is the ``ganeti-noded`` PID, ``cpid`` the child's PID, ``time`` the
300 6e56e84a Michael Hanselmann
current Unix timestamp with decimal places and ``random`` at least 16
301 6e56e84a Michael Hanselmann
random bits.
302 6e56e84a Michael Hanselmann
303 6e56e84a Michael Hanselmann
The following operations will be supported:
304 6e56e84a Michael Hanselmann
305 6e56e84a Michael Hanselmann
``StartFunction(fn_name, fn_args, timeout)``
306 6e56e84a Michael Hanselmann
  Starts a function specified by ``fn_name`` with arguments in
307 6e56e84a Michael Hanselmann
  ``fn_args`` and waits up to ``timeout`` seconds for the function
308 6e56e84a Michael Hanselmann
  to finish. Fire-and-forget calls can be made by specifying a timeout
309 6e56e84a Michael Hanselmann
  of 0 seconds (e.g. for powercycling the node). Returns three values:
310 6e56e84a Michael Hanselmann
  function call ID (if not finished), whether function finished (or
311 6e56e84a Michael Hanselmann
  timeout) and the function's return value.
312 6e56e84a Michael Hanselmann
``WaitForFunction(fnc_id, timeout)``
313 6e56e84a Michael Hanselmann
  Waits up to ``timeout`` seconds for function call to finish. Return
314 6e56e84a Michael Hanselmann
  value same as ``StartFunction``.
315 6e56e84a Michael Hanselmann
316 6e56e84a Michael Hanselmann
In the future, ``StartFunction`` could support an additional parameter
317 6e56e84a Michael Hanselmann
to specify after how long the function process should be aborted.
318 6e56e84a Michael Hanselmann
319 6e56e84a Michael Hanselmann
Simplified timing diagram::
320 6e56e84a Michael Hanselmann
321 6e56e84a Michael Hanselmann
  Master daemon        Node daemon                      Function process
322 6e56e84a Michael Hanselmann
   |
323 6e56e84a Michael Hanselmann
  Call function
324 6e56e84a Michael Hanselmann
  (timeout 10s) -----> Parse request and fork for ----> Start function
325 6e56e84a Michael Hanselmann
                       calling actual function, then     |
326 6e56e84a Michael Hanselmann
                       wait up to 10s for function to    |
327 6e56e84a Michael Hanselmann
                       finish                            |
328 6e56e84a Michael Hanselmann
                        |                                |
329 6e56e84a Michael Hanselmann
                       ...                              ...
330 6e56e84a Michael Hanselmann
                        |                                |
331 6e56e84a Michael Hanselmann
  Examine return <----  |                                |
332 6e56e84a Michael Hanselmann
  value and wait                                         |
333 6e56e84a Michael Hanselmann
  again -------------> Wait another 10s for function     |
334 6e56e84a Michael Hanselmann
                        |                                |
335 6e56e84a Michael Hanselmann
                       ...                              ...
336 6e56e84a Michael Hanselmann
                        |                                |
337 6e56e84a Michael Hanselmann
  Examine return <----  |                                |
338 6e56e84a Michael Hanselmann
  value and wait                                         |
339 6e56e84a Michael Hanselmann
  again -------------> Wait another 10s for function     |
340 6e56e84a Michael Hanselmann
                        |                                |
341 6e56e84a Michael Hanselmann
                       ...                              ...
342 6e56e84a Michael Hanselmann
                        |                                |
343 6e56e84a Michael Hanselmann
                        |                               Function ends,
344 6e56e84a Michael Hanselmann
                       Get return value and forward <-- process exits
345 6e56e84a Michael Hanselmann
  Process return <---- it to caller
346 6e56e84a Michael Hanselmann
  value and continue
347 6e56e84a Michael Hanselmann
   |
348 6e56e84a Michael Hanselmann
349 6e56e84a Michael Hanselmann
.. TODO: Convert diagram above to graphviz/dot graphic
350 6e56e84a Michael Hanselmann
351 6e56e84a Michael Hanselmann
On process termination (e.g. after having been sent a ``SIGTERM`` or
352 6e56e84a Michael Hanselmann
``SIGINT`` signal), ``ganeti-noded`` should send ``SIGTERM`` to all
353 6e56e84a Michael Hanselmann
function processes and wait for all of them to terminate.
354 6e56e84a Michael Hanselmann
355 6e56e84a Michael Hanselmann
356 5b2069a9 Michael Hanselmann
Inter-cluster instance moves
357 5b2069a9 Michael Hanselmann
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
358 5b2069a9 Michael Hanselmann
359 5b2069a9 Michael Hanselmann
Current state and shortcomings
360 5b2069a9 Michael Hanselmann
++++++++++++++++++++++++++++++
361 5b2069a9 Michael Hanselmann
362 5b2069a9 Michael Hanselmann
With the current design of Ganeti, moving whole instances between
363 5b2069a9 Michael Hanselmann
different clusters involves a lot of manual work. There are several ways
364 5b2069a9 Michael Hanselmann
to move instances, one of them being to export the instance, manually
365 5b2069a9 Michael Hanselmann
copying all data to the new cluster before importing it again. Manual
366 5b2069a9 Michael Hanselmann
changes to the instances configuration, such as the IP address, may be
367 5b2069a9 Michael Hanselmann
necessary in the new environment. The goal is to improve and automate
368 5b2069a9 Michael Hanselmann
this process in Ganeti 2.2.
369 5b2069a9 Michael Hanselmann
370 5b2069a9 Michael Hanselmann
Proposed changes
371 5b2069a9 Michael Hanselmann
++++++++++++++++
372 5b2069a9 Michael Hanselmann
373 5b2069a9 Michael Hanselmann
Authorization, Authentication and Security
374 5b2069a9 Michael Hanselmann
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
375 5b2069a9 Michael Hanselmann
376 5b2069a9 Michael Hanselmann
Until now, each Ganeti cluster was a self-contained entity and wouldn't
377 5b2069a9 Michael Hanselmann
talk to other Ganeti clusters. Nodes within clusters only had to trust
378 5b2069a9 Michael Hanselmann
the other nodes in the same cluster and the network used for replication
379 5b2069a9 Michael Hanselmann
was trusted, too (hence the ability the use a separate, local network
380 5b2069a9 Michael Hanselmann
for replication).
381 5b2069a9 Michael Hanselmann
382 5b2069a9 Michael Hanselmann
For inter-cluster instance transfers this model must be weakened. Nodes
383 5b2069a9 Michael Hanselmann
in one cluster will have to talk to nodes in other clusters, sometimes
384 5b2069a9 Michael Hanselmann
in other locations and, very important, via untrusted network
385 5b2069a9 Michael Hanselmann
connections.
386 5b2069a9 Michael Hanselmann
387 5b2069a9 Michael Hanselmann
Various option have been considered for securing and authenticating the
388 5b2069a9 Michael Hanselmann
data transfer from one machine to another. To reduce the risk of
389 5b2069a9 Michael Hanselmann
accidentally overwriting data due to software bugs, authenticating the
390 5b2069a9 Michael Hanselmann
arriving data was considered critical. Eventually we decided to use
391 5b2069a9 Michael Hanselmann
socat's OpenSSL options (``OPENSSL:``, ``OPENSSL-LISTEN:`` et al), which
392 5b2069a9 Michael Hanselmann
provide us with encryption, authentication and authorization when used
393 5b2069a9 Michael Hanselmann
with separate keys and certificates.
394 5b2069a9 Michael Hanselmann
395 5b2069a9 Michael Hanselmann
Combinations of OpenSSH, GnuPG and Netcat were deemed too complex to set
396 5b2069a9 Michael Hanselmann
up from within Ganeti. Any solution involving OpenSSH would require a
397 5b2069a9 Michael Hanselmann
dedicated user with a home directory and likely automated modifications
398 5b2069a9 Michael Hanselmann
to the user's ``$HOME/.ssh/authorized_keys`` file. When using Netcat,
399 5b2069a9 Michael Hanselmann
GnuPG or another encryption method would be necessary to transfer the
400 5b2069a9 Michael Hanselmann
data over an untrusted network. socat combines both in one program and
401 5b2069a9 Michael Hanselmann
is already a dependency.
402 5b2069a9 Michael Hanselmann
403 5b2069a9 Michael Hanselmann
Each of the two clusters will have to generate an RSA key. The public
404 5b2069a9 Michael Hanselmann
parts are exchanged between the clusters by a third party, such as an
405 5b2069a9 Michael Hanselmann
administrator or a system interacting with Ganeti via the remote API
406 5b2069a9 Michael Hanselmann
("third party" from here on). After receiving each other's public key,
407 5b2069a9 Michael Hanselmann
the clusters can start talking to each other.
408 5b2069a9 Michael Hanselmann
409 5b2069a9 Michael Hanselmann
All encrypted connections must be verified on both sides. Neither side
410 5b2069a9 Michael Hanselmann
may accept unverified certificates. The generated certificate should
411 5b2069a9 Michael Hanselmann
only be valid for the time necessary to move the instance.
412 5b2069a9 Michael Hanselmann
413 a7c6552d Michael Hanselmann
For additional protection of the instance data, the two clusters can
414 f0476905 Michael Hanselmann
verify the certificates and destination information exchanged via the
415 f0476905 Michael Hanselmann
third party by checking an HMAC signature using a key shared among the
416 f0476905 Michael Hanselmann
involved clusters. By default this secret key will be a random string
417 f0476905 Michael Hanselmann
unique to the cluster, generated by running SHA1 over 20 bytes read from
418 f0476905 Michael Hanselmann
``/dev/urandom`` and the administrator must synchronize the secrets
419 f0476905 Michael Hanselmann
between clusters before instances can be moved. If the third party does
420 f0476905 Michael Hanselmann
not know the secret, it can't forge the certificates or redirect the
421 f0476905 Michael Hanselmann
data. Unless disabled by a new cluster parameter, verifying the HMAC
422 f0476905 Michael Hanselmann
signatures must be mandatory. The HMAC signature for X509 certificates
423 f0476905 Michael Hanselmann
will be prepended to the certificate similar to an RFC822 header and
424 f0476905 Michael Hanselmann
only covers the certificate (from ``-----BEGIN CERTIFICATE-----`` to
425 f0476905 Michael Hanselmann
``-----END CERTIFICATE-----``). The header name will be
426 68857643 Michael Hanselmann
``X-Ganeti-Signature`` and its value will have the format
427 68857643 Michael Hanselmann
``$salt/$hash`` (salt and hash separated by slash). The salt may only
428 68857643 Michael Hanselmann
contain characters in the range ``[a-zA-Z0-9]``.
429 a7c6552d Michael Hanselmann
430 5b2069a9 Michael Hanselmann
On the web, the destination cluster would be equivalent to an HTTPS
431 5b2069a9 Michael Hanselmann
server requiring verifiable client certificates. The browser would be
432 5b2069a9 Michael Hanselmann
equivalent to the source cluster and must verify the server's
433 5b2069a9 Michael Hanselmann
certificate while providing a client certificate to the server.
434 5b2069a9 Michael Hanselmann
435 5b2069a9 Michael Hanselmann
Copying data
436 5b2069a9 Michael Hanselmann
^^^^^^^^^^^^
437 5b2069a9 Michael Hanselmann
438 5b2069a9 Michael Hanselmann
To simplify the implementation, we decided to operate at a block-device
439 5b2069a9 Michael Hanselmann
level only, allowing us to easily support non-DRBD instance moves.
440 5b2069a9 Michael Hanselmann
441 5b2069a9 Michael Hanselmann
Intra-cluster instance moves will re-use the existing export and import
442 5b2069a9 Michael Hanselmann
scripts supplied by instance OS definitions. Unlike simply copying the
443 5b2069a9 Michael Hanselmann
raw data, this allows to use filesystem-specific utilities to dump only
444 5b2069a9 Michael Hanselmann
used parts of the disk and to exclude certain disks from the move.
445 5b2069a9 Michael Hanselmann
Compression should be used to further reduce the amount of data
446 5b2069a9 Michael Hanselmann
transferred.
447 5b2069a9 Michael Hanselmann
448 5b2069a9 Michael Hanselmann
The export scripts writes all data to stdout and the import script reads
449 5b2069a9 Michael Hanselmann
it from stdin again. To avoid copying data and reduce disk space
450 5b2069a9 Michael Hanselmann
consumption, everything is read from the disk and sent over the network
451 5b2069a9 Michael Hanselmann
directly, where it'll be written to the new block device directly again.
452 5b2069a9 Michael Hanselmann
453 5b2069a9 Michael Hanselmann
Workflow
454 5b2069a9 Michael Hanselmann
^^^^^^^^
455 5b2069a9 Michael Hanselmann
456 5b2069a9 Michael Hanselmann
#. Third party tells source cluster to shut down instance, asks for the
457 5b2069a9 Michael Hanselmann
   instance specification and for the public part of an encryption key
458 f0476905 Michael Hanselmann
459 f0476905 Michael Hanselmann
   - Instance information can already be retrieved using an existing API
460 f0476905 Michael Hanselmann
     (``OpQueryInstanceData``).
461 f0476905 Michael Hanselmann
   - An RSA encryption key and a corresponding self-signed X509
462 f0476905 Michael Hanselmann
     certificate is generated using the "openssl" command. This key will
463 f0476905 Michael Hanselmann
     be used to encrypt the data sent to the destination cluster.
464 f0476905 Michael Hanselmann
465 f0476905 Michael Hanselmann
     - Private keys never leave the cluster.
466 f0476905 Michael Hanselmann
     - The public part (the X509 certificate) is signed using HMAC with
467 f0476905 Michael Hanselmann
       salting and a secret shared between Ganeti clusters.
468 f0476905 Michael Hanselmann
469 5b2069a9 Michael Hanselmann
#. Third party tells destination cluster to create an instance with the
470 5b2069a9 Michael Hanselmann
   same specifications as on source cluster and to prepare for an
471 5b2069a9 Michael Hanselmann
   instance move with the key received from the source cluster and
472 5b2069a9 Michael Hanselmann
   receives the public part of the destination's encryption key
473 f0476905 Michael Hanselmann
474 f0476905 Michael Hanselmann
   - The current API to create instances (``OpCreateInstance``) will be
475 f0476905 Michael Hanselmann
     extended to support an import from a remote cluster.
476 f0476905 Michael Hanselmann
   - A valid, unexpired X509 certificate signed with the destination
477 f0476905 Michael Hanselmann
     cluster's secret will be required. By verifying the signature, we
478 f0476905 Michael Hanselmann
     know the third party didn't modify the certificate.
479 f0476905 Michael Hanselmann
480 f0476905 Michael Hanselmann
     - The private keys never leave their cluster, hence the third party
481 f0476905 Michael Hanselmann
       can not decrypt or intercept the instance's data by modifying the
482 f0476905 Michael Hanselmann
       IP address or port sent by the destination cluster.
483 f0476905 Michael Hanselmann
484 f0476905 Michael Hanselmann
   - The destination cluster generates another key and certificate,
485 f0476905 Michael Hanselmann
     signs and sends it to the third party, who will have to pass it to
486 f0476905 Michael Hanselmann
     the API for exporting an instance (``OpExportInstance``). This
487 f0476905 Michael Hanselmann
     certificate is used to ensure we're sending the disk data to the
488 f0476905 Michael Hanselmann
     correct destination cluster.
489 f0476905 Michael Hanselmann
   - Once a disk can be imported, the API sends the destination
490 f0476905 Michael Hanselmann
     information (IP address and TCP port) together with an HMAC
491 f0476905 Michael Hanselmann
     signature to the third party.
492 f0476905 Michael Hanselmann
493 5b2069a9 Michael Hanselmann
#. Third party hands public part of the destination's encryption key
494 5b2069a9 Michael Hanselmann
   together with all necessary information to source cluster and tells
495 5b2069a9 Michael Hanselmann
   it to start the move
496 f0476905 Michael Hanselmann
497 f0476905 Michael Hanselmann
   - The existing API for exporting instances (``OpExportInstance``)
498 f0476905 Michael Hanselmann
     will be extended to export instances to remote clusters.
499 f0476905 Michael Hanselmann
500 5b2069a9 Michael Hanselmann
#. Source cluster connects to destination cluster for each disk and
501 5b2069a9 Michael Hanselmann
   transfers its data using the instance OS definition's export and
502 5b2069a9 Michael Hanselmann
   import scripts
503 f0476905 Michael Hanselmann
504 f0476905 Michael Hanselmann
   - Before starting, the source cluster must verify the HMAC signature
505 f0476905 Michael Hanselmann
     of the certificate and destination information (IP address and TCP
506 f0476905 Michael Hanselmann
     port).
507 f0476905 Michael Hanselmann
   - When connecting to the remote machine, strong certificate checks
508 f0476905 Michael Hanselmann
     must be employed.
509 f0476905 Michael Hanselmann
510 5b2069a9 Michael Hanselmann
#. Due to the asynchronous nature of the whole process, the destination
511 5b2069a9 Michael Hanselmann
   cluster checks whether all disks have been transferred every time
512 f0476905 Michael Hanselmann
   after transferring a single disk; if so, it destroys the encryption
513 5b2069a9 Michael Hanselmann
   key
514 5b2069a9 Michael Hanselmann
#. After sending all disks, the source cluster destroys its key
515 5b2069a9 Michael Hanselmann
#. Destination cluster runs OS definition's rename script to adjust
516 5b2069a9 Michael Hanselmann
   instance settings if needed (e.g. IP address)
517 5b2069a9 Michael Hanselmann
#. Destination cluster starts the instance if requested at the beginning
518 5b2069a9 Michael Hanselmann
   by the third party
519 5b2069a9 Michael Hanselmann
#. Source cluster removes the instance if requested
520 5b2069a9 Michael Hanselmann
521 f0476905 Michael Hanselmann
Instance move in pseudo code
522 f0476905 Michael Hanselmann
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
523 f0476905 Michael Hanselmann
524 f0476905 Michael Hanselmann
.. highlight:: python
525 f0476905 Michael Hanselmann
526 f0476905 Michael Hanselmann
The following pseudo code describes a script moving instances between
527 f0476905 Michael Hanselmann
clusters and what happens on both clusters.
528 f0476905 Michael Hanselmann
529 f0476905 Michael Hanselmann
#. Script is started, gets the instance name and destination cluster::
530 f0476905 Michael Hanselmann
531 f0476905 Michael Hanselmann
    (instance_name, dest_cluster_name) = sys.argv[1:]
532 f0476905 Michael Hanselmann
533 f0476905 Michael Hanselmann
    # Get destination cluster object
534 f0476905 Michael Hanselmann
    dest_cluster = db.FindCluster(dest_cluster_name)
535 f0476905 Michael Hanselmann
536 f0476905 Michael Hanselmann
    # Use database to find source cluster
537 f0476905 Michael Hanselmann
    src_cluster = db.FindClusterByInstance(instance_name)
538 f0476905 Michael Hanselmann
539 f0476905 Michael Hanselmann
#. Script tells source cluster to stop instance::
540 f0476905 Michael Hanselmann
541 f0476905 Michael Hanselmann
    # Stop instance
542 f0476905 Michael Hanselmann
    src_cluster.StopInstance(instance_name)
543 f0476905 Michael Hanselmann
544 f0476905 Michael Hanselmann
    # Get instance specification (memory, disk, etc.)
545 f0476905 Michael Hanselmann
    inst_spec = src_cluster.GetInstanceInfo(instance_name)
546 f0476905 Michael Hanselmann
547 f0476905 Michael Hanselmann
    (src_key_name, src_cert) = src_cluster.CreateX509Certificate()
548 f0476905 Michael Hanselmann
549 f0476905 Michael Hanselmann
#. ``CreateX509Certificate`` on source cluster::
550 f0476905 Michael Hanselmann
551 f0476905 Michael Hanselmann
    key_file = mkstemp()
552 f0476905 Michael Hanselmann
    cert_file = "%s.cert" % key_file
553 f0476905 Michael Hanselmann
    RunCmd(["/usr/bin/openssl", "req", "-new",
554 f0476905 Michael Hanselmann
             "-newkey", "rsa:1024", "-days", "1",
555 f0476905 Michael Hanselmann
             "-nodes", "-x509", "-batch",
556 f0476905 Michael Hanselmann
             "-keyout", key_file, "-out", cert_file])
557 f0476905 Michael Hanselmann
558 f0476905 Michael Hanselmann
    plain_cert = utils.ReadFile(cert_file)
559 f0476905 Michael Hanselmann
560 f0476905 Michael Hanselmann
    # HMAC sign using secret key, this adds a "X-Ganeti-Signature"
561 f0476905 Michael Hanselmann
    # header to the beginning of the certificate
562 f0476905 Michael Hanselmann
    signed_cert = utils.SignX509Certificate(plain_cert,
563 f0476905 Michael Hanselmann
      utils.ReadFile(constants.X509_SIGNKEY_FILE))
564 f0476905 Michael Hanselmann
565 f0476905 Michael Hanselmann
    # The certificate now looks like the following:
566 f0476905 Michael Hanselmann
    #
567 f0476905 Michael Hanselmann
    #   X-Ganeti-Signature: $1234$28676f0516c6ab68062b[โ€ฆ]
568 f0476905 Michael Hanselmann
    #   -----BEGIN CERTIFICATE-----
569 f0476905 Michael Hanselmann
    #   MIICsDCCAhmgAwIBAgI[โ€ฆ]
570 f0476905 Michael Hanselmann
    #   -----END CERTIFICATE-----
571 f0476905 Michael Hanselmann
572 f0476905 Michael Hanselmann
    # Return name of key file and signed certificate in PEM format
573 f0476905 Michael Hanselmann
    return (os.path.basename(key_file), signed_cert)
574 f0476905 Michael Hanselmann
575 f0476905 Michael Hanselmann
#. Script creates instance on destination cluster and waits for move to
576 f0476905 Michael Hanselmann
   finish::
577 f0476905 Michael Hanselmann
578 f0476905 Michael Hanselmann
    dest_cluster.CreateInstance(mode=constants.REMOTE_IMPORT,
579 f0476905 Michael Hanselmann
                                spec=inst_spec,
580 f0476905 Michael Hanselmann
                                source_cert=src_cert)
581 f0476905 Michael Hanselmann
582 f0476905 Michael Hanselmann
    # Wait until destination cluster gives us its certificate
583 f0476905 Michael Hanselmann
    dest_cert = None
584 f0476905 Michael Hanselmann
    disk_info = []
585 f0476905 Michael Hanselmann
    while not (dest_cert and len(disk_info) < len(inst_spec.disks)):
586 f0476905 Michael Hanselmann
      tmp = dest_cluster.WaitOutput()
587 f0476905 Michael Hanselmann
      if tmp is Certificate:
588 f0476905 Michael Hanselmann
        dest_cert = tmp
589 f0476905 Michael Hanselmann
      elif tmp is DiskInfo:
590 f0476905 Michael Hanselmann
        # DiskInfo contains destination address and port
591 f0476905 Michael Hanselmann
        disk_info[tmp.index] = tmp
592 f0476905 Michael Hanselmann
593 f0476905 Michael Hanselmann
    # Tell source cluster to export disks
594 f0476905 Michael Hanselmann
    for disk in disk_info:
595 f0476905 Michael Hanselmann
      src_cluster.ExportDisk(instance_name, disk=disk,
596 f0476905 Michael Hanselmann
                             key_name=src_key_name,
597 f0476905 Michael Hanselmann
                             dest_cert=dest_cert)
598 f0476905 Michael Hanselmann
599 f0476905 Michael Hanselmann
    print ("Instance %s sucessfully moved to %s" %
600 f0476905 Michael Hanselmann
           (instance_name, dest_cluster.name))
601 f0476905 Michael Hanselmann
602 f0476905 Michael Hanselmann
#. ``CreateInstance`` on destination cluster::
603 f0476905 Michael Hanselmann
604 f0476905 Michael Hanselmann
    # โ€ฆ
605 f0476905 Michael Hanselmann
606 f0476905 Michael Hanselmann
    if mode == constants.REMOTE_IMPORT:
607 f0476905 Michael Hanselmann
      # Make sure certificate was not modified since it was generated by
608 f0476905 Michael Hanselmann
      # source cluster (which must use the same secret)
609 f0476905 Michael Hanselmann
      if (not utils.VerifySignedX509Cert(source_cert,
610 f0476905 Michael Hanselmann
            utils.ReadFile(constants.X509_SIGNKEY_FILE))):
611 f0476905 Michael Hanselmann
        raise Error("Certificate not signed with this cluster's secret")
612 f0476905 Michael Hanselmann
613 f0476905 Michael Hanselmann
      if utils.CheckExpiredX509Cert(source_cert):
614 f0476905 Michael Hanselmann
        raise Error("X509 certificate is expired")
615 f0476905 Michael Hanselmann
616 f0476905 Michael Hanselmann
      source_cert_file = utils.WriteTempFile(source_cert)
617 f0476905 Michael Hanselmann
618 f0476905 Michael Hanselmann
      # See above for X509 certificate generation and signing
619 f0476905 Michael Hanselmann
      (key_name, signed_cert) = CreateSignedX509Certificate()
620 f0476905 Michael Hanselmann
621 f0476905 Michael Hanselmann
      SendToClient("x509-cert", signed_cert)
622 f0476905 Michael Hanselmann
623 f0476905 Michael Hanselmann
      for disk in instance.disks:
624 f0476905 Michael Hanselmann
        # Start socat
625 f0476905 Michael Hanselmann
        RunCmd(("socat"
626 f0476905 Michael Hanselmann
                " OPENSSL-LISTEN:%s,โ€ฆ,key=%s,cert=%s,cafile=%s,verify=1"
627 f0476905 Michael Hanselmann
                " stdout > /dev/diskโ€ฆ") %
628 f0476905 Michael Hanselmann
               port, GetRsaKeyPath(key_name, private=True),
629 f0476905 Michael Hanselmann
               GetRsaKeyPath(key_name, private=False), src_cert_file)
630 f0476905 Michael Hanselmann
        SendToClient("send-disk-to", disk, ip_address, port)
631 f0476905 Michael Hanselmann
632 f0476905 Michael Hanselmann
      DestroyX509Cert(key_name)
633 f0476905 Michael Hanselmann
634 f0476905 Michael Hanselmann
      RunRenameScript(instance_name)
635 f0476905 Michael Hanselmann
636 f0476905 Michael Hanselmann
#. ``ExportDisk`` on source cluster::
637 f0476905 Michael Hanselmann
638 f0476905 Michael Hanselmann
    # Make sure certificate was not modified since it was generated by
639 f0476905 Michael Hanselmann
    # destination cluster (which must use the same secret)
640 f0476905 Michael Hanselmann
    if (not utils.VerifySignedX509Cert(cert_pem,
641 f0476905 Michael Hanselmann
          utils.ReadFile(constants.X509_SIGNKEY_FILE))):
642 f0476905 Michael Hanselmann
      raise Error("Certificate not signed with this cluster's secret")
643 f0476905 Michael Hanselmann
644 f0476905 Michael Hanselmann
    if utils.CheckExpiredX509Cert(cert_pem):
645 f0476905 Michael Hanselmann
      raise Error("X509 certificate is expired")
646 f0476905 Michael Hanselmann
647 f0476905 Michael Hanselmann
    dest_cert_file = utils.WriteTempFile(cert_pem)
648 f0476905 Michael Hanselmann
649 f0476905 Michael Hanselmann
    # Start socat
650 f0476905 Michael Hanselmann
    RunCmd(("socat stdin"
651 f0476905 Michael Hanselmann
            " OPENSSL:%s:%s,โ€ฆ,key=%s,cert=%s,cafile=%s,verify=1"
652 f0476905 Michael Hanselmann
            " < /dev/diskโ€ฆ") %
653 f0476905 Michael Hanselmann
           disk.host, disk.port,
654 f0476905 Michael Hanselmann
           GetRsaKeyPath(key_name, private=True),
655 f0476905 Michael Hanselmann
           GetRsaKeyPath(key_name, private=False), dest_cert_file)
656 f0476905 Michael Hanselmann
657 f0476905 Michael Hanselmann
    if instance.all_disks_done:
658 f0476905 Michael Hanselmann
      DestroyX509Cert(key_name)
659 f0476905 Michael Hanselmann
660 f0476905 Michael Hanselmann
.. highlight:: text
661 f0476905 Michael Hanselmann
662 5b2069a9 Michael Hanselmann
Miscellaneous notes
663 5b2069a9 Michael Hanselmann
^^^^^^^^^^^^^^^^^^^
664 5b2069a9 Michael Hanselmann
665 5b2069a9 Michael Hanselmann
- A very similar system could also be used for instance exports within
666 5b2069a9 Michael Hanselmann
  the same cluster. Currently OpenSSH is being used, but could be
667 5b2069a9 Michael Hanselmann
  replaced by socat and SSL/TLS.
668 5b2069a9 Michael Hanselmann
- During the design of intra-cluster instance moves we also discussed
669 5b2069a9 Michael Hanselmann
  encrypting instance exports using GnuPG.
670 5b2069a9 Michael Hanselmann
- While most instances should have exactly the same configuration as
671 5b2069a9 Michael Hanselmann
  on the source cluster, setting them up with a different disk layout
672 5b2069a9 Michael Hanselmann
  might be helpful in some use-cases.
673 5b2069a9 Michael Hanselmann
- A cleanup operation, similar to the one available for failed instance
674 5b2069a9 Michael Hanselmann
  migrations, should be provided.
675 5b2069a9 Michael Hanselmann
- ``ganeti-watcher`` should remove instances pending a move from another
676 5b2069a9 Michael Hanselmann
  cluster after a certain amount of time. This takes care of failures
677 5b2069a9 Michael Hanselmann
  somewhere in the process.
678 5b2069a9 Michael Hanselmann
- RSA keys can be generated using the existing
679 5b2069a9 Michael Hanselmann
  ``bootstrap.GenerateSelfSignedSslCert`` function, though it might be
680 5b2069a9 Michael Hanselmann
  useful to not write both parts into a single file, requiring small
681 5b2069a9 Michael Hanselmann
  changes to the function. The public part always starts with
682 5b2069a9 Michael Hanselmann
  ``-----BEGIN CERTIFICATE-----`` and ends with ``-----END
683 5b2069a9 Michael Hanselmann
  CERTIFICATE-----``.
684 5b2069a9 Michael Hanselmann
- The source and destination cluster might be different when it comes
685 5b2069a9 Michael Hanselmann
  to available hypervisors, kernels, etc. The destination cluster should
686 5b2069a9 Michael Hanselmann
  refuse to accept an instance move if it can't fulfill an instance's
687 5b2069a9 Michael Hanselmann
  requirements.
688 5b2069a9 Michael Hanselmann
689 5b2069a9 Michael Hanselmann
690 e56bb0e8 Guido Trotter
Feature changes
691 e56bb0e8 Guido Trotter
---------------
692 e56bb0e8 Guido Trotter
693 8388e9ff Guido Trotter
KVM Security
694 8388e9ff Guido Trotter
~~~~~~~~~~~~
695 8388e9ff Guido Trotter
696 8388e9ff Guido Trotter
Current state and shortcomings
697 8388e9ff Guido Trotter
++++++++++++++++++++++++++++++
698 8388e9ff Guido Trotter
699 8388e9ff Guido Trotter
Currently all kvm processes run as root. Taking ownership of the
700 8388e9ff Guido Trotter
hypervisor process, from inside a virtual machine, would mean a full
701 8388e9ff Guido Trotter
compromise of the whole Ganeti cluster, knowledge of all Ganeti
702 8388e9ff Guido Trotter
authentication secrets, full access to all running instances, and the
703 8388e9ff Guido Trotter
option of subverting other basic services on the cluster (eg: ssh).
704 8388e9ff Guido Trotter
705 8388e9ff Guido Trotter
Proposed changes
706 8388e9ff Guido Trotter
++++++++++++++++
707 8388e9ff Guido Trotter
708 8388e9ff Guido Trotter
We would like to decrease the surface of attack available if an
709 8388e9ff Guido Trotter
hypervisor is compromised. We can do so adding different features to
710 8388e9ff Guido Trotter
Ganeti, which will allow restricting the broken hypervisor
711 8388e9ff Guido Trotter
possibilities, in the absence of a local privilege escalation attack, to
712 8388e9ff Guido Trotter
subvert the node.
713 8388e9ff Guido Trotter
714 8388e9ff Guido Trotter
Dropping privileges in kvm to a single user (easy)
715 8388e9ff Guido Trotter
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
716 8388e9ff Guido Trotter
717 8388e9ff Guido Trotter
By passing the ``-runas`` option to kvm, we can make it drop privileges.
718 8388e9ff Guido Trotter
The user can be chosen by an hypervisor parameter, so that each instance
719 8388e9ff Guido Trotter
can have its own user, but by default they will all run under the same
720 8388e9ff Guido Trotter
one. It should be very easy to implement, and can easily be backported
721 8388e9ff Guido Trotter
to 2.1.X.
722 8388e9ff Guido Trotter
723 8388e9ff Guido Trotter
This mode protects the Ganeti cluster from a subverted hypervisor, but
724 8388e9ff Guido Trotter
doesn't protect the instances between each other, unless care is taken
725 8388e9ff Guido Trotter
to specify a different user for each. This would prevent the worst
726 8388e9ff Guido Trotter
attacks, including:
727 8388e9ff Guido Trotter
728 8388e9ff Guido Trotter
- logging in to other nodes
729 8388e9ff Guido Trotter
- administering the Ganeti cluster
730 8388e9ff Guido Trotter
- subverting other services
731 8388e9ff Guido Trotter
732 8388e9ff Guido Trotter
But the following would remain an option:
733 8388e9ff Guido Trotter
734 8388e9ff Guido Trotter
- terminate other VMs (but not start them again, as that requires root
735 8388e9ff Guido Trotter
  privileges to set up networking) (unless different users are used)
736 8388e9ff Guido Trotter
- trace other VMs, and probably subvert them and access their data
737 8388e9ff Guido Trotter
  (unless different users are used)
738 8388e9ff Guido Trotter
- send network traffic from the node
739 8388e9ff Guido Trotter
- read unprotected data on the node filesystem
740 8388e9ff Guido Trotter
741 8388e9ff Guido Trotter
Running kvm in a chroot (slightly harder)
742 8388e9ff Guido Trotter
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
743 8388e9ff Guido Trotter
744 8388e9ff Guido Trotter
By passing the ``-chroot`` option to kvm, we can restrict the kvm
745 8388e9ff Guido Trotter
process in its own (possibly empty) root directory. We need to set this
746 8388e9ff Guido Trotter
area up so that the instance disks and control sockets are accessible,
747 8388e9ff Guido Trotter
so it would require slightly more work at the Ganeti level.
748 8388e9ff Guido Trotter
749 8388e9ff Guido Trotter
Breaking out in a chroot would mean:
750 8388e9ff Guido Trotter
751 8388e9ff Guido Trotter
- a lot less options to find a local privilege escalation vector
752 8388e9ff Guido Trotter
- the impossibility to write local data, if the chroot is set up
753 8388e9ff Guido Trotter
  correctly
754 8388e9ff Guido Trotter
- the impossibility to read filesystem data on the host
755 8388e9ff Guido Trotter
756 8388e9ff Guido Trotter
It would still be possible though to:
757 8388e9ff Guido Trotter
758 8388e9ff Guido Trotter
- terminate other VMs
759 8388e9ff Guido Trotter
- trace other VMs, and possibly subvert them (if a tracer can be
760 8388e9ff Guido Trotter
  installed in the chroot)
761 8388e9ff Guido Trotter
- send network traffic from the node
762 8388e9ff Guido Trotter
763 8388e9ff Guido Trotter
764 8388e9ff Guido Trotter
Running kvm with a pool of users (slightly harder)
765 8388e9ff Guido Trotter
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
766 8388e9ff Guido Trotter
767 8388e9ff Guido Trotter
If rather than passing a single user as an hypervisor parameter, we have
768 8388e9ff Guido Trotter
a pool of useable ones, we can dynamically choose a free one to use and
769 8388e9ff Guido Trotter
thus guarantee that each machine will be separate from the others,
770 8388e9ff Guido Trotter
without putting the burden of this on the cluster administrator.
771 8388e9ff Guido Trotter
772 8388e9ff Guido Trotter
This would mean interfering between machines would be impossible, and
773 8388e9ff Guido Trotter
can still be combined with the chroot benefits.
774 8388e9ff Guido Trotter
775 8388e9ff Guido Trotter
Running iptables rules to limit network interaction (easy)
776 8388e9ff Guido Trotter
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
777 8388e9ff Guido Trotter
778 8388e9ff Guido Trotter
These don't need to be handled by Ganeti, but we can ship examples. If
779 8388e9ff Guido Trotter
the users used to run VMs would be blocked from sending some or all
780 8388e9ff Guido Trotter
network traffic, it would become impossible for a broken into hypervisor
781 8388e9ff Guido Trotter
to send arbitrary data on the node network, which is especially useful
782 8388e9ff Guido Trotter
when the instance and the node network are separated (using ganeti-nbma
783 8388e9ff Guido Trotter
or a separate set of network interfaces), or when a separate replication
784 8388e9ff Guido Trotter
network is maintained. We need to experiment to see how much restriction
785 8388e9ff Guido Trotter
we can properly apply, without limiting the instance legitimate traffic.
786 8388e9ff Guido Trotter
787 8388e9ff Guido Trotter
788 8388e9ff Guido Trotter
Running kvm inside a container (even harder)
789 8388e9ff Guido Trotter
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
790 8388e9ff Guido Trotter
791 8388e9ff Guido Trotter
Recent linux kernels support different process namespaces through
792 8388e9ff Guido Trotter
control groups. PIDs, users, filesystems and even network interfaces can
793 8388e9ff Guido Trotter
be separated. If we can set up ganeti to run kvm in a separate container
794 8388e9ff Guido Trotter
we could insulate all the host process from being even visible if the
795 8388e9ff Guido Trotter
hypervisor gets broken into. Most probably separating the network
796 8388e9ff Guido Trotter
namespace would require one extra hop in the host, through a veth
797 8388e9ff Guido Trotter
interface, thus reducing performance, so we may want to avoid that, and
798 8388e9ff Guido Trotter
just rely on iptables.
799 8388e9ff Guido Trotter
800 8388e9ff Guido Trotter
Implementation plan
801 8388e9ff Guido Trotter
+++++++++++++++++++
802 8388e9ff Guido Trotter
803 8388e9ff Guido Trotter
We will first implement dropping privileges for kvm processes as a
804 8388e9ff Guido Trotter
single user, and most probably backport it to 2.1. Then we'll ship
805 8388e9ff Guido Trotter
example iptables rules to show how the user can be limited in its
806 8388e9ff Guido Trotter
network activities.  After that we'll implement chroot restriction for
807 8388e9ff Guido Trotter
kvm processes, and extend the user limitation to use a user pool.
808 8388e9ff Guido Trotter
809 8388e9ff Guido Trotter
Finally we'll look into namespaces and containers, although that might
810 8388e9ff Guido Trotter
slip after the 2.2 release.
811 8388e9ff Guido Trotter
812 e56bb0e8 Guido Trotter
External interface changes
813 e56bb0e8 Guido Trotter
--------------------------
814 e56bb0e8 Guido Trotter
815 e56bb0e8 Guido Trotter
.. vim: set textwidth=72 :