root / doc / design-2.2.rst @ 2f2f1289
History | View | Annotate | Download (41.1 kB)
1 |
================= |
---|---|
2 |
Ganeti 2.2 design |
3 |
================= |
4 |
|
5 |
This document describes the major changes in Ganeti 2.2 compared to |
6 |
the 2.1 version. |
7 |
|
8 |
The 2.2 version will be a relatively small release. Its main aim is to |
9 |
avoid changing too much of the core code, while addressing issues and |
10 |
adding new features and improvements over 2.1, in a timely fashion. |
11 |
|
12 |
.. contents:: :depth: 4 |
13 |
|
14 |
As for 2.1 we divide the 2.2 design into three areas: |
15 |
|
16 |
- core changes, which affect the master daemon/job queue/locking or |
17 |
all/most logical units |
18 |
- logical unit/feature changes |
19 |
- external interface changes (e.g. command line, OS API, hooks, ...) |
20 |
|
21 |
|
22 |
Core changes |
23 |
============ |
24 |
|
25 |
Master Daemon Scaling improvements |
26 |
---------------------------------- |
27 |
|
28 |
Current state and shortcomings |
29 |
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
30 |
|
31 |
Currently the Ganeti master daemon is based on four sets of threads: |
32 |
|
33 |
- The main thread (1 thread) just accepts connections on the master |
34 |
socket |
35 |
- The client worker pool (16 threads) handles those connections, |
36 |
one thread per connected socket, parses luxi requests, and sends data |
37 |
back to the clients |
38 |
- The job queue worker pool (25 threads) executes the actual jobs |
39 |
submitted by the clients |
40 |
- The rpc worker pool (10 threads) interacts with the nodes via |
41 |
http-based-rpc |
42 |
|
43 |
This means that every masterd currently runs 52 threads to do its job. |
44 |
Being able to reduce the number of thread sets would make the master's |
45 |
architecture a lot simpler. Moreover having less threads can help |
46 |
decrease lock contention, log pollution and memory usage. |
47 |
Also, with the current architecture, masterd suffers from quite a few |
48 |
scalability issues: |
49 |
|
50 |
Core daemon connection handling |
51 |
+++++++++++++++++++++++++++++++ |
52 |
|
53 |
Since the 16 client worker threads handle one connection each, it's very |
54 |
easy to exhaust them, by just connecting to masterd 16 times and not |
55 |
sending any data. While we could perhaps make those pools resizable, |
56 |
increasing the number of threads won't help with lock contention nor |
57 |
with better handling long running operations making sure the client is |
58 |
informed that everything is proceeding, and doesn't need to time out. |
59 |
|
60 |
Wait for job change |
61 |
+++++++++++++++++++ |
62 |
|
63 |
The REQ_WAIT_FOR_JOB_CHANGE luxi operation makes the relevant client |
64 |
thread block on its job for a relative long time. This is another easy |
65 |
way to exhaust the 16 client threads, and a place where clients often |
66 |
time out, moreover this operation is negative for the job queue lock |
67 |
contention (see below). |
68 |
|
69 |
Job Queue lock |
70 |
++++++++++++++ |
71 |
|
72 |
The job queue lock is quite heavily contended, and certain easily |
73 |
reproducible workloads show that's it's very easy to put masterd in |
74 |
trouble: for example running ~15 background instance reinstall jobs, |
75 |
results in a master daemon that, even without having finished the |
76 |
client worker threads, can't answer simple job list requests, or |
77 |
submit more jobs. |
78 |
|
79 |
Currently the job queue lock is an exclusive non-fair lock insulating |
80 |
the following job queue methods (called by the client workers). |
81 |
|
82 |
- AddNode |
83 |
- RemoveNode |
84 |
- SubmitJob |
85 |
- SubmitManyJobs |
86 |
- WaitForJobChanges |
87 |
- CancelJob |
88 |
- ArchiveJob |
89 |
- AutoArchiveJobs |
90 |
- QueryJobs |
91 |
- Shutdown |
92 |
|
93 |
Moreover the job queue lock is acquired outside of the job queue in two |
94 |
other classes: |
95 |
|
96 |
- jqueue._JobQueueWorker (in RunTask) before executing the opcode, after |
97 |
finishing its executing and when handling an exception. |
98 |
- jqueue._OpExecCallbacks (in NotifyStart and Feedback) when the |
99 |
processor (mcpu.Processor) is about to start working on the opcode |
100 |
(after acquiring the necessary locks) and when any data is sent back |
101 |
via the feedback function. |
102 |
|
103 |
Of those the major critical points are: |
104 |
|
105 |
- Submit[Many]Job, QueryJobs, WaitForJobChanges, which can easily slow |
106 |
down and block client threads up to making the respective clients |
107 |
time out. |
108 |
- The code paths in NotifyStart, Feedback, and RunTask, which slow |
109 |
down job processing between clients and otherwise non-related jobs. |
110 |
|
111 |
To increase the pain: |
112 |
|
113 |
- WaitForJobChanges is a bad offender because it's implemented with a |
114 |
notified condition which awakes waiting threads, who then try to |
115 |
acquire the global lock again |
116 |
- Many should-be-fast code paths are slowed down by replicating the |
117 |
change to remote nodes, and thus waiting, with the lock held, on |
118 |
remote rpcs to complete (starting, finishing, and submitting jobs) |
119 |
|
120 |
Proposed changes |
121 |
~~~~~~~~~~~~~~~~ |
122 |
|
123 |
In order to be able to interact with the master daemon even when it's |
124 |
under heavy load, and to make it simpler to add core functionality |
125 |
(such as an asynchronous rpc client) we propose three subsequent levels |
126 |
of changes to the master core architecture. |
127 |
|
128 |
After making this change we'll be able to re-evaluate the size of our |
129 |
thread pool, if we see that we can make most threads in the client |
130 |
worker pool always idle. In the future we should also investigate making |
131 |
the rpc client asynchronous as well, so that we can make masterd a lot |
132 |
smaller in number of threads, and memory size, and thus also easier to |
133 |
understand, debug, and scale. |
134 |
|
135 |
Connection handling |
136 |
+++++++++++++++++++ |
137 |
|
138 |
We'll move the main thread of ganeti-masterd to asyncore, so that it can |
139 |
share the mainloop code with all other Ganeti daemons. Then all luxi |
140 |
clients will be asyncore clients, and I/O to/from them will be handled |
141 |
by the master thread asynchronously. Data will be read from the client |
142 |
sockets as it becomes available, and kept in a buffer, then when a |
143 |
complete message is found, it's passed to a client worker thread for |
144 |
parsing and processing. The client worker thread is responsible for |
145 |
serializing the reply, which can then be sent asynchronously by the main |
146 |
thread on the socket. |
147 |
|
148 |
Wait for job change |
149 |
+++++++++++++++++++ |
150 |
|
151 |
The REQ_WAIT_FOR_JOB_CHANGE luxi request is changed to be |
152 |
subscription-based, so that the executing thread doesn't have to be |
153 |
waiting for the changes to arrive. Threads producing messages (job queue |
154 |
executors) will make sure that when there is a change another thread is |
155 |
awaken and delivers it to the waiting clients. This can be either a |
156 |
dedicated "wait for job changes" thread or pool, or one of the client |
157 |
workers, depending on what's easier to implement. In either case the |
158 |
main asyncore thread will only be involved in pushing of the actual |
159 |
data, and not in fetching/serializing it. |
160 |
|
161 |
Other features to look at, when implementing this code are: |
162 |
|
163 |
- Possibility not to need the job lock to know which updates to push: |
164 |
if the thread producing the data pushed a copy of the update for the |
165 |
waiting clients, the thread sending it won't need to acquire the |
166 |
lock again to fetch the actual data. |
167 |
- Possibility to signal clients about to time out, when no update has |
168 |
been received, not to despair and to keep waiting (luxi level |
169 |
keepalive). |
170 |
- Possibility to defer updates if they are too frequent, providing |
171 |
them at a maximum rate (lower priority). |
172 |
|
173 |
Job Queue lock |
174 |
++++++++++++++ |
175 |
|
176 |
In order to decrease the job queue lock contention, we will change the |
177 |
code paths in the following ways, initially: |
178 |
|
179 |
- A per-job lock will be introduced. All operations affecting only one |
180 |
job (for example feedback, starting/finishing notifications, |
181 |
subscribing to or watching a job) will only require the job lock. |
182 |
This should be a leaf lock, but if a situation arises in which it |
183 |
must be acquired together with the global job queue lock the global |
184 |
one must always be acquired last (for the global section). |
185 |
- The locks will be converted to a sharedlock. Any read-only operation |
186 |
will be able to proceed in parallel. |
187 |
- During remote update (which happens already per-job) we'll drop the |
188 |
job lock level to shared mode, so that activities reading the lock |
189 |
(for example job change notifications or QueryJobs calls) will be |
190 |
able to proceed in parallel. |
191 |
- The wait for job changes improvements proposed above will be |
192 |
implemented. |
193 |
|
194 |
In the future other improvements may include splitting off some of the |
195 |
work (eg replication of a job to remote nodes) to a separate thread pool |
196 |
or asynchronous thread, not tied with the code path for answering client |
197 |
requests or the one executing the "real" work. This can be discussed |
198 |
again after we used the more granular job queue in production and tested |
199 |
its benefits. |
200 |
|
201 |
|
202 |
Remote procedure call timeouts |
203 |
------------------------------ |
204 |
|
205 |
Current state and shortcomings |
206 |
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
207 |
|
208 |
The current RPC protocol used by Ganeti is based on HTTP. Every request |
209 |
consists of an HTTP PUT request (e.g. ``PUT /hooks_runner HTTP/1.0``) |
210 |
and doesn't return until the function called has returned. Parameters |
211 |
and return values are encoded using JSON. |
212 |
|
213 |
On the server side, ``ganeti-noded`` handles every incoming connection |
214 |
in a separate process by forking just after accepting the connection. |
215 |
This process exits after sending the response. |
216 |
|
217 |
There is one major problem with this design: Timeouts can not be used on |
218 |
a per-request basis. Neither client or server know how long it will |
219 |
take. Even if we might be able to group requests into different |
220 |
categories (e.g. fast and slow), this is not reliable. |
221 |
|
222 |
If a node has an issue or the network connection fails while a request |
223 |
is being handled, the master daemon can wait for a long time for the |
224 |
connection to time out (e.g. due to the operating system's underlying |
225 |
TCP keep-alive packets or timeouts). While the settings for keep-alive |
226 |
packets can be changed using Linux-specific socket options, we prefer to |
227 |
use application-level timeouts because these cover both machine down and |
228 |
unresponsive node daemon cases. |
229 |
|
230 |
Proposed changes |
231 |
~~~~~~~~~~~~~~~~ |
232 |
|
233 |
RPC glossary |
234 |
++++++++++++ |
235 |
|
236 |
Function call ID |
237 |
Unique identifier returned by ``ganeti-noded`` after invoking a |
238 |
function. |
239 |
Function process |
240 |
Process started by ``ganeti-noded`` to call actual (backend) function. |
241 |
|
242 |
Protocol |
243 |
++++++++ |
244 |
|
245 |
Initially we chose HTTP as our RPC protocol because there were existing |
246 |
libraries, which, unfortunately, turned out to miss important features |
247 |
(such as SSL certificate authentication) and we had to write our own. |
248 |
|
249 |
This proposal can easily be implemented using HTTP, though it would |
250 |
likely be more efficient and less complicated to use the LUXI protocol |
251 |
already used to communicate between client tools and the Ganeti master |
252 |
daemon. Switching to another protocol can occur at a later point. This |
253 |
proposal should be implemented using HTTP as its underlying protocol. |
254 |
|
255 |
The LUXI protocol currently contains two functions, ``WaitForJobChange`` |
256 |
and ``AutoArchiveJobs``, which can take a longer time. They both support |
257 |
a parameter to specify the timeout. This timeout is usually chosen as |
258 |
roughly half of the socket timeout, guaranteeing a response before the |
259 |
socket times out. After the specified amount of time, |
260 |
``AutoArchiveJobs`` returns and reports the number of archived jobs. |
261 |
``WaitForJobChange`` returns and reports a timeout. In both cases, the |
262 |
functions can be called again. |
263 |
|
264 |
A similar model can be used for the inter-node RPC protocol. In some |
265 |
sense, the node daemon will implement a light variant of *"node daemon |
266 |
jobs"*. When the function call is sent, it specifies an initial timeout. |
267 |
If the function didn't finish within this timeout, a response is sent |
268 |
with a unique identifier, the function call ID. The client can then |
269 |
choose to wait for the function to finish again with a timeout. |
270 |
Inter-node RPC calls would no longer be blocking indefinitely and there |
271 |
would be an implicit ping-mechanism. |
272 |
|
273 |
Request handling |
274 |
++++++++++++++++ |
275 |
|
276 |
To support the protocol changes described above, the way the node daemon |
277 |
handles request will have to change. Instead of forking and handling |
278 |
every connection in a separate process, there should be one child |
279 |
process per function call and the master process will handle the |
280 |
communication with clients and the function processes using asynchronous |
281 |
I/O. |
282 |
|
283 |
Function processes communicate with the parent process via stdio and |
284 |
possibly their exit status. Every function process has a unique |
285 |
identifier, though it shouldn't be the process ID only (PIDs can be |
286 |
recycled and are prone to race conditions for this use case). The |
287 |
proposed format is ``${ppid}:${cpid}:${time}:${random}``, where ``ppid`` |
288 |
is the ``ganeti-noded`` PID, ``cpid`` the child's PID, ``time`` the |
289 |
current Unix timestamp with decimal places and ``random`` at least 16 |
290 |
random bits. |
291 |
|
292 |
The following operations will be supported: |
293 |
|
294 |
``StartFunction(fn_name, fn_args, timeout)`` |
295 |
Starts a function specified by ``fn_name`` with arguments in |
296 |
``fn_args`` and waits up to ``timeout`` seconds for the function |
297 |
to finish. Fire-and-forget calls can be made by specifying a timeout |
298 |
of 0 seconds (e.g. for powercycling the node). Returns three values: |
299 |
function call ID (if not finished), whether function finished (or |
300 |
timeout) and the function's return value. |
301 |
``WaitForFunction(fnc_id, timeout)`` |
302 |
Waits up to ``timeout`` seconds for function call to finish. Return |
303 |
value same as ``StartFunction``. |
304 |
|
305 |
In the future, ``StartFunction`` could support an additional parameter |
306 |
to specify after how long the function process should be aborted. |
307 |
|
308 |
Simplified timing diagram:: |
309 |
|
310 |
Master daemon Node daemon Function process |
311 |
| |
312 |
Call function |
313 |
(timeout 10s) -----> Parse request and fork for ----> Start function |
314 |
calling actual function, then | |
315 |
wait up to 10s for function to | |
316 |
finish | |
317 |
| | |
318 |
... ... |
319 |
| | |
320 |
Examine return <---- | | |
321 |
value and wait | |
322 |
again -------------> Wait another 10s for function | |
323 |
| | |
324 |
... ... |
325 |
| | |
326 |
Examine return <---- | | |
327 |
value and wait | |
328 |
again -------------> Wait another 10s for function | |
329 |
| | |
330 |
... ... |
331 |
| | |
332 |
| Function ends, |
333 |
Get return value and forward <-- process exits |
334 |
Process return <---- it to caller |
335 |
value and continue |
336 |
| |
337 |
|
338 |
.. TODO: Convert diagram above to graphviz/dot graphic |
339 |
|
340 |
On process termination (e.g. after having been sent a ``SIGTERM`` or |
341 |
``SIGINT`` signal), ``ganeti-noded`` should send ``SIGTERM`` to all |
342 |
function processes and wait for all of them to terminate. |
343 |
|
344 |
|
345 |
Inter-cluster instance moves |
346 |
---------------------------- |
347 |
|
348 |
Current state and shortcomings |
349 |
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
350 |
|
351 |
With the current design of Ganeti, moving whole instances between |
352 |
different clusters involves a lot of manual work. There are several ways |
353 |
to move instances, one of them being to export the instance, manually |
354 |
copying all data to the new cluster before importing it again. Manual |
355 |
changes to the instances configuration, such as the IP address, may be |
356 |
necessary in the new environment. The goal is to improve and automate |
357 |
this process in Ganeti 2.2. |
358 |
|
359 |
Proposed changes |
360 |
~~~~~~~~~~~~~~~~ |
361 |
|
362 |
Authorization, Authentication and Security |
363 |
++++++++++++++++++++++++++++++++++++++++++ |
364 |
|
365 |
Until now, each Ganeti cluster was a self-contained entity and wouldn't |
366 |
talk to other Ganeti clusters. Nodes within clusters only had to trust |
367 |
the other nodes in the same cluster and the network used for replication |
368 |
was trusted, too (hence the ability the use a separate, local network |
369 |
for replication). |
370 |
|
371 |
For inter-cluster instance transfers this model must be weakened. Nodes |
372 |
in one cluster will have to talk to nodes in other clusters, sometimes |
373 |
in other locations and, very important, via untrusted network |
374 |
connections. |
375 |
|
376 |
Various option have been considered for securing and authenticating the |
377 |
data transfer from one machine to another. To reduce the risk of |
378 |
accidentally overwriting data due to software bugs, authenticating the |
379 |
arriving data was considered critical. Eventually we decided to use |
380 |
socat's OpenSSL options (``OPENSSL:``, ``OPENSSL-LISTEN:`` et al), which |
381 |
provide us with encryption, authentication and authorization when used |
382 |
with separate keys and certificates. |
383 |
|
384 |
Combinations of OpenSSH, GnuPG and Netcat were deemed too complex to set |
385 |
up from within Ganeti. Any solution involving OpenSSH would require a |
386 |
dedicated user with a home directory and likely automated modifications |
387 |
to the user's ``$HOME/.ssh/authorized_keys`` file. When using Netcat, |
388 |
GnuPG or another encryption method would be necessary to transfer the |
389 |
data over an untrusted network. socat combines both in one program and |
390 |
is already a dependency. |
391 |
|
392 |
Each of the two clusters will have to generate an RSA key. The public |
393 |
parts are exchanged between the clusters by a third party, such as an |
394 |
administrator or a system interacting with Ganeti via the remote API |
395 |
("third party" from here on). After receiving each other's public key, |
396 |
the clusters can start talking to each other. |
397 |
|
398 |
All encrypted connections must be verified on both sides. Neither side |
399 |
may accept unverified certificates. The generated certificate should |
400 |
only be valid for the time necessary to move the instance. |
401 |
|
402 |
For additional protection of the instance data, the two clusters can |
403 |
verify the certificates and destination information exchanged via the |
404 |
third party by checking an HMAC signature using a key shared among the |
405 |
involved clusters. By default this secret key will be a random string |
406 |
unique to the cluster, generated by running SHA1 over 20 bytes read from |
407 |
``/dev/urandom`` and the administrator must synchronize the secrets |
408 |
between clusters before instances can be moved. If the third party does |
409 |
not know the secret, it can't forge the certificates or redirect the |
410 |
data. Unless disabled by a new cluster parameter, verifying the HMAC |
411 |
signatures must be mandatory. The HMAC signature for X509 certificates |
412 |
will be prepended to the certificate similar to an :rfc:`822` header and |
413 |
only covers the certificate (from ``-----BEGIN CERTIFICATE-----`` to |
414 |
``-----END CERTIFICATE-----``). The header name will be |
415 |
``X-Ganeti-Signature`` and its value will have the format |
416 |
``$salt/$hash`` (salt and hash separated by slash). The salt may only |
417 |
contain characters in the range ``[a-zA-Z0-9]``. |
418 |
|
419 |
On the web, the destination cluster would be equivalent to an HTTPS |
420 |
server requiring verifiable client certificates. The browser would be |
421 |
equivalent to the source cluster and must verify the server's |
422 |
certificate while providing a client certificate to the server. |
423 |
|
424 |
Copying data |
425 |
++++++++++++ |
426 |
|
427 |
To simplify the implementation, we decided to operate at a block-device |
428 |
level only, allowing us to easily support non-DRBD instance moves. |
429 |
|
430 |
Intra-cluster instance moves will re-use the existing export and import |
431 |
scripts supplied by instance OS definitions. Unlike simply copying the |
432 |
raw data, this allows to use filesystem-specific utilities to dump only |
433 |
used parts of the disk and to exclude certain disks from the move. |
434 |
Compression should be used to further reduce the amount of data |
435 |
transferred. |
436 |
|
437 |
The export scripts writes all data to stdout and the import script reads |
438 |
it from stdin again. To avoid copying data and reduce disk space |
439 |
consumption, everything is read from the disk and sent over the network |
440 |
directly, where it'll be written to the new block device directly again. |
441 |
|
442 |
Workflow |
443 |
++++++++ |
444 |
|
445 |
#. Third party tells source cluster to shut down instance, asks for the |
446 |
instance specification and for the public part of an encryption key |
447 |
|
448 |
- Instance information can already be retrieved using an existing API |
449 |
(``OpQueryInstanceData``). |
450 |
- An RSA encryption key and a corresponding self-signed X509 |
451 |
certificate is generated using the "openssl" command. This key will |
452 |
be used to encrypt the data sent to the destination cluster. |
453 |
|
454 |
- Private keys never leave the cluster. |
455 |
- The public part (the X509 certificate) is signed using HMAC with |
456 |
salting and a secret shared between Ganeti clusters. |
457 |
|
458 |
#. Third party tells destination cluster to create an instance with the |
459 |
same specifications as on source cluster and to prepare for an |
460 |
instance move with the key received from the source cluster and |
461 |
receives the public part of the destination's encryption key |
462 |
|
463 |
- The current API to create instances (``OpCreateInstance``) will be |
464 |
extended to support an import from a remote cluster. |
465 |
- A valid, unexpired X509 certificate signed with the destination |
466 |
cluster's secret will be required. By verifying the signature, we |
467 |
know the third party didn't modify the certificate. |
468 |
|
469 |
- The private keys never leave their cluster, hence the third party |
470 |
can not decrypt or intercept the instance's data by modifying the |
471 |
IP address or port sent by the destination cluster. |
472 |
|
473 |
- The destination cluster generates another key and certificate, |
474 |
signs and sends it to the third party, who will have to pass it to |
475 |
the API for exporting an instance (``OpExportInstance``). This |
476 |
certificate is used to ensure we're sending the disk data to the |
477 |
correct destination cluster. |
478 |
- Once a disk can be imported, the API sends the destination |
479 |
information (IP address and TCP port) together with an HMAC |
480 |
signature to the third party. |
481 |
|
482 |
#. Third party hands public part of the destination's encryption key |
483 |
together with all necessary information to source cluster and tells |
484 |
it to start the move |
485 |
|
486 |
- The existing API for exporting instances (``OpExportInstance``) |
487 |
will be extended to export instances to remote clusters. |
488 |
|
489 |
#. Source cluster connects to destination cluster for each disk and |
490 |
transfers its data using the instance OS definition's export and |
491 |
import scripts |
492 |
|
493 |
- Before starting, the source cluster must verify the HMAC signature |
494 |
of the certificate and destination information (IP address and TCP |
495 |
port). |
496 |
- When connecting to the remote machine, strong certificate checks |
497 |
must be employed. |
498 |
|
499 |
#. Due to the asynchronous nature of the whole process, the destination |
500 |
cluster checks whether all disks have been transferred every time |
501 |
after transferring a single disk; if so, it destroys the encryption |
502 |
key |
503 |
#. After sending all disks, the source cluster destroys its key |
504 |
#. Destination cluster runs OS definition's rename script to adjust |
505 |
instance settings if needed (e.g. IP address) |
506 |
#. Destination cluster starts the instance if requested at the beginning |
507 |
by the third party |
508 |
#. Source cluster removes the instance if requested |
509 |
|
510 |
Instance move in pseudo code |
511 |
++++++++++++++++++++++++++++ |
512 |
|
513 |
.. highlight:: python |
514 |
|
515 |
The following pseudo code describes a script moving instances between |
516 |
clusters and what happens on both clusters. |
517 |
|
518 |
#. Script is started, gets the instance name and destination cluster:: |
519 |
|
520 |
(instance_name, dest_cluster_name) = sys.argv[1:] |
521 |
|
522 |
# Get destination cluster object |
523 |
dest_cluster = db.FindCluster(dest_cluster_name) |
524 |
|
525 |
# Use database to find source cluster |
526 |
src_cluster = db.FindClusterByInstance(instance_name) |
527 |
|
528 |
#. Script tells source cluster to stop instance:: |
529 |
|
530 |
# Stop instance |
531 |
src_cluster.StopInstance(instance_name) |
532 |
|
533 |
# Get instance specification (memory, disk, etc.) |
534 |
inst_spec = src_cluster.GetInstanceInfo(instance_name) |
535 |
|
536 |
(src_key_name, src_cert) = src_cluster.CreateX509Certificate() |
537 |
|
538 |
#. ``CreateX509Certificate`` on source cluster:: |
539 |
|
540 |
key_file = mkstemp() |
541 |
cert_file = "%s.cert" % key_file |
542 |
RunCmd(["/usr/bin/openssl", "req", "-new", |
543 |
"-newkey", "rsa:1024", "-days", "1", |
544 |
"-nodes", "-x509", "-batch", |
545 |
"-keyout", key_file, "-out", cert_file]) |
546 |
|
547 |
plain_cert = utils.ReadFile(cert_file) |
548 |
|
549 |
# HMAC sign using secret key, this adds a "X-Ganeti-Signature" |
550 |
# header to the beginning of the certificate |
551 |
signed_cert = utils.SignX509Certificate(plain_cert, |
552 |
utils.ReadFile(constants.X509_SIGNKEY_FILE)) |
553 |
|
554 |
# The certificate now looks like the following: |
555 |
# |
556 |
# X-Ganeti-Signature: $1234$28676f0516c6ab68062b[โฆ] |
557 |
# -----BEGIN CERTIFICATE----- |
558 |
# MIICsDCCAhmgAwIBAgI[โฆ] |
559 |
# -----END CERTIFICATE----- |
560 |
|
561 |
# Return name of key file and signed certificate in PEM format |
562 |
return (os.path.basename(key_file), signed_cert) |
563 |
|
564 |
#. Script creates instance on destination cluster and waits for move to |
565 |
finish:: |
566 |
|
567 |
dest_cluster.CreateInstance(mode=constants.REMOTE_IMPORT, |
568 |
spec=inst_spec, |
569 |
source_cert=src_cert) |
570 |
|
571 |
# Wait until destination cluster gives us its certificate |
572 |
dest_cert = None |
573 |
disk_info = [] |
574 |
while not (dest_cert and len(disk_info) < len(inst_spec.disks)): |
575 |
tmp = dest_cluster.WaitOutput() |
576 |
if tmp is Certificate: |
577 |
dest_cert = tmp |
578 |
elif tmp is DiskInfo: |
579 |
# DiskInfo contains destination address and port |
580 |
disk_info[tmp.index] = tmp |
581 |
|
582 |
# Tell source cluster to export disks |
583 |
for disk in disk_info: |
584 |
src_cluster.ExportDisk(instance_name, disk=disk, |
585 |
key_name=src_key_name, |
586 |
dest_cert=dest_cert) |
587 |
|
588 |
print ("Instance %s sucessfully moved to %s" % |
589 |
(instance_name, dest_cluster.name)) |
590 |
|
591 |
#. ``CreateInstance`` on destination cluster:: |
592 |
|
593 |
# โฆ |
594 |
|
595 |
if mode == constants.REMOTE_IMPORT: |
596 |
# Make sure certificate was not modified since it was generated by |
597 |
# source cluster (which must use the same secret) |
598 |
if (not utils.VerifySignedX509Cert(source_cert, |
599 |
utils.ReadFile(constants.X509_SIGNKEY_FILE))): |
600 |
raise Error("Certificate not signed with this cluster's secret") |
601 |
|
602 |
if utils.CheckExpiredX509Cert(source_cert): |
603 |
raise Error("X509 certificate is expired") |
604 |
|
605 |
source_cert_file = utils.WriteTempFile(source_cert) |
606 |
|
607 |
# See above for X509 certificate generation and signing |
608 |
(key_name, signed_cert) = CreateSignedX509Certificate() |
609 |
|
610 |
SendToClient("x509-cert", signed_cert) |
611 |
|
612 |
for disk in instance.disks: |
613 |
# Start socat |
614 |
RunCmd(("socat" |
615 |
" OPENSSL-LISTEN:%s,โฆ,key=%s,cert=%s,cafile=%s,verify=1" |
616 |
" stdout > /dev/diskโฆ") % |
617 |
port, GetRsaKeyPath(key_name, private=True), |
618 |
GetRsaKeyPath(key_name, private=False), src_cert_file) |
619 |
SendToClient("send-disk-to", disk, ip_address, port) |
620 |
|
621 |
DestroyX509Cert(key_name) |
622 |
|
623 |
RunRenameScript(instance_name) |
624 |
|
625 |
#. ``ExportDisk`` on source cluster:: |
626 |
|
627 |
# Make sure certificate was not modified since it was generated by |
628 |
# destination cluster (which must use the same secret) |
629 |
if (not utils.VerifySignedX509Cert(cert_pem, |
630 |
utils.ReadFile(constants.X509_SIGNKEY_FILE))): |
631 |
raise Error("Certificate not signed with this cluster's secret") |
632 |
|
633 |
if utils.CheckExpiredX509Cert(cert_pem): |
634 |
raise Error("X509 certificate is expired") |
635 |
|
636 |
dest_cert_file = utils.WriteTempFile(cert_pem) |
637 |
|
638 |
# Start socat |
639 |
RunCmd(("socat stdin" |
640 |
" OPENSSL:%s:%s,โฆ,key=%s,cert=%s,cafile=%s,verify=1" |
641 |
" < /dev/diskโฆ") % |
642 |
disk.host, disk.port, |
643 |
GetRsaKeyPath(key_name, private=True), |
644 |
GetRsaKeyPath(key_name, private=False), dest_cert_file) |
645 |
|
646 |
if instance.all_disks_done: |
647 |
DestroyX509Cert(key_name) |
648 |
|
649 |
.. highlight:: text |
650 |
|
651 |
Miscellaneous notes |
652 |
+++++++++++++++++++ |
653 |
|
654 |
- A very similar system could also be used for instance exports within |
655 |
the same cluster. Currently OpenSSH is being used, but could be |
656 |
replaced by socat and SSL/TLS. |
657 |
- During the design of intra-cluster instance moves we also discussed |
658 |
encrypting instance exports using GnuPG. |
659 |
- While most instances should have exactly the same configuration as |
660 |
on the source cluster, setting them up with a different disk layout |
661 |
might be helpful in some use-cases. |
662 |
- A cleanup operation, similar to the one available for failed instance |
663 |
migrations, should be provided. |
664 |
- ``ganeti-watcher`` should remove instances pending a move from another |
665 |
cluster after a certain amount of time. This takes care of failures |
666 |
somewhere in the process. |
667 |
- RSA keys can be generated using the existing |
668 |
``bootstrap.GenerateSelfSignedSslCert`` function, though it might be |
669 |
useful to not write both parts into a single file, requiring small |
670 |
changes to the function. The public part always starts with |
671 |
``-----BEGIN CERTIFICATE-----`` and ends with ``-----END |
672 |
CERTIFICATE-----``. |
673 |
- The source and destination cluster might be different when it comes |
674 |
to available hypervisors, kernels, etc. The destination cluster should |
675 |
refuse to accept an instance move if it can't fulfill an instance's |
676 |
requirements. |
677 |
|
678 |
|
679 |
Privilege separation |
680 |
-------------------- |
681 |
|
682 |
Current state and shortcomings |
683 |
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
684 |
|
685 |
All Ganeti daemons are run under the user root. This is not ideal from a |
686 |
security perspective as for possible exploitation of any daemon the user |
687 |
has full access to the system. |
688 |
|
689 |
In order to overcome this situation we'll allow Ganeti to run its daemon |
690 |
under different users and a dedicated group. This also will allow some |
691 |
side effects, like letting the user run some ``gnt-*`` commands if one |
692 |
is in the same group. |
693 |
|
694 |
Implementation |
695 |
~~~~~~~~~~~~~~ |
696 |
|
697 |
For Ganeti 2.2 the implementation will be focused on a the RAPI daemon |
698 |
only. This involves changes to ``daemons.py`` so it's possible to drop |
699 |
privileges on daemonize the process. Though, this will be a short term |
700 |
solution which will be replaced by a privilege drop already on daemon |
701 |
startup in Ganeti 2.3. |
702 |
|
703 |
It also needs changes in the master daemon to create the socket with new |
704 |
permissions/owners to allow RAPI access. There will be no other |
705 |
permission/owner changes in the file structure as the RAPI daemon is |
706 |
started with root permission. In that time it will read all needed files |
707 |
and then drop privileges before contacting the master daemon. |
708 |
|
709 |
|
710 |
Feature changes |
711 |
=============== |
712 |
|
713 |
KVM Security |
714 |
------------ |
715 |
|
716 |
Current state and shortcomings |
717 |
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
718 |
|
719 |
Currently all kvm processes run as root. Taking ownership of the |
720 |
hypervisor process, from inside a virtual machine, would mean a full |
721 |
compromise of the whole Ganeti cluster, knowledge of all Ganeti |
722 |
authentication secrets, full access to all running instances, and the |
723 |
option of subverting other basic services on the cluster (eg: ssh). |
724 |
|
725 |
Proposed changes |
726 |
~~~~~~~~~~~~~~~~ |
727 |
|
728 |
We would like to decrease the surface of attack available if an |
729 |
hypervisor is compromised. We can do so adding different features to |
730 |
Ganeti, which will allow restricting the broken hypervisor |
731 |
possibilities, in the absence of a local privilege escalation attack, to |
732 |
subvert the node. |
733 |
|
734 |
Dropping privileges in kvm to a single user (easy) |
735 |
++++++++++++++++++++++++++++++++++++++++++++++++++ |
736 |
|
737 |
By passing the ``-runas`` option to kvm, we can make it drop privileges. |
738 |
The user can be chosen by an hypervisor parameter, so that each instance |
739 |
can have its own user, but by default they will all run under the same |
740 |
one. It should be very easy to implement, and can easily be backported |
741 |
to 2.1.X. |
742 |
|
743 |
This mode protects the Ganeti cluster from a subverted hypervisor, but |
744 |
doesn't protect the instances between each other, unless care is taken |
745 |
to specify a different user for each. This would prevent the worst |
746 |
attacks, including: |
747 |
|
748 |
- logging in to other nodes |
749 |
- administering the Ganeti cluster |
750 |
- subverting other services |
751 |
|
752 |
But the following would remain an option: |
753 |
|
754 |
- terminate other VMs (but not start them again, as that requires root |
755 |
privileges to set up networking) (unless different users are used) |
756 |
- trace other VMs, and probably subvert them and access their data |
757 |
(unless different users are used) |
758 |
- send network traffic from the node |
759 |
- read unprotected data on the node filesystem |
760 |
|
761 |
Running kvm in a chroot (slightly harder) |
762 |
+++++++++++++++++++++++++++++++++++++++++ |
763 |
|
764 |
By passing the ``-chroot`` option to kvm, we can restrict the kvm |
765 |
process in its own (possibly empty) root directory. We need to set this |
766 |
area up so that the instance disks and control sockets are accessible, |
767 |
so it would require slightly more work at the Ganeti level. |
768 |
|
769 |
Breaking out in a chroot would mean: |
770 |
|
771 |
- a lot less options to find a local privilege escalation vector |
772 |
- the impossibility to write local data, if the chroot is set up |
773 |
correctly |
774 |
- the impossibility to read filesystem data on the host |
775 |
|
776 |
It would still be possible though to: |
777 |
|
778 |
- terminate other VMs |
779 |
- trace other VMs, and possibly subvert them (if a tracer can be |
780 |
installed in the chroot) |
781 |
- send network traffic from the node |
782 |
|
783 |
|
784 |
Running kvm with a pool of users (slightly harder) |
785 |
++++++++++++++++++++++++++++++++++++++++++++++++++ |
786 |
|
787 |
If rather than passing a single user as an hypervisor parameter, we have |
788 |
a pool of useable ones, we can dynamically choose a free one to use and |
789 |
thus guarantee that each machine will be separate from the others, |
790 |
without putting the burden of this on the cluster administrator. |
791 |
|
792 |
This would mean interfering between machines would be impossible, and |
793 |
can still be combined with the chroot benefits. |
794 |
|
795 |
Running iptables rules to limit network interaction (easy) |
796 |
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ |
797 |
|
798 |
These don't need to be handled by Ganeti, but we can ship examples. If |
799 |
the users used to run VMs would be blocked from sending some or all |
800 |
network traffic, it would become impossible for a broken into hypervisor |
801 |
to send arbitrary data on the node network, which is especially useful |
802 |
when the instance and the node network are separated (using ganeti-nbma |
803 |
or a separate set of network interfaces), or when a separate replication |
804 |
network is maintained. We need to experiment to see how much restriction |
805 |
we can properly apply, without limiting the instance legitimate traffic. |
806 |
|
807 |
|
808 |
Running kvm inside a container (even harder) |
809 |
++++++++++++++++++++++++++++++++++++++++++++ |
810 |
|
811 |
Recent linux kernels support different process namespaces through |
812 |
control groups. PIDs, users, filesystems and even network interfaces can |
813 |
be separated. If we can set up ganeti to run kvm in a separate container |
814 |
we could insulate all the host process from being even visible if the |
815 |
hypervisor gets broken into. Most probably separating the network |
816 |
namespace would require one extra hop in the host, through a veth |
817 |
interface, thus reducing performance, so we may want to avoid that, and |
818 |
just rely on iptables. |
819 |
|
820 |
Implementation plan |
821 |
~~~~~~~~~~~~~~~~~~~ |
822 |
|
823 |
We will first implement dropping privileges for kvm processes as a |
824 |
single user, and most probably backport it to 2.1. Then we'll ship |
825 |
example iptables rules to show how the user can be limited in its |
826 |
network activities. After that we'll implement chroot restriction for |
827 |
kvm processes, and extend the user limitation to use a user pool. |
828 |
|
829 |
Finally we'll look into namespaces and containers, although that might |
830 |
slip after the 2.2 release. |
831 |
|
832 |
|
833 |
External interface changes |
834 |
========================== |
835 |
|
836 |
|
837 |
OS API |
838 |
------ |
839 |
|
840 |
The OS variants implementation in Ganeti 2.1 didn't prove to be useful |
841 |
enough to alleviate the need to hack around the Ganeti API in order to |
842 |
provide flexible OS parameters. |
843 |
|
844 |
As such, for Ganeti 2.2 we will provide support for arbitrary OS |
845 |
parameters. However, since OSes are not registered in Ganeti, but |
846 |
instead discovered at runtime, the interface is not entirely |
847 |
straightforward. |
848 |
|
849 |
Furthermore, to support the system administrator in keeping OSes |
850 |
properly in sync across the nodes of a cluster, Ganeti will also verify |
851 |
(if existing) the consistence of a new ``os_version`` file. |
852 |
|
853 |
These changes to the OS API will bump the API version to 20. |
854 |
|
855 |
|
856 |
OS version |
857 |
~~~~~~~~~~ |
858 |
|
859 |
A new ``os_version`` file will be supported by Ganeti. This file is not |
860 |
required, but if existing, its contents will be checked for consistency |
861 |
across nodes. The file should hold only one line of text (any extra data |
862 |
will be discarded), and its contents will be shown in the OS information |
863 |
and diagnose commands. |
864 |
|
865 |
It is recommended that OS authors increase the contents of this file for |
866 |
any changes; at a minimum, modifications that change the behaviour of |
867 |
import/export scripts must increase the version, since they break |
868 |
intra-cluster migration. |
869 |
|
870 |
Parameters |
871 |
~~~~~~~~~~ |
872 |
|
873 |
The interface between Ganeti and the OS scripts will be based on |
874 |
environment variables, and as such the parameters and their values will |
875 |
need to be valid in this context. |
876 |
|
877 |
Names |
878 |
+++++ |
879 |
|
880 |
The parameter names will be declared in a new file, ``parameters.list``, |
881 |
together with a one-line documentation (whitespace-separated). Example:: |
882 |
|
883 |
$ cat parameters.list |
884 |
ns1 Specifies the first name server to add to /etc/resolv.conf |
885 |
extra_packages Specifies additional packages to install |
886 |
rootfs_size Specifies the root filesystem size (the rest will be left unallocated) |
887 |
track Specifies the distribution track, one of 'stable', 'testing' or 'unstable' |
888 |
|
889 |
As seen above, the documentation can be separate via multiple |
890 |
spaces/tabs from the names. |
891 |
|
892 |
The parameter names as read from the file will be used for the command |
893 |
line interface in lowercased form; as such, there shouldn't be any two |
894 |
parameters which differ in case only. |
895 |
|
896 |
Values |
897 |
++++++ |
898 |
|
899 |
The values of the parameters are, from Ganeti's point of view, |
900 |
completely freeform. If a given parameter has, from the OS' point of |
901 |
view, a fixed set of valid values, these should be documented as such |
902 |
and verified by the OS, but Ganeti will not handle such parameters |
903 |
specially. |
904 |
|
905 |
An empty value must be handled identically as a missing parameter. In |
906 |
other words, the validation script should only test for non-empty |
907 |
values, and not for declared versus undeclared parameters. |
908 |
|
909 |
Furthermore, each parameter should have an (internal to the OS) default |
910 |
value, that will be used if not passed from Ganeti. More precisely, it |
911 |
should be possible for any parameter to specify a value that will have |
912 |
the same effect as not passing the parameter, and no in no case should |
913 |
the absence of a parameter be treated as an exceptional case (outside |
914 |
the value space). |
915 |
|
916 |
|
917 |
Environment variables |
918 |
^^^^^^^^^^^^^^^^^^^^^ |
919 |
|
920 |
The parameters will be exposed in the environment upper-case and |
921 |
prefixed with the string ``OSP_``. For example, a parameter declared in |
922 |
the 'parameters' file as ``ns1`` will appear in the environment as the |
923 |
variable ``OSP_NS1``. |
924 |
|
925 |
Validation |
926 |
++++++++++ |
927 |
|
928 |
For the purpose of parameter name/value validation, the OS scripts |
929 |
*must* provide an additional script, named ``verify``. This script will |
930 |
be called with the argument ``parameters``, and all the parameters will |
931 |
be passed in via environment variables, as described above. |
932 |
|
933 |
The script should signify result/failure based on its exit code, and |
934 |
show explanatory messages either on its standard output or standard |
935 |
error. These messages will be passed on to the master, and stored as in |
936 |
the OpCode result/error message. |
937 |
|
938 |
The parameters must be constructed to be independent of the instance |
939 |
specifications. In general, the validation script will only be called |
940 |
with the parameter variables set, but not with the normal per-instance |
941 |
variables, in order for Ganeti to be able to validate default parameters |
942 |
too, when they change. Validation will only be performed on one cluster |
943 |
node, and it will be up to the ganeti administrator to keep the OS |
944 |
scripts in sync between all nodes. |
945 |
|
946 |
Instance operations |
947 |
+++++++++++++++++++ |
948 |
|
949 |
The parameters will be passed, as described above, to all the other |
950 |
instance operations (creation, import, export). Ideally, these scripts |
951 |
will not abort with parameter validation errors, if the ``verify`` |
952 |
script has verified them correctly. |
953 |
|
954 |
Note: when changing an instance's OS type, any OS parameters defined at |
955 |
instance level will be kept as-is. If the parameters differ between the |
956 |
new and the old OS, the user should manually remove/update them as |
957 |
needed. |
958 |
|
959 |
Declaration and modification |
960 |
++++++++++++++++++++++++++++ |
961 |
|
962 |
Since the OSes are not registered in Ganeti, we will only make a 'weak' |
963 |
link between the parameters as declared in Ganeti and the actual OSes |
964 |
existing on the cluster. |
965 |
|
966 |
It will be possible to declare parameters either globally, per cluster |
967 |
(where they are indexed per OS/variant), or individually, per |
968 |
instance. The declaration of parameters will not be tied to current |
969 |
existing OSes. When specifying a parameter, if the OS exists, it will be |
970 |
validated; if not, then it will simply be stored as-is. |
971 |
|
972 |
A special note is that it will not be possible to 'unset' at instance |
973 |
level a parameter that is declared globally. Instead, at instance level |
974 |
the parameter should be given an explicit value, or the default value as |
975 |
explained above. |
976 |
|
977 |
CLI interface |
978 |
+++++++++++++ |
979 |
|
980 |
The modification of global (default) parameters will be done via the |
981 |
``gnt-os`` command, and the per-instance parameters via the |
982 |
``gnt-instance`` command. Both these commands will take an addition |
983 |
``--os-parameters`` or ``-O`` flag that specifies the parameters in the |
984 |
familiar comma-separated, key=value format. For removing a parameter, a |
985 |
``-key`` syntax will be used, e.g.:: |
986 |
|
987 |
# initial modification |
988 |
$ gnt-instance modify -O use_dchp=true instance1 |
989 |
# later revert (to the cluster default, or the OS default if not |
990 |
# defined at cluster level) |
991 |
$ gnt-instance modify -O -use_dhcp instance1 |
992 |
|
993 |
Internal storage |
994 |
++++++++++++++++ |
995 |
|
996 |
Internally, the OS parameters will be stored in a new ``osparams`` |
997 |
attribute. The global parameters will be stored on the cluster object, |
998 |
and the value of this attribute will be a dictionary indexed by OS name |
999 |
(this also accepts an OS+variant name, which will override a simple OS |
1000 |
name, see below), and for values the key/name dictionary. For the |
1001 |
instances, the value will be directly the key/name dictionary. |
1002 |
|
1003 |
Overriding rules |
1004 |
++++++++++++++++ |
1005 |
|
1006 |
Any instance-specific parameters will override any variant-specific |
1007 |
parameters, which in turn will override any global parameters. The |
1008 |
global parameters, in turn, override the built-in defaults (of the OS |
1009 |
scripts). |
1010 |
|
1011 |
|
1012 |
.. vim: set textwidth=72 : |
1013 |
.. Local Variables: |
1014 |
.. mode: rst |
1015 |
.. fill-column: 72 |
1016 |
.. End: |