Statistics
| Branch: | Tag: | Revision:

root / doc / design-2.2.rst @ 2f2f1289

History | View | Annotate | Download (41.1 kB)

1
=================
2
Ganeti 2.2 design
3
=================
4

    
5
This document describes the major changes in Ganeti 2.2 compared to
6
the 2.1 version.
7

    
8
The 2.2 version will be a relatively small release. Its main aim is to
9
avoid changing too much of the core code, while addressing issues and
10
adding new features and improvements over 2.1, in a timely fashion.
11

    
12
.. contents:: :depth: 4
13

    
14
As for 2.1 we divide the 2.2 design into three areas:
15

    
16
- core changes, which affect the master daemon/job queue/locking or
17
  all/most logical units
18
- logical unit/feature changes
19
- external interface changes (e.g. command line, OS API, hooks, ...)
20

    
21

    
22
Core changes
23
============
24

    
25
Master Daemon Scaling improvements
26
----------------------------------
27

    
28
Current state and shortcomings
29
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
30

    
31
Currently the Ganeti master daemon is based on four sets of threads:
32

    
33
- The main thread (1 thread) just accepts connections on the master
34
  socket
35
- The client worker pool (16 threads) handles those connections,
36
  one thread per connected socket, parses luxi requests, and sends data
37
  back to the clients
38
- The job queue worker pool (25 threads) executes the actual jobs
39
  submitted by the clients
40
- The rpc worker pool (10 threads) interacts with the nodes via
41
  http-based-rpc
42

    
43
This means that every masterd currently runs 52 threads to do its job.
44
Being able to reduce the number of thread sets would make the master's
45
architecture a lot simpler. Moreover having less threads can help
46
decrease lock contention, log pollution and memory usage.
47
Also, with the current architecture, masterd suffers from quite a few
48
scalability issues:
49

    
50
Core daemon connection handling
51
+++++++++++++++++++++++++++++++
52

    
53
Since the 16 client worker threads handle one connection each, it's very
54
easy to exhaust them, by just connecting to masterd 16 times and not
55
sending any data. While we could perhaps make those pools resizable,
56
increasing the number of threads won't help with lock contention nor
57
with better handling long running operations making sure the client is
58
informed that everything is proceeding, and doesn't need to time out.
59

    
60
Wait for job change
61
+++++++++++++++++++
62

    
63
The REQ_WAIT_FOR_JOB_CHANGE luxi operation makes the relevant client
64
thread block on its job for a relative long time. This is another easy
65
way to exhaust the 16 client threads, and a place where clients often
66
time out, moreover this operation is negative for the job queue lock
67
contention (see below).
68

    
69
Job Queue lock
70
++++++++++++++
71

    
72
The job queue lock is quite heavily contended, and certain easily
73
reproducible workloads show that's it's very easy to put masterd in
74
trouble: for example running ~15 background instance reinstall jobs,
75
results in a master daemon that, even without having finished the
76
client worker threads, can't answer simple job list requests, or
77
submit more jobs.
78

    
79
Currently the job queue lock is an exclusive non-fair lock insulating
80
the following job queue methods (called by the client workers).
81

    
82
  - AddNode
83
  - RemoveNode
84
  - SubmitJob
85
  - SubmitManyJobs
86
  - WaitForJobChanges
87
  - CancelJob
88
  - ArchiveJob
89
  - AutoArchiveJobs
90
  - QueryJobs
91
  - Shutdown
92

    
93
Moreover the job queue lock is acquired outside of the job queue in two
94
other classes:
95

    
96
  - jqueue._JobQueueWorker (in RunTask) before executing the opcode, after
97
    finishing its executing and when handling an exception.
98
  - jqueue._OpExecCallbacks (in NotifyStart and Feedback) when the
99
    processor (mcpu.Processor) is about to start working on the opcode
100
    (after acquiring the necessary locks) and when any data is sent back
101
    via the feedback function.
102

    
103
Of those the major critical points are:
104

    
105
  - Submit[Many]Job, QueryJobs, WaitForJobChanges, which can easily slow
106
    down and block client threads up to making the respective clients
107
    time out.
108
  - The code paths in NotifyStart, Feedback, and RunTask, which slow
109
    down job processing between clients and otherwise non-related jobs.
110

    
111
To increase the pain:
112

    
113
  - WaitForJobChanges is a bad offender because it's implemented with a
114
    notified condition which awakes waiting threads, who then try to
115
    acquire the global lock again
116
  - Many should-be-fast code paths are slowed down by replicating the
117
    change to remote nodes, and thus waiting, with the lock held, on
118
    remote rpcs to complete (starting, finishing, and submitting jobs)
119

    
120
Proposed changes
121
~~~~~~~~~~~~~~~~
122

    
123
In order to be able to interact with the master daemon even when it's
124
under heavy load, and  to make it simpler to add core functionality
125
(such as an asynchronous rpc client) we propose three subsequent levels
126
of changes to the master core architecture.
127

    
128
After making this change we'll be able to re-evaluate the size of our
129
thread pool, if we see that we can make most threads in the client
130
worker pool always idle. In the future we should also investigate making
131
the rpc client asynchronous as well, so that we can make masterd a lot
132
smaller in number of threads, and memory size, and thus also easier to
133
understand, debug, and scale.
134

    
135
Connection handling
136
+++++++++++++++++++
137

    
138
We'll move the main thread of ganeti-masterd to asyncore, so that it can
139
share the mainloop code with all other Ganeti daemons. Then all luxi
140
clients will be asyncore clients, and I/O to/from them will be handled
141
by the master thread asynchronously. Data will be read from the client
142
sockets as it becomes available, and kept in a buffer, then when a
143
complete message is found, it's passed to a client worker thread for
144
parsing and processing. The client worker thread is responsible for
145
serializing the reply, which can then be sent asynchronously by the main
146
thread on the socket.
147

    
148
Wait for job change
149
+++++++++++++++++++
150

    
151
The REQ_WAIT_FOR_JOB_CHANGE luxi request is changed to be
152
subscription-based, so that the executing thread doesn't have to be
153
waiting for the changes to arrive. Threads producing messages (job queue
154
executors) will make sure that when there is a change another thread is
155
awaken and delivers it to the waiting clients. This can be either a
156
dedicated "wait for job changes" thread or pool, or one of the client
157
workers, depending on what's easier to implement. In either case the
158
main asyncore thread will only be involved in pushing of the actual
159
data, and not in fetching/serializing it.
160

    
161
Other features to look at, when implementing this code are:
162

    
163
  - Possibility not to need the job lock to know which updates to push:
164
    if the thread producing the data pushed a copy of the update for the
165
    waiting clients, the thread sending it won't need to acquire the
166
    lock again to fetch the actual data.
167
  - Possibility to signal clients about to time out, when no update has
168
    been received, not to despair and to keep waiting (luxi level
169
    keepalive).
170
  - Possibility to defer updates if they are too frequent, providing
171
    them at a maximum rate (lower priority).
172

    
173
Job Queue lock
174
++++++++++++++
175

    
176
In order to decrease the job queue lock contention, we will change the
177
code paths in the following ways, initially:
178

    
179
  - A per-job lock will be introduced. All operations affecting only one
180
    job (for example feedback, starting/finishing notifications,
181
    subscribing to or watching a job) will only require the job lock.
182
    This should be a leaf lock, but if a situation arises in which it
183
    must be acquired together with the global job queue lock the global
184
    one must always be acquired last (for the global section).
185
  - The locks will be converted to a sharedlock. Any read-only operation
186
    will be able to proceed in parallel.
187
  - During remote update (which happens already per-job) we'll drop the
188
    job lock level to shared mode, so that activities reading the lock
189
    (for example job change notifications or QueryJobs calls) will be
190
    able to proceed in parallel.
191
  - The wait for job changes improvements proposed above will be
192
    implemented.
193

    
194
In the future other improvements may include splitting off some of the
195
work (eg replication of a job to remote nodes) to a separate thread pool
196
or asynchronous thread, not tied with the code path for answering client
197
requests or the one executing the "real" work. This can be discussed
198
again after we used the more granular job queue in production and tested
199
its benefits.
200

    
201

    
202
Remote procedure call timeouts
203
------------------------------
204

    
205
Current state and shortcomings
206
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
207

    
208
The current RPC protocol used by Ganeti is based on HTTP. Every request
209
consists of an HTTP PUT request (e.g. ``PUT /hooks_runner HTTP/1.0``)
210
and doesn't return until the function called has returned. Parameters
211
and return values are encoded using JSON.
212

    
213
On the server side, ``ganeti-noded`` handles every incoming connection
214
in a separate process by forking just after accepting the connection.
215
This process exits after sending the response.
216

    
217
There is one major problem with this design: Timeouts can not be used on
218
a per-request basis. Neither client or server know how long it will
219
take. Even if we might be able to group requests into different
220
categories (e.g. fast and slow), this is not reliable.
221

    
222
If a node has an issue or the network connection fails while a request
223
is being handled, the master daemon can wait for a long time for the
224
connection to time out (e.g. due to the operating system's underlying
225
TCP keep-alive packets or timeouts). While the settings for keep-alive
226
packets can be changed using Linux-specific socket options, we prefer to
227
use application-level timeouts because these cover both machine down and
228
unresponsive node daemon cases.
229

    
230
Proposed changes
231
~~~~~~~~~~~~~~~~
232

    
233
RPC glossary
234
++++++++++++
235

    
236
Function call ID
237
  Unique identifier returned by ``ganeti-noded`` after invoking a
238
  function.
239
Function process
240
  Process started by ``ganeti-noded`` to call actual (backend) function.
241

    
242
Protocol
243
++++++++
244

    
245
Initially we chose HTTP as our RPC protocol because there were existing
246
libraries, which, unfortunately, turned out to miss important features
247
(such as SSL certificate authentication) and we had to write our own.
248

    
249
This proposal can easily be implemented using HTTP, though it would
250
likely be more efficient and less complicated to use the LUXI protocol
251
already used to communicate between client tools and the Ganeti master
252
daemon. Switching to another protocol can occur at a later point. This
253
proposal should be implemented using HTTP as its underlying protocol.
254

    
255
The LUXI protocol currently contains two functions, ``WaitForJobChange``
256
and ``AutoArchiveJobs``, which can take a longer time. They both support
257
a parameter to specify the timeout. This timeout is usually chosen as
258
roughly half of the socket timeout, guaranteeing a response before the
259
socket times out. After the specified amount of time,
260
``AutoArchiveJobs`` returns and reports the number of archived jobs.
261
``WaitForJobChange`` returns and reports a timeout. In both cases, the
262
functions can be called again.
263

    
264
A similar model can be used for the inter-node RPC protocol. In some
265
sense, the node daemon will implement a light variant of *"node daemon
266
jobs"*. When the function call is sent, it specifies an initial timeout.
267
If the function didn't finish within this timeout, a response is sent
268
with a unique identifier, the function call ID. The client can then
269
choose to wait for the function to finish again with a timeout.
270
Inter-node RPC calls would no longer be blocking indefinitely and there
271
would be an implicit ping-mechanism.
272

    
273
Request handling
274
++++++++++++++++
275

    
276
To support the protocol changes described above, the way the node daemon
277
handles request will have to change. Instead of forking and handling
278
every connection in a separate process, there should be one child
279
process per function call and the master process will handle the
280
communication with clients and the function processes using asynchronous
281
I/O.
282

    
283
Function processes communicate with the parent process via stdio and
284
possibly their exit status. Every function process has a unique
285
identifier, though it shouldn't be the process ID only (PIDs can be
286
recycled and are prone to race conditions for this use case). The
287
proposed format is ``${ppid}:${cpid}:${time}:${random}``, where ``ppid``
288
is the ``ganeti-noded`` PID, ``cpid`` the child's PID, ``time`` the
289
current Unix timestamp with decimal places and ``random`` at least 16
290
random bits.
291

    
292
The following operations will be supported:
293

    
294
``StartFunction(fn_name, fn_args, timeout)``
295
  Starts a function specified by ``fn_name`` with arguments in
296
  ``fn_args`` and waits up to ``timeout`` seconds for the function
297
  to finish. Fire-and-forget calls can be made by specifying a timeout
298
  of 0 seconds (e.g. for powercycling the node). Returns three values:
299
  function call ID (if not finished), whether function finished (or
300
  timeout) and the function's return value.
301
``WaitForFunction(fnc_id, timeout)``
302
  Waits up to ``timeout`` seconds for function call to finish. Return
303
  value same as ``StartFunction``.
304

    
305
In the future, ``StartFunction`` could support an additional parameter
306
to specify after how long the function process should be aborted.
307

    
308
Simplified timing diagram::
309

    
310
  Master daemon        Node daemon                      Function process
311
   |
312
  Call function
313
  (timeout 10s) -----> Parse request and fork for ----> Start function
314
                       calling actual function, then     |
315
                       wait up to 10s for function to    |
316
                       finish                            |
317
                        |                                |
318
                       ...                              ...
319
                        |                                |
320
  Examine return <----  |                                |
321
  value and wait                                         |
322
  again -------------> Wait another 10s for function     |
323
                        |                                |
324
                       ...                              ...
325
                        |                                |
326
  Examine return <----  |                                |
327
  value and wait                                         |
328
  again -------------> Wait another 10s for function     |
329
                        |                                |
330
                       ...                              ...
331
                        |                                |
332
                        |                               Function ends,
333
                       Get return value and forward <-- process exits
334
  Process return <---- it to caller
335
  value and continue
336
   |
337

    
338
.. TODO: Convert diagram above to graphviz/dot graphic
339

    
340
On process termination (e.g. after having been sent a ``SIGTERM`` or
341
``SIGINT`` signal), ``ganeti-noded`` should send ``SIGTERM`` to all
342
function processes and wait for all of them to terminate.
343

    
344

    
345
Inter-cluster instance moves
346
----------------------------
347

    
348
Current state and shortcomings
349
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
350

    
351
With the current design of Ganeti, moving whole instances between
352
different clusters involves a lot of manual work. There are several ways
353
to move instances, one of them being to export the instance, manually
354
copying all data to the new cluster before importing it again. Manual
355
changes to the instances configuration, such as the IP address, may be
356
necessary in the new environment. The goal is to improve and automate
357
this process in Ganeti 2.2.
358

    
359
Proposed changes
360
~~~~~~~~~~~~~~~~
361

    
362
Authorization, Authentication and Security
363
++++++++++++++++++++++++++++++++++++++++++
364

    
365
Until now, each Ganeti cluster was a self-contained entity and wouldn't
366
talk to other Ganeti clusters. Nodes within clusters only had to trust
367
the other nodes in the same cluster and the network used for replication
368
was trusted, too (hence the ability the use a separate, local network
369
for replication).
370

    
371
For inter-cluster instance transfers this model must be weakened. Nodes
372
in one cluster will have to talk to nodes in other clusters, sometimes
373
in other locations and, very important, via untrusted network
374
connections.
375

    
376
Various option have been considered for securing and authenticating the
377
data transfer from one machine to another. To reduce the risk of
378
accidentally overwriting data due to software bugs, authenticating the
379
arriving data was considered critical. Eventually we decided to use
380
socat's OpenSSL options (``OPENSSL:``, ``OPENSSL-LISTEN:`` et al), which
381
provide us with encryption, authentication and authorization when used
382
with separate keys and certificates.
383

    
384
Combinations of OpenSSH, GnuPG and Netcat were deemed too complex to set
385
up from within Ganeti. Any solution involving OpenSSH would require a
386
dedicated user with a home directory and likely automated modifications
387
to the user's ``$HOME/.ssh/authorized_keys`` file. When using Netcat,
388
GnuPG or another encryption method would be necessary to transfer the
389
data over an untrusted network. socat combines both in one program and
390
is already a dependency.
391

    
392
Each of the two clusters will have to generate an RSA key. The public
393
parts are exchanged between the clusters by a third party, such as an
394
administrator or a system interacting with Ganeti via the remote API
395
("third party" from here on). After receiving each other's public key,
396
the clusters can start talking to each other.
397

    
398
All encrypted connections must be verified on both sides. Neither side
399
may accept unverified certificates. The generated certificate should
400
only be valid for the time necessary to move the instance.
401

    
402
For additional protection of the instance data, the two clusters can
403
verify the certificates and destination information exchanged via the
404
third party by checking an HMAC signature using a key shared among the
405
involved clusters. By default this secret key will be a random string
406
unique to the cluster, generated by running SHA1 over 20 bytes read from
407
``/dev/urandom`` and the administrator must synchronize the secrets
408
between clusters before instances can be moved. If the third party does
409
not know the secret, it can't forge the certificates or redirect the
410
data. Unless disabled by a new cluster parameter, verifying the HMAC
411
signatures must be mandatory. The HMAC signature for X509 certificates
412
will be prepended to the certificate similar to an :rfc:`822` header and
413
only covers the certificate (from ``-----BEGIN CERTIFICATE-----`` to
414
``-----END CERTIFICATE-----``). The header name will be
415
``X-Ganeti-Signature`` and its value will have the format
416
``$salt/$hash`` (salt and hash separated by slash). The salt may only
417
contain characters in the range ``[a-zA-Z0-9]``.
418

    
419
On the web, the destination cluster would be equivalent to an HTTPS
420
server requiring verifiable client certificates. The browser would be
421
equivalent to the source cluster and must verify the server's
422
certificate while providing a client certificate to the server.
423

    
424
Copying data
425
++++++++++++
426

    
427
To simplify the implementation, we decided to operate at a block-device
428
level only, allowing us to easily support non-DRBD instance moves.
429

    
430
Intra-cluster instance moves will re-use the existing export and import
431
scripts supplied by instance OS definitions. Unlike simply copying the
432
raw data, this allows to use filesystem-specific utilities to dump only
433
used parts of the disk and to exclude certain disks from the move.
434
Compression should be used to further reduce the amount of data
435
transferred.
436

    
437
The export scripts writes all data to stdout and the import script reads
438
it from stdin again. To avoid copying data and reduce disk space
439
consumption, everything is read from the disk and sent over the network
440
directly, where it'll be written to the new block device directly again.
441

    
442
Workflow
443
++++++++
444

    
445
#. Third party tells source cluster to shut down instance, asks for the
446
   instance specification and for the public part of an encryption key
447

    
448
   - Instance information can already be retrieved using an existing API
449
     (``OpQueryInstanceData``).
450
   - An RSA encryption key and a corresponding self-signed X509
451
     certificate is generated using the "openssl" command. This key will
452
     be used to encrypt the data sent to the destination cluster.
453

    
454
     - Private keys never leave the cluster.
455
     - The public part (the X509 certificate) is signed using HMAC with
456
       salting and a secret shared between Ganeti clusters.
457

    
458
#. Third party tells destination cluster to create an instance with the
459
   same specifications as on source cluster and to prepare for an
460
   instance move with the key received from the source cluster and
461
   receives the public part of the destination's encryption key
462

    
463
   - The current API to create instances (``OpCreateInstance``) will be
464
     extended to support an import from a remote cluster.
465
   - A valid, unexpired X509 certificate signed with the destination
466
     cluster's secret will be required. By verifying the signature, we
467
     know the third party didn't modify the certificate.
468

    
469
     - The private keys never leave their cluster, hence the third party
470
       can not decrypt or intercept the instance's data by modifying the
471
       IP address or port sent by the destination cluster.
472

    
473
   - The destination cluster generates another key and certificate,
474
     signs and sends it to the third party, who will have to pass it to
475
     the API for exporting an instance (``OpExportInstance``). This
476
     certificate is used to ensure we're sending the disk data to the
477
     correct destination cluster.
478
   - Once a disk can be imported, the API sends the destination
479
     information (IP address and TCP port) together with an HMAC
480
     signature to the third party.
481

    
482
#. Third party hands public part of the destination's encryption key
483
   together with all necessary information to source cluster and tells
484
   it to start the move
485

    
486
   - The existing API for exporting instances (``OpExportInstance``)
487
     will be extended to export instances to remote clusters.
488

    
489
#. Source cluster connects to destination cluster for each disk and
490
   transfers its data using the instance OS definition's export and
491
   import scripts
492

    
493
   - Before starting, the source cluster must verify the HMAC signature
494
     of the certificate and destination information (IP address and TCP
495
     port).
496
   - When connecting to the remote machine, strong certificate checks
497
     must be employed.
498

    
499
#. Due to the asynchronous nature of the whole process, the destination
500
   cluster checks whether all disks have been transferred every time
501
   after transferring a single disk; if so, it destroys the encryption
502
   key
503
#. After sending all disks, the source cluster destroys its key
504
#. Destination cluster runs OS definition's rename script to adjust
505
   instance settings if needed (e.g. IP address)
506
#. Destination cluster starts the instance if requested at the beginning
507
   by the third party
508
#. Source cluster removes the instance if requested
509

    
510
Instance move in pseudo code
511
++++++++++++++++++++++++++++
512

    
513
.. highlight:: python
514

    
515
The following pseudo code describes a script moving instances between
516
clusters and what happens on both clusters.
517

    
518
#. Script is started, gets the instance name and destination cluster::
519

    
520
    (instance_name, dest_cluster_name) = sys.argv[1:]
521

    
522
    # Get destination cluster object
523
    dest_cluster = db.FindCluster(dest_cluster_name)
524

    
525
    # Use database to find source cluster
526
    src_cluster = db.FindClusterByInstance(instance_name)
527

    
528
#. Script tells source cluster to stop instance::
529

    
530
    # Stop instance
531
    src_cluster.StopInstance(instance_name)
532

    
533
    # Get instance specification (memory, disk, etc.)
534
    inst_spec = src_cluster.GetInstanceInfo(instance_name)
535

    
536
    (src_key_name, src_cert) = src_cluster.CreateX509Certificate()
537

    
538
#. ``CreateX509Certificate`` on source cluster::
539

    
540
    key_file = mkstemp()
541
    cert_file = "%s.cert" % key_file
542
    RunCmd(["/usr/bin/openssl", "req", "-new",
543
             "-newkey", "rsa:1024", "-days", "1",
544
             "-nodes", "-x509", "-batch",
545
             "-keyout", key_file, "-out", cert_file])
546

    
547
    plain_cert = utils.ReadFile(cert_file)
548

    
549
    # HMAC sign using secret key, this adds a "X-Ganeti-Signature"
550
    # header to the beginning of the certificate
551
    signed_cert = utils.SignX509Certificate(plain_cert,
552
      utils.ReadFile(constants.X509_SIGNKEY_FILE))
553

    
554
    # The certificate now looks like the following:
555
    #
556
    #   X-Ganeti-Signature: $1234$28676f0516c6ab68062b[…]
557
    #   -----BEGIN CERTIFICATE-----
558
    #   MIICsDCCAhmgAwIBAgI[…]
559
    #   -----END CERTIFICATE-----
560

    
561
    # Return name of key file and signed certificate in PEM format
562
    return (os.path.basename(key_file), signed_cert)
563

    
564
#. Script creates instance on destination cluster and waits for move to
565
   finish::
566

    
567
    dest_cluster.CreateInstance(mode=constants.REMOTE_IMPORT,
568
                                spec=inst_spec,
569
                                source_cert=src_cert)
570

    
571
    # Wait until destination cluster gives us its certificate
572
    dest_cert = None
573
    disk_info = []
574
    while not (dest_cert and len(disk_info) < len(inst_spec.disks)):
575
      tmp = dest_cluster.WaitOutput()
576
      if tmp is Certificate:
577
        dest_cert = tmp
578
      elif tmp is DiskInfo:
579
        # DiskInfo contains destination address and port
580
        disk_info[tmp.index] = tmp
581

    
582
    # Tell source cluster to export disks
583
    for disk in disk_info:
584
      src_cluster.ExportDisk(instance_name, disk=disk,
585
                             key_name=src_key_name,
586
                             dest_cert=dest_cert)
587

    
588
    print ("Instance %s sucessfully moved to %s" %
589
           (instance_name, dest_cluster.name))
590

    
591
#. ``CreateInstance`` on destination cluster::
592

    
593
    # …
594

    
595
    if mode == constants.REMOTE_IMPORT:
596
      # Make sure certificate was not modified since it was generated by
597
      # source cluster (which must use the same secret)
598
      if (not utils.VerifySignedX509Cert(source_cert,
599
            utils.ReadFile(constants.X509_SIGNKEY_FILE))):
600
        raise Error("Certificate not signed with this cluster's secret")
601

    
602
      if utils.CheckExpiredX509Cert(source_cert):
603
        raise Error("X509 certificate is expired")
604

    
605
      source_cert_file = utils.WriteTempFile(source_cert)
606

    
607
      # See above for X509 certificate generation and signing
608
      (key_name, signed_cert) = CreateSignedX509Certificate()
609

    
610
      SendToClient("x509-cert", signed_cert)
611

    
612
      for disk in instance.disks:
613
        # Start socat
614
        RunCmd(("socat"
615
                " OPENSSL-LISTEN:%s,…,key=%s,cert=%s,cafile=%s,verify=1"
616
                " stdout > /dev/disk…") %
617
               port, GetRsaKeyPath(key_name, private=True),
618
               GetRsaKeyPath(key_name, private=False), src_cert_file)
619
        SendToClient("send-disk-to", disk, ip_address, port)
620

    
621
      DestroyX509Cert(key_name)
622

    
623
      RunRenameScript(instance_name)
624

    
625
#. ``ExportDisk`` on source cluster::
626

    
627
    # Make sure certificate was not modified since it was generated by
628
    # destination cluster (which must use the same secret)
629
    if (not utils.VerifySignedX509Cert(cert_pem,
630
          utils.ReadFile(constants.X509_SIGNKEY_FILE))):
631
      raise Error("Certificate not signed with this cluster's secret")
632

    
633
    if utils.CheckExpiredX509Cert(cert_pem):
634
      raise Error("X509 certificate is expired")
635

    
636
    dest_cert_file = utils.WriteTempFile(cert_pem)
637

    
638
    # Start socat
639
    RunCmd(("socat stdin"
640
            " OPENSSL:%s:%s,…,key=%s,cert=%s,cafile=%s,verify=1"
641
            " < /dev/disk…") %
642
           disk.host, disk.port,
643
           GetRsaKeyPath(key_name, private=True),
644
           GetRsaKeyPath(key_name, private=False), dest_cert_file)
645

    
646
    if instance.all_disks_done:
647
      DestroyX509Cert(key_name)
648

    
649
.. highlight:: text
650

    
651
Miscellaneous notes
652
+++++++++++++++++++
653

    
654
- A very similar system could also be used for instance exports within
655
  the same cluster. Currently OpenSSH is being used, but could be
656
  replaced by socat and SSL/TLS.
657
- During the design of intra-cluster instance moves we also discussed
658
  encrypting instance exports using GnuPG.
659
- While most instances should have exactly the same configuration as
660
  on the source cluster, setting them up with a different disk layout
661
  might be helpful in some use-cases.
662
- A cleanup operation, similar to the one available for failed instance
663
  migrations, should be provided.
664
- ``ganeti-watcher`` should remove instances pending a move from another
665
  cluster after a certain amount of time. This takes care of failures
666
  somewhere in the process.
667
- RSA keys can be generated using the existing
668
  ``bootstrap.GenerateSelfSignedSslCert`` function, though it might be
669
  useful to not write both parts into a single file, requiring small
670
  changes to the function. The public part always starts with
671
  ``-----BEGIN CERTIFICATE-----`` and ends with ``-----END
672
  CERTIFICATE-----``.
673
- The source and destination cluster might be different when it comes
674
  to available hypervisors, kernels, etc. The destination cluster should
675
  refuse to accept an instance move if it can't fulfill an instance's
676
  requirements.
677

    
678

    
679
Privilege separation
680
--------------------
681

    
682
Current state and shortcomings
683
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
684

    
685
All Ganeti daemons are run under the user root. This is not ideal from a
686
security perspective as for possible exploitation of any daemon the user
687
has full access to the system.
688

    
689
In order to overcome this situation we'll allow Ganeti to run its daemon
690
under different users and a dedicated group. This also will allow some
691
side effects, like letting the user run some ``gnt-*`` commands if one
692
is in the same group.
693

    
694
Implementation
695
~~~~~~~~~~~~~~
696

    
697
For Ganeti 2.2 the implementation will be focused on a the RAPI daemon
698
only. This involves changes to ``daemons.py`` so it's possible to drop
699
privileges on daemonize the process. Though, this will be a short term
700
solution which will be replaced by a privilege drop already on daemon
701
startup in Ganeti 2.3.
702

    
703
It also needs changes in the master daemon to create the socket with new
704
permissions/owners to allow RAPI access. There will be no other
705
permission/owner changes in the file structure as the RAPI daemon is
706
started with root permission. In that time it will read all needed files
707
and then drop privileges before contacting the master daemon.
708

    
709

    
710
Feature changes
711
===============
712

    
713
KVM Security
714
------------
715

    
716
Current state and shortcomings
717
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
718

    
719
Currently all kvm processes run as root. Taking ownership of the
720
hypervisor process, from inside a virtual machine, would mean a full
721
compromise of the whole Ganeti cluster, knowledge of all Ganeti
722
authentication secrets, full access to all running instances, and the
723
option of subverting other basic services on the cluster (eg: ssh).
724

    
725
Proposed changes
726
~~~~~~~~~~~~~~~~
727

    
728
We would like to decrease the surface of attack available if an
729
hypervisor is compromised. We can do so adding different features to
730
Ganeti, which will allow restricting the broken hypervisor
731
possibilities, in the absence of a local privilege escalation attack, to
732
subvert the node.
733

    
734
Dropping privileges in kvm to a single user (easy)
735
++++++++++++++++++++++++++++++++++++++++++++++++++
736

    
737
By passing the ``-runas`` option to kvm, we can make it drop privileges.
738
The user can be chosen by an hypervisor parameter, so that each instance
739
can have its own user, but by default they will all run under the same
740
one. It should be very easy to implement, and can easily be backported
741
to 2.1.X.
742

    
743
This mode protects the Ganeti cluster from a subverted hypervisor, but
744
doesn't protect the instances between each other, unless care is taken
745
to specify a different user for each. This would prevent the worst
746
attacks, including:
747

    
748
- logging in to other nodes
749
- administering the Ganeti cluster
750
- subverting other services
751

    
752
But the following would remain an option:
753

    
754
- terminate other VMs (but not start them again, as that requires root
755
  privileges to set up networking) (unless different users are used)
756
- trace other VMs, and probably subvert them and access their data
757
  (unless different users are used)
758
- send network traffic from the node
759
- read unprotected data on the node filesystem
760

    
761
Running kvm in a chroot (slightly harder)
762
+++++++++++++++++++++++++++++++++++++++++
763

    
764
By passing the ``-chroot`` option to kvm, we can restrict the kvm
765
process in its own (possibly empty) root directory. We need to set this
766
area up so that the instance disks and control sockets are accessible,
767
so it would require slightly more work at the Ganeti level.
768

    
769
Breaking out in a chroot would mean:
770

    
771
- a lot less options to find a local privilege escalation vector
772
- the impossibility to write local data, if the chroot is set up
773
  correctly
774
- the impossibility to read filesystem data on the host
775

    
776
It would still be possible though to:
777

    
778
- terminate other VMs
779
- trace other VMs, and possibly subvert them (if a tracer can be
780
  installed in the chroot)
781
- send network traffic from the node
782

    
783

    
784
Running kvm with a pool of users (slightly harder)
785
++++++++++++++++++++++++++++++++++++++++++++++++++
786

    
787
If rather than passing a single user as an hypervisor parameter, we have
788
a pool of useable ones, we can dynamically choose a free one to use and
789
thus guarantee that each machine will be separate from the others,
790
without putting the burden of this on the cluster administrator.
791

    
792
This would mean interfering between machines would be impossible, and
793
can still be combined with the chroot benefits.
794

    
795
Running iptables rules to limit network interaction (easy)
796
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
797

    
798
These don't need to be handled by Ganeti, but we can ship examples. If
799
the users used to run VMs would be blocked from sending some or all
800
network traffic, it would become impossible for a broken into hypervisor
801
to send arbitrary data on the node network, which is especially useful
802
when the instance and the node network are separated (using ganeti-nbma
803
or a separate set of network interfaces), or when a separate replication
804
network is maintained. We need to experiment to see how much restriction
805
we can properly apply, without limiting the instance legitimate traffic.
806

    
807

    
808
Running kvm inside a container (even harder)
809
++++++++++++++++++++++++++++++++++++++++++++
810

    
811
Recent linux kernels support different process namespaces through
812
control groups. PIDs, users, filesystems and even network interfaces can
813
be separated. If we can set up ganeti to run kvm in a separate container
814
we could insulate all the host process from being even visible if the
815
hypervisor gets broken into. Most probably separating the network
816
namespace would require one extra hop in the host, through a veth
817
interface, thus reducing performance, so we may want to avoid that, and
818
just rely on iptables.
819

    
820
Implementation plan
821
~~~~~~~~~~~~~~~~~~~
822

    
823
We will first implement dropping privileges for kvm processes as a
824
single user, and most probably backport it to 2.1. Then we'll ship
825
example iptables rules to show how the user can be limited in its
826
network activities.  After that we'll implement chroot restriction for
827
kvm processes, and extend the user limitation to use a user pool.
828

    
829
Finally we'll look into namespaces and containers, although that might
830
slip after the 2.2 release.
831

    
832

    
833
External interface changes
834
==========================
835

    
836

    
837
OS API
838
------
839

    
840
The OS variants implementation in Ganeti 2.1 didn't prove to be useful
841
enough to alleviate the need to hack around the Ganeti API in order to
842
provide flexible OS parameters.
843

    
844
As such, for Ganeti 2.2 we will provide support for arbitrary OS
845
parameters. However, since OSes are not registered in Ganeti, but
846
instead discovered at runtime, the interface is not entirely
847
straightforward.
848

    
849
Furthermore, to support the system administrator in keeping OSes
850
properly in sync across the nodes of a cluster, Ganeti will also verify
851
(if existing) the consistence of a new ``os_version`` file.
852

    
853
These changes to the OS API will bump the API version to 20.
854

    
855

    
856
OS version
857
~~~~~~~~~~
858

    
859
A new ``os_version`` file will be supported by Ganeti. This file is not
860
required, but if existing, its contents will be checked for consistency
861
across nodes. The file should hold only one line of text (any extra data
862
will be discarded), and its contents will be shown in the OS information
863
and diagnose commands.
864

    
865
It is recommended that OS authors increase the contents of this file for
866
any changes; at a minimum, modifications that change the behaviour of
867
import/export scripts must increase the version, since they break
868
intra-cluster migration.
869

    
870
Parameters
871
~~~~~~~~~~
872

    
873
The interface between Ganeti and the OS scripts will be based on
874
environment variables, and as such the parameters and their values will
875
need to be valid in this context.
876

    
877
Names
878
+++++
879

    
880
The parameter names will be declared in a new file, ``parameters.list``,
881
together with a one-line documentation (whitespace-separated). Example::
882

    
883
  $ cat parameters.list
884
  ns1    Specifies the first name server to add to /etc/resolv.conf
885
  extra_packages  Specifies additional packages to install
886
  rootfs_size     Specifies the root filesystem size (the rest will be left unallocated)
887
  track  Specifies the distribution track, one of 'stable', 'testing' or 'unstable'
888

    
889
As seen above, the documentation can be separate via multiple
890
spaces/tabs from the names.
891

    
892
The parameter names as read from the file will be used for the command
893
line interface in lowercased form; as such, there shouldn't be any two
894
parameters which differ in case only.
895

    
896
Values
897
++++++
898

    
899
The values of the parameters are, from Ganeti's point of view,
900
completely freeform. If a given parameter has, from the OS' point of
901
view, a fixed set of valid values, these should be documented as such
902
and verified by the OS, but Ganeti will not handle such parameters
903
specially.
904

    
905
An empty value must be handled identically as a missing parameter. In
906
other words, the validation script should only test for non-empty
907
values, and not for declared versus undeclared parameters.
908

    
909
Furthermore, each parameter should have an (internal to the OS) default
910
value, that will be used if not passed from Ganeti. More precisely, it
911
should be possible for any parameter to specify a value that will have
912
the same effect as not passing the parameter, and no in no case should
913
the absence of a parameter be treated as an exceptional case (outside
914
the value space).
915

    
916

    
917
Environment variables
918
^^^^^^^^^^^^^^^^^^^^^
919

    
920
The parameters will be exposed in the environment upper-case and
921
prefixed with the string ``OSP_``. For example, a parameter declared in
922
the 'parameters' file as ``ns1`` will appear in the environment as the
923
variable ``OSP_NS1``.
924

    
925
Validation
926
++++++++++
927

    
928
For the purpose of parameter name/value validation, the OS scripts
929
*must* provide an additional script, named ``verify``. This script will
930
be called with the argument ``parameters``, and all the parameters will
931
be passed in via environment variables, as described above.
932

    
933
The script should signify result/failure based on its exit code, and
934
show explanatory messages either on its standard output or standard
935
error. These messages will be passed on to the master, and stored as in
936
the OpCode result/error message.
937

    
938
The parameters must be constructed to be independent of the instance
939
specifications. In general, the validation script will only be called
940
with the parameter variables set, but not with the normal per-instance
941
variables, in order for Ganeti to be able to validate default parameters
942
too, when they change. Validation will only be performed on one cluster
943
node, and it will be up to the ganeti administrator to keep the OS
944
scripts in sync between all nodes.
945

    
946
Instance operations
947
+++++++++++++++++++
948

    
949
The parameters will be passed, as described above, to all the other
950
instance operations (creation, import, export). Ideally, these scripts
951
will not abort with parameter validation errors, if the ``verify``
952
script has verified them correctly.
953

    
954
Note: when changing an instance's OS type, any OS parameters defined at
955
instance level will be kept as-is. If the parameters differ between the
956
new and the old OS, the user should manually remove/update them as
957
needed.
958

    
959
Declaration and modification
960
++++++++++++++++++++++++++++
961

    
962
Since the OSes are not registered in Ganeti, we will only make a 'weak'
963
link between the parameters as declared in Ganeti and the actual OSes
964
existing on the cluster.
965

    
966
It will be possible to declare parameters either globally, per cluster
967
(where they are indexed per OS/variant), or individually, per
968
instance. The declaration of parameters will not be tied to current
969
existing OSes. When specifying a parameter, if the OS exists, it will be
970
validated; if not, then it will simply be stored as-is.
971

    
972
A special note is that it will not be possible to 'unset' at instance
973
level a parameter that is declared globally. Instead, at instance level
974
the parameter should be given an explicit value, or the default value as
975
explained above.
976

    
977
CLI interface
978
+++++++++++++
979

    
980
The modification of global (default) parameters will be done via the
981
``gnt-os`` command, and the per-instance parameters via the
982
``gnt-instance`` command. Both these commands will take an addition
983
``--os-parameters`` or ``-O`` flag that specifies the parameters in the
984
familiar comma-separated, key=value format. For removing a parameter, a
985
``-key`` syntax will be used, e.g.::
986

    
987
  # initial modification
988
  $ gnt-instance modify -O use_dchp=true instance1
989
  # later revert (to the cluster default, or the OS default if not
990
  # defined at cluster level)
991
  $ gnt-instance modify -O -use_dhcp instance1
992

    
993
Internal storage
994
++++++++++++++++
995

    
996
Internally, the OS parameters will be stored in a new ``osparams``
997
attribute. The global parameters will be stored on the cluster object,
998
and the value of this attribute will be a dictionary indexed by OS name
999
(this also accepts an OS+variant name, which will override a simple OS
1000
name, see below), and for values the key/name dictionary. For the
1001
instances, the value will be directly the key/name dictionary.
1002

    
1003
Overriding rules
1004
++++++++++++++++
1005

    
1006
Any instance-specific parameters will override any variant-specific
1007
parameters, which in turn will override any global parameters. The
1008
global parameters, in turn, override the built-in defaults (of the OS
1009
scripts).
1010

    
1011

    
1012
.. vim: set textwidth=72 :
1013
.. Local Variables:
1014
.. mode: rst
1015
.. fill-column: 72
1016
.. End: