Statistics
| Branch: | Tag: | Revision:

root / doc / design-2.2.rst @ ab6536ba

History | View | Annotate | Download (36.5 kB)

1
=================
2
Ganeti 2.2 design
3
=================
4

    
5
This document describes the major changes in Ganeti 2.2 compared to
6
the 2.1 version.
7

    
8
The 2.2 version will be a relatively small release. Its main aim is to
9
avoid changing too much of the core code, while addressing issues and
10
adding new features and improvements over 2.1, in a timely fashion.
11

    
12
.. contents:: :depth: 4
13

    
14
As for 2.1 we divide the 2.2 design into three areas:
15

    
16
- core changes, which affect the master daemon/job queue/locking or
17
  all/most logical units
18
- logical unit/feature changes
19
- external interface changes (e.g. command line, OS API, hooks, ...)
20

    
21

    
22
Core changes
23
============
24

    
25
Master Daemon Scaling improvements
26
----------------------------------
27

    
28
Current state and shortcomings
29
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
30

    
31
Currently the Ganeti master daemon is based on four sets of threads:
32

    
33
- The main thread (1 thread) just accepts connections on the master
34
  socket
35
- The client worker pool (16 threads) handles those connections,
36
  one thread per connected socket, parses luxi requests, and sends data
37
  back to the clients
38
- The job queue worker pool (25 threads) executes the actual jobs
39
  submitted by the clients
40
- The rpc worker pool (10 threads) interacts with the nodes via
41
  http-based-rpc
42

    
43
This means that every masterd currently runs 52 threads to do its job.
44
Being able to reduce the number of thread sets would make the master's
45
architecture a lot simpler. Moreover having less threads can help
46
decrease lock contention, log pollution and memory usage.
47
Also, with the current architecture, masterd suffers from quite a few
48
scalability issues:
49

    
50
Core daemon connection handling
51
+++++++++++++++++++++++++++++++
52

    
53
Since the 16 client worker threads handle one connection each, it's very
54
easy to exhaust them, by just connecting to masterd 16 times and not
55
sending any data. While we could perhaps make those pools resizable,
56
increasing the number of threads won't help with lock contention nor
57
with better handling long running operations making sure the client is
58
informed that everything is proceeding, and doesn't need to time out.
59

    
60
Wait for job change
61
+++++++++++++++++++
62

    
63
The REQ_WAIT_FOR_JOB_CHANGE luxi operation makes the relevant client
64
thread block on its job for a relative long time. This is another easy
65
way to exhaust the 16 client threads, and a place where clients often
66
time out, moreover this operation is negative for the job queue lock
67
contention (see below).
68

    
69
Job Queue lock
70
++++++++++++++
71

    
72
The job queue lock is quite heavily contended, and certain easily
73
reproducible workloads show that's it's very easy to put masterd in
74
trouble: for example running ~15 background instance reinstall jobs,
75
results in a master daemon that, even without having finished the
76
client worker threads, can't answer simple job list requests, or
77
submit more jobs.
78

    
79
Currently the job queue lock is an exclusive non-fair lock insulating
80
the following job queue methods (called by the client workers).
81

    
82
  - AddNode
83
  - RemoveNode
84
  - SubmitJob
85
  - SubmitManyJobs
86
  - WaitForJobChanges
87
  - CancelJob
88
  - ArchiveJob
89
  - AutoArchiveJobs
90
  - QueryJobs
91
  - Shutdown
92

    
93
Moreover the job queue lock is acquired outside of the job queue in two
94
other classes:
95

    
96
  - jqueue._JobQueueWorker (in RunTask) before executing the opcode, after
97
    finishing its executing and when handling an exception.
98
  - jqueue._OpExecCallbacks (in NotifyStart and Feedback) when the
99
    processor (mcpu.Processor) is about to start working on the opcode
100
    (after acquiring the necessary locks) and when any data is sent back
101
    via the feedback function.
102

    
103
Of those the major critical points are:
104

    
105
  - Submit[Many]Job, QueryJobs, WaitForJobChanges, which can easily slow
106
    down and block client threads up to making the respective clients
107
    time out.
108
  - The code paths in NotifyStart, Feedback, and RunTask, which slow
109
    down job processing between clients and otherwise non-related jobs.
110

    
111
To increase the pain:
112

    
113
  - WaitForJobChanges is a bad offender because it's implemented with a
114
    notified condition which awakes waiting threads, who then try to
115
    acquire the global lock again
116
  - Many should-be-fast code paths are slowed down by replicating the
117
    change to remote nodes, and thus waiting, with the lock held, on
118
    remote rpcs to complete (starting, finishing, and submitting jobs)
119

    
120
Proposed changes
121
~~~~~~~~~~~~~~~~
122

    
123
In order to be able to interact with the master daemon even when it's
124
under heavy load, and  to make it simpler to add core functionality
125
(such as an asynchronous rpc client) we propose three subsequent levels
126
of changes to the master core architecture.
127

    
128
After making this change we'll be able to re-evaluate the size of our
129
thread pool, if we see that we can make most threads in the client
130
worker pool always idle. In the future we should also investigate making
131
the rpc client asynchronous as well, so that we can make masterd a lot
132
smaller in number of threads, and memory size, and thus also easier to
133
understand, debug, and scale.
134

    
135
Connection handling
136
+++++++++++++++++++
137

    
138
We'll move the main thread of ganeti-masterd to asyncore, so that it can
139
share the mainloop code with all other Ganeti daemons. Then all luxi
140
clients will be asyncore clients, and I/O to/from them will be handled
141
by the master thread asynchronously. Data will be read from the client
142
sockets as it becomes available, and kept in a buffer, then when a
143
complete message is found, it's passed to a client worker thread for
144
parsing and processing. The client worker thread is responsible for
145
serializing the reply, which can then be sent asynchronously by the main
146
thread on the socket.
147

    
148
Wait for job change
149
+++++++++++++++++++
150

    
151
The REQ_WAIT_FOR_JOB_CHANGE luxi request is changed to be
152
subscription-based, so that the executing thread doesn't have to be
153
waiting for the changes to arrive. Threads producing messages (job queue
154
executors) will make sure that when there is a change another thread is
155
awaken and delivers it to the waiting clients. This can be either a
156
dedicated "wait for job changes" thread or pool, or one of the client
157
workers, depending on what's easier to implement. In either case the
158
main asyncore thread will only be involved in pushing of the actual
159
data, and not in fetching/serializing it.
160

    
161
Other features to look at, when implementing this code are:
162

    
163
  - Possibility not to need the job lock to know which updates to push:
164
    if the thread producing the data pushed a copy of the update for the
165
    waiting clients, the thread sending it won't need to acquire the
166
    lock again to fetch the actual data.
167
  - Possibility to signal clients about to time out, when no update has
168
    been received, not to despair and to keep waiting (luxi level
169
    keepalive).
170
  - Possibility to defer updates if they are too frequent, providing
171
    them at a maximum rate (lower priority).
172

    
173
Job Queue lock
174
++++++++++++++
175

    
176
In order to decrease the job queue lock contention, we will change the
177
code paths in the following ways, initially:
178

    
179
  - A per-job lock will be introduced. All operations affecting only one
180
    job (for example feedback, starting/finishing notifications,
181
    subscribing to or watching a job) will only require the job lock.
182
    This should be a leaf lock, but if a situation arises in which it
183
    must be acquired together with the global job queue lock the global
184
    one must always be acquired last (for the global section).
185
  - The locks will be converted to a sharedlock. Any read-only operation
186
    will be able to proceed in parallel.
187
  - During remote update (which happens already per-job) we'll drop the
188
    job lock level to shared mode, so that activities reading the lock
189
    (for example job change notifications or QueryJobs calls) will be
190
    able to proceed in parallel.
191
  - The wait for job changes improvements proposed above will be
192
    implemented.
193

    
194
In the future other improvements may include splitting off some of the
195
work (eg replication of a job to remote nodes) to a separate thread pool
196
or asynchronous thread, not tied with the code path for answering client
197
requests or the one executing the "real" work. This can be discussed
198
again after we used the more granular job queue in production and tested
199
its benefits.
200

    
201

    
202
Inter-cluster instance moves
203
----------------------------
204

    
205
Current state and shortcomings
206
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
207

    
208
With the current design of Ganeti, moving whole instances between
209
different clusters involves a lot of manual work. There are several ways
210
to move instances, one of them being to export the instance, manually
211
copying all data to the new cluster before importing it again. Manual
212
changes to the instances configuration, such as the IP address, may be
213
necessary in the new environment. The goal is to improve and automate
214
this process in Ganeti 2.2.
215

    
216
Proposed changes
217
~~~~~~~~~~~~~~~~
218

    
219
Authorization, Authentication and Security
220
++++++++++++++++++++++++++++++++++++++++++
221

    
222
Until now, each Ganeti cluster was a self-contained entity and wouldn't
223
talk to other Ganeti clusters. Nodes within clusters only had to trust
224
the other nodes in the same cluster and the network used for replication
225
was trusted, too (hence the ability the use a separate, local network
226
for replication).
227

    
228
For inter-cluster instance transfers this model must be weakened. Nodes
229
in one cluster will have to talk to nodes in other clusters, sometimes
230
in other locations and, very important, via untrusted network
231
connections.
232

    
233
Various option have been considered for securing and authenticating the
234
data transfer from one machine to another. To reduce the risk of
235
accidentally overwriting data due to software bugs, authenticating the
236
arriving data was considered critical. Eventually we decided to use
237
socat's OpenSSL options (``OPENSSL:``, ``OPENSSL-LISTEN:`` et al), which
238
provide us with encryption, authentication and authorization when used
239
with separate keys and certificates.
240

    
241
Combinations of OpenSSH, GnuPG and Netcat were deemed too complex to set
242
up from within Ganeti. Any solution involving OpenSSH would require a
243
dedicated user with a home directory and likely automated modifications
244
to the user's ``$HOME/.ssh/authorized_keys`` file. When using Netcat,
245
GnuPG or another encryption method would be necessary to transfer the
246
data over an untrusted network. socat combines both in one program and
247
is already a dependency.
248

    
249
Each of the two clusters will have to generate an RSA key. The public
250
parts are exchanged between the clusters by a third party, such as an
251
administrator or a system interacting with Ganeti via the remote API
252
("third party" from here on). After receiving each other's public key,
253
the clusters can start talking to each other.
254

    
255
All encrypted connections must be verified on both sides. Neither side
256
may accept unverified certificates. The generated certificate should
257
only be valid for the time necessary to move the instance.
258

    
259
For additional protection of the instance data, the two clusters can
260
verify the certificates and destination information exchanged via the
261
third party by checking an HMAC signature using a key shared among the
262
involved clusters. By default this secret key will be a random string
263
unique to the cluster, generated by running SHA1 over 20 bytes read from
264
``/dev/urandom`` and the administrator must synchronize the secrets
265
between clusters before instances can be moved. If the third party does
266
not know the secret, it can't forge the certificates or redirect the
267
data. Unless disabled by a new cluster parameter, verifying the HMAC
268
signatures must be mandatory. The HMAC signature for X509 certificates
269
will be prepended to the certificate similar to an :rfc:`822` header and
270
only covers the certificate (from ``-----BEGIN CERTIFICATE-----`` to
271
``-----END CERTIFICATE-----``). The header name will be
272
``X-Ganeti-Signature`` and its value will have the format
273
``$salt/$hash`` (salt and hash separated by slash). The salt may only
274
contain characters in the range ``[a-zA-Z0-9]``.
275

    
276
On the web, the destination cluster would be equivalent to an HTTPS
277
server requiring verifiable client certificates. The browser would be
278
equivalent to the source cluster and must verify the server's
279
certificate while providing a client certificate to the server.
280

    
281
Copying data
282
++++++++++++
283

    
284
To simplify the implementation, we decided to operate at a block-device
285
level only, allowing us to easily support non-DRBD instance moves.
286

    
287
Intra-cluster instance moves will re-use the existing export and import
288
scripts supplied by instance OS definitions. Unlike simply copying the
289
raw data, this allows one to use filesystem-specific utilities to dump
290
only used parts of the disk and to exclude certain disks from the move.
291
Compression should be used to further reduce the amount of data
292
transferred.
293

    
294
The export scripts writes all data to stdout and the import script reads
295
it from stdin again. To avoid copying data and reduce disk space
296
consumption, everything is read from the disk and sent over the network
297
directly, where it'll be written to the new block device directly again.
298

    
299
Workflow
300
++++++++
301

    
302
#. Third party tells source cluster to shut down instance, asks for the
303
   instance specification and for the public part of an encryption key
304

    
305
   - Instance information can already be retrieved using an existing API
306
     (``OpInstanceQueryData``).
307
   - An RSA encryption key and a corresponding self-signed X509
308
     certificate is generated using the "openssl" command. This key will
309
     be used to encrypt the data sent to the destination cluster.
310

    
311
     - Private keys never leave the cluster.
312
     - The public part (the X509 certificate) is signed using HMAC with
313
       salting and a secret shared between Ganeti clusters.
314

    
315
#. Third party tells destination cluster to create an instance with the
316
   same specifications as on source cluster and to prepare for an
317
   instance move with the key received from the source cluster and
318
   receives the public part of the destination's encryption key
319

    
320
   - The current API to create instances (``OpInstanceCreate``) will be
321
     extended to support an import from a remote cluster.
322
   - A valid, unexpired X509 certificate signed with the destination
323
     cluster's secret will be required. By verifying the signature, we
324
     know the third party didn't modify the certificate.
325

    
326
     - The private keys never leave their cluster, hence the third party
327
       can not decrypt or intercept the instance's data by modifying the
328
       IP address or port sent by the destination cluster.
329

    
330
   - The destination cluster generates another key and certificate,
331
     signs and sends it to the third party, who will have to pass it to
332
     the API for exporting an instance (``OpBackupExport``). This
333
     certificate is used to ensure we're sending the disk data to the
334
     correct destination cluster.
335
   - Once a disk can be imported, the API sends the destination
336
     information (IP address and TCP port) together with an HMAC
337
     signature to the third party.
338

    
339
#. Third party hands public part of the destination's encryption key
340
   together with all necessary information to source cluster and tells
341
   it to start the move
342

    
343
   - The existing API for exporting instances (``OpBackupExport``)
344
     will be extended to export instances to remote clusters.
345

    
346
#. Source cluster connects to destination cluster for each disk and
347
   transfers its data using the instance OS definition's export and
348
   import scripts
349

    
350
   - Before starting, the source cluster must verify the HMAC signature
351
     of the certificate and destination information (IP address and TCP
352
     port).
353
   - When connecting to the remote machine, strong certificate checks
354
     must be employed.
355

    
356
#. Due to the asynchronous nature of the whole process, the destination
357
   cluster checks whether all disks have been transferred every time
358
   after transferring a single disk; if so, it destroys the encryption
359
   key
360
#. After sending all disks, the source cluster destroys its key
361
#. Destination cluster runs OS definition's rename script to adjust
362
   instance settings if needed (e.g. IP address)
363
#. Destination cluster starts the instance if requested at the beginning
364
   by the third party
365
#. Source cluster removes the instance if requested
366

    
367
Instance move in pseudo code
368
++++++++++++++++++++++++++++
369

    
370
.. highlight:: python
371

    
372
The following pseudo code describes a script moving instances between
373
clusters and what happens on both clusters.
374

    
375
#. Script is started, gets the instance name and destination cluster::
376

    
377
    (instance_name, dest_cluster_name) = sys.argv[1:]
378

    
379
    # Get destination cluster object
380
    dest_cluster = db.FindCluster(dest_cluster_name)
381

    
382
    # Use database to find source cluster
383
    src_cluster = db.FindClusterByInstance(instance_name)
384

    
385
#. Script tells source cluster to stop instance::
386

    
387
    # Stop instance
388
    src_cluster.StopInstance(instance_name)
389

    
390
    # Get instance specification (memory, disk, etc.)
391
    inst_spec = src_cluster.GetInstanceInfo(instance_name)
392

    
393
    (src_key_name, src_cert) = src_cluster.CreateX509Certificate()
394

    
395
#. ``CreateX509Certificate`` on source cluster::
396

    
397
    key_file = mkstemp()
398
    cert_file = "%s.cert" % key_file
399
    RunCmd(["/usr/bin/openssl", "req", "-new",
400
             "-newkey", "rsa:1024", "-days", "1",
401
             "-nodes", "-x509", "-batch",
402
             "-keyout", key_file, "-out", cert_file])
403

    
404
    plain_cert = utils.ReadFile(cert_file)
405

    
406
    # HMAC sign using secret key, this adds a "X-Ganeti-Signature"
407
    # header to the beginning of the certificate
408
    signed_cert = utils.SignX509Certificate(plain_cert,
409
      utils.ReadFile(constants.X509_SIGNKEY_FILE))
410

    
411
    # The certificate now looks like the following:
412
    #
413
    #   X-Ganeti-Signature: $1234$28676f0516c6ab68062b[โ€ฆ]
414
    #   -----BEGIN CERTIFICATE-----
415
    #   MIICsDCCAhmgAwIBAgI[โ€ฆ]
416
    #   -----END CERTIFICATE-----
417

    
418
    # Return name of key file and signed certificate in PEM format
419
    return (os.path.basename(key_file), signed_cert)
420

    
421
#. Script creates instance on destination cluster and waits for move to
422
   finish::
423

    
424
    dest_cluster.CreateInstance(mode=constants.REMOTE_IMPORT,
425
                                spec=inst_spec,
426
                                source_cert=src_cert)
427

    
428
    # Wait until destination cluster gives us its certificate
429
    dest_cert = None
430
    disk_info = []
431
    while not (dest_cert and len(disk_info) < len(inst_spec.disks)):
432
      tmp = dest_cluster.WaitOutput()
433
      if tmp is Certificate:
434
        dest_cert = tmp
435
      elif tmp is DiskInfo:
436
        # DiskInfo contains destination address and port
437
        disk_info[tmp.index] = tmp
438

    
439
    # Tell source cluster to export disks
440
    for disk in disk_info:
441
      src_cluster.ExportDisk(instance_name, disk=disk,
442
                             key_name=src_key_name,
443
                             dest_cert=dest_cert)
444

    
445
    print ("Instance %s sucessfully moved to %s" %
446
           (instance_name, dest_cluster.name))
447

    
448
#. ``CreateInstance`` on destination cluster::
449

    
450
    # โ€ฆ
451

    
452
    if mode == constants.REMOTE_IMPORT:
453
      # Make sure certificate was not modified since it was generated by
454
      # source cluster (which must use the same secret)
455
      if (not utils.VerifySignedX509Cert(source_cert,
456
            utils.ReadFile(constants.X509_SIGNKEY_FILE))):
457
        raise Error("Certificate not signed with this cluster's secret")
458

    
459
      if utils.CheckExpiredX509Cert(source_cert):
460
        raise Error("X509 certificate is expired")
461

    
462
      source_cert_file = utils.WriteTempFile(source_cert)
463

    
464
      # See above for X509 certificate generation and signing
465
      (key_name, signed_cert) = CreateSignedX509Certificate()
466

    
467
      SendToClient("x509-cert", signed_cert)
468

    
469
      for disk in instance.disks:
470
        # Start socat
471
        RunCmd(("socat"
472
                " OPENSSL-LISTEN:%s,โ€ฆ,key=%s,cert=%s,cafile=%s,verify=1"
473
                " stdout > /dev/diskโ€ฆ") %
474
               port, GetRsaKeyPath(key_name, private=True),
475
               GetRsaKeyPath(key_name, private=False), src_cert_file)
476
        SendToClient("send-disk-to", disk, ip_address, port)
477

    
478
      DestroyX509Cert(key_name)
479

    
480
      RunRenameScript(instance_name)
481

    
482
#. ``ExportDisk`` on source cluster::
483

    
484
    # Make sure certificate was not modified since it was generated by
485
    # destination cluster (which must use the same secret)
486
    if (not utils.VerifySignedX509Cert(cert_pem,
487
          utils.ReadFile(constants.X509_SIGNKEY_FILE))):
488
      raise Error("Certificate not signed with this cluster's secret")
489

    
490
    if utils.CheckExpiredX509Cert(cert_pem):
491
      raise Error("X509 certificate is expired")
492

    
493
    dest_cert_file = utils.WriteTempFile(cert_pem)
494

    
495
    # Start socat
496
    RunCmd(("socat stdin"
497
            " OPENSSL:%s:%s,โ€ฆ,key=%s,cert=%s,cafile=%s,verify=1"
498
            " < /dev/diskโ€ฆ") %
499
           disk.host, disk.port,
500
           GetRsaKeyPath(key_name, private=True),
501
           GetRsaKeyPath(key_name, private=False), dest_cert_file)
502

    
503
    if instance.all_disks_done:
504
      DestroyX509Cert(key_name)
505

    
506
.. highlight:: text
507

    
508
Miscellaneous notes
509
+++++++++++++++++++
510

    
511
- A very similar system could also be used for instance exports within
512
  the same cluster. Currently OpenSSH is being used, but could be
513
  replaced by socat and SSL/TLS.
514
- During the design of intra-cluster instance moves we also discussed
515
  encrypting instance exports using GnuPG.
516
- While most instances should have exactly the same configuration as
517
  on the source cluster, setting them up with a different disk layout
518
  might be helpful in some use-cases.
519
- A cleanup operation, similar to the one available for failed instance
520
  migrations, should be provided.
521
- ``ganeti-watcher`` should remove instances pending a move from another
522
  cluster after a certain amount of time. This takes care of failures
523
  somewhere in the process.
524
- RSA keys can be generated using the existing
525
  ``bootstrap.GenerateSelfSignedSslCert`` function, though it might be
526
  useful to not write both parts into a single file, requiring small
527
  changes to the function. The public part always starts with
528
  ``-----BEGIN CERTIFICATE-----`` and ends with ``-----END
529
  CERTIFICATE-----``.
530
- The source and destination cluster might be different when it comes
531
  to available hypervisors, kernels, etc. The destination cluster should
532
  refuse to accept an instance move if it can't fulfill an instance's
533
  requirements.
534

    
535

    
536
Privilege separation
537
--------------------
538

    
539
Current state and shortcomings
540
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
541

    
542
All Ganeti daemons are run under the user root. This is not ideal from a
543
security perspective as for possible exploitation of any daemon the user
544
has full access to the system.
545

    
546
In order to overcome this situation we'll allow Ganeti to run its daemon
547
under different users and a dedicated group. This also will allow some
548
side effects, like letting the user run some ``gnt-*`` commands if one
549
is in the same group.
550

    
551
Implementation
552
~~~~~~~~~~~~~~
553

    
554
For Ganeti 2.2 the implementation will be focused on a the RAPI daemon
555
only. This involves changes to ``daemons.py`` so it's possible to drop
556
privileges on daemonize the process. Though, this will be a short term
557
solution which will be replaced by a privilege drop already on daemon
558
startup in Ganeti 2.3.
559

    
560
It also needs changes in the master daemon to create the socket with new
561
permissions/owners to allow RAPI access. There will be no other
562
permission/owner changes in the file structure as the RAPI daemon is
563
started with root permission. In that time it will read all needed files
564
and then drop privileges before contacting the master daemon.
565

    
566

    
567
Feature changes
568
===============
569

    
570
KVM Security
571
------------
572

    
573
Current state and shortcomings
574
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
575

    
576
Currently all kvm processes run as root. Taking ownership of the
577
hypervisor process, from inside a virtual machine, would mean a full
578
compromise of the whole Ganeti cluster, knowledge of all Ganeti
579
authentication secrets, full access to all running instances, and the
580
option of subverting other basic services on the cluster (eg: ssh).
581

    
582
Proposed changes
583
~~~~~~~~~~~~~~~~
584

    
585
We would like to decrease the surface of attack available if an
586
hypervisor is compromised. We can do so adding different features to
587
Ganeti, which will allow restricting the broken hypervisor
588
possibilities, in the absence of a local privilege escalation attack, to
589
subvert the node.
590

    
591
Dropping privileges in kvm to a single user (easy)
592
++++++++++++++++++++++++++++++++++++++++++++++++++
593

    
594
By passing the ``-runas`` option to kvm, we can make it drop privileges.
595
The user can be chosen by an hypervisor parameter, so that each instance
596
can have its own user, but by default they will all run under the same
597
one. It should be very easy to implement, and can easily be backported
598
to 2.1.X.
599

    
600
This mode protects the Ganeti cluster from a subverted hypervisor, but
601
doesn't protect the instances between each other, unless care is taken
602
to specify a different user for each. This would prevent the worst
603
attacks, including:
604

    
605
- logging in to other nodes
606
- administering the Ganeti cluster
607
- subverting other services
608

    
609
But the following would remain an option:
610

    
611
- terminate other VMs (but not start them again, as that requires root
612
  privileges to set up networking) (unless different users are used)
613
- trace other VMs, and probably subvert them and access their data
614
  (unless different users are used)
615
- send network traffic from the node
616
- read unprotected data on the node filesystem
617

    
618
Running kvm in a chroot (slightly harder)
619
+++++++++++++++++++++++++++++++++++++++++
620

    
621
By passing the ``-chroot`` option to kvm, we can restrict the kvm
622
process in its own (possibly empty) root directory. We need to set this
623
area up so that the instance disks and control sockets are accessible,
624
so it would require slightly more work at the Ganeti level.
625

    
626
Breaking out in a chroot would mean:
627

    
628
- a lot less options to find a local privilege escalation vector
629
- the impossibility to write local data, if the chroot is set up
630
  correctly
631
- the impossibility to read filesystem data on the host
632

    
633
It would still be possible though to:
634

    
635
- terminate other VMs
636
- trace other VMs, and possibly subvert them (if a tracer can be
637
  installed in the chroot)
638
- send network traffic from the node
639

    
640

    
641
Running kvm with a pool of users (slightly harder)
642
++++++++++++++++++++++++++++++++++++++++++++++++++
643

    
644
If rather than passing a single user as an hypervisor parameter, we have
645
a pool of useable ones, we can dynamically choose a free one to use and
646
thus guarantee that each machine will be separate from the others,
647
without putting the burden of this on the cluster administrator.
648

    
649
This would mean interfering between machines would be impossible, and
650
can still be combined with the chroot benefits.
651

    
652
Running iptables rules to limit network interaction (easy)
653
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
654

    
655
These don't need to be handled by Ganeti, but we can ship examples. If
656
the users used to run VMs would be blocked from sending some or all
657
network traffic, it would become impossible for a broken into hypervisor
658
to send arbitrary data on the node network, which is especially useful
659
when the instance and the node network are separated (using ganeti-nbma
660
or a separate set of network interfaces), or when a separate replication
661
network is maintained. We need to experiment to see how much restriction
662
we can properly apply, without limiting the instance legitimate traffic.
663

    
664

    
665
Running kvm inside a container (even harder)
666
++++++++++++++++++++++++++++++++++++++++++++
667

    
668
Recent linux kernels support different process namespaces through
669
control groups. PIDs, users, filesystems and even network interfaces can
670
be separated. If we can set up ganeti to run kvm in a separate container
671
we could insulate all the host process from being even visible if the
672
hypervisor gets broken into. Most probably separating the network
673
namespace would require one extra hop in the host, through a veth
674
interface, thus reducing performance, so we may want to avoid that, and
675
just rely on iptables.
676

    
677
Implementation plan
678
~~~~~~~~~~~~~~~~~~~
679

    
680
We will first implement dropping privileges for kvm processes as a
681
single user, and most probably backport it to 2.1. Then we'll ship
682
example iptables rules to show how the user can be limited in its
683
network activities.  After that we'll implement chroot restriction for
684
kvm processes, and extend the user limitation to use a user pool.
685

    
686
Finally we'll look into namespaces and containers, although that might
687
slip after the 2.2 release.
688

    
689
New OS states
690
-------------
691

    
692
Separate from the OS external changes, described below, we'll add some
693
internal changes to the OS.
694

    
695
Current state and shortcomings
696
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
697

    
698
There are two issues related to the handling of the OSes.
699

    
700
First, it's impossible to disable an OS for new instances, since that
701
will also break reinstallations and renames of existing instances. To
702
phase out an OS definition, without actually having to modify the OS
703
scripts, it would be ideal to be able to restrict new installations but
704
keep the rest of the functionality available.
705

    
706
Second, ``gnt-instance reinstall --select-os`` shows all the OSes
707
available on the clusters. Some OSes might exist only for debugging and
708
diagnose, and not for end-user availability. For this, it would be
709
useful to "hide" a set of OSes, but keep it otherwise functional.
710

    
711
Proposed changes
712
~~~~~~~~~~~~~~~~
713

    
714
Two new cluster-level attributes will be added, holding the list of OSes
715
hidden from the user and respectively the list of OSes which are
716
blacklisted from new installations.
717

    
718
These lists will be modifiable via ``gnt-os modify`` (implemented via
719
``OpClusterSetParams``), such that even not-yet-existing OSes can be
720
preseeded into a given state.
721

    
722
For the hidden OSes, they are fully functional except that they are not
723
returned in the default OS list (as computed via ``OpOsDiagnose``),
724
unless the hidden state is requested.
725

    
726
For the blacklisted OSes, they are also not shown (unless the
727
blacklisted state is requested), and they are also prevented from
728
installation via ``OpInstanceCreate`` (in create mode).
729

    
730
Both these attributes are per-OS, not per-variant. Thus they apply to
731
all of an OS' variants, and it's impossible to blacklist or hide just
732
one variant. Further improvements might allow a given OS variant to be
733
blacklisted, as opposed to whole OSes.
734

    
735
External interface changes
736
==========================
737

    
738

    
739
OS API
740
------
741

    
742
The OS variants implementation in Ganeti 2.1 didn't prove to be useful
743
enough to alleviate the need to hack around the Ganeti API in order to
744
provide flexible OS parameters.
745

    
746
As such, for Ganeti 2.2 we will provide support for arbitrary OS
747
parameters. However, since OSes are not registered in Ganeti, but
748
instead discovered at runtime, the interface is not entirely
749
straightforward.
750

    
751
Furthermore, to support the system administrator in keeping OSes
752
properly in sync across the nodes of a cluster, Ganeti will also verify
753
(if existing) the consistence of a new ``os_version`` file.
754

    
755
These changes to the OS API will bump the API version to 20.
756

    
757

    
758
OS version
759
~~~~~~~~~~
760

    
761
A new ``os_version`` file will be supported by Ganeti. This file is not
762
required, but if existing, its contents will be checked for consistency
763
across nodes. The file should hold only one line of text (any extra data
764
will be discarded), and its contents will be shown in the OS information
765
and diagnose commands.
766

    
767
It is recommended that OS authors increase the contents of this file for
768
any changes; at a minimum, modifications that change the behaviour of
769
import/export scripts must increase the version, since they break
770
intra-cluster migration.
771

    
772
Parameters
773
~~~~~~~~~~
774

    
775
The interface between Ganeti and the OS scripts will be based on
776
environment variables, and as such the parameters and their values will
777
need to be valid in this context.
778

    
779
Names
780
+++++
781

    
782
The parameter names will be declared in a new file, ``parameters.list``,
783
together with a one-line documentation (whitespace-separated). Example::
784

    
785
  $ cat parameters.list
786
  ns1    Specifies the first name server to add to /etc/resolv.conf
787
  extra_packages  Specifies additional packages to install
788
  rootfs_size     Specifies the root filesystem size (the rest will be left unallocated)
789
  track  Specifies the distribution track, one of 'stable', 'testing' or 'unstable'
790

    
791
As seen above, the documentation can be separate via multiple
792
spaces/tabs from the names.
793

    
794
The parameter names as read from the file will be used for the command
795
line interface in lowercased form; as such, there shouldn't be any two
796
parameters which differ in case only.
797

    
798
Values
799
++++++
800

    
801
The values of the parameters are, from Ganeti's point of view,
802
completely freeform. If a given parameter has, from the OS' point of
803
view, a fixed set of valid values, these should be documented as such
804
and verified by the OS, but Ganeti will not handle such parameters
805
specially.
806

    
807
An empty value must be handled identically as a missing parameter. In
808
other words, the validation script should only test for non-empty
809
values, and not for declared versus undeclared parameters.
810

    
811
Furthermore, each parameter should have an (internal to the OS) default
812
value, that will be used if not passed from Ganeti. More precisely, it
813
should be possible for any parameter to specify a value that will have
814
the same effect as not passing the parameter, and no in no case should
815
the absence of a parameter be treated as an exceptional case (outside
816
the value space).
817

    
818

    
819
Environment variables
820
^^^^^^^^^^^^^^^^^^^^^
821

    
822
The parameters will be exposed in the environment upper-case and
823
prefixed with the string ``OSP_``. For example, a parameter declared in
824
the 'parameters' file as ``ns1`` will appear in the environment as the
825
variable ``OSP_NS1``.
826

    
827
Validation
828
++++++++++
829

    
830
For the purpose of parameter name/value validation, the OS scripts
831
*must* provide an additional script, named ``verify``. This script will
832
be called with the argument ``parameters``, and all the parameters will
833
be passed in via environment variables, as described above.
834

    
835
The script should signify result/failure based on its exit code, and
836
show explanatory messages either on its standard output or standard
837
error. These messages will be passed on to the master, and stored as in
838
the OpCode result/error message.
839

    
840
The parameters must be constructed to be independent of the instance
841
specifications. In general, the validation script will only be called
842
with the parameter variables set, but not with the normal per-instance
843
variables, in order for Ganeti to be able to validate default parameters
844
too, when they change. Validation will only be performed on one cluster
845
node, and it will be up to the ganeti administrator to keep the OS
846
scripts in sync between all nodes.
847

    
848
Instance operations
849
+++++++++++++++++++
850

    
851
The parameters will be passed, as described above, to all the other
852
instance operations (creation, import, export). Ideally, these scripts
853
will not abort with parameter validation errors, if the ``verify``
854
script has verified them correctly.
855

    
856
Note: when changing an instance's OS type, any OS parameters defined at
857
instance level will be kept as-is. If the parameters differ between the
858
new and the old OS, the user should manually remove/update them as
859
needed.
860

    
861
Declaration and modification
862
++++++++++++++++++++++++++++
863

    
864
Since the OSes are not registered in Ganeti, we will only make a 'weak'
865
link between the parameters as declared in Ganeti and the actual OSes
866
existing on the cluster.
867

    
868
It will be possible to declare parameters either globally, per cluster
869
(where they are indexed per OS/variant), or individually, per
870
instance. The declaration of parameters will not be tied to current
871
existing OSes. When specifying a parameter, if the OS exists, it will be
872
validated; if not, then it will simply be stored as-is.
873

    
874
A special note is that it will not be possible to 'unset' at instance
875
level a parameter that is declared globally. Instead, at instance level
876
the parameter should be given an explicit value, or the default value as
877
explained above.
878

    
879
CLI interface
880
+++++++++++++
881

    
882
The modification of global (default) parameters will be done via the
883
``gnt-os`` command, and the per-instance parameters via the
884
``gnt-instance`` command. Both these commands will take an addition
885
``--os-parameters`` or ``-O`` flag that specifies the parameters in the
886
familiar comma-separated, key=value format. For removing a parameter, a
887
``-key`` syntax will be used, e.g.::
888

    
889
  # initial modification
890
  $ gnt-instance modify -O use_dchp=true instance1
891
  # later revert (to the cluster default, or the OS default if not
892
  # defined at cluster level)
893
  $ gnt-instance modify -O -use_dhcp instance1
894

    
895
Internal storage
896
++++++++++++++++
897

    
898
Internally, the OS parameters will be stored in a new ``osparams``
899
attribute. The global parameters will be stored on the cluster object,
900
and the value of this attribute will be a dictionary indexed by OS name
901
(this also accepts an OS+variant name, which will override a simple OS
902
name, see below), and for values the key/name dictionary. For the
903
instances, the value will be directly the key/name dictionary.
904

    
905
Overriding rules
906
++++++++++++++++
907

    
908
Any instance-specific parameters will override any variant-specific
909
parameters, which in turn will override any global parameters. The
910
global parameters, in turn, override the built-in defaults (of the OS
911
scripts).
912

    
913

    
914
.. vim: set textwidth=72 :
915
.. Local Variables:
916
.. mode: rst
917
.. fill-column: 72
918
.. End: