Statistics
| Branch: | Tag: | Revision:

root / doc / design-impexp2.rst @ 56c934da

History | View | Annotate | Download (22.2 kB)

1
==================================
2
Design for import/export version 2
3
==================================
4

    
5
.. contents:: :depth: 4
6

    
7
Current state and shortcomings
8
------------------------------
9

    
10
Ganeti 2.2 introduced :doc:`inter-cluster instance moves <design-2.2>`
11
and replaced the import/export mechanism with the same technology. It's
12
since shown that the chosen implementation was too complicated and and
13
can be difficult to debug.
14

    
15
The old implementation is henceforth called "version 1". It used
16
``socat`` in combination with a rather complex tree of ``bash`` and
17
Python utilities to move instances between clusters and import/export
18
them inside the cluster. Due to protocol limitations, the master daemon
19
starts a daemon on the involved nodes and then keeps polling a status
20
file for updates. A non-trivial number of timeouts ensures that jobs
21
don't freeze.
22

    
23
In version 1, the destination node would start a daemon listening on a
24
random TCP port. Upon receiving the destination information, the source
25
node would temporarily stop the instance, create snapshots, and start
26
exporting the data by connecting to the destination. The random TCP port
27
is chosen by the operating system by binding the socket to port 0.
28
While this is a somewhat elegant solution, it causes problems in setups
29
with restricted connectivity (e.g. iptables).
30

    
31
Another issue encountered was with dual-stack IPv6 setups. ``socat`` can
32
only listen on one protocol, IPv4 or IPv6, at a time. The connecting
33
node can not simply resolve the DNS name, but it must be told the exact
34
IP address.
35

    
36
Instance OS definitions can provide custom import/export scripts. They
37
were working well in the early days when a filesystem was usually
38
created directly on the block device. Around Ganeti 2.0 there was a
39
transition to using partitions on the block devices. Import/export
40
scripts could no longer use simple ``dump`` and ``restore`` commands,
41
but usually ended up doing raw data dumps.
42

    
43

    
44
Proposed changes
45
----------------
46

    
47
Unlike in version 1, in version 2 the destination node will connect to
48
the source. The active side is swapped. This design assumes the
49
following design documents have been implemented:
50

    
51
- :doc:`design-x509-ca`
52
- :doc:`design-http-server`
53

    
54
The following design is mostly targetted at inter-cluster instance
55
moves. Intra-cluster import and export use the same technology, but do
56
so in a less complicated way (e.g. reusing the node daemon certificate
57
in version 1).
58

    
59
Support for instance OS import/export scripts, which have been in Ganeti
60
since the beginning, will be dropped with this design. Should the need
61
arise, they can be re-added later.
62

    
63

    
64
Software requirements
65
+++++++++++++++++++++
66

    
67
- HTTP client: cURL/pycURL (already used for inter-node RPC and RAPI
68
  client)
69
- Authentication: X509 certificates (server and client)
70

    
71

    
72
Transport
73
+++++++++
74

    
75
Instead of a home-grown, mostly raw protocol the widely used HTTP
76
protocol will be used. Ganeti already uses HTTP for its :doc:`Remote API
77
<rapi>` and inter-node communication. Encryption and authentication will
78
be implemented using SSL and X509 certificates.
79

    
80

    
81
SSL certificates
82
++++++++++++++++
83

    
84
The source machine will identify connecting clients by their SSL
85
certificate. Unknown certificates will be refused.
86

    
87
Version 1 created a new self-signed certificate per instance
88
import/export, allowing the certificate to be used as a Certificate
89
Authority (CA). This worked by means of starting a new ``socat``
90
instance per instance import/export.
91

    
92
Under the version 2 model, a continously running HTTP server will be
93
used. This disallows the use of self-signed certificates for
94
authentication as the CA needs to be the same for all issued
95
certificates.
96

    
97
See the :doc:`separate design document for more details on how the
98
certificate authority will be implemented <design-x509-ca>`.
99

    
100
Local imports/exports will, like version 1, use the node daemon's
101
certificate/key. Doing so allows the verification of local connections.
102
The client's certificate can be exported to the CGI/FastCGI handler
103
using lighttpd's ``ssl.verifyclient.exportcert`` setting. If a
104
cluster-local import/export is being done, the handler verifies if the
105
used certificate matches with the local node daemon key.
106

    
107

    
108
Source
109
++++++
110

    
111
The source can be the same physical machine as the destination, another
112
node in the same cluster, or a node in another cluster. A
113
physical-to-virtual migration mechanism could be implemented as an
114
alternative source.
115

    
116
In the case of a traditional import, the source is usually a file on the
117
source machine. For exports and remote imports, the source is an
118
instance's raw disk data. In all cases the transported data is opaque to
119
Ganeti.
120

    
121
All nodes of a cluster will run an instance of Lighttpd. The
122
configuration is automatically generated when starting Ganeti. The HTTP
123
server is configured to listen on IPv4 and IPv6 simultaneously.
124
Imports/exports will use a dedicated TCP port, similar to the Remote
125
API.
126

    
127
See the separate :ref:`HTTP server design document
128
<http-srv-shortcomings>` for why Ganeti's existing, built-in HTTP server
129
is not a good choice.
130

    
131
The source cluster is provided with a X509 Certificate Signing Request
132
(CSR) for a key private to the destination cluster.
133

    
134
After shutting down the instance, creating snapshots and restarting the
135
instance the master will sign the destination's X509 certificate using
136
the :doc:`X509 CA <design-x509-ca>` once per instance disk. Instead of
137
using another identifier, the certificate's serial number (:ref:`never
138
reused <x509-ca-serial>`) and fingerprint are used to identify incoming
139
requests. Once ready, the master will call an RPC method on the source
140
node and provide it with the input information (e.g. file paths or block
141
devices) and the certificate identities.
142

    
143
The RPC method will write the identities to a place accessible by the
144
HTTP request handler, generate unique transfer IDs and return them to
145
the master. The transfer ID could be a filename containing the
146
certificate's serial number, fingerprint and some disk information. The
147
file containing the per-transfer information is signed using the node
148
daemon key and the signature written to a separate file.
149

    
150
Once everything is in place, the master sends the certificates, the data
151
and notification URLs (which include the transfer IDs) and the public
152
part of the source's CA to the job submitter. Like in version 1,
153
everything will be signed using the cluster domain secret.
154

    
155
Upon receiving a request, the handler verifies the identity and
156
continues to stream the instance data. The serial number and fingerprint
157
contained in the transfer ID should be matched with the certificate
158
used. If a cluster-local import/export was requested, the remote's
159
certificate is verified with the local node daemon key. The signature of
160
the information file from which the handler takes the path of the block
161
device (and more) is verified using the local node daemon certificate.
162
There are two options for handling requests, :ref:`CGI
163
<lighttpd-cgi-opt>` and :ref:`FastCGI <lighttpd-fastcgi-opt>`.
164

    
165
To wait for all requests to finish, the master calls another RPC method.
166
The destination should notify the source once it's done with downloading
167
the data. Since this notification may never arrive (e.g. network
168
issues), an additional timeout needs to be used.
169

    
170
There is no good way to avoid polling as the HTTP requests will be
171
handled asynchronously in another process. Once, and if, implemented
172
:ref:`RPC feedback <rpc-feedback>` could be used to combine the two RPC
173
methods.
174

    
175
Upon completion of the transfer requests, the instance is removed if
176
requested.
177

    
178

    
179
.. _lighttpd-cgi-opt:
180

    
181
Option 1: CGI
182
~~~~~~~~~~~~~
183

    
184
While easier to implement, this option requires the HTTP server to
185
either run as "root" or a so-called SUID binary to elevate the started
186
process to run as "root".
187

    
188
The export data can be sent directly to the HTTP server without any
189
further processing.
190

    
191

    
192
.. _lighttpd-fastcgi-opt:
193

    
194
Option 2: FastCGI
195
~~~~~~~~~~~~~~~~~
196

    
197
Unlike plain CGI, FastCGI scripts are run separately from the webserver.
198
The webserver talks to them via a Unix socket. Webserver and scripts can
199
run as separate users. Unlike for CGI, there are almost no bootstrap
200
costs attached to each request.
201

    
202
The FastCGI protocol requires data to be sent in length-prefixed
203
packets, something which wouldn't be very efficient to do in Python for
204
large amounts of data (instance imports/exports can be hundreds of
205
gigabytes). For this reason the proposal is to use a wrapper program
206
written in C (e.g. `fcgiwrap
207
<http://nginx.localdomain.pl/wiki/FcgiWrap>`_) and to write the handler
208
like an old-style CGI program with standard input/output. If data should
209
be copied from a file, ``cat``, ``dd`` or ``socat`` can be used (see
210
note about :ref:`sendfile(2)/splice(2) with Python <python-sendfile>`).
211

    
212
The bootstrap cost associated with starting a Python interpreter for
213
a disk export is expected to be negligible.
214

    
215
The `spawn-fcgi <http://cgit.stbuehler.de/gitosis/spawn-fcgi/about/>`_
216
program will be used to start the CGI wrapper as "root".
217

    
218
FastCGI is, in the author's opinion, the better choice as it allows user
219
separation. As a first implementation step the export handler can be run
220
as a standard CGI program. User separation can be implemented as a
221
second step.
222

    
223

    
224
Destination
225
+++++++++++
226

    
227
The destination can be the same physical machine as the source, another
228
node in the same cluster, or a node in another cluster. While not
229
considered in this design document, instances could be exported from the
230
cluster by implementing an external client for exports.
231

    
232
For traditional exports the destination is usually a file on the
233
destination machine. For imports and remote exports, the destination is
234
an instance's disks. All transported data is opaque to Ganeti.
235

    
236
Before an import can be started, an RSA key and corresponding
237
Certificate Signing Request (CSR) must be generated using the new opcode
238
``OpInstanceImportPrepare``. The returned information is signed using
239
the cluster domain secret. The RSA key backing the CSR must not leave
240
the destination cluster. After being passed through a third party, the
241
source cluster will generate signed certificates from the CSR.
242

    
243
Once the request for creating the instance arrives at the master daemon,
244
it'll create the instance and call an RPC method on the instance's
245
primary node to download all data. The RPC method does not return until
246
the transfer is complete or failed (see :ref:`EXP_SIZE_FD <exp-size-fd>`
247
and :ref:`RPC feedback <rpc-feedback>`).
248

    
249
The node will use pycURL to connect to the source machine and identify
250
itself with the signed certificate received. pycURL will be configured
251
to write directly to a file descriptor pointing to either a regular file
252
or block device. The file descriptor needs to point to the correct
253
offset for resuming downloads.
254

    
255
Using cURL's multi interface, more than one transfer can be made at the
256
same time. While parallel transfers are used by the version 1
257
import/export, it can be decided at a later time whether to use them in
258
version 2 too. More investigation is necessary to determine whether
259
``CURLOPT_MAXCONNECTS`` is enough to limit the number of connections or
260
whether more logic is necessary.
261

    
262
If a transfer fails before it's finished (e.g. timeout or network
263
issues) it should be retried using an exponential backoff delay. The
264
opcode submitter can specify for how long the transfer should be
265
retried.
266

    
267
At the end of a transfer, succssful or not, the source cluster must be
268
notified. A the same time the RSA key needs to be destroyed.
269

    
270
Support for HTTP proxies can be implemented by setting
271
``CURLOPT_PROXY``. Proxies could be used for moving instances in/out of
272
restricted network environments or across protocol borders (e.g. IPv4
273
networks unable to talk to IPv6 networks).
274

    
275

    
276
The big picture for instance moves
277
----------------------------------
278

    
279
#. ``OpInstanceImportPrepare`` (destination cluster)
280

    
281
  Create RSA key and CSR (certificate signing request), return signed
282
  with cluster domain secret.
283

    
284
#. ``OpBackupPrepare`` (source cluster)
285

    
286
  Becomes a no-op in version 2, but see :ref:`backwards-compat`.
287

    
288
#. ``OpBackupExport`` (source cluster)
289

    
290
  - Receives destination cluster's CSR, verifies signature using
291
    cluster domain secret.
292
  - Creates certificates using CSR and :doc:`cluster CA
293
    <design-x509-ca>`, one for each disk
294
  - Stop instance, create snapshots, start instance
295
  - Prepare HTTP resources on node
296
  - Send certificates, URLs and CA certificate to job submitter using
297
    feedback mechanism
298
  - Wait for all transfers to finish or fail (with timeout)
299
  - Remove snapshots
300

    
301
#. ``OpInstanceCreate`` (destination cluster)
302

    
303
  - Receives certificates signed by destination cluster, verifies
304
    certificates and URLs using cluster domain secret
305

    
306
    Note that the parameters should be implemented in a generic way
307
    allowing future extensions, e.g. to download disk images from a
308
    public, remote server. The cluster domain secret allows Ganeti to
309
    check data received from a third party, but since this won't work
310
    with such extensions, other checks will have to be designed.
311

    
312
  - Create block devices
313
  - Download every disk from source, verified using remote's CA and
314
    authenticated using signed certificates
315
  - Destroy RSA key and certificates
316
  - Start instance
317

    
318
.. TODO: separate create from import?
319

    
320

    
321
.. _impexp2-http-resources:
322

    
323
HTTP resources on source
324
------------------------
325

    
326
The HTTP resources listed below will be made available by the source
327
machine. The transfer ID is generated while preparing the export and is
328
unique per disk and instance. No caching should be used and the
329
``Pragma`` (HTTP/1.0) and ``Cache-Control`` (HTTP/1.1) headers set
330
accordingly by the server.
331

    
332
``GET /transfers/[transfer_id]/contents``
333
  Dump disk contents. Important request headers:
334

    
335
  ``Accept`` (:rfc:`2616`, section 14.1)
336
    Specify preferred media types. Only one type is supported in the
337
    initial implementation:
338

    
339
    ``application/octet-stream``
340
      Request raw disk content.
341

    
342
    If support for more media types were to be implemented in the
343
    future, the "q" parameter used for "indicating a relative quality
344
    factor" needs to be used. In the meantime parameters need to be
345
    expected, but can be ignored.
346

    
347
    If support for OS scripts were to be re-added in the future, the
348
    MIME type ``application/x-ganeti-instance-export`` is hereby
349
    reserved for disk dumps using an export script.
350

    
351
    If the source can not satisfy the request the response status code
352
    will be 406 (Not Acceptable). Successful requests will specify the
353
    used media type using the ``Content-Type`` header. Unless only
354
    exactly one media type is requested, the client must handle the
355
    different response types.
356

    
357
  ``Accept-Encoding`` (:rfc:`2616`, section 14.3)
358
    Specify desired content coding. Supported are ``identity`` for
359
    uncompressed data, ``gzip`` for compressed data and ``*`` for any.
360
    The response will include a ``Content-Encoding`` header with the
361
    actual coding used. If the client specifies an unknown coding, the
362
    response status code will be 406 (Not Acceptable).
363

    
364
    If the client specifically needs compressed data (see
365
    :ref:`impexp2-compression`) but only gets ``identity``, it can
366
    either compress locally or abort the request.
367

    
368
  ``Range`` (:rfc:`2616`, section 14.35)
369
    Raw disk dumps can be resumed using this header (e.g. after a
370
    network issue).
371

    
372
    If this header was given in the request and the source supports
373
    resuming, the status code of the response will be 206 (Partial
374
    Content) and it'll include the ``Content-Range`` header as per
375
    :rfc:`2616`. If it does not support resuming or the request was not
376
    specifying a range, the status code will be 200 (OK).
377

    
378
    Only a single byte range is supported. cURL does not support
379
    ``multipart/byteranges`` responses by itself. Even if they could be
380
    somehow implemented, doing so would be of doubtful benefit for
381
    import/export.
382

    
383
    For raw data dumps handling ranges is pretty straightforward by just
384
    dumping the requested range.
385

    
386
    cURL will fail with the error code ``CURLE_RANGE_ERROR`` if a
387
    request included a range but the server can't handle it. The request
388
    must be retried without a range.
389

    
390
``POST /transfers/[transfer_id]/done``
391
  Use this resource to notify the source when transfer is finished (even
392
  if not successful). The status code will be 204 (No Content).
393

    
394

    
395
Code samples
396
------------
397

    
398
pycURL to file
399
++++++++++++++
400

    
401
.. highlight:: python
402

    
403
The following code sample shows how to write downloaded data directly to
404
a file without pumping it through Python::
405

    
406
  curl = pycurl.Curl()
407
  curl.setopt(pycurl.URL, "http://www.google.com/")
408
  curl.setopt(pycurl.WRITEDATA, open("googlecom.html", "w"))
409
  curl.perform()
410

    
411
This works equally well if the file descriptor is a pipe to another
412
process.
413

    
414

    
415
.. _backwards-compat:
416

    
417
Backwards compatibility
418
-----------------------
419

    
420
.. _backwards-compat-v1:
421

    
422
Version 1
423
+++++++++
424

    
425
The old inter-cluster import/export implementation described in the
426
:doc:`Ganeti 2.2 design document <design-2.2>` will be supported for at
427
least one minor (2.x) release. Intra-cluster imports/exports will use
428
the new version right away.
429

    
430

    
431
.. _exp-size-fd:
432

    
433
``EXP_SIZE_FD``
434
+++++++++++++++
435

    
436
Together with the improved import/export infrastructure Ganeti 2.2
437
allowed instance export scripts to report the expected data size. This
438
was then used to provide the user with an estimated remaining time.
439
Version 2 no longer supports OS import/export scripts and therefore
440
``EXP_SIZE_FD`` is no longer needed.
441

    
442

    
443
.. _impexp2-compression:
444

    
445
Compression
446
+++++++++++
447

    
448
Version 1 used explicit compression using ``gzip`` for transporting
449
data, but the dumped files didn't use any compression. Version 2 will
450
allow the destination to specify which encoding should be used. This way
451
the transported data is already compressed and can be directly used by
452
the client (see :ref:`impexp2-http-resources`). The cURL option
453
``CURLOPT_ENCODING`` can be used to set the ``Accept-Encoding`` header.
454
cURL will not decompress received data when
455
``CURLOPT_HTTP_CONTENT_DECODING`` is set to zero (if another HTTP client
456
library were used which doesn't support disabling transparent
457
compression, a custom content-coding type could be defined, e.g.
458
``x-ganeti-gzip``).
459

    
460

    
461
Notes
462
-----
463

    
464
The HTTP/1.1 protocol (:rfc:`2616`) defines trailing headers for chunked
465
transfers in section 3.6.1. This could be used to transfer a checksum at
466
the end of an import/export. cURL supports trailing headers since
467
version 7.14.1. Lighttpd doesn't seem to support them for FastCGI, but
468
they appear to be usable in combination with an NPH CGI (No Parsed
469
Headers).
470

    
471
.. _lighttp-sendfile:
472

    
473
Lighttpd allows FastCGI applications to send the special headers
474
``X-Sendfile`` and ``X-Sendfile2`` (the latter with a range). Using
475
these headers applications can send response headers and tell the
476
webserver to serve regular file stored on the file system as a response
477
body. The webserver will then take care of sending that file.
478
Unfortunately this mechanism is restricted to regular files and can not
479
be used for data from programs, neither direct nor via named pipes,
480
without writing to a file first. The latter is not an option as instance
481
data can be very large. Theoretically ``X-Sendfile`` could be used for
482
sending the input for a file-based instance import, but that'd require
483
the webserver to run as "root".
484

    
485
.. _python-sendfile:
486

    
487
Python does not include interfaces for the ``sendfile(2)`` or
488
``splice(2)`` system calls. The latter can be useful for faster copying
489
of data between file descriptors. There are some 3rd-party modules (e.g.
490
http://pypi.python.org/pypi/py-sendfile/) and discussions
491
(http://bugs.python.org/issue10882) for including support for
492
``sendfile(2)``, but the later is certainly not going to happen for the
493
Python versions supported by Ganeti. Calling the function using the
494
``ctypes`` module might be possible.
495

    
496

    
497
Performance considerations
498
--------------------------
499

    
500
The design described above was confirmed to be one of the better choices
501
in terms of download performance with bigger block sizes. All numbers
502
were gathered on the same physical machine with a single CPU and 1 GB of
503
RAM while downloading 2 GB of zeros read from ``/dev/zero``. ``wget``
504
(version 1.10.2) was used as the client, ``lighttpd`` (version 1.4.28)
505
as the server. The numbers in the first line are in megabytes per
506
second. The second line in each row is the CPU time spent in userland
507
respective system (measured for the CGI/FastCGI program using ``time
508
-v``).
509

    
510
::
511

    
512
  ----------------------------------------------------------------------
513
  Block size                      4 KB    64 KB   128 KB    1 MB    4 MB
514
  ======================================================================
515
  Plain CGI script reading          83      174      180     122     120
516
  from ``/dev/zero``
517
                               0.6/3.9  0.1/2.4  0.1/2.2 0.0/1.9 0.0/2.1
518
  ----------------------------------------------------------------------
519
  FastCGI with ``fcgiwrap``,        86      167      170     177     174
520
  ``dd`` reading from
521
  ``/dev/zero``                  1.1/5  0.5/2.9  0.5/2.7 0.7/3.1 0.7/2.8
522
  ----------------------------------------------------------------------
523
  FastCGI with ``fcgiwrap``,        68      146      150     170     170
524
  Python script copying from
525
  ``/dev/zero`` to stdout
526
                               1.3/5.1  0.8/3.7  0.7/3.3  0.9/2.9  0.8/3
527
  ----------------------------------------------------------------------
528
  FastCGI, Python script using      31       48       47       5       1
529
  ``flup`` library (version
530
  1.0.2) reading from
531
  ``/dev/zero``
532
                              23.5/9.8 14.3/8.5   16.1/8       -       -
533
  ----------------------------------------------------------------------
534

    
535

    
536
It should be mentioned that the ``flup`` library is not implemented in
537
the most efficient way, but even with some changes it doesn't get much
538
faster. It is fine for small amounts of data, but not for huge
539
transfers.
540

    
541

    
542
Other considered solutions
543
--------------------------
544

    
545
Another possible solution considered was to use ``socat`` like version 1
546
did. Due to the changing model, a large part of the code would've
547
required a rewrite anyway, while still not fixing all shortcomings. For
548
example, ``socat`` could still listen on only one protocol, IPv4 or
549
IPv6. Running two separate instances might have fixed that, but it'd get
550
more complicated. Using an existing HTTP server will provide us with a
551
number of other benefits as well, such as easier user separation between
552
server and backend.
553

    
554

    
555
.. vim: set textwidth=72 :
556
.. Local Variables:
557
.. mode: rst
558
.. fill-column: 72
559
.. End: