Statistics
| Branch: | Tag: | Revision:

root / doc / design-impexp2.rst @ fb4b885a

History | View | Annotate | Download (22.2 kB)

1 1375e1d9 Michael Hanselmann
==================================
2 1375e1d9 Michael Hanselmann
Design for import/export version 2
3 1375e1d9 Michael Hanselmann
==================================
4 1375e1d9 Michael Hanselmann
5 1375e1d9 Michael Hanselmann
.. contents:: :depth: 4
6 1375e1d9 Michael Hanselmann
7 1375e1d9 Michael Hanselmann
Current state and shortcomings
8 1375e1d9 Michael Hanselmann
------------------------------
9 1375e1d9 Michael Hanselmann
10 1375e1d9 Michael Hanselmann
Ganeti 2.2 introduced :doc:`inter-cluster instance moves <design-2.2>`
11 1375e1d9 Michael Hanselmann
and replaced the import/export mechanism with the same technology. It's
12 1375e1d9 Michael Hanselmann
since shown that the chosen implementation was too complicated and and
13 1375e1d9 Michael Hanselmann
can be difficult to debug.
14 1375e1d9 Michael Hanselmann
15 1375e1d9 Michael Hanselmann
The old implementation is henceforth called "version 1". It used
16 1375e1d9 Michael Hanselmann
``socat`` in combination with a rather complex tree of ``bash`` and
17 1375e1d9 Michael Hanselmann
Python utilities to move instances between clusters and import/export
18 1375e1d9 Michael Hanselmann
them inside the cluster. Due to protocol limitations, the master daemon
19 1375e1d9 Michael Hanselmann
starts a daemon on the involved nodes and then keeps polling a status
20 1375e1d9 Michael Hanselmann
file for updates. A non-trivial number of timeouts ensures that jobs
21 1375e1d9 Michael Hanselmann
don't freeze.
22 1375e1d9 Michael Hanselmann
23 1375e1d9 Michael Hanselmann
In version 1, the destination node would start a daemon listening on a
24 1375e1d9 Michael Hanselmann
random TCP port. Upon receiving the destination information, the source
25 1375e1d9 Michael Hanselmann
node would temporarily stop the instance, create snapshots, and start
26 1375e1d9 Michael Hanselmann
exporting the data by connecting to the destination. The random TCP port
27 1375e1d9 Michael Hanselmann
is chosen by the operating system by binding the socket to port 0.
28 1375e1d9 Michael Hanselmann
While this is a somewhat elegant solution, it causes problems in setups
29 1375e1d9 Michael Hanselmann
with restricted connectivity (e.g. iptables).
30 1375e1d9 Michael Hanselmann
31 1375e1d9 Michael Hanselmann
Another issue encountered was with dual-stack IPv6 setups. ``socat`` can
32 1375e1d9 Michael Hanselmann
only listen on one protocol, IPv4 or IPv6, at a time. The connecting
33 1375e1d9 Michael Hanselmann
node can not simply resolve the DNS name, but it must be told the exact
34 1375e1d9 Michael Hanselmann
IP address.
35 1375e1d9 Michael Hanselmann
36 1375e1d9 Michael Hanselmann
Instance OS definitions can provide custom import/export scripts. They
37 1375e1d9 Michael Hanselmann
were working well in the early days when a filesystem was usually
38 1375e1d9 Michael Hanselmann
created directly on the block device. Around Ganeti 2.0 there was a
39 1375e1d9 Michael Hanselmann
transition to using partitions on the block devices. Import/export
40 1375e1d9 Michael Hanselmann
scripts could no longer use simple ``dump`` and ``restore`` commands,
41 1375e1d9 Michael Hanselmann
but usually ended up doing raw data dumps.
42 1375e1d9 Michael Hanselmann
43 1375e1d9 Michael Hanselmann
44 1375e1d9 Michael Hanselmann
Proposed changes
45 1375e1d9 Michael Hanselmann
----------------
46 1375e1d9 Michael Hanselmann
47 1375e1d9 Michael Hanselmann
Unlike in version 1, in version 2 the destination node will connect to
48 1375e1d9 Michael Hanselmann
the source. The active side is swapped. This design assumes the
49 1375e1d9 Michael Hanselmann
following design documents have been implemented:
50 1375e1d9 Michael Hanselmann
51 1375e1d9 Michael Hanselmann
- :doc:`design-x509-ca`
52 1375e1d9 Michael Hanselmann
- :doc:`design-http-server`
53 1375e1d9 Michael Hanselmann
54 1375e1d9 Michael Hanselmann
The following design is mostly targetted at inter-cluster instance
55 1375e1d9 Michael Hanselmann
moves. Intra-cluster import and export use the same technology, but do
56 1375e1d9 Michael Hanselmann
so in a less complicated way (e.g. reusing the node daemon certificate
57 1375e1d9 Michael Hanselmann
in version 1).
58 1375e1d9 Michael Hanselmann
59 1375e1d9 Michael Hanselmann
Support for instance OS import/export scripts, which have been in Ganeti
60 1375e1d9 Michael Hanselmann
since the beginning, will be dropped with this design. Should the need
61 1375e1d9 Michael Hanselmann
arise, they can be re-added later.
62 1375e1d9 Michael Hanselmann
63 1375e1d9 Michael Hanselmann
64 1375e1d9 Michael Hanselmann
Software requirements
65 1375e1d9 Michael Hanselmann
+++++++++++++++++++++
66 1375e1d9 Michael Hanselmann
67 1375e1d9 Michael Hanselmann
- HTTP client: cURL/pycURL (already used for inter-node RPC and RAPI
68 1375e1d9 Michael Hanselmann
  client)
69 1375e1d9 Michael Hanselmann
- Authentication: X509 certificates (server and client)
70 1375e1d9 Michael Hanselmann
71 1375e1d9 Michael Hanselmann
72 1375e1d9 Michael Hanselmann
Transport
73 1375e1d9 Michael Hanselmann
+++++++++
74 1375e1d9 Michael Hanselmann
75 1375e1d9 Michael Hanselmann
Instead of a home-grown, mostly raw protocol the widely used HTTP
76 1375e1d9 Michael Hanselmann
protocol will be used. Ganeti already uses HTTP for its :doc:`Remote API
77 1375e1d9 Michael Hanselmann
<rapi>` and inter-node communication. Encryption and authentication will
78 1375e1d9 Michael Hanselmann
be implemented using SSL and X509 certificates.
79 1375e1d9 Michael Hanselmann
80 1375e1d9 Michael Hanselmann
81 1375e1d9 Michael Hanselmann
SSL certificates
82 1375e1d9 Michael Hanselmann
++++++++++++++++
83 1375e1d9 Michael Hanselmann
84 1375e1d9 Michael Hanselmann
The source machine will identify connecting clients by their SSL
85 1375e1d9 Michael Hanselmann
certificate. Unknown certificates will be refused.
86 1375e1d9 Michael Hanselmann
87 1375e1d9 Michael Hanselmann
Version 1 created a new self-signed certificate per instance
88 1375e1d9 Michael Hanselmann
import/export, allowing the certificate to be used as a Certificate
89 1375e1d9 Michael Hanselmann
Authority (CA). This worked by means of starting a new ``socat``
90 1375e1d9 Michael Hanselmann
instance per instance import/export.
91 1375e1d9 Michael Hanselmann
92 1375e1d9 Michael Hanselmann
Under the version 2 model, a continously running HTTP server will be
93 1375e1d9 Michael Hanselmann
used. This disallows the use of self-signed certificates for
94 1375e1d9 Michael Hanselmann
authentication as the CA needs to be the same for all issued
95 1375e1d9 Michael Hanselmann
certificates.
96 1375e1d9 Michael Hanselmann
97 1375e1d9 Michael Hanselmann
See the :doc:`separate design document for more details on how the
98 1375e1d9 Michael Hanselmann
certificate authority will be implemented <design-x509-ca>`.
99 1375e1d9 Michael Hanselmann
100 1375e1d9 Michael Hanselmann
Local imports/exports will, like version 1, use the node daemon's
101 1375e1d9 Michael Hanselmann
certificate/key. Doing so allows the verification of local connections.
102 1375e1d9 Michael Hanselmann
The client's certificate can be exported to the CGI/FastCGI handler
103 1375e1d9 Michael Hanselmann
using lighttpd's ``ssl.verifyclient.exportcert`` setting. If a
104 1375e1d9 Michael Hanselmann
cluster-local import/export is being done, the handler verifies if the
105 1375e1d9 Michael Hanselmann
used certificate matches with the local node daemon key.
106 1375e1d9 Michael Hanselmann
107 1375e1d9 Michael Hanselmann
108 1375e1d9 Michael Hanselmann
Source
109 1375e1d9 Michael Hanselmann
++++++
110 1375e1d9 Michael Hanselmann
111 1375e1d9 Michael Hanselmann
The source can be the same physical machine as the destination, another
112 1375e1d9 Michael Hanselmann
node in the same cluster, or a node in another cluster. A
113 1375e1d9 Michael Hanselmann
physical-to-virtual migration mechanism could be implemented as an
114 1375e1d9 Michael Hanselmann
alternative source.
115 1375e1d9 Michael Hanselmann
116 1375e1d9 Michael Hanselmann
In the case of a traditional import, the source is usually a file on the
117 1375e1d9 Michael Hanselmann
source machine. For exports and remote imports, the source is an
118 1375e1d9 Michael Hanselmann
instance's raw disk data. In all cases the transported data is opaque to
119 1375e1d9 Michael Hanselmann
Ganeti.
120 1375e1d9 Michael Hanselmann
121 1375e1d9 Michael Hanselmann
All nodes of a cluster will run an instance of Lighttpd. The
122 1375e1d9 Michael Hanselmann
configuration is automatically generated when starting Ganeti. The HTTP
123 1375e1d9 Michael Hanselmann
server is configured to listen on IPv4 and IPv6 simultaneously.
124 1375e1d9 Michael Hanselmann
Imports/exports will use a dedicated TCP port, similar to the Remote
125 1375e1d9 Michael Hanselmann
API.
126 1375e1d9 Michael Hanselmann
127 1375e1d9 Michael Hanselmann
See the separate :ref:`HTTP server design document
128 1375e1d9 Michael Hanselmann
<http-srv-shortcomings>` for why Ganeti's existing, built-in HTTP server
129 1375e1d9 Michael Hanselmann
is not a good choice.
130 1375e1d9 Michael Hanselmann
131 1375e1d9 Michael Hanselmann
The source cluster is provided with a X509 Certificate Signing Request
132 1375e1d9 Michael Hanselmann
(CSR) for a key private to the destination cluster.
133 1375e1d9 Michael Hanselmann
134 1375e1d9 Michael Hanselmann
After shutting down the instance, creating snapshots and restarting the
135 1375e1d9 Michael Hanselmann
instance the master will sign the destination's X509 certificate using
136 1375e1d9 Michael Hanselmann
the :doc:`X509 CA <design-x509-ca>` once per instance disk. Instead of
137 1375e1d9 Michael Hanselmann
using another identifier, the certificate's serial number (:ref:`never
138 1375e1d9 Michael Hanselmann
reused <x509-ca-serial>`) and fingerprint are used to identify incoming
139 1375e1d9 Michael Hanselmann
requests. Once ready, the master will call an RPC method on the source
140 1375e1d9 Michael Hanselmann
node and provide it with the input information (e.g. file paths or block
141 1375e1d9 Michael Hanselmann
devices) and the certificate identities.
142 1375e1d9 Michael Hanselmann
143 1375e1d9 Michael Hanselmann
The RPC method will write the identities to a place accessible by the
144 1375e1d9 Michael Hanselmann
HTTP request handler, generate unique transfer IDs and return them to
145 1375e1d9 Michael Hanselmann
the master. The transfer ID could be a filename containing the
146 1375e1d9 Michael Hanselmann
certificate's serial number, fingerprint and some disk information. The
147 1375e1d9 Michael Hanselmann
file containing the per-transfer information is signed using the node
148 1375e1d9 Michael Hanselmann
daemon key and the signature written to a separate file.
149 1375e1d9 Michael Hanselmann
150 1375e1d9 Michael Hanselmann
Once everything is in place, the master sends the certificates, the data
151 1375e1d9 Michael Hanselmann
and notification URLs (which include the transfer IDs) and the public
152 1375e1d9 Michael Hanselmann
part of the source's CA to the job submitter. Like in version 1,
153 1375e1d9 Michael Hanselmann
everything will be signed using the cluster domain secret.
154 1375e1d9 Michael Hanselmann
155 1375e1d9 Michael Hanselmann
Upon receiving a request, the handler verifies the identity and
156 1375e1d9 Michael Hanselmann
continues to stream the instance data. The serial number and fingerprint
157 1375e1d9 Michael Hanselmann
contained in the transfer ID should be matched with the certificate
158 1375e1d9 Michael Hanselmann
used. If a cluster-local import/export was requested, the remote's
159 1375e1d9 Michael Hanselmann
certificate is verified with the local node daemon key. The signature of
160 1375e1d9 Michael Hanselmann
the information file from which the handler takes the path of the block
161 1375e1d9 Michael Hanselmann
device (and more) is verified using the local node daemon certificate.
162 1375e1d9 Michael Hanselmann
There are two options for handling requests, :ref:`CGI
163 1375e1d9 Michael Hanselmann
<lighttpd-cgi-opt>` and :ref:`FastCGI <lighttpd-fastcgi-opt>`.
164 1375e1d9 Michael Hanselmann
165 1375e1d9 Michael Hanselmann
To wait for all requests to finish, the master calls another RPC method.
166 1375e1d9 Michael Hanselmann
The destination should notify the source once it's done with downloading
167 1375e1d9 Michael Hanselmann
the data. Since this notification may never arrive (e.g. network
168 1375e1d9 Michael Hanselmann
issues), an additional timeout needs to be used.
169 1375e1d9 Michael Hanselmann
170 1375e1d9 Michael Hanselmann
There is no good way to avoid polling as the HTTP requests will be
171 1375e1d9 Michael Hanselmann
handled asynchronously in another process. Once, and if, implemented
172 1375e1d9 Michael Hanselmann
:ref:`RPC feedback <rpc-feedback>` could be used to combine the two RPC
173 1375e1d9 Michael Hanselmann
methods.
174 1375e1d9 Michael Hanselmann
175 1375e1d9 Michael Hanselmann
Upon completion of the transfer requests, the instance is removed if
176 1375e1d9 Michael Hanselmann
requested.
177 1375e1d9 Michael Hanselmann
178 1375e1d9 Michael Hanselmann
179 1375e1d9 Michael Hanselmann
.. _lighttpd-cgi-opt:
180 1375e1d9 Michael Hanselmann
181 1375e1d9 Michael Hanselmann
Option 1: CGI
182 1375e1d9 Michael Hanselmann
~~~~~~~~~~~~~
183 1375e1d9 Michael Hanselmann
184 1375e1d9 Michael Hanselmann
While easier to implement, this option requires the HTTP server to
185 1375e1d9 Michael Hanselmann
either run as "root" or a so-called SUID binary to elevate the started
186 1375e1d9 Michael Hanselmann
process to run as "root".
187 1375e1d9 Michael Hanselmann
188 1375e1d9 Michael Hanselmann
The export data can be sent directly to the HTTP server without any
189 1375e1d9 Michael Hanselmann
further processing.
190 1375e1d9 Michael Hanselmann
191 1375e1d9 Michael Hanselmann
192 1375e1d9 Michael Hanselmann
.. _lighttpd-fastcgi-opt:
193 1375e1d9 Michael Hanselmann
194 1375e1d9 Michael Hanselmann
Option 2: FastCGI
195 1375e1d9 Michael Hanselmann
~~~~~~~~~~~~~~~~~
196 1375e1d9 Michael Hanselmann
197 1375e1d9 Michael Hanselmann
Unlike plain CGI, FastCGI scripts are run separately from the webserver.
198 1375e1d9 Michael Hanselmann
The webserver talks to them via a Unix socket. Webserver and scripts can
199 1375e1d9 Michael Hanselmann
run as separate users. Unlike for CGI, there are almost no bootstrap
200 1375e1d9 Michael Hanselmann
costs attached to each request.
201 1375e1d9 Michael Hanselmann
202 1375e1d9 Michael Hanselmann
The FastCGI protocol requires data to be sent in length-prefixed
203 1375e1d9 Michael Hanselmann
packets, something which wouldn't be very efficient to do in Python for
204 1375e1d9 Michael Hanselmann
large amounts of data (instance imports/exports can be hundreds of
205 1375e1d9 Michael Hanselmann
gigabytes). For this reason the proposal is to use a wrapper program
206 1375e1d9 Michael Hanselmann
written in C (e.g. `fcgiwrap
207 1375e1d9 Michael Hanselmann
<http://nginx.localdomain.pl/wiki/FcgiWrap>`_) and to write the handler
208 1375e1d9 Michael Hanselmann
like an old-style CGI program with standard input/output. If data should
209 1375e1d9 Michael Hanselmann
be copied from a file, ``cat``, ``dd`` or ``socat`` can be used (see
210 1375e1d9 Michael Hanselmann
note about :ref:`sendfile(2)/splice(2) with Python <python-sendfile>`).
211 1375e1d9 Michael Hanselmann
212 1375e1d9 Michael Hanselmann
The bootstrap cost associated with starting a Python interpreter for
213 1375e1d9 Michael Hanselmann
a disk export is expected to be negligible.
214 1375e1d9 Michael Hanselmann
215 1375e1d9 Michael Hanselmann
The `spawn-fcgi <http://cgit.stbuehler.de/gitosis/spawn-fcgi/about/>`_
216 1375e1d9 Michael Hanselmann
program will be used to start the CGI wrapper as "root".
217 1375e1d9 Michael Hanselmann
218 1375e1d9 Michael Hanselmann
FastCGI is, in the author's opinion, the better choice as it allows user
219 1375e1d9 Michael Hanselmann
separation. As a first implementation step the export handler can be run
220 1375e1d9 Michael Hanselmann
as a standard CGI program. User separation can be implemented as a
221 1375e1d9 Michael Hanselmann
second step.
222 1375e1d9 Michael Hanselmann
223 1375e1d9 Michael Hanselmann
224 1375e1d9 Michael Hanselmann
Destination
225 1375e1d9 Michael Hanselmann
+++++++++++
226 1375e1d9 Michael Hanselmann
227 1375e1d9 Michael Hanselmann
The destination can be the same physical machine as the source, another
228 1375e1d9 Michael Hanselmann
node in the same cluster, or a node in another cluster. While not
229 1375e1d9 Michael Hanselmann
considered in this design document, instances could be exported from the
230 1375e1d9 Michael Hanselmann
cluster by implementing an external client for exports.
231 1375e1d9 Michael Hanselmann
232 1375e1d9 Michael Hanselmann
For traditional exports the destination is usually a file on the
233 1375e1d9 Michael Hanselmann
destination machine. For imports and remote exports, the destination is
234 1375e1d9 Michael Hanselmann
an instance's disks. All transported data is opaque to Ganeti.
235 1375e1d9 Michael Hanselmann
236 1375e1d9 Michael Hanselmann
Before an import can be started, an RSA key and corresponding
237 1375e1d9 Michael Hanselmann
Certificate Signing Request (CSR) must be generated using the new opcode
238 1375e1d9 Michael Hanselmann
``OpInstanceImportPrepare``. The returned information is signed using
239 1375e1d9 Michael Hanselmann
the cluster domain secret. The RSA key backing the CSR must not leave
240 1375e1d9 Michael Hanselmann
the destination cluster. After being passed through a third party, the
241 1375e1d9 Michael Hanselmann
source cluster will generate signed certificates from the CSR.
242 1375e1d9 Michael Hanselmann
243 1375e1d9 Michael Hanselmann
Once the request for creating the instance arrives at the master daemon,
244 1375e1d9 Michael Hanselmann
it'll create the instance and call an RPC method on the instance's
245 1375e1d9 Michael Hanselmann
primary node to download all data. The RPC method does not return until
246 1375e1d9 Michael Hanselmann
the transfer is complete or failed (see :ref:`EXP_SIZE_FD <exp-size-fd>`
247 1375e1d9 Michael Hanselmann
and :ref:`RPC feedback <rpc-feedback>`).
248 1375e1d9 Michael Hanselmann
249 1375e1d9 Michael Hanselmann
The node will use pycURL to connect to the source machine and identify
250 1375e1d9 Michael Hanselmann
itself with the signed certificate received. pycURL will be configured
251 1375e1d9 Michael Hanselmann
to write directly to a file descriptor pointing to either a regular file
252 1375e1d9 Michael Hanselmann
or block device. The file descriptor needs to point to the correct
253 1375e1d9 Michael Hanselmann
offset for resuming downloads.
254 1375e1d9 Michael Hanselmann
255 1375e1d9 Michael Hanselmann
Using cURL's multi interface, more than one transfer can be made at the
256 1375e1d9 Michael Hanselmann
same time. While parallel transfers are used by the version 1
257 1375e1d9 Michael Hanselmann
import/export, it can be decided at a later time whether to use them in
258 1375e1d9 Michael Hanselmann
version 2 too. More investigation is necessary to determine whether
259 1375e1d9 Michael Hanselmann
``CURLOPT_MAXCONNECTS`` is enough to limit the number of connections or
260 1375e1d9 Michael Hanselmann
whether more logic is necessary.
261 1375e1d9 Michael Hanselmann
262 1375e1d9 Michael Hanselmann
If a transfer fails before it's finished (e.g. timeout or network
263 1375e1d9 Michael Hanselmann
issues) it should be retried using an exponential backoff delay. The
264 1375e1d9 Michael Hanselmann
opcode submitter can specify for how long the transfer should be
265 1375e1d9 Michael Hanselmann
retried.
266 1375e1d9 Michael Hanselmann
267 1375e1d9 Michael Hanselmann
At the end of a transfer, succssful or not, the source cluster must be
268 1375e1d9 Michael Hanselmann
notified. A the same time the RSA key needs to be destroyed.
269 1375e1d9 Michael Hanselmann
270 1375e1d9 Michael Hanselmann
Support for HTTP proxies can be implemented by setting
271 1375e1d9 Michael Hanselmann
``CURLOPT_PROXY``. Proxies could be used for moving instances in/out of
272 1375e1d9 Michael Hanselmann
restricted network environments or across protocol borders (e.g. IPv4
273 1375e1d9 Michael Hanselmann
networks unable to talk to IPv6 networks).
274 1375e1d9 Michael Hanselmann
275 1375e1d9 Michael Hanselmann
276 1375e1d9 Michael Hanselmann
The big picture for instance moves
277 1375e1d9 Michael Hanselmann
----------------------------------
278 1375e1d9 Michael Hanselmann
279 1375e1d9 Michael Hanselmann
#. ``OpInstanceImportPrepare`` (destination cluster)
280 1375e1d9 Michael Hanselmann
281 1375e1d9 Michael Hanselmann
  Create RSA key and CSR (certificate signing request), return signed
282 1375e1d9 Michael Hanselmann
  with cluster domain secret.
283 1375e1d9 Michael Hanselmann
284 1375e1d9 Michael Hanselmann
#. ``OpBackupPrepare`` (source cluster)
285 1375e1d9 Michael Hanselmann
286 1375e1d9 Michael Hanselmann
  Becomes a no-op in version 2, but see :ref:`backwards-compat`.
287 1375e1d9 Michael Hanselmann
288 1375e1d9 Michael Hanselmann
#. ``OpBackupExport`` (source cluster)
289 1375e1d9 Michael Hanselmann
290 1375e1d9 Michael Hanselmann
  - Receives destination cluster's CSR, verifies signature using
291 1375e1d9 Michael Hanselmann
    cluster domain secret.
292 1375e1d9 Michael Hanselmann
  - Creates certificates using CSR and :doc:`cluster CA
293 1375e1d9 Michael Hanselmann
    <design-x509-ca>`, one for each disk
294 1375e1d9 Michael Hanselmann
  - Stop instance, create snapshots, start instance
295 1375e1d9 Michael Hanselmann
  - Prepare HTTP resources on node
296 1375e1d9 Michael Hanselmann
  - Send certificates, URLs and CA certificate to job submitter using
297 1375e1d9 Michael Hanselmann
    feedback mechanism
298 1375e1d9 Michael Hanselmann
  - Wait for all transfers to finish or fail (with timeout)
299 1375e1d9 Michael Hanselmann
  - Remove snapshots
300 1375e1d9 Michael Hanselmann
301 1375e1d9 Michael Hanselmann
#. ``OpInstanceCreate`` (destination cluster)
302 1375e1d9 Michael Hanselmann
303 1375e1d9 Michael Hanselmann
  - Receives certificates signed by destination cluster, verifies
304 1375e1d9 Michael Hanselmann
    certificates and URLs using cluster domain secret
305 1375e1d9 Michael Hanselmann
306 1375e1d9 Michael Hanselmann
    Note that the parameters should be implemented in a generic way
307 1375e1d9 Michael Hanselmann
    allowing future extensions, e.g. to download disk images from a
308 1375e1d9 Michael Hanselmann
    public, remote server. The cluster domain secret allows Ganeti to
309 1375e1d9 Michael Hanselmann
    check data received from a third party, but since this won't work
310 1375e1d9 Michael Hanselmann
    with such extensions, other checks will have to be designed.
311 1375e1d9 Michael Hanselmann
312 1375e1d9 Michael Hanselmann
  - Create block devices
313 1375e1d9 Michael Hanselmann
  - Download every disk from source, verified using remote's CA and
314 1375e1d9 Michael Hanselmann
    authenticated using signed certificates
315 1375e1d9 Michael Hanselmann
  - Destroy RSA key and certificates
316 1375e1d9 Michael Hanselmann
  - Start instance
317 1375e1d9 Michael Hanselmann
318 1375e1d9 Michael Hanselmann
.. TODO: separate create from import?
319 1375e1d9 Michael Hanselmann
320 1375e1d9 Michael Hanselmann
321 1375e1d9 Michael Hanselmann
.. _impexp2-http-resources:
322 1375e1d9 Michael Hanselmann
323 1375e1d9 Michael Hanselmann
HTTP resources on source
324 1375e1d9 Michael Hanselmann
------------------------
325 1375e1d9 Michael Hanselmann
326 1375e1d9 Michael Hanselmann
The HTTP resources listed below will be made available by the source
327 1375e1d9 Michael Hanselmann
machine. The transfer ID is generated while preparing the export and is
328 1375e1d9 Michael Hanselmann
unique per disk and instance. No caching should be used and the
329 1375e1d9 Michael Hanselmann
``Pragma`` (HTTP/1.0) and ``Cache-Control`` (HTTP/1.1) headers set
330 1375e1d9 Michael Hanselmann
accordingly by the server.
331 1375e1d9 Michael Hanselmann
332 1375e1d9 Michael Hanselmann
``GET /transfers/[transfer_id]/contents``
333 1375e1d9 Michael Hanselmann
  Dump disk contents. Important request headers:
334 1375e1d9 Michael Hanselmann
335 1375e1d9 Michael Hanselmann
  ``Accept`` (:rfc:`2616`, section 14.1)
336 1375e1d9 Michael Hanselmann
    Specify preferred media types. Only one type is supported in the
337 1375e1d9 Michael Hanselmann
    initial implementation:
338 1375e1d9 Michael Hanselmann
339 1375e1d9 Michael Hanselmann
    ``application/octet-stream``
340 1375e1d9 Michael Hanselmann
      Request raw disk content.
341 1375e1d9 Michael Hanselmann
342 1375e1d9 Michael Hanselmann
    If support for more media types were to be implemented in the
343 1375e1d9 Michael Hanselmann
    future, the "q" parameter used for "indicating a relative quality
344 1375e1d9 Michael Hanselmann
    factor" needs to be used. In the meantime parameters need to be
345 1375e1d9 Michael Hanselmann
    expected, but can be ignored.
346 1375e1d9 Michael Hanselmann
347 1375e1d9 Michael Hanselmann
    If support for OS scripts were to be re-added in the future, the
348 1375e1d9 Michael Hanselmann
    MIME type ``application/x-ganeti-instance-export`` is hereby
349 1375e1d9 Michael Hanselmann
    reserved for disk dumps using an export script.
350 1375e1d9 Michael Hanselmann
351 1375e1d9 Michael Hanselmann
    If the source can not satisfy the request the response status code
352 1375e1d9 Michael Hanselmann
    will be 406 (Not Acceptable). Successful requests will specify the
353 1375e1d9 Michael Hanselmann
    used media type using the ``Content-Type`` header. Unless only
354 1375e1d9 Michael Hanselmann
    exactly one media type is requested, the client must handle the
355 1375e1d9 Michael Hanselmann
    different response types.
356 1375e1d9 Michael Hanselmann
357 1375e1d9 Michael Hanselmann
  ``Accept-Encoding`` (:rfc:`2616`, section 14.3)
358 1375e1d9 Michael Hanselmann
    Specify desired content coding. Supported are ``identity`` for
359 1375e1d9 Michael Hanselmann
    uncompressed data, ``gzip`` for compressed data and ``*`` for any.
360 1375e1d9 Michael Hanselmann
    The response will include a ``Content-Encoding`` header with the
361 1375e1d9 Michael Hanselmann
    actual coding used. If the client specifies an unknown coding, the
362 1375e1d9 Michael Hanselmann
    response status code will be 406 (Not Acceptable).
363 1375e1d9 Michael Hanselmann
364 1375e1d9 Michael Hanselmann
    If the client specifically needs compressed data (see
365 1375e1d9 Michael Hanselmann
    :ref:`impexp2-compression`) but only gets ``identity``, it can
366 1375e1d9 Michael Hanselmann
    either compress locally or abort the request.
367 1375e1d9 Michael Hanselmann
368 1375e1d9 Michael Hanselmann
  ``Range`` (:rfc:`2616`, section 14.35)
369 1375e1d9 Michael Hanselmann
    Raw disk dumps can be resumed using this header (e.g. after a
370 1375e1d9 Michael Hanselmann
    network issue).
371 1375e1d9 Michael Hanselmann
372 1375e1d9 Michael Hanselmann
    If this header was given in the request and the source supports
373 1375e1d9 Michael Hanselmann
    resuming, the status code of the response will be 206 (Partial
374 1375e1d9 Michael Hanselmann
    Content) and it'll include the ``Content-Range`` header as per
375 1375e1d9 Michael Hanselmann
    :rfc:`2616`. If it does not support resuming or the request was not
376 1375e1d9 Michael Hanselmann
    specifying a range, the status code will be 200 (OK).
377 1375e1d9 Michael Hanselmann
378 1375e1d9 Michael Hanselmann
    Only a single byte range is supported. cURL does not support
379 1375e1d9 Michael Hanselmann
    ``multipart/byteranges`` responses by itself. Even if they could be
380 1375e1d9 Michael Hanselmann
    somehow implemented, doing so would be of doubtful benefit for
381 1375e1d9 Michael Hanselmann
    import/export.
382 1375e1d9 Michael Hanselmann
383 1375e1d9 Michael Hanselmann
    For raw data dumps handling ranges is pretty straightforward by just
384 1375e1d9 Michael Hanselmann
    dumping the requested range.
385 1375e1d9 Michael Hanselmann
386 1375e1d9 Michael Hanselmann
    cURL will fail with the error code ``CURLE_RANGE_ERROR`` if a
387 1375e1d9 Michael Hanselmann
    request included a range but the server can't handle it. The request
388 1375e1d9 Michael Hanselmann
    must be retried without a range.
389 1375e1d9 Michael Hanselmann
390 1375e1d9 Michael Hanselmann
``POST /transfers/[transfer_id]/done``
391 1375e1d9 Michael Hanselmann
  Use this resource to notify the source when transfer is finished (even
392 1375e1d9 Michael Hanselmann
  if not successful). The status code will be 204 (No Content).
393 1375e1d9 Michael Hanselmann
394 1375e1d9 Michael Hanselmann
395 1375e1d9 Michael Hanselmann
Code samples
396 1375e1d9 Michael Hanselmann
------------
397 1375e1d9 Michael Hanselmann
398 1375e1d9 Michael Hanselmann
pycURL to file
399 1375e1d9 Michael Hanselmann
++++++++++++++
400 1375e1d9 Michael Hanselmann
401 1375e1d9 Michael Hanselmann
.. highlight:: python
402 1375e1d9 Michael Hanselmann
403 1375e1d9 Michael Hanselmann
The following code sample shows how to write downloaded data directly to
404 1375e1d9 Michael Hanselmann
a file without pumping it through Python::
405 1375e1d9 Michael Hanselmann
406 1375e1d9 Michael Hanselmann
  curl = pycurl.Curl()
407 1375e1d9 Michael Hanselmann
  curl.setopt(pycurl.URL, "http://www.google.com/")
408 1375e1d9 Michael Hanselmann
  curl.setopt(pycurl.WRITEDATA, open("googlecom.html", "w"))
409 1375e1d9 Michael Hanselmann
  curl.perform()
410 1375e1d9 Michael Hanselmann
411 1375e1d9 Michael Hanselmann
This works equally well if the file descriptor is a pipe to another
412 1375e1d9 Michael Hanselmann
process.
413 1375e1d9 Michael Hanselmann
414 1375e1d9 Michael Hanselmann
415 1375e1d9 Michael Hanselmann
.. _backwards-compat:
416 1375e1d9 Michael Hanselmann
417 1375e1d9 Michael Hanselmann
Backwards compatibility
418 1375e1d9 Michael Hanselmann
-----------------------
419 1375e1d9 Michael Hanselmann
420 1375e1d9 Michael Hanselmann
.. _backwards-compat-v1:
421 1375e1d9 Michael Hanselmann
422 1375e1d9 Michael Hanselmann
Version 1
423 1375e1d9 Michael Hanselmann
+++++++++
424 1375e1d9 Michael Hanselmann
425 1375e1d9 Michael Hanselmann
The old inter-cluster import/export implementation described in the
426 1375e1d9 Michael Hanselmann
:doc:`Ganeti 2.2 design document <design-2.2>` will be supported for at
427 1375e1d9 Michael Hanselmann
least one minor (2.x) release. Intra-cluster imports/exports will use
428 1375e1d9 Michael Hanselmann
the new version right away.
429 1375e1d9 Michael Hanselmann
430 1375e1d9 Michael Hanselmann
431 1375e1d9 Michael Hanselmann
.. _exp-size-fd:
432 1375e1d9 Michael Hanselmann
433 1375e1d9 Michael Hanselmann
``EXP_SIZE_FD``
434 1375e1d9 Michael Hanselmann
+++++++++++++++
435 1375e1d9 Michael Hanselmann
436 1375e1d9 Michael Hanselmann
Together with the improved import/export infrastructure Ganeti 2.2
437 1375e1d9 Michael Hanselmann
allowed instance export scripts to report the expected data size. This
438 1375e1d9 Michael Hanselmann
was then used to provide the user with an estimated remaining time.
439 1375e1d9 Michael Hanselmann
Version 2 no longer supports OS import/export scripts and therefore
440 1375e1d9 Michael Hanselmann
``EXP_SIZE_FD`` is no longer needed.
441 1375e1d9 Michael Hanselmann
442 1375e1d9 Michael Hanselmann
443 1375e1d9 Michael Hanselmann
.. _impexp2-compression:
444 1375e1d9 Michael Hanselmann
445 1375e1d9 Michael Hanselmann
Compression
446 1375e1d9 Michael Hanselmann
+++++++++++
447 1375e1d9 Michael Hanselmann
448 1375e1d9 Michael Hanselmann
Version 1 used explicit compression using ``gzip`` for transporting
449 1375e1d9 Michael Hanselmann
data, but the dumped files didn't use any compression. Version 2 will
450 1375e1d9 Michael Hanselmann
allow the destination to specify which encoding should be used. This way
451 1375e1d9 Michael Hanselmann
the transported data is already compressed and can be directly used by
452 1375e1d9 Michael Hanselmann
the client (see :ref:`impexp2-http-resources`). The cURL option
453 1375e1d9 Michael Hanselmann
``CURLOPT_ENCODING`` can be used to set the ``Accept-Encoding`` header.
454 1375e1d9 Michael Hanselmann
cURL will not decompress received data when
455 1375e1d9 Michael Hanselmann
``CURLOPT_HTTP_CONTENT_DECODING`` is set to zero (if another HTTP client
456 1375e1d9 Michael Hanselmann
library were used which doesn't support disabling transparent
457 1375e1d9 Michael Hanselmann
compression, a custom content-coding type could be defined, e.g.
458 1375e1d9 Michael Hanselmann
``x-ganeti-gzip``).
459 1375e1d9 Michael Hanselmann
460 1375e1d9 Michael Hanselmann
461 1375e1d9 Michael Hanselmann
Notes
462 1375e1d9 Michael Hanselmann
-----
463 1375e1d9 Michael Hanselmann
464 1375e1d9 Michael Hanselmann
The HTTP/1.1 protocol (:rfc:`2616`) defines trailing headers for chunked
465 1375e1d9 Michael Hanselmann
transfers in section 3.6.1. This could be used to transfer a checksum at
466 1375e1d9 Michael Hanselmann
the end of an import/export. cURL supports trailing headers since
467 1375e1d9 Michael Hanselmann
version 7.14.1. Lighttpd doesn't seem to support them for FastCGI, but
468 1375e1d9 Michael Hanselmann
they appear to be usable in combination with an NPH CGI (No Parsed
469 1375e1d9 Michael Hanselmann
Headers).
470 1375e1d9 Michael Hanselmann
471 1375e1d9 Michael Hanselmann
.. _lighttp-sendfile:
472 1375e1d9 Michael Hanselmann
473 1375e1d9 Michael Hanselmann
Lighttpd allows FastCGI applications to send the special headers
474 1375e1d9 Michael Hanselmann
``X-Sendfile`` and ``X-Sendfile2`` (the latter with a range). Using
475 1375e1d9 Michael Hanselmann
these headers applications can send response headers and tell the
476 1375e1d9 Michael Hanselmann
webserver to serve regular file stored on the file system as a response
477 1375e1d9 Michael Hanselmann
body. The webserver will then take care of sending that file.
478 1375e1d9 Michael Hanselmann
Unfortunately this mechanism is restricted to regular files and can not
479 1375e1d9 Michael Hanselmann
be used for data from programs, neither direct nor via named pipes,
480 1375e1d9 Michael Hanselmann
without writing to a file first. The latter is not an option as instance
481 1375e1d9 Michael Hanselmann
data can be very large. Theoretically ``X-Sendfile`` could be used for
482 1375e1d9 Michael Hanselmann
sending the input for a file-based instance import, but that'd require
483 1375e1d9 Michael Hanselmann
the webserver to run as "root".
484 1375e1d9 Michael Hanselmann
485 1375e1d9 Michael Hanselmann
.. _python-sendfile:
486 1375e1d9 Michael Hanselmann
487 1375e1d9 Michael Hanselmann
Python does not include interfaces for the ``sendfile(2)`` or
488 1375e1d9 Michael Hanselmann
``splice(2)`` system calls. The latter can be useful for faster copying
489 1375e1d9 Michael Hanselmann
of data between file descriptors. There are some 3rd-party modules (e.g.
490 1375e1d9 Michael Hanselmann
http://pypi.python.org/pypi/py-sendfile/) and discussions
491 1375e1d9 Michael Hanselmann
(http://bugs.python.org/issue10882) for including support for
492 1375e1d9 Michael Hanselmann
``sendfile(2)``, but the later is certainly not going to happen for the
493 1375e1d9 Michael Hanselmann
Python versions supported by Ganeti. Calling the function using the
494 1375e1d9 Michael Hanselmann
``ctypes`` module might be possible.
495 1375e1d9 Michael Hanselmann
496 1375e1d9 Michael Hanselmann
497 1375e1d9 Michael Hanselmann
Performance considerations
498 1375e1d9 Michael Hanselmann
--------------------------
499 1375e1d9 Michael Hanselmann
500 1375e1d9 Michael Hanselmann
The design described above was confirmed to be one of the better choices
501 1375e1d9 Michael Hanselmann
in terms of download performance with bigger block sizes. All numbers
502 1375e1d9 Michael Hanselmann
were gathered on the same physical machine with a single CPU and 1 GB of
503 1375e1d9 Michael Hanselmann
RAM while downloading 2 GB of zeros read from ``/dev/zero``. ``wget``
504 1375e1d9 Michael Hanselmann
(version 1.10.2) was used as the client, ``lighttpd`` (version 1.4.28)
505 1375e1d9 Michael Hanselmann
as the server. The numbers in the first line are in megabytes per
506 1375e1d9 Michael Hanselmann
second. The second line in each row is the CPU time spent in userland
507 1375e1d9 Michael Hanselmann
respective system (measured for the CGI/FastCGI program using ``time
508 1375e1d9 Michael Hanselmann
-v``).
509 1375e1d9 Michael Hanselmann
510 1375e1d9 Michael Hanselmann
::
511 1375e1d9 Michael Hanselmann
512 1375e1d9 Michael Hanselmann
  ----------------------------------------------------------------------
513 1375e1d9 Michael Hanselmann
  Block size                      4 KB    64 KB   128 KB    1 MB    4 MB
514 1375e1d9 Michael Hanselmann
  ======================================================================
515 1375e1d9 Michael Hanselmann
  Plain CGI script reading          83      174      180     122     120
516 1375e1d9 Michael Hanselmann
  from ``/dev/zero``
517 1375e1d9 Michael Hanselmann
                               0.6/3.9  0.1/2.4  0.1/2.2 0.0/1.9 0.0/2.1
518 1375e1d9 Michael Hanselmann
  ----------------------------------------------------------------------
519 1375e1d9 Michael Hanselmann
  FastCGI with ``fcgiwrap``,        86      167      170     177     174
520 1375e1d9 Michael Hanselmann
  ``dd`` reading from
521 1375e1d9 Michael Hanselmann
  ``/dev/zero``                  1.1/5  0.5/2.9  0.5/2.7 0.7/3.1 0.7/2.8
522 1375e1d9 Michael Hanselmann
  ----------------------------------------------------------------------
523 1375e1d9 Michael Hanselmann
  FastCGI with ``fcgiwrap``,        68      146      150     170     170
524 1375e1d9 Michael Hanselmann
  Python script copying from
525 1375e1d9 Michael Hanselmann
  ``/dev/zero`` to stdout
526 1375e1d9 Michael Hanselmann
                               1.3/5.1  0.8/3.7  0.7/3.3  0.9/2.9  0.8/3
527 1375e1d9 Michael Hanselmann
  ----------------------------------------------------------------------
528 1375e1d9 Michael Hanselmann
  FastCGI, Python script using      31       48       47       5       1
529 1375e1d9 Michael Hanselmann
  ``flup`` library (version
530 1375e1d9 Michael Hanselmann
  1.0.2) reading from
531 1375e1d9 Michael Hanselmann
  ``/dev/zero``
532 1375e1d9 Michael Hanselmann
                              23.5/9.8 14.3/8.5   16.1/8       -       -
533 1375e1d9 Michael Hanselmann
  ----------------------------------------------------------------------
534 1375e1d9 Michael Hanselmann
535 1375e1d9 Michael Hanselmann
536 1375e1d9 Michael Hanselmann
It should be mentioned that the ``flup`` library is not implemented in
537 1375e1d9 Michael Hanselmann
the most efficient way, but even with some changes it doesn't get much
538 1375e1d9 Michael Hanselmann
faster. It is fine for small amounts of data, but not for huge
539 1375e1d9 Michael Hanselmann
transfers.
540 1375e1d9 Michael Hanselmann
541 1375e1d9 Michael Hanselmann
542 1375e1d9 Michael Hanselmann
Other considered solutions
543 1375e1d9 Michael Hanselmann
--------------------------
544 1375e1d9 Michael Hanselmann
545 1375e1d9 Michael Hanselmann
Another possible solution considered was to use ``socat`` like version 1
546 1375e1d9 Michael Hanselmann
did. Due to the changing model, a large part of the code would've
547 1375e1d9 Michael Hanselmann
required a rewrite anyway, while still not fixing all shortcomings. For
548 1375e1d9 Michael Hanselmann
example, ``socat`` could still listen on only one protocol, IPv4 or
549 1375e1d9 Michael Hanselmann
IPv6. Running two separate instances might have fixed that, but it'd get
550 1375e1d9 Michael Hanselmann
more complicated. Using an existing HTTP server will provide us with a
551 1375e1d9 Michael Hanselmann
number of other benefits as well, such as easier user separation between
552 1375e1d9 Michael Hanselmann
server and backend.
553 1375e1d9 Michael Hanselmann
554 1375e1d9 Michael Hanselmann
555 1375e1d9 Michael Hanselmann
.. vim: set textwidth=72 :
556 1375e1d9 Michael Hanselmann
.. Local Variables:
557 1375e1d9 Michael Hanselmann
.. mode: rst
558 1375e1d9 Michael Hanselmann
.. fill-column: 72
559 1375e1d9 Michael Hanselmann
.. End: