root / docs / pithos.rst @ 71053581
History | View | Annotate | Download (12.9 kB)
1 |
.. _pithos: |
---|---|
2 |
|
3 |
File/Object Storage Service (Pithos) |
4 |
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
5 |
|
6 |
Overview |
7 |
======== |
8 |
|
9 |
Pithos is the Object/File Storage component of Synnefo. Users upload files on |
10 |
Pithos using either the Web UI, the command-line client, or native syncing |
11 |
clients. It is a thin layer mapping user-files to content-addressable blocks |
12 |
which are then stored on a storage backend. Files are split in blocks of fixed |
13 |
size, which are hashed independently to create a unique identifier for each |
14 |
block, so each file is represented by a sequence of block names (a |
15 |
hashmap). This way, Pithos provides deduplication of file data; blocks |
16 |
shared among files are only stored once. |
17 |
|
18 |
The current implementation uses 4MB blocks hashed with SHA256. Content-based |
19 |
addressing also enables efficient two-way file syncing that can be used by all |
20 |
Pithos clients (e.g. the ``kamaki`` command-line client or the native |
21 |
Windows/Mac OS clients). Whenever someone wishes to upload an updated version |
22 |
of a file, the client hashes all blocks of the file and then requests the |
23 |
server to create a new version for this block sequence. The server will return |
24 |
an error reply with a list of the missing blocks. The client may then upload |
25 |
each block one by one, and retry file creation. Similarly, whenever a file has |
26 |
been changed on the server, the client can ask for its list of blocks and only |
27 |
download the modified ones. |
28 |
|
29 |
Pithos runs at the cloud layer and exposes the OpenStack Object Storage API to |
30 |
the outside world, with custom extensions for syncing. Any client speaking to |
31 |
OpenStack Swift can also be used to store objects in a Pithos deployment. The |
32 |
process of mapping user files to hashed objects is independent from the actual |
33 |
storage backend, which is selectable by the administrator using pluggable |
34 |
drivers. Currently, Pithos has drivers for two storage backends: |
35 |
|
36 |
* files on a shared filesystem, e.g., NFS, Lustre, GPFS or GlusterFS |
37 |
* objects on a Ceph/RADOS cluster. |
38 |
|
39 |
Whatever the storage backend, it is responsible for storing objects reliably, |
40 |
without any connection to the cloud APIs or to the hashing operations. |
41 |
|
42 |
|
43 |
OpenStack extensions |
44 |
==================== |
45 |
|
46 |
The major extensions on the OpenStack API are: |
47 |
|
48 |
* The use of block-based storage in lieu of an object-based one. |
49 |
OpenStack stores objects, which may be files, but this is not |
50 |
necessary - large files (longer than 5GBytes), for instance, must be |
51 |
stored as a series of distinct objects accompanied by a manifest. |
52 |
Pithos stores blocks, so objects can be of unlimited size. |
53 |
* Permissions on individual files and folders. Note that folders |
54 |
do not exist in the OpenStack API, but are simulated by |
55 |
appropriate conventions, an approach we have kept in Pithos to |
56 |
avoid incompatibility. |
57 |
* Fully-versioned objects. |
58 |
* Metadata-based queries. Users are free to set metadata on their |
59 |
objects, and they can list objects meeting metadata criteria. |
60 |
* Policies, such as whether to enable object versioning and to |
61 |
enforce quotas. This is particularly important for sharing object |
62 |
containers, since the user may want to avoid running out of space |
63 |
because of collaborators writing in the shared storage. |
64 |
* Partial upload and download based on HTTP request |
65 |
headers and parameters. |
66 |
* Object updates, where data may even come from other objects |
67 |
already stored in Pithos. This allows users to compose objects from |
68 |
other objects without uploading data. |
69 |
* All objects are assigned UUIDs on creation, which can be |
70 |
used to reference them regardless of their path location. |
71 |
|
72 |
Pithos Design |
73 |
============= |
74 |
|
75 |
Pithos is built on a layered architecture. The Pithos server speaks HTTP with |
76 |
the outside world. The HTTP operations implement an extended OpenStack Object |
77 |
Storage API. The back end is a library meant to be used by internal code and |
78 |
other front ends. For instance, the back end library, apart from being used in |
79 |
Pithos for implementing the OpenStack Object Storage API, is also used in our |
80 |
implementation of the OpenStack Image Service API. Moreover, the back end |
81 |
library allows specification of different namespaces for metadata, so that the |
82 |
same object can be viewed by different front end APIs with different sets of |
83 |
metadata. Hence the same object can be viewed as a file in Pithos, with one set |
84 |
of metadata, or as an image with a different set of metadata, in our |
85 |
implementation of the OpenStack Image Service. |
86 |
|
87 |
The data component provides storage of block and the information needed to |
88 |
retrieve them, while the metadata component is a database of nodes and |
89 |
permissions. At the current implementation, data is saved to the filesystem and |
90 |
metadata in an SQL database. |
91 |
|
92 |
Block-based Storage for the Client |
93 |
---------------------------------- |
94 |
|
95 |
Since an object is saved as a set of blocks in Pithos, object |
96 |
operations are no longer required to refer to the whole object. We can |
97 |
handle parts of objects as needed when uploading, downloading, or |
98 |
copying and moving data. |
99 |
|
100 |
In particular, a client, provided it has access permissions, can |
101 |
download data from Pithos by issuing a ``GET`` request on an |
102 |
object. If the request includes the ``hashmap`` parameter, then the |
103 |
request refers to a hashmap, that is, a set containing the |
104 |
object's block hashes. The reply is of the form:: |
105 |
|
106 |
{"block_hash": "sha1", |
107 |
"hashes": ["7295c41da03d7f916440b98e32c4a2a39351546c", ...], |
108 |
"block_size":131072, |
109 |
"bytes": 242} |
110 |
|
111 |
The client can then compare the hashmap with the hashmap computed from the |
112 |
local file. Any missing parts can be downloaded with ``GET`` requests with an |
113 |
additional ``Range`` header containing the hashes of the blocks to be |
114 |
retrieved. The integrity of the file can be checked against the |
115 |
``X-Object-Hash`` header, returned by the server and containing the root Merkle |
116 |
hash of the object's hashmap. |
117 |
|
118 |
When uploading a file to Pithos, only the missing blocks will be submitted to |
119 |
the server, with the following algorithm: |
120 |
|
121 |
* Calculate the hash value for each block of the object to be |
122 |
uploaded. |
123 |
* Send a hashmap ``PUT`` request for the object. This is a |
124 |
``PUT`` request with a ``hashmap`` request parameter appended |
125 |
to it. If the parameter is not present, the object's data (or part |
126 |
of it) is provided with the request. If the parameter is present, |
127 |
the object hashmap is provided with the request. |
128 |
* If the server responds with status 201 (Created), the blocks are |
129 |
already on the server and we do not need to do anything more. |
130 |
* If the server responds with status 409 (Conflict), the server’s |
131 |
response body contains the hashes of the blocks that do not exist on |
132 |
the server. Then, for each hash value in the server’s response (or all |
133 |
hashes together) send a ``POST`` request to the server with the |
134 |
block's data. |
135 |
|
136 |
In effect, we are deduplicating data based on their block hashes, transparently |
137 |
to the users. This results to perceived instantaneous uploads when material is |
138 |
already present in Pithos storage. |
139 |
|
140 |
Block-based Storage Processing |
141 |
------------------------------ |
142 |
|
143 |
Hashmaps themselves are saved in blocks. All blocks are persisted to storage |
144 |
using content-based addressing. It follows that to read a file, Pithos performs |
145 |
the following operations: |
146 |
|
147 |
* The client issues a request to get a file, via HTTP ``GET``. |
148 |
* The API front end asks from the back end the metadata |
149 |
of the object. |
150 |
* The back end checks the permissions of the object and, if they |
151 |
allow access to it, returns the object's metadata. |
152 |
* The front end evaluates any HTTP headers (such as |
153 |
``If-Modified-Since``, ``If-Match``, etc.). |
154 |
* If the preconditions are met, the API front end requests |
155 |
from the back end the object's hashmap (hashmaps are indexed by the |
156 |
full path). |
157 |
* The back end will read and return to the API front end the |
158 |
object's hashmap from the underlying storage. |
159 |
* Depending on the HTTP ``Range`` header, the |
160 |
API front end asks from the back end the required blocks, giving |
161 |
their corresponding hashes. |
162 |
* The back end fetches the blocks from the underlying storage, |
163 |
passes them to the API front end, which returns them to the client. |
164 |
|
165 |
Saving data from the client to the server is done in several different ways. |
166 |
|
167 |
First, a regular HTTP ``PUT`` is the reverse of the HTTP ``GET``. The client |
168 |
sends the full object to the API front end. The API front end splits the |
169 |
object to blocks. It sends each block to the back end, which calculates its |
170 |
hash and saves it to storage. When the hashmap is complete, the API front end |
171 |
commands the back end to create a new object with the created hashmap and any |
172 |
associated metadata. |
173 |
|
174 |
Secondly, the client may send to the API front end a hashmap and any associated |
175 |
metadata, with a special formatted HTTP ``PUT``, using an appropriate URL |
176 |
parameter. In this case, if the back end can find the requested blocks, the |
177 |
object will be created as previously, otherwise it will report back the list of |
178 |
missing blocks, which will be passed back to the client. The client then may |
179 |
send the missing blocks by issuing an HTTP ``POST`` and then retry the HTTP |
180 |
``PUT`` for the hashmap. This allows for very fast uploads, since it may happen |
181 |
that no real data uploading takes place, if the blocks are already in data |
182 |
storage. |
183 |
|
184 |
Copying objects does not involve data copying, but is performed by associating |
185 |
the object's hashmap with the new path. Moving objects, as in OpenStack, is a |
186 |
copy followed by a delete, again with no real data being moved. |
187 |
|
188 |
Updates to an existing object, which are not offered by OpenStack, are |
189 |
implemented by issuing an HTTP ``POST`` request including the offset and the |
190 |
length of the data. The API front end requests from the back end the hashmap of |
191 |
the existing object. Depending on the offset of the update (whether it falls |
192 |
within block boundaries or not) the front end will ask the back end to update |
193 |
or create new blocks. At the end, the front end will save the updated hashmap. |
194 |
It is also possible to pass a parameter to HTTP ``POST`` to specify that the |
195 |
data will come from another object, instead of being uploaded by the client. |
196 |
|
197 |
Pithos Back End Nodes |
198 |
--------------------- |
199 |
|
200 |
Pithos organizes entities in a tree hierarchy, with one tree node per path |
201 |
entry (see Figure). Nodes can be accounts, containers, and objects. A user may |
202 |
have multiple accounts, each account may have multiple containers, and each |
203 |
container may have multiple objects. An object may have multiple versions, and |
204 |
each version of an object has properties (a set of fixed metadata, like size |
205 |
and mtime) and arbitrary metadata. |
206 |
|
207 |
.. image:: images/pithos-backend-nodes.png |
208 |
|
209 |
The tree hierarchy has up to three levels, since, following the OpenStack API, |
210 |
everything is stored as an object in a container. The notion of folders or |
211 |
directories is through conventions that simulate pseudo-hierarchical folders. |
212 |
In particular, object names that contain the forward slash character and have |
213 |
an accompanying marker object with a ``Content-Type: application/directory`` as |
214 |
part of their metadata can be treated as directories by Pithos clients. Each |
215 |
node corresponds to a unique path, and we keep its parent in the |
216 |
account/container/object hierarchy (that is, all objects have a container as |
217 |
their parent). |
218 |
|
219 |
Pithos Back End Versions |
220 |
------------------------ |
221 |
|
222 |
For each object version we keep the root Merkle hash of the object it refers |
223 |
to, the size of the object, the last modification time and the user that |
224 |
modified the file, and its cluster. A version belongs to one of the following |
225 |
three clusters (see Figure): |
226 |
|
227 |
* normal, which are the current versions |
228 |
* history, which contain the previous versions of an object |
229 |
* deleted, which contain objects that have been deleted |
230 |
|
231 |
.. image:: images/pithos-backend-versions.png |
232 |
|
233 |
This versioning allows Pithos to offer to its user time-based contents listing |
234 |
of their accounts. In effect, this also allows them to take their containers |
235 |
back in time. This is implemented conceptually by taking a vertical line in the |
236 |
Figure and presenting to the user the state on the left side of the line. |
237 |
|
238 |
Pithos Back End Permissions |
239 |
--------------------------- |
240 |
|
241 |
Pithos recognizes read and write permissions, which can be granted to |
242 |
individual users or groups of users. Groups as collections of users created at |
243 |
the account level by users themselves, and are flat - a group cannot contain or |
244 |
reference another group. Ownership of a file cannot be delegated. |
245 |
|
246 |
Pithos also recognizes a "public" permission, which means that the object is |
247 |
readable by all. When an object is made public, it is assigned a URL that can |
248 |
be used to access the object from outside Pithos even by non-Pithos users. |
249 |
|
250 |
Permissions can be assigned to objects, which may be actual files, or |
251 |
directories. When listing objects, the back end uses the permissions as filters |
252 |
for what to display, so that users will see only objects to which they have |
253 |
access. Depending on the type of the object, the filter may be exact (plain |
254 |
object), or a prefix (like ``path/*`` for a directory). When accessing objects, |
255 |
the same rules are used to decide whether to allow the user to read or modify |
256 |
the object or directory. If no permissions apply to a specific object, the back |
257 |
end searches for permissions on the closest directory sharing a common prefix |
258 |
with the object. |