Statistics
| Branch: | Tag: | Revision:

root / docs / pithos.rst @ 999bf7b6

History | View | Annotate | Download (12.9 kB)

1 305dbce0 Constantinos Venetsanopoulos
.. _pithos:
2 305dbce0 Constantinos Venetsanopoulos
3 305dbce0 Constantinos Venetsanopoulos
File/Object Storage Service (Pithos)
4 305dbce0 Constantinos Venetsanopoulos
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
5 305dbce0 Constantinos Venetsanopoulos
6 305dbce0 Constantinos Venetsanopoulos
Overview
7 305dbce0 Constantinos Venetsanopoulos
========
8 305dbce0 Constantinos Venetsanopoulos
9 305dbce0 Constantinos Venetsanopoulos
Pithos is the Object/File Storage component of Synnefo. Users upload files on
10 305dbce0 Constantinos Venetsanopoulos
Pithos using either the Web UI, the command-line client, or native syncing
11 305dbce0 Constantinos Venetsanopoulos
clients. It is a thin layer mapping user-files to content-addressable blocks
12 305dbce0 Constantinos Venetsanopoulos
which are then stored on a storage backend. Files are split in blocks of fixed
13 305dbce0 Constantinos Venetsanopoulos
size, which are hashed independently to create a unique identifier for each
14 305dbce0 Constantinos Venetsanopoulos
block, so each file is represented by a sequence of block names (a
15 305dbce0 Constantinos Venetsanopoulos
hashmap). This way, Pithos provides deduplication of file data; blocks
16 305dbce0 Constantinos Venetsanopoulos
shared among files are only stored once.
17 305dbce0 Constantinos Venetsanopoulos
18 305dbce0 Constantinos Venetsanopoulos
The current implementation uses 4MB blocks hashed with SHA256. Content-based
19 305dbce0 Constantinos Venetsanopoulos
addressing also enables efficient two-way file syncing that can be used by all
20 305dbce0 Constantinos Venetsanopoulos
Pithos clients (e.g. the ``kamaki`` command-line client or the native
21 305dbce0 Constantinos Venetsanopoulos
Windows/Mac OS clients). Whenever someone wishes to upload an updated version
22 305dbce0 Constantinos Venetsanopoulos
of a file, the client hashes all blocks of the file and then requests the
23 305dbce0 Constantinos Venetsanopoulos
server to create a new version for this block sequence. The server will return
24 305dbce0 Constantinos Venetsanopoulos
an error reply with a list of the missing blocks. The client may then upload
25 305dbce0 Constantinos Venetsanopoulos
each block one by one, and retry file creation. Similarly, whenever a file has
26 305dbce0 Constantinos Venetsanopoulos
been changed on the server, the client can ask for its list of blocks and only
27 305dbce0 Constantinos Venetsanopoulos
download the modified ones.
28 305dbce0 Constantinos Venetsanopoulos
29 305dbce0 Constantinos Venetsanopoulos
Pithos runs at the cloud layer and exposes the OpenStack Object Storage API to
30 305dbce0 Constantinos Venetsanopoulos
the outside world, with custom extensions for syncing. Any client speaking to
31 305dbce0 Constantinos Venetsanopoulos
OpenStack Swift can also be used to store objects in a Pithos deployment. The
32 305dbce0 Constantinos Venetsanopoulos
process of mapping user files to hashed objects is independent from the actual
33 305dbce0 Constantinos Venetsanopoulos
storage backend, which is selectable by the administrator using pluggable
34 305dbce0 Constantinos Venetsanopoulos
drivers. Currently, Pithos has drivers for two storage backends:
35 305dbce0 Constantinos Venetsanopoulos
36 305dbce0 Constantinos Venetsanopoulos
 * files on a shared filesystem, e.g., NFS, Lustre, GPFS or GlusterFS
37 305dbce0 Constantinos Venetsanopoulos
 * objects on a Ceph/RADOS cluster.
38 305dbce0 Constantinos Venetsanopoulos
39 305dbce0 Constantinos Venetsanopoulos
Whatever the storage backend, it is responsible for storing objects reliably,
40 305dbce0 Constantinos Venetsanopoulos
without any connection to the cloud APIs or to the hashing operations.
41 305dbce0 Constantinos Venetsanopoulos
42 305dbce0 Constantinos Venetsanopoulos
43 305dbce0 Constantinos Venetsanopoulos
OpenStack extensions
44 305dbce0 Constantinos Venetsanopoulos
====================
45 dad708b4 Antony Chazapis
46 dad708b4 Antony Chazapis
The major extensions on the OpenStack API are:
47 dad708b4 Antony Chazapis
48 dad708b4 Antony Chazapis
* The use of block-based storage in lieu of an object-based one.
49 dad708b4 Antony Chazapis
  OpenStack stores objects, which may be files, but this is not
50 dad708b4 Antony Chazapis
  necessary - large files (longer than 5GBytes), for instance, must be
51 dad708b4 Antony Chazapis
  stored as a series of distinct objects accompanied by a manifest.
52 e5d8df8c Constantinos Venetsanopoulos
  Pithos stores blocks, so objects can be of unlimited size.
53 dad708b4 Antony Chazapis
* Permissions on individual files and folders. Note that folders
54 dad708b4 Antony Chazapis
  do not exist in the OpenStack API, but are simulated by
55 e5d8df8c Constantinos Venetsanopoulos
  appropriate conventions, an approach we have kept in Pithos to
56 dad708b4 Antony Chazapis
  avoid incompatibility.
57 dad708b4 Antony Chazapis
* Fully-versioned objects.
58 dad708b4 Antony Chazapis
* Metadata-based queries. Users are free to set metadata on their
59 dad708b4 Antony Chazapis
  objects, and they can list objects meeting metadata criteria.
60 dad708b4 Antony Chazapis
* Policies, such as whether to enable object versioning and to
61 dad708b4 Antony Chazapis
  enforce quotas. This is particularly important for sharing object
62 dad708b4 Antony Chazapis
  containers, since the user may want to avoid running out of space
63 dad708b4 Antony Chazapis
  because of collaborators writing in the shared storage.
64 dad708b4 Antony Chazapis
* Partial upload and download based on HTTP request
65 dad708b4 Antony Chazapis
  headers and parameters.
66 dad708b4 Antony Chazapis
* Object updates, where data may even come from other objects
67 e5d8df8c Constantinos Venetsanopoulos
  already stored in Pithos. This allows users to compose objects from
68 dad708b4 Antony Chazapis
  other objects without uploading data.
69 dad708b4 Antony Chazapis
* All objects are assigned UUIDs on creation, which can be
70 dad708b4 Antony Chazapis
  used to reference them regardless of their path location.
71 dad708b4 Antony Chazapis
72 e5d8df8c Constantinos Venetsanopoulos
Pithos Design
73 305dbce0 Constantinos Venetsanopoulos
=============
74 305dbce0 Constantinos Venetsanopoulos
75 305dbce0 Constantinos Venetsanopoulos
Pithos is built on a layered architecture. The Pithos server speaks HTTP with
76 305dbce0 Constantinos Venetsanopoulos
the outside world. The HTTP operations implement an extended OpenStack Object
77 305dbce0 Constantinos Venetsanopoulos
Storage API.  The back end is a library meant to be used by internal code and
78 305dbce0 Constantinos Venetsanopoulos
other front ends. For instance, the back end library, apart from being used in
79 305dbce0 Constantinos Venetsanopoulos
Pithos for implementing the OpenStack Object Storage API, is also used in our
80 305dbce0 Constantinos Venetsanopoulos
implementation of the OpenStack Image Service API. Moreover, the back end
81 305dbce0 Constantinos Venetsanopoulos
library allows specification of different namespaces for metadata, so that the
82 305dbce0 Constantinos Venetsanopoulos
same object can be viewed by different front end APIs with different sets of
83 305dbce0 Constantinos Venetsanopoulos
metadata. Hence the same object can be viewed as a file in Pithos, with one set
84 305dbce0 Constantinos Venetsanopoulos
of metadata, or as an image with a different set of metadata, in our
85 305dbce0 Constantinos Venetsanopoulos
implementation of the OpenStack Image Service.
86 305dbce0 Constantinos Venetsanopoulos
87 305dbce0 Constantinos Venetsanopoulos
The data component provides storage of block and the information needed to
88 305dbce0 Constantinos Venetsanopoulos
retrieve them, while the metadata component is a database of nodes and
89 305dbce0 Constantinos Venetsanopoulos
permissions. At the current implementation, data is saved to the filesystem and
90 305dbce0 Constantinos Venetsanopoulos
metadata in an SQL database.
91 dad708b4 Antony Chazapis
92 dad708b4 Antony Chazapis
Block-based Storage for the Client
93 305dbce0 Constantinos Venetsanopoulos
----------------------------------
94 dad708b4 Antony Chazapis
95 e5d8df8c Constantinos Venetsanopoulos
Since an object is saved as a set of blocks in Pithos, object
96 dad708b4 Antony Chazapis
operations are no longer required to refer to the whole object. We can
97 dad708b4 Antony Chazapis
handle parts of objects as needed when uploading, downloading, or
98 dad708b4 Antony Chazapis
copying and moving data.
99 dad708b4 Antony Chazapis
100 dad708b4 Antony Chazapis
In particular, a client, provided it has access permissions, can
101 e5d8df8c Constantinos Venetsanopoulos
download data from Pithos by issuing a ``GET`` request on an
102 dad708b4 Antony Chazapis
object. If the request includes the ``hashmap`` parameter, then the
103 dad708b4 Antony Chazapis
request refers to a hashmap, that is, a set containing the
104 dad708b4 Antony Chazapis
object's block hashes. The reply is of the form::
105 dad708b4 Antony Chazapis
106 dad708b4 Antony Chazapis
    {"block_hash": "sha1", 
107 dad708b4 Antony Chazapis
     "hashes": ["7295c41da03d7f916440b98e32c4a2a39351546c", ...],
108 dad708b4 Antony Chazapis
     "block_size":131072,
109 dad708b4 Antony Chazapis
     "bytes": 242}
110 dad708b4 Antony Chazapis
111 305dbce0 Constantinos Venetsanopoulos
The client can then compare the hashmap with the hashmap computed from the
112 305dbce0 Constantinos Venetsanopoulos
local file. Any missing parts can be downloaded with ``GET`` requests with an
113 305dbce0 Constantinos Venetsanopoulos
additional ``Range`` header containing the hashes of the blocks to be
114 305dbce0 Constantinos Venetsanopoulos
retrieved. The integrity of the file can be checked against the
115 305dbce0 Constantinos Venetsanopoulos
``X-Object-Hash`` header, returned by the server and containing the root Merkle
116 305dbce0 Constantinos Venetsanopoulos
hash of the object's hashmap.
117 dad708b4 Antony Chazapis
118 305dbce0 Constantinos Venetsanopoulos
When uploading a file to Pithos, only the missing blocks will be submitted to
119 305dbce0 Constantinos Venetsanopoulos
the server, with the following algorithm:
120 dad708b4 Antony Chazapis
121 dad708b4 Antony Chazapis
* Calculate the hash value for each block of the object to be
122 dad708b4 Antony Chazapis
  uploaded.
123 dad708b4 Antony Chazapis
* Send a hashmap ``PUT`` request for the object. This is a
124 dad708b4 Antony Chazapis
  ``PUT`` request with a ``hashmap`` request parameter appended
125 dad708b4 Antony Chazapis
  to it. If the parameter is not present, the object's data (or part
126 dad708b4 Antony Chazapis
  of it) is provided with the request. If the parameter is present,
127 dad708b4 Antony Chazapis
  the object hashmap is provided with the request.
128 dad708b4 Antony Chazapis
* If the server responds with status 201 (Created), the blocks are
129 dad708b4 Antony Chazapis
  already on the server and we do not need to do anything more.
130 dad708b4 Antony Chazapis
* If the server responds with status 409 (Conflict), the server’s
131 dad708b4 Antony Chazapis
  response body contains the hashes of the blocks that do not exist on
132 dad708b4 Antony Chazapis
  the server. Then, for each hash value in the server’s response (or all
133 dad708b4 Antony Chazapis
  hashes together) send a ``POST`` request to the server with the
134 dad708b4 Antony Chazapis
  block's data.
135 dad708b4 Antony Chazapis
136 305dbce0 Constantinos Venetsanopoulos
In effect, we are deduplicating data based on their block hashes, transparently
137 305dbce0 Constantinos Venetsanopoulos
to the users. This results to perceived instantaneous uploads when material is
138 305dbce0 Constantinos Venetsanopoulos
already present in Pithos storage.
139 dad708b4 Antony Chazapis
140 dad708b4 Antony Chazapis
Block-based Storage Processing
141 305dbce0 Constantinos Venetsanopoulos
------------------------------
142 bc055d09 Constantinos Venetsanopoulos
143 305dbce0 Constantinos Venetsanopoulos
Hashmaps themselves are saved in blocks. All blocks are persisted to storage
144 305dbce0 Constantinos Venetsanopoulos
using content-based addressing. It follows that to read a file, Pithos performs
145 305dbce0 Constantinos Venetsanopoulos
the following operations:
146 8f9976c6 Constantinos Venetsanopoulos
147 dad708b4 Antony Chazapis
* The client issues a request to get a file, via HTTP ``GET``.
148 dad708b4 Antony Chazapis
* The API front end asks from the back end the metadata
149 dad708b4 Antony Chazapis
  of the object.
150 dad708b4 Antony Chazapis
* The back end checks the permissions of the object and, if they
151 dad708b4 Antony Chazapis
  allow access to it, returns the object's metadata.
152 dad708b4 Antony Chazapis
* The front end evaluates any HTTP headers (such as
153 dad708b4 Antony Chazapis
  ``If-Modified-Since``, ``If-Match``, etc.).
154 dad708b4 Antony Chazapis
* If the preconditions are met, the API front end requests
155 dad708b4 Antony Chazapis
  from the back end the object's hashmap (hashmaps are indexed by the
156 dad708b4 Antony Chazapis
  full path).
157 dad708b4 Antony Chazapis
* The back end will read and return to the API front end the
158 dad708b4 Antony Chazapis
  object's hashmap from the underlying storage.
159 dad708b4 Antony Chazapis
* Depending on the HTTP ``Range`` header, the 
160 dad708b4 Antony Chazapis
  API front end asks from the back end the required blocks, giving
161 dad708b4 Antony Chazapis
  their corresponding hashes.
162 dad708b4 Antony Chazapis
* The back end fetches the blocks from the underlying storage,
163 dad708b4 Antony Chazapis
  passes them to the API front end, which returns them to the client.
164 8f9976c6 Constantinos Venetsanopoulos
165 305dbce0 Constantinos Venetsanopoulos
Saving data from the client to the server is done in several different ways.
166 dad708b4 Antony Chazapis
167 305dbce0 Constantinos Venetsanopoulos
First, a regular HTTP ``PUT`` is the reverse of the HTTP ``GET``.  The client
168 305dbce0 Constantinos Venetsanopoulos
sends the full object to the API front end.  The API front end splits the
169 305dbce0 Constantinos Venetsanopoulos
object to blocks. It sends each block to the back end, which calculates its
170 305dbce0 Constantinos Venetsanopoulos
hash and saves it to storage. When the hashmap is complete, the API front end
171 305dbce0 Constantinos Venetsanopoulos
commands the back end to create a new object with the created hashmap and any
172 dad708b4 Antony Chazapis
associated metadata.
173 dad708b4 Antony Chazapis
174 305dbce0 Constantinos Venetsanopoulos
Secondly, the client may send to the API front end a hashmap and any associated
175 305dbce0 Constantinos Venetsanopoulos
metadata, with a special formatted HTTP ``PUT``, using an appropriate URL
176 305dbce0 Constantinos Venetsanopoulos
parameter. In this case, if the back end can find the requested blocks, the
177 305dbce0 Constantinos Venetsanopoulos
object will be created as previously, otherwise it will report back the list of
178 305dbce0 Constantinos Venetsanopoulos
missing blocks, which will be passed back to the client. The client then may
179 305dbce0 Constantinos Venetsanopoulos
send the missing blocks by issuing an HTTP ``POST`` and then retry the HTTP
180 305dbce0 Constantinos Venetsanopoulos
``PUT`` for the hashmap. This allows for very fast uploads, since it may happen
181 305dbce0 Constantinos Venetsanopoulos
that no real data uploading takes place, if the blocks are already in data
182 305dbce0 Constantinos Venetsanopoulos
storage.
183 305dbce0 Constantinos Venetsanopoulos
184 305dbce0 Constantinos Venetsanopoulos
Copying objects does not involve data copying, but is performed by associating
185 305dbce0 Constantinos Venetsanopoulos
the object's hashmap with the new path. Moving objects, as in OpenStack, is a
186 305dbce0 Constantinos Venetsanopoulos
copy followed by a delete, again with no real data being moved.
187 dad708b4 Antony Chazapis
188 dad708b4 Antony Chazapis
Updates to an existing object, which are not offered by OpenStack, are
189 305dbce0 Constantinos Venetsanopoulos
implemented by issuing an HTTP ``POST`` request including the offset and the
190 305dbce0 Constantinos Venetsanopoulos
length of the data. The API front end requests from the back end the hashmap of
191 305dbce0 Constantinos Venetsanopoulos
the existing object. Depending on the offset of the update (whether it falls
192 305dbce0 Constantinos Venetsanopoulos
within block boundaries or not) the front end will ask the back end to update
193 305dbce0 Constantinos Venetsanopoulos
or create new blocks. At the end, the front end will save the updated hashmap.
194 305dbce0 Constantinos Venetsanopoulos
It is also possible to pass a parameter to HTTP ``POST`` to specify that the
195 305dbce0 Constantinos Venetsanopoulos
data will come from another object, instead of being uploaded by the client.
196 dad708b4 Antony Chazapis
197 e5d8df8c Constantinos Venetsanopoulos
Pithos Back End Nodes
198 305dbce0 Constantinos Venetsanopoulos
---------------------
199 dad708b4 Antony Chazapis
200 305dbce0 Constantinos Venetsanopoulos
Pithos organizes entities in a tree hierarchy, with one tree node per path
201 305dbce0 Constantinos Venetsanopoulos
entry (see Figure). Nodes can be accounts, containers, and objects. A user may
202 305dbce0 Constantinos Venetsanopoulos
have multiple accounts, each account may have multiple containers, and each
203 305dbce0 Constantinos Venetsanopoulos
container may have multiple objects. An object may have multiple versions, and
204 305dbce0 Constantinos Venetsanopoulos
each version of an object has properties (a set of fixed metadata, like size
205 305dbce0 Constantinos Venetsanopoulos
and mtime) and arbitrary metadata.
206 dad708b4 Antony Chazapis
207 dad708b4 Antony Chazapis
.. image:: images/pithos-backend-nodes.png
208 dad708b4 Antony Chazapis
209 305dbce0 Constantinos Venetsanopoulos
The tree hierarchy has up to three levels, since, following the OpenStack API,
210 305dbce0 Constantinos Venetsanopoulos
everything is stored as an object in a container.  The notion of folders or
211 305dbce0 Constantinos Venetsanopoulos
directories is through conventions that simulate pseudo-hierarchical folders.
212 305dbce0 Constantinos Venetsanopoulos
In particular, object names that contain the forward slash character and have
213 305dbce0 Constantinos Venetsanopoulos
an accompanying marker object with a ``Content-Type: application/directory`` as
214 305dbce0 Constantinos Venetsanopoulos
part of their metadata can be treated as directories by Pithos clients. Each
215 dad708b4 Antony Chazapis
node corresponds to a unique path, and we keep its parent in the
216 305dbce0 Constantinos Venetsanopoulos
account/container/object hierarchy (that is, all objects have a container as
217 305dbce0 Constantinos Venetsanopoulos
their parent).
218 dad708b4 Antony Chazapis
219 e5d8df8c Constantinos Venetsanopoulos
Pithos Back End Versions
220 305dbce0 Constantinos Venetsanopoulos
------------------------
221 dad708b4 Antony Chazapis
222 305dbce0 Constantinos Venetsanopoulos
For each object version we keep the root Merkle hash of the object it refers
223 305dbce0 Constantinos Venetsanopoulos
to, the size of the object, the last modification time and the user that
224 305dbce0 Constantinos Venetsanopoulos
modified the file, and its cluster. A version belongs to one of the following
225 305dbce0 Constantinos Venetsanopoulos
three clusters (see Figure):
226 dad708b4 Antony Chazapis
227 dad708b4 Antony Chazapis
  * normal, which are the current versions
228 dad708b4 Antony Chazapis
  * history, which contain the previous versions of an object
229 dad708b4 Antony Chazapis
  * deleted, which contain objects that have been deleted
230 dad708b4 Antony Chazapis
231 dad708b4 Antony Chazapis
.. image:: images/pithos-backend-versions.png
232 dad708b4 Antony Chazapis
233 305dbce0 Constantinos Venetsanopoulos
This versioning allows Pithos to offer to its user time-based contents listing
234 305dbce0 Constantinos Venetsanopoulos
of their accounts. In effect, this also allows them to take their containers
235 305dbce0 Constantinos Venetsanopoulos
back in time. This is implemented conceptually by taking a vertical line in the
236 305dbce0 Constantinos Venetsanopoulos
Figure and presenting to the user the state on the left side of the line.
237 dad708b4 Antony Chazapis
238 e5d8df8c Constantinos Venetsanopoulos
Pithos Back End Permissions
239 305dbce0 Constantinos Venetsanopoulos
---------------------------
240 dad708b4 Antony Chazapis
241 e5d8df8c Constantinos Venetsanopoulos
Pithos recognizes read and write permissions, which can be granted to
242 305dbce0 Constantinos Venetsanopoulos
individual users or groups of users. Groups as collections of users created at
243 305dbce0 Constantinos Venetsanopoulos
the account level by users themselves, and are flat - a group cannot contain or
244 305dbce0 Constantinos Venetsanopoulos
reference another group. Ownership of a file cannot be delegated.
245 dad708b4 Antony Chazapis
246 305dbce0 Constantinos Venetsanopoulos
Pithos also recognizes a "public" permission, which means that the object is
247 305dbce0 Constantinos Venetsanopoulos
readable by all. When an object is made public, it is assigned a URL that can
248 305dbce0 Constantinos Venetsanopoulos
be used to access the object from outside Pithos even by non-Pithos users.
249 dad708b4 Antony Chazapis
250 dad708b4 Antony Chazapis
Permissions can be assigned to objects, which may be actual files, or
251 305dbce0 Constantinos Venetsanopoulos
directories. When listing objects, the back end uses the permissions as filters
252 305dbce0 Constantinos Venetsanopoulos
for what to display, so that users will see only objects to which they have
253 305dbce0 Constantinos Venetsanopoulos
access. Depending on the type of the object, the filter may be exact (plain
254 305dbce0 Constantinos Venetsanopoulos
object), or a prefix (like ``path/*`` for a directory). When accessing objects,
255 305dbce0 Constantinos Venetsanopoulos
the same rules are used to decide whether to allow the user to read or modify
256 305dbce0 Constantinos Venetsanopoulos
the object or directory. If no permissions apply to a specific object, the back
257 305dbce0 Constantinos Venetsanopoulos
end searches for permissions on the closest directory sharing a common prefix
258 305dbce0 Constantinos Venetsanopoulos
with the object.