Revision dad708b4 docs/pithos.rst
b/docs/pithos.rst | ||
---|---|---|
1 |
.. _pithos: |
|
1 |
Object Storage Service (Pithos+) |
|
2 |
================================ |
|
2 | 3 |
|
3 |
File Storage Service (pithos+) |
|
4 |
Pithos+ is an online storage service based on the OpenStack Object |
|
5 |
Storage API with several important extensions. It uses a |
|
6 |
block-based mechanism to allow users to upload, download, and share |
|
7 |
files, keep different versions of a file, and attach policies to them. |
|
8 |
It follows a layered, modular implementation. Pithos+ was designed to |
|
9 |
be used as a storage service by the total set of the Greek research |
|
10 |
and academic community (counting tens of thousands of users) but is |
|
11 |
free and open to use by anybody, under a BSD-2 clause license. |
|
12 |
|
|
13 |
A presentation of Pithos+ features and architecture is :download:`here <pithos-plus.pdf>`. |
|
14 |
|
|
15 |
Introduction |
|
16 |
------------ |
|
17 |
|
|
18 |
In 2008 the Greek Research and Technology Network (GRNET) decided |
|
19 |
to offer an online storage service to the Greek research and academic |
|
20 |
community. The service, called Pithos, was implemented in 2008-2009, |
|
21 |
and was made available in spring 2009. It now has more than |
|
22 |
12,000 users. |
|
23 |
|
|
24 |
In 2011 GRNET decided to offer a new, evolved online storage |
|
25 |
service, to be called Pithos+. Pithos+ is designed to address the |
|
26 |
main requirements expressed by the Pithos users in the first two years of |
|
27 |
operation: |
|
28 |
|
|
29 |
* Provide both a web-based client and native desktop clients for |
|
30 |
the most common operating systems. |
|
31 |
* Allow not only uploading, downloading, and sharing, but also |
|
32 |
synchronization capabilities so that uses are able to select folders |
|
33 |
and have then synchronized automatically with their online accounts. |
|
34 |
* Allow uploading of large files, regardless of browser |
|
35 |
capabilities (depending on the version, browsers may place a 2 |
|
36 |
GBytes upload limit). |
|
37 |
* Improve upload speed; not an issue as long as the user is on a |
|
38 |
computer connected to the GRNET backbone, but it becomes important |
|
39 |
over ADSL connections. |
|
40 |
* Allow access by |
|
41 |
non-Shibboleth (http://shibboleth.internet2.edu/). |
|
42 |
accounts. Pithos delegates user authentication to the Greek |
|
43 |
Shibboleth federation, in which all research and academic |
|
44 |
institutions belong. However, it is desirable to have the option to |
|
45 |
open up Pithos to non-Shibboleth authenticated users as well. |
|
46 |
* Use open standards as far as possible. |
|
47 |
|
|
48 |
In what follows we describe the main features of Pithos+, the elements |
|
49 |
of its design and the capabilities it affords. We touch on related |
|
50 |
work and we provide some discussion on our experiences and thoughts on |
|
51 |
the future. |
|
52 |
|
|
53 |
Pithos+ Features |
|
54 |
---------------- |
|
55 |
|
|
56 |
Pithos+ is based on the OpenStack Object Storage API (Pithos |
|
57 |
used a home-grown API). We decided to adopt an open standard |
|
58 |
API in order to leverage existing clients that implement the |
|
59 |
API. In this way, a user can access Pithos+ with a standard |
|
60 |
OpenStack client - although users will want to use a Pithos+ client to |
|
61 |
use features going beyond those offered by the OpenStack API. |
|
62 |
The strategy paid off during Pithos+ development itself, as we were |
|
63 |
able to access and test the service with existing clients, while also |
|
64 |
developing new clients based on open source OpenStack clients. |
|
65 |
|
|
66 |
The major extensions on the OpenStack API are: |
|
67 |
|
|
68 |
* The use of block-based storage in lieu of an object-based one. |
|
69 |
OpenStack stores objects, which may be files, but this is not |
|
70 |
necessary - large files (longer than 5GBytes), for instance, must be |
|
71 |
stored as a series of distinct objects accompanied by a manifest. |
|
72 |
Pithos+ stores blocks, so objects can be of unlimited size. |
|
73 |
* Permissions on individual files and folders. Note that folders |
|
74 |
do not exist in the OpenStack API, but are simulated by |
|
75 |
appropriate conventions, an approach we have kept in Pithos+ to |
|
76 |
avoid incompatibility. |
|
77 |
* Fully-versioned objects. |
|
78 |
* Metadata-based queries. Users are free to set metadata on their |
|
79 |
objects, and they can list objects meeting metadata criteria. |
|
80 |
* Policies, such as whether to enable object versioning and to |
|
81 |
enforce quotas. This is particularly important for sharing object |
|
82 |
containers, since the user may want to avoid running out of space |
|
83 |
because of collaborators writing in the shared storage. |
|
84 |
* Partial upload and download based on HTTP request |
|
85 |
headers and parameters. |
|
86 |
* Object updates, where data may even come from other objects |
|
87 |
already stored in Pithos+. This allows users to compose objects from |
|
88 |
other objects without uploading data. |
|
89 |
* All objects are assigned UUIDs on creation, which can be |
|
90 |
used to reference them regardless of their path location. |
|
91 |
|
|
92 |
Pithos+ Design |
|
93 |
-------------- |
|
94 |
|
|
95 |
Pithos+ is built on a layered architecture (see Figure). |
|
96 |
The Pithos+ server speaks HTTP with the outside world. The HTTP |
|
97 |
operations implement an extended OpenStack Object Storage API. |
|
98 |
The back end is a library meant to be used by internal code and |
|
99 |
other front ends. For instance, the back end library, apart from being |
|
100 |
used in Pithos+ for implementing the OpenStack Object Storage API, |
|
101 |
is also used in our implementation of the OpenStack Image |
|
102 |
Service API. Moreover, the back end library allows specification |
|
103 |
of different namespaces for metadata, so that the same object can be |
|
104 |
viewed by different front end APIs with different sets of |
|
105 |
metadata. Hence the same object can be viewed as a file in Pithos+, |
|
106 |
with one set of metadata, or as an image with a different set of |
|
107 |
metadata, in our implementation of the OpenStack Image Service. |
|
108 |
|
|
109 |
The data component provides storage of block and the information |
|
110 |
needed to retrieve them, while the metadata component is a database of |
|
111 |
nodes and permissions. At the current implementation, data is saved to |
|
112 |
the filesystem and metadata in an SQL database. In the future, |
|
113 |
data will be saved to some distributed block storage (we are currently |
|
114 |
evaluating RADOS - http://ceph.newdream.net/category/rados), and metadata to a NoSQL database. |
|
115 |
|
|
116 |
.. image:: images/pithos-layers.png |
|
117 |
|
|
118 |
Block-based Storage for the Client |
|
119 |
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
|
120 |
|
|
121 |
Since an object is saved as a set of blocks in Pithos+, object |
|
122 |
operations are no longer required to refer to the whole object. We can |
|
123 |
handle parts of objects as needed when uploading, downloading, or |
|
124 |
copying and moving data. |
|
125 |
|
|
126 |
In particular, a client, provided it has access permissions, can |
|
127 |
download data from Pithos+ by issuing a ``GET`` request on an |
|
128 |
object. If the request includes the ``hashmap`` parameter, then the |
|
129 |
request refers to a hashmap, that is, a set containing the |
|
130 |
object's block hashes. The reply is of the form:: |
|
131 |
|
|
132 |
{"block_hash": "sha1", |
|
133 |
"hashes": ["7295c41da03d7f916440b98e32c4a2a39351546c", ...], |
|
134 |
"block_size":131072, |
|
135 |
"bytes": 242} |
|
136 |
|
|
137 |
The client can then compare the hashmap with the hashmap computed from |
|
138 |
the local file. Any missing parts can be downloaded with ``GET`` |
|
139 |
requests with an additional ``Range`` header containing the hashes |
|
140 |
of the blocks to be retrieved. The integrity of the file can be |
|
141 |
checked against the ``X-Object-Hash`` header, returned by the |
|
142 |
server and containing the root Merkle hash of the object's |
|
143 |
hashmap. |
|
144 |
|
|
145 |
When uploading a file to Pithos+, only the missing blocks will be |
|
146 |
submitted to the server, with the following algorithm: |
|
147 |
|
|
148 |
* Calculate the hash value for each block of the object to be |
|
149 |
uploaded. |
|
150 |
* Send a hashmap ``PUT`` request for the object. This is a |
|
151 |
``PUT`` request with a ``hashmap`` request parameter appended |
|
152 |
to it. If the parameter is not present, the object's data (or part |
|
153 |
of it) is provided with the request. If the parameter is present, |
|
154 |
the object hashmap is provided with the request. |
|
155 |
* If the server responds with status 201 (Created), the blocks are |
|
156 |
already on the server and we do not need to do anything more. |
|
157 |
* If the server responds with status 409 (Conflict), the server’s |
|
158 |
response body contains the hashes of the blocks that do not exist on |
|
159 |
the server. Then, for each hash value in the server’s response (or all |
|
160 |
hashes together) send a ``POST`` request to the server with the |
|
161 |
block's data. |
|
162 |
|
|
163 |
In effect, we are deduplicating data based on their block hashes, |
|
164 |
transparently to the users. This results to perceived instantaneous |
|
165 |
uploads when material is already present in Pithos+ storage. |
|
166 |
|
|
167 |
Block-based Storage Processing |
|
4 | 168 |
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
5 | 169 |
|
6 |
Pithos+ is the synnefo File Storage Service and implements the OpenStack Object |
|
7 |
Storage API + synnefo extensions. |
|
170 |
Hashmaps themselves are saved in blocks. All blocks are persisted to |
|
171 |
storage using content-based addressing. It follows that to read a |
|
172 |
file, Pithos+ performs the following operations: |
|
8 | 173 |
|
174 |
* The client issues a request to get a file, via HTTP ``GET``. |
|
175 |
* The API front end asks from the back end the metadata |
|
176 |
of the object. |
|
177 |
* The back end checks the permissions of the object and, if they |
|
178 |
allow access to it, returns the object's metadata. |
|
179 |
* The front end evaluates any HTTP headers (such as |
|
180 |
``If-Modified-Since``, ``If-Match``, etc.). |
|
181 |
* If the preconditions are met, the API front end requests |
|
182 |
from the back end the object's hashmap (hashmaps are indexed by the |
|
183 |
full path). |
|
184 |
* The back end will read and return to the API front end the |
|
185 |
object's hashmap from the underlying storage. |
|
186 |
* Depending on the HTTP ``Range`` header, the |
|
187 |
API front end asks from the back end the required blocks, giving |
|
188 |
their corresponding hashes. |
|
189 |
* The back end fetches the blocks from the underlying storage, |
|
190 |
passes them to the API front end, which returns them to the client. |
|
9 | 191 |
|
10 |
Introduction |
|
11 |
============ |
|
12 |
|
|
13 |
Pithos is a storage service implemented by GRNET (http://www.grnet.gr). Data is |
|
14 |
stored as objects, organized in containers, belonging to an account. This |
|
15 |
hierarchy of storage layers has been inspired by the OpenStack Object Storage |
|
16 |
(OOS) API and similar CloudFiles API by Rackspace. The Pithos API follows the |
|
17 |
OOS API as closely as possible. One of the design requirements has been to be |
|
18 |
able to use Pithos with clients built for the OOS, without changes. |
|
19 |
|
|
20 |
However, to be able to take full advantage of the Pithos infrastructure, client |
|
21 |
software should be aware of the extensions that differentiate Pithos from OOS. |
|
22 |
Pithos objects can be updated, or appended to. Pithos will store sharing |
|
23 |
permissions per object and enforce corresponding authorization policies. |
|
24 |
Automatic version management, allows taking account and container listings back |
|
25 |
in time, as well as reading previous instances of objects. |
|
26 |
|
|
27 |
The storage backend of Pithos is block oriented, permitting efficient, |
|
28 |
deduplicated data placement. The block structure of objects is exposed at the |
|
29 |
API layer, in order to encourage external software to implement advanced data |
|
30 |
management operations. |
|
31 |
|
|
32 |
|
|
33 |
Pithos Users and Authentication |
|
34 |
=============================== |
|
35 |
|
|
36 |
In Pithos, each user is uniquely identified by a token. All API requests |
|
37 |
require a token and each token is internally resolved to an account string. The |
|
38 |
API uses the account string to identify the user's own files, thus whether a |
|
39 |
request is local or cross-account. |
|
40 |
|
|
41 |
Pithos does not keep a user database. For development and testing purposes, |
|
42 |
user identifiers and their corresponding tokens can be defined in the settings |
|
43 |
file. However, Pithos is designed with an external authentication service in |
|
44 |
mind. This service must handle the details of validating user credentials and |
|
45 |
communicate with Pithos via a middleware software component that, given a |
|
46 |
token, fills in the internal request account variable. |
|
47 |
|
|
48 |
Client software using Pithos, if not already knowing a user's identifier and |
|
49 |
token, should forward to the ``/login`` URI. The Pithos server, depending on |
|
50 |
its configuration will redirect to the appropriate login page. |
|
51 |
|
|
52 |
The login URI accepts the following parameters: |
|
53 |
|
|
54 |
====================== ========================= |
|
55 |
Request Parameter Name Value |
|
56 |
====================== ========================= |
|
57 |
next The URI to redirect to when the process is finished |
|
58 |
renew Force token renewal (no value parameter) |
|
59 |
force Force logout current user (no value parameter) |
|
60 |
====================== ========================= |
|
61 |
|
|
62 |
When done with logging in, the service's login URI should redirect to the URI |
|
63 |
provided with ``next``, adding ``user`` and ``token`` parameters, which contain |
|
64 |
the account and token fields respectively. |
|
65 |
|
|
66 |
A user management service that implements a login URI according to these |
|
67 |
conventions is Astakos. |
|
68 |
|
|
69 |
|
|
70 |
Pithos+ Architecture |
|
71 |
==================== |
|
192 |
Saving data from the client to the server is done in several different |
|
193 |
ways. |
|
194 |
|
|
195 |
First, a regular HTTP ``PUT`` is the reverse of the HTTP ``GET``. |
|
196 |
The client sends the full object to the API front end. |
|
197 |
The API front end splits the object to blocks. It sends each |
|
198 |
block to the back end, which calculates its hash and saves it to |
|
199 |
storage. When the hashmap is complete, the API front end commands |
|
200 |
the back end to create a new object with the created hashmap and any |
|
201 |
associated metadata. |
|
202 |
|
|
203 |
Secondly, the client may send to the API front end a hashmap and |
|
204 |
any associated metadata, with a special formatted HTTP ``PUT``, |
|
205 |
using an appropriate URL parameter. In this case, if the |
|
206 |
back end can find the requested blocks, the object will be created as |
|
207 |
previously, otherwise it will report back the list of missing blocks, |
|
208 |
which will be passed back to the client. The client then may send the |
|
209 |
missing blocks by issuing an HTTP ``POST`` and then retry the |
|
210 |
HTTP ``PUT`` for the hashmap. This allows for very fast uploads, |
|
211 |
since it may happen that no real data uploading takes place, if the |
|
212 |
blocks are already in data storage. |
|
213 |
|
|
214 |
Copying objects does not involve data copying, but is performed by |
|
215 |
associating the object's hashmap with the new path. Moving objects, as |
|
216 |
in OpenStack, is a copy followed by a delete, again with no real data |
|
217 |
being moved. |
|
218 |
|
|
219 |
Updates to an existing object, which are not offered by OpenStack, are |
|
220 |
implemented by issuing an HTTP ``POST`` request including the |
|
221 |
offset and the length of the data. The API front end requests |
|
222 |
from the back end the hashmap of the existing object. Depending on the |
|
223 |
offset of the update (whether it falls within block boundaries or not) |
|
224 |
the front end will ask the back end to update or create new blocks. At |
|
225 |
the end, the front end will save the updated hashmap. It is also |
|
226 |
possible to pass a parameter to HTTP ``POST`` to specify that the |
|
227 |
data will come from another object, instead of being uploaded by the |
|
228 |
client. |
|
229 |
|
|
230 |
Pithos+ Back End Nodes |
|
231 |
^^^^^^^^^^^^^^^^^^^^^^ |
|
232 |
|
|
233 |
Pithos+ organizes entities in a tree hierarchy, with one tree node per |
|
234 |
path entry (see Figure). Nodes can be accounts, |
|
235 |
containers, and objects. A user may have multiple |
|
236 |
accounts, each account may have multiple containers, and each |
|
237 |
container may have multiple objects. An object may have multiple |
|
238 |
versions, and each version of an object has properties (a set of fixed |
|
239 |
metadata, like size and mtime) and arbitrary metadata. |
|
240 |
|
|
241 |
.. image:: images/pithos-backend-nodes.png |
|
242 |
|
|
243 |
The tree hierarchy has up to three levels, since, following the |
|
244 |
OpenStack API, everything is stored as an object in a container. |
|
245 |
The notion of folders or directories is through conventions that |
|
246 |
simulate pseudo-hierarchical folders. In particular, object names that |
|
247 |
contain the forward slash character and have an accompanying marker |
|
248 |
object with a ``Content-Type: application/directory`` as part of |
|
249 |
their metadata can be treated as directories by Pithos+ clients. Each |
|
250 |
node corresponds to a unique path, and we keep its parent in the |
|
251 |
account/container/object hierarchy (that is, all objects have a |
|
252 |
container as their parent). |
|
253 |
|
|
254 |
Pithos+ Back End Versions |
|
255 |
^^^^^^^^^^^^^^^^^^^^^^^^^ |
|
256 |
|
|
257 |
For each object version we keep the root Merkle hash of the object it |
|
258 |
refers to, the size of the object, the last modification time and the |
|
259 |
user that modified the file, and its cluster. A version belongs |
|
260 |
to one of the following three clusters (see Figure): |
|
261 |
|
|
262 |
* normal, which are the current versions |
|
263 |
* history, which contain the previous versions of an object |
|
264 |
* deleted, which contain objects that have been deleted |
|
265 |
|
|
266 |
.. image:: images/pithos-backend-versions.png |
|
267 |
|
|
268 |
This versioning allows Pithos+ to offer to its user time-based |
|
269 |
contents listing of their accounts. In effect, this also allows them |
|
270 |
to take their containers back in time. This is implemented |
|
271 |
conceptually by taking a vertical line in the Figure and |
|
272 |
presenting to the user the state on the left side of the line. |
|
273 |
|
|
274 |
Pithos+ Back End Permissions |
|
275 |
^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
|
276 |
|
|
277 |
Pithos+ recognizes read and write permissions, which can be granted to |
|
278 |
individual users or groups of users. Groups as collections of users |
|
279 |
created at the account level by users themselves, and are flat - a |
|
280 |
group cannot contain or reference another group. Ownership of a file |
|
281 |
cannot be delegated. |
|
282 |
|
|
283 |
Pithos+ also recognizes a "public" permission, which means that the |
|
284 |
object is readable by all. When an object is made public, it is |
|
285 |
assigned a URL that can be used to access the object from |
|
286 |
outside Pithos+ even by non-Pithos+ users. |
|
287 |
|
|
288 |
Permissions can be assigned to objects, which may be actual files, or |
|
289 |
directories. When listing objects, the back end uses the permissions as |
|
290 |
filters for what to display, so that users will see only objects to |
|
291 |
which they have access. Depending on the type of the object, the |
|
292 |
filter may be exact (plain object), or a prefix (like ``path/*`` for |
|
293 |
a directory). When accessing objects, the same rules are used to |
|
294 |
decide whether to allow the user to read or modify the object or |
|
295 |
directory. If no permissions apply to a specific object, the back end |
|
296 |
searches for permissions on the closest directory sharing a common |
|
297 |
prefix with the object. |
|
298 |
|
|
299 |
Related Work |
|
300 |
------------ |
|
301 |
|
|
302 |
Commercial cloud providers have been offering online storage for quite |
|
303 |
some time, but the code is not published and we do not know the |
|
304 |
details of their implementation. Rackspace has used the OpenStack |
|
305 |
Object Storage in its Cloud Files product. Swift is an open source |
|
306 |
implementation of the OpenStack Object Storage API. As we have |
|
307 |
pointed out, our implementation maintains compatibility with |
|
308 |
OpenStack, while offering additional capabilities. |
|
309 |
|
|
310 |
Discussion |
|
311 |
---------- |
|
312 |
|
|
313 |
Pithos+ is implemented in Python as a Django application. We use SQLAlchemy |
|
314 |
as a database abstraction layer. It is currently about |
|
315 |
17,000 lines of code, and it has taken about 50 person months of |
|
316 |
development effort. This development was done from scratch, with no |
|
317 |
reuse of the existing Pithos code. That service was written in the |
|
318 |
J2EE framework. We decided to move from J2EE to Python for |
|
319 |
two reasons: first, J2EE proved an overkill for the original |
|
320 |
Pithos service in its years of operation. Secondly, Python was |
|
321 |
strongly favored by the GRNET operations team, who are the people |
|
322 |
taking responsibility for running the service - so their voice is |
|
323 |
heard. |
|
324 |
|
|
325 |
Apart from the service implementation, which we have been describing |
|
326 |
here, we have parallel development lines for native client tools on |
|
327 |
different operating systems (MS-Windows, Mac OS X, Android, and iOS). |
|
328 |
The desktop clients allow synchronization with local directories, a |
|
329 |
feature that existing users of Pithos have been asking for, probably |
|
330 |
influenced by services like DropBox. These clients are offered in |
|
331 |
parallel to the standard Pithos+ interface, which is a web application |
|
332 |
build on top of the API front end - we treat our own web |
|
333 |
application as just another client that has to go through the API |
|
334 |
front end, without granting it access to the back end directly. |
|
335 |
|
|
336 |
We are carrying the idea of our own services being clients to Pithos+ |
|
337 |
a step further, with new projects we have in our pipeline, in which a |
|
338 |
digital repository service will be built on top of Pithos+. It will |
|
339 |
use again the API front end, so that repository users will have |
|
340 |
all Pithos+ capabilities, and on top of them we will build additional |
|
341 |
functionality such as full text search, Dublin Core metadata storage |
|
342 |
and querying, streaming, and so on. |
|
343 |
|
|
344 |
At the time of this writing (March 2012) Pithos+ is in alpha, |
|
345 |
available to users by invitation. We will extend our user base as we |
|
346 |
move to beta in the coming months, and to our full set of users in the |
|
347 |
second half of 2012. We are eager to see how our ideas fare as we will |
|
348 |
scaling up, and we welcome any comments and suggestions. |
|
349 |
|
|
350 |
Acknowledgments |
|
351 |
--------------- |
|
352 |
|
|
353 |
Pithos+ is financially supported by Grant 296114, "Advanced Computing |
|
354 |
Services for the Research and Academic Community", of the Greek |
|
355 |
National Strategic Reference Framework. |
|
356 |
|
|
357 |
Availability |
|
358 |
------------ |
|
359 |
|
|
360 |
The Pithos+ code is available under a BSD 2-clause license from: |
|
361 |
https://code.grnet.gr/projects/pithos/repository |
|
362 |
|
|
363 |
The code can also be accessed from its source repository: |
|
364 |
https://code.grnet.gr/git/pithos/ |
|
365 |
|
|
366 |
More information and documentation is available at: |
|
367 |
http://docs.dev.grnet.gr/pithos/latest/index.html |
Also available in: Unified diff