Statistics
| Branch: | Tag: | Revision:

root / docs / pithos.rst @ a1d0bacb

History | View | Annotate | Download (12.9 kB)

1
.. _pithos:
2

    
3
File/Object Storage Service (Pithos)
4
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
5

    
6
Overview
7
========
8

    
9
Pithos is the Object/File Storage component of Synnefo. Users upload files on
10
Pithos using either the Web UI, the command-line client, or native syncing
11
clients. It is a thin layer mapping user-files to content-addressable blocks
12
which are then stored on a storage backend. Files are split in blocks of fixed
13
size, which are hashed independently to create a unique identifier for each
14
block, so each file is represented by a sequence of block names (a
15
hashmap). This way, Pithos provides deduplication of file data; blocks
16
shared among files are only stored once.
17

    
18
The current implementation uses 4MB blocks hashed with SHA256. Content-based
19
addressing also enables efficient two-way file syncing that can be used by all
20
Pithos clients (e.g. the ``kamaki`` command-line client or the native
21
Windows/Mac OS clients). Whenever someone wishes to upload an updated version
22
of a file, the client hashes all blocks of the file and then requests the
23
server to create a new version for this block sequence. The server will return
24
an error reply with a list of the missing blocks. The client may then upload
25
each block one by one, and retry file creation. Similarly, whenever a file has
26
been changed on the server, the client can ask for its list of blocks and only
27
download the modified ones.
28

    
29
Pithos runs at the cloud layer and exposes the OpenStack Object Storage API to
30
the outside world, with custom extensions for syncing. Any client speaking to
31
OpenStack Swift can also be used to store objects in a Pithos deployment. The
32
process of mapping user files to hashed objects is independent from the actual
33
storage backend, which is selectable by the administrator using pluggable
34
drivers. Currently, Pithos has drivers for two storage backends:
35

    
36
 * files on a shared filesystem, e.g., NFS, Lustre, GPFS or GlusterFS
37
 * objects on a Ceph/RADOS cluster.
38

    
39
Whatever the storage backend, it is responsible for storing objects reliably,
40
without any connection to the cloud APIs or to the hashing operations.
41

    
42

    
43
OpenStack extensions
44
====================
45

    
46
The major extensions on the OpenStack API are:
47

    
48
* The use of block-based storage in lieu of an object-based one.
49
  OpenStack stores objects, which may be files, but this is not
50
  necessary - large files (longer than 5GBytes), for instance, must be
51
  stored as a series of distinct objects accompanied by a manifest.
52
  Pithos stores blocks, so objects can be of unlimited size.
53
* Permissions on individual files and folders. Note that folders
54
  do not exist in the OpenStack API, but are simulated by
55
  appropriate conventions, an approach we have kept in Pithos to
56
  avoid incompatibility.
57
* Fully-versioned objects.
58
* Metadata-based queries. Users are free to set metadata on their
59
  objects, and they can list objects meeting metadata criteria.
60
* Policies, such as whether to enable object versioning and to
61
  enforce quotas. This is particularly important for sharing object
62
  containers, since the user may want to avoid running out of space
63
  because of collaborators writing in the shared storage.
64
* Partial upload and download based on HTTP request
65
  headers and parameters.
66
* Object updates, where data may even come from other objects
67
  already stored in Pithos. This allows users to compose objects from
68
  other objects without uploading data.
69
* All objects are assigned UUIDs on creation, which can be
70
  used to reference them regardless of their path location.
71

    
72
Pithos Design
73
=============
74

    
75
Pithos is built on a layered architecture. The Pithos server speaks HTTP with
76
the outside world. The HTTP operations implement an extended OpenStack Object
77
Storage API.  The back end is a library meant to be used by internal code and
78
other front ends. For instance, the back end library, apart from being used in
79
Pithos for implementing the OpenStack Object Storage API, is also used in our
80
implementation of the OpenStack Image Service API. Moreover, the back end
81
library allows specification of different namespaces for metadata, so that the
82
same object can be viewed by different front end APIs with different sets of
83
metadata. Hence the same object can be viewed as a file in Pithos, with one set
84
of metadata, or as an image with a different set of metadata, in our
85
implementation of the OpenStack Image Service.
86

    
87
The data component provides storage of block and the information needed to
88
retrieve them, while the metadata component is a database of nodes and
89
permissions. At the current implementation, data is saved to the filesystem and
90
metadata in an SQL database.
91

    
92
Block-based Storage for the Client
93
----------------------------------
94

    
95
Since an object is saved as a set of blocks in Pithos, object
96
operations are no longer required to refer to the whole object. We can
97
handle parts of objects as needed when uploading, downloading, or
98
copying and moving data.
99

    
100
In particular, a client, provided it has access permissions, can
101
download data from Pithos by issuing a ``GET`` request on an
102
object. If the request includes the ``hashmap`` parameter, then the
103
request refers to a hashmap, that is, a set containing the
104
object's block hashes. The reply is of the form::
105

    
106
    {"block_hash": "sha1", 
107
     "hashes": ["7295c41da03d7f916440b98e32c4a2a39351546c", ...],
108
     "block_size":131072,
109
     "bytes": 242}
110

    
111
The client can then compare the hashmap with the hashmap computed from the
112
local file. Any missing parts can be downloaded with ``GET`` requests with an
113
additional ``Range`` header containing the hashes of the blocks to be
114
retrieved. The integrity of the file can be checked against the
115
``X-Object-Hash`` header, returned by the server and containing the root Merkle
116
hash of the object's hashmap.
117

    
118
When uploading a file to Pithos, only the missing blocks will be submitted to
119
the server, with the following algorithm:
120

    
121
* Calculate the hash value for each block of the object to be
122
  uploaded.
123
* Send a hashmap ``PUT`` request for the object. This is a
124
  ``PUT`` request with a ``hashmap`` request parameter appended
125
  to it. If the parameter is not present, the object's data (or part
126
  of it) is provided with the request. If the parameter is present,
127
  the object hashmap is provided with the request.
128
* If the server responds with status 201 (Created), the blocks are
129
  already on the server and we do not need to do anything more.
130
* If the server responds with status 409 (Conflict), the server’s
131
  response body contains the hashes of the blocks that do not exist on
132
  the server. Then, for each hash value in the server’s response (or all
133
  hashes together) send a ``POST`` request to the server with the
134
  block's data.
135

    
136
In effect, we are deduplicating data based on their block hashes, transparently
137
to the users. This results to perceived instantaneous uploads when material is
138
already present in Pithos storage.
139

    
140
Block-based Storage Processing
141
------------------------------
142

    
143
Hashmaps themselves are saved in blocks. All blocks are persisted to storage
144
using content-based addressing. It follows that to read a file, Pithos performs
145
the following operations:
146

    
147
* The client issues a request to get a file, via HTTP ``GET``.
148
* The API front end asks from the back end the metadata
149
  of the object.
150
* The back end checks the permissions of the object and, if they
151
  allow access to it, returns the object's metadata.
152
* The front end evaluates any HTTP headers (such as
153
  ``If-Modified-Since``, ``If-Match``, etc.).
154
* If the preconditions are met, the API front end requests
155
  from the back end the object's hashmap (hashmaps are indexed by the
156
  full path).
157
* The back end will read and return to the API front end the
158
  object's hashmap from the underlying storage.
159
* Depending on the HTTP ``Range`` header, the 
160
  API front end asks from the back end the required blocks, giving
161
  their corresponding hashes.
162
* The back end fetches the blocks from the underlying storage,
163
  passes them to the API front end, which returns them to the client.
164

    
165
Saving data from the client to the server is done in several different ways.
166

    
167
First, a regular HTTP ``PUT`` is the reverse of the HTTP ``GET``.  The client
168
sends the full object to the API front end.  The API front end splits the
169
object to blocks. It sends each block to the back end, which calculates its
170
hash and saves it to storage. When the hashmap is complete, the API front end
171
commands the back end to create a new object with the created hashmap and any
172
associated metadata.
173

    
174
Secondly, the client may send to the API front end a hashmap and any associated
175
metadata, with a special formatted HTTP ``PUT``, using an appropriate URL
176
parameter. In this case, if the back end can find the requested blocks, the
177
object will be created as previously, otherwise it will report back the list of
178
missing blocks, which will be passed back to the client. The client then may
179
send the missing blocks by issuing an HTTP ``POST`` and then retry the HTTP
180
``PUT`` for the hashmap. This allows for very fast uploads, since it may happen
181
that no real data uploading takes place, if the blocks are already in data
182
storage.
183

    
184
Copying objects does not involve data copying, but is performed by associating
185
the object's hashmap with the new path. Moving objects, as in OpenStack, is a
186
copy followed by a delete, again with no real data being moved.
187

    
188
Updates to an existing object, which are not offered by OpenStack, are
189
implemented by issuing an HTTP ``POST`` request including the offset and the
190
length of the data. The API front end requests from the back end the hashmap of
191
the existing object. Depending on the offset of the update (whether it falls
192
within block boundaries or not) the front end will ask the back end to update
193
or create new blocks. At the end, the front end will save the updated hashmap.
194
It is also possible to pass a parameter to HTTP ``POST`` to specify that the
195
data will come from another object, instead of being uploaded by the client.
196

    
197
Pithos Back End Nodes
198
---------------------
199

    
200
Pithos organizes entities in a tree hierarchy, with one tree node per path
201
entry (see Figure). Nodes can be accounts, containers, and objects. A user may
202
have multiple accounts, each account may have multiple containers, and each
203
container may have multiple objects. An object may have multiple versions, and
204
each version of an object has properties (a set of fixed metadata, like size
205
and mtime) and arbitrary metadata.
206

    
207
.. image:: images/pithos-backend-nodes.png
208

    
209
The tree hierarchy has up to three levels, since, following the OpenStack API,
210
everything is stored as an object in a container.  The notion of folders or
211
directories is through conventions that simulate pseudo-hierarchical folders.
212
In particular, object names that contain the forward slash character and have
213
an accompanying marker object with a ``Content-Type: application/directory`` as
214
part of their metadata can be treated as directories by Pithos clients. Each
215
node corresponds to a unique path, and we keep its parent in the
216
account/container/object hierarchy (that is, all objects have a container as
217
their parent).
218

    
219
Pithos Back End Versions
220
------------------------
221

    
222
For each object version we keep the root Merkle hash of the object it refers
223
to, the size of the object, the last modification time and the user that
224
modified the file, and its cluster. A version belongs to one of the following
225
three clusters (see Figure):
226

    
227
  * normal, which are the current versions
228
  * history, which contain the previous versions of an object
229
  * deleted, which contain objects that have been deleted
230

    
231
.. image:: images/pithos-backend-versions.png
232

    
233
This versioning allows Pithos to offer to its user time-based contents listing
234
of their accounts. In effect, this also allows them to take their containers
235
back in time. This is implemented conceptually by taking a vertical line in the
236
Figure and presenting to the user the state on the left side of the line.
237

    
238
Pithos Back End Permissions
239
---------------------------
240

    
241
Pithos recognizes read and write permissions, which can be granted to
242
individual users or groups of users. Groups as collections of users created at
243
the account level by users themselves, and are flat - a group cannot contain or
244
reference another group. Ownership of a file cannot be delegated.
245

    
246
Pithos also recognizes a "public" permission, which means that the object is
247
readable by all. When an object is made public, it is assigned a URL that can
248
be used to access the object from outside Pithos even by non-Pithos users.
249

    
250
Permissions can be assigned to objects, which may be actual files, or
251
directories. When listing objects, the back end uses the permissions as filters
252
for what to display, so that users will see only objects to which they have
253
access. Depending on the type of the object, the filter may be exact (plain
254
object), or a prefix (like ``path/*`` for a directory). When accessing objects,
255
the same rules are used to decide whether to allow the user to read or modify
256
the object or directory. If no permissions apply to a specific object, the back
257
end searches for permissions on the closest directory sharing a common prefix
258
with the object.