Statistics
| Branch: | Tag: | Revision:

root / docs / source / design.rst @ 5bd53e3b

History | View | Annotate | Download (10.1 kB)

1

    
2
********************
3
Pithos Server Design
4
********************
5

    
6
Up-to-date: 2011-06-14
7

    
8
Key Target Features
9
===================
10

    
11
OpenStack Object Storage API
12
----------------------------
13

    
14
Originally derived from Rackspace's Cloudfiles API,
15
it features, like Amazon S3, a two-level real hieararchy
16
with containers and objects per account, while objects
17
within each container are hosted in a flat namespace,
18
albeit with options for 'virtual' hierarchy listings.
19
Accounts, containers and objects may host user-defined metadata.
20

    
21
The use of a well-known API, hopefully, will provide immediate
22
compatibility with existing clients and provide familiarity
23
to developers. Adoption and familiarity will also make
24
custom extensions more accessible to developers.
25

    
26
However, Pithos differs as a service, mainly because it (also)
27
targets end users. This inevitably forces extensions to the API,
28
but the plan is to keep OOS compatibility, anyway.
29

    
30
Partial Transfers
31
-----------------
32

    
33
An important feature, especially for home users, is the ability
34
to continue transfers after interruption by choice or failure,
35
without significant loss of the partial transfer completed
36
up to the point of interruption.
37

    
38
The Manifest mechanism in the CloudFiles API and
39
the the Multipart Upload mechanism in Amazon S3,
40
both provide a means to partial transfers.
41
Manifests is not (yet?) in the OpenStack specification
42
but is considered for support in Pithos.
43

    
44
The Pithos Server approach is similar, allowing appending
45
(actually, any modification) to existing objects via
46
HTTP 1.1 chunked transfers. Chunks reach stable storage
47
before the whole transfer is complete and restarting
48
a transfer from the point it was interrupted is possible
49
by querying the status of the target object (e.g. its size).
50

    
51
Atomic Updates
52
--------------
53

    
54
However, when updating existing objects, transfers do not
55
happen instantaneously. To support partial transfers,
56
the content must be committed to storage before the transfer
57
is complete. This creates a hazard:
58
The partially committed content may temporarily set the object
59
in a visible, inconsistent state after the transfer has started
60
and before the transfer has completed.
61
Furthermore, failed transfers that are never retried result
62
in permanent corruption and data loss.
63

    
64
The Manifest and Multipart Upload mechanisms both provide atomicity,
65
but they only specify the creation of a new object and not
66
the updating of an existing one.
67

    
68
The Pithos Server approach is similar, but more flexible.
69
There are two types of updates to an object:
70
Those with an HTTP body as a source which is not atomic,
71
and those with another object as a source, which is atomic.
72
This way, clients may choose to perform atomic updates in
73
two stages, as with Manifests and Multipart Uploads,
74
or choose to make a hazardous update directly.
75

    
76
Transfer bandwidth efficiency
77
-----------------------------
78

    
79
Another issue is the volume of data needed to be transfered
80
when updating objects.
81
If multiple clients update the object regularly, each one of
82
them cannot just send a patch because the state of the file
83
on the server is unknown.
84
A mechanism is needed that can compute delta patches between
85
two versions. The standard candidate is librsync.
86
However, content-hashing provides another alternative,
87
discussed below.
88

    
89
Cheap versions, snapshots and clones
90
------------------------------------
91

    
92
The Pithos Server supports cheap versions, snapshots and clones
93
for its hosted objects. First, to clarify the terms.
94

    
95
Versions
96
    are different objects with the same path,
97
    identified by an extra version identifier.
98
    Version identifiers are ordered on the time of the version creation.
99

    
100
Snapshots
101
    are immutable copies of objects at a specific point in time,
102
    archived for future reference.
103

    
104
Clones
105
    are mutable copies of snapshots or other objects,
106
    and have their own private history after their creation.
107

    
108
Note that snapshots and clones may have a different path that
109
their source object, while versions always have the same path.
110

    
111
It is not yet decided if snapshots will be explicitly in the API,
112
by providing a read-only object type.
113
It is also not yet decided if clones will be explicitly in the API,
114
as objects that have a 'source' back to their source object.
115

    
116
Effectively, the only prerequisite for cheap versions, shapshot and
117
clones is cheap copying of objects. This Pithos Server feature is
118
designed in concert with content-hashing, explained below.
119

    
120
Pluggable Storage Backend
121
-------------------------
122

    
123
The Pithos service is destined to run on distributed block storage servers.
124
Therefore the Pithos Server design assumes a pluggable storage backend
125
to a reliable, redundant object storage service.
126

    
127

    
128
Content-Hashing: The Case for Virtualizing Files
129
================================================
130

    
131
All of the key target features (except the OOS API) can be well serviced
132
by a content-hashing/blocking approach to storage.
133

    
134
Raw storage is becoming a basic commodity over the internet,
135
in many forms, and especially as a 'cloud' service.
136
It makes sense to virtualize files and separate
137
the storage containing and serving layer
138
from the semantically rich and application-defined file serving layer.
139

    
140
Content hashing with a consistent blocking (chunking) scheme,
141
inherently provides both universal content identification
142
and logical separation of file- and content-serving.
143
Additionally it offers many benefits and
144
its costs have minimum impact.
145

    
146
We iterate through the benefits and comment:
147

    
148
* Universal identification
149
    With the reservation of hash collisions, every file and every block-aligned
150
    part of a file can be uniquely identified by its hash digest.
151

    
152
* Get data from anyone who might serve it, easy sharing
153
    Because of universal identification,
154
    it matters not where you get the actual contents from.
155
    Just like it (almost) does not matter where you get
156
    your internet feed from.
157

    
158
    Pithos, will seek to exploit that by deploying a
159
    separate layer for serving blocks.
160

    
161
* Universal storage provisioning, efficient content serving.
162
    Because content-hashing separates the semantically and funtionally rich
163
    file-serving layer from the content serving layer,
164
    the actual storage service has a simple get/put interface
165
    to read-only objects.
166

    
167
    This enables easy deployment of diverse systems as storage backends.
168
    It also enables the content-serving system to be separate, simpler
169
    and more performant than the file-serving system.
170

    
171
* Cheap file copies
172
    Files become maps of hashes, and are "virtualized", in the sense
173
    that they only contain pointers to the actual content.
174
    The maps are much smaller (depending on the block size)
175
    and copying them incurs little overhead compared to copying
176
    the data.
177

    
178
* Cheap updates
179
    A small change in a large file will result only in small changes in
180
    the hashes map, therefore only the relevant blocks need to be uploaded
181
    and updated within the map.
182

    
183
* Data checksums
184
    Data reliability is no longer an ignorable issue and we need to
185
    checksum our data anyway. Content-hash is precicely that.
186

    
187
The only drawback we have registered is the overhead of managing
188
blocks that are no longer used.
189
However, our (somewhat) educated guess is that the unused blocks will
190
be only a small percentage of the total.
191
In any case, our current consensus is that we clean them up
192
in maintenance operations.
193

    
194

    
195
Specific Issues
196
===============
197

    
198
Pseudo- vs real hierarchies
199
---------------------------
200

    
201
The main difference between real and pseudo-hierarchies is in the
202
way the namespace is built.
203
In real hierarchies, like most disk filesystems,
204
the namespace is built by by recursive containing of directories
205
inside a root directory.
206
In pseudo-hierarchies, the namespace is flat, and hierarchy is defined
207
by the lexicographical ordering of the path of each file.
208

    
209
There are two important practical consequences:
210

    
211
- Pseudo-hierarchies can have less overhead and perform faster lookups,
212
  but cannot move files efficiently.
213

    
214
- Real-hierarchies can move files instantly,
215
  but every lookup must iterate through all parents of the target file.
216

    
217
Since Pithos Server, through the OpenStack Object Storage API,
218
has adopted a pseudo-hierarchy for the containers,
219
it is important to not contradict this design choice
220
with other incompatible choices.
221

    
222
File Properties, Attributes, Features
223
-------------------------------------
224

    
225
For the purpose of the design of Pithos Server Internals,
226
we define three types of metadata for files.
227

    
228
Properties
229
    are intrinsic qualities of files, such as their size, or content-type.
230
    Properties are interpreted and handled specially by the Server code,
231
    and their changing means fundamentally altering the object.
232

    
233
Attributes
234
    are key-value pairs attached to each file by the user,
235
    and are not intrinsic qualities of files like properties.
236
    The System may interpret some special keys as user input,
237
    but never consider attributes to be a fundamental
238
    and trusted quality of a file.
239
    In the current design,
240
    file versions do not share one attribute set, but each has its own.
241

    
242
X-Features
243
    are like attributes, but with one key difference and one key limitation.
244
    Unlike attributes, features are attached to paths and not versions.
245
    Therefore, each file "inherits" the features that are defined
246
    by its path, or some prefix of its path.
247
    The X stands for *exclusive*, which is the limitation of x-features.
248
    There can never be two X-Feature sets in two overlapping paths.
249
    Therefore, in order to set a feature set on a path,
250
    it is necessary to purge features from overlapping paths,
251
    or the operation fails.
252
    This limitation greatly reduces the practical overhead for the Server
253
    to query features and feature inheritance for arbitrary paths.
254
    However it does not cripple their use nearly as much.
255
    One may even argue that it simplifies things for the users.
256

    
257
Sharing
258
-------
259

    
260
    The basic idea is that sharing is one read-list and one write-list
261
    as x-features of a path, and that users and group of users may
262
    be specified in each list, granting the corresponding access rights.
263
    Currently, the 'write' permission implies the 'read' one.
264
    More permission types are possible with the addition of relevant lists.
265