root / docs / source / design.rst @ 5bd53e3b
History | View | Annotate | Download (10.1 kB)
1 |
|
---|---|
2 |
******************** |
3 |
Pithos Server Design |
4 |
******************** |
5 |
|
6 |
Up-to-date: 2011-06-14 |
7 |
|
8 |
Key Target Features |
9 |
=================== |
10 |
|
11 |
OpenStack Object Storage API |
12 |
---------------------------- |
13 |
|
14 |
Originally derived from Rackspace's Cloudfiles API, |
15 |
it features, like Amazon S3, a two-level real hieararchy |
16 |
with containers and objects per account, while objects |
17 |
within each container are hosted in a flat namespace, |
18 |
albeit with options for 'virtual' hierarchy listings. |
19 |
Accounts, containers and objects may host user-defined metadata. |
20 |
|
21 |
The use of a well-known API, hopefully, will provide immediate |
22 |
compatibility with existing clients and provide familiarity |
23 |
to developers. Adoption and familiarity will also make |
24 |
custom extensions more accessible to developers. |
25 |
|
26 |
However, Pithos differs as a service, mainly because it (also) |
27 |
targets end users. This inevitably forces extensions to the API, |
28 |
but the plan is to keep OOS compatibility, anyway. |
29 |
|
30 |
Partial Transfers |
31 |
----------------- |
32 |
|
33 |
An important feature, especially for home users, is the ability |
34 |
to continue transfers after interruption by choice or failure, |
35 |
without significant loss of the partial transfer completed |
36 |
up to the point of interruption. |
37 |
|
38 |
The Manifest mechanism in the CloudFiles API and |
39 |
the the Multipart Upload mechanism in Amazon S3, |
40 |
both provide a means to partial transfers. |
41 |
Manifests is not (yet?) in the OpenStack specification |
42 |
but is considered for support in Pithos. |
43 |
|
44 |
The Pithos Server approach is similar, allowing appending |
45 |
(actually, any modification) to existing objects via |
46 |
HTTP 1.1 chunked transfers. Chunks reach stable storage |
47 |
before the whole transfer is complete and restarting |
48 |
a transfer from the point it was interrupted is possible |
49 |
by querying the status of the target object (e.g. its size). |
50 |
|
51 |
Atomic Updates |
52 |
-------------- |
53 |
|
54 |
However, when updating existing objects, transfers do not |
55 |
happen instantaneously. To support partial transfers, |
56 |
the content must be committed to storage before the transfer |
57 |
is complete. This creates a hazard: |
58 |
The partially committed content may temporarily set the object |
59 |
in a visible, inconsistent state after the transfer has started |
60 |
and before the transfer has completed. |
61 |
Furthermore, failed transfers that are never retried result |
62 |
in permanent corruption and data loss. |
63 |
|
64 |
The Manifest and Multipart Upload mechanisms both provide atomicity, |
65 |
but they only specify the creation of a new object and not |
66 |
the updating of an existing one. |
67 |
|
68 |
The Pithos Server approach is similar, but more flexible. |
69 |
There are two types of updates to an object: |
70 |
Those with an HTTP body as a source which is not atomic, |
71 |
and those with another object as a source, which is atomic. |
72 |
This way, clients may choose to perform atomic updates in |
73 |
two stages, as with Manifests and Multipart Uploads, |
74 |
or choose to make a hazardous update directly. |
75 |
|
76 |
Transfer bandwidth efficiency |
77 |
----------------------------- |
78 |
|
79 |
Another issue is the volume of data needed to be transfered |
80 |
when updating objects. |
81 |
If multiple clients update the object regularly, each one of |
82 |
them cannot just send a patch because the state of the file |
83 |
on the server is unknown. |
84 |
A mechanism is needed that can compute delta patches between |
85 |
two versions. The standard candidate is librsync. |
86 |
However, content-hashing provides another alternative, |
87 |
discussed below. |
88 |
|
89 |
Cheap versions, snapshots and clones |
90 |
------------------------------------ |
91 |
|
92 |
The Pithos Server supports cheap versions, snapshots and clones |
93 |
for its hosted objects. First, to clarify the terms. |
94 |
|
95 |
Versions |
96 |
are different objects with the same path, |
97 |
identified by an extra version identifier. |
98 |
Version identifiers are ordered on the time of the version creation. |
99 |
|
100 |
Snapshots |
101 |
are immutable copies of objects at a specific point in time, |
102 |
archived for future reference. |
103 |
|
104 |
Clones |
105 |
are mutable copies of snapshots or other objects, |
106 |
and have their own private history after their creation. |
107 |
|
108 |
Note that snapshots and clones may have a different path that |
109 |
their source object, while versions always have the same path. |
110 |
|
111 |
It is not yet decided if snapshots will be explicitly in the API, |
112 |
by providing a read-only object type. |
113 |
It is also not yet decided if clones will be explicitly in the API, |
114 |
as objects that have a 'source' back to their source object. |
115 |
|
116 |
Effectively, the only prerequisite for cheap versions, shapshot and |
117 |
clones is cheap copying of objects. This Pithos Server feature is |
118 |
designed in concert with content-hashing, explained below. |
119 |
|
120 |
Pluggable Storage Backend |
121 |
------------------------- |
122 |
|
123 |
The Pithos service is destined to run on distributed block storage servers. |
124 |
Therefore the Pithos Server design assumes a pluggable storage backend |
125 |
to a reliable, redundant object storage service. |
126 |
|
127 |
|
128 |
Content-Hashing: The Case for Virtualizing Files |
129 |
================================================ |
130 |
|
131 |
All of the key target features (except the OOS API) can be well serviced |
132 |
by a content-hashing/blocking approach to storage. |
133 |
|
134 |
Raw storage is becoming a basic commodity over the internet, |
135 |
in many forms, and especially as a 'cloud' service. |
136 |
It makes sense to virtualize files and separate |
137 |
the storage containing and serving layer |
138 |
from the semantically rich and application-defined file serving layer. |
139 |
|
140 |
Content hashing with a consistent blocking (chunking) scheme, |
141 |
inherently provides both universal content identification |
142 |
and logical separation of file- and content-serving. |
143 |
Additionally it offers many benefits and |
144 |
its costs have minimum impact. |
145 |
|
146 |
We iterate through the benefits and comment: |
147 |
|
148 |
* Universal identification |
149 |
With the reservation of hash collisions, every file and every block-aligned |
150 |
part of a file can be uniquely identified by its hash digest. |
151 |
|
152 |
* Get data from anyone who might serve it, easy sharing |
153 |
Because of universal identification, |
154 |
it matters not where you get the actual contents from. |
155 |
Just like it (almost) does not matter where you get |
156 |
your internet feed from. |
157 |
|
158 |
Pithos, will seek to exploit that by deploying a |
159 |
separate layer for serving blocks. |
160 |
|
161 |
* Universal storage provisioning, efficient content serving. |
162 |
Because content-hashing separates the semantically and funtionally rich |
163 |
file-serving layer from the content serving layer, |
164 |
the actual storage service has a simple get/put interface |
165 |
to read-only objects. |
166 |
|
167 |
This enables easy deployment of diverse systems as storage backends. |
168 |
It also enables the content-serving system to be separate, simpler |
169 |
and more performant than the file-serving system. |
170 |
|
171 |
* Cheap file copies |
172 |
Files become maps of hashes, and are "virtualized", in the sense |
173 |
that they only contain pointers to the actual content. |
174 |
The maps are much smaller (depending on the block size) |
175 |
and copying them incurs little overhead compared to copying |
176 |
the data. |
177 |
|
178 |
* Cheap updates |
179 |
A small change in a large file will result only in small changes in |
180 |
the hashes map, therefore only the relevant blocks need to be uploaded |
181 |
and updated within the map. |
182 |
|
183 |
* Data checksums |
184 |
Data reliability is no longer an ignorable issue and we need to |
185 |
checksum our data anyway. Content-hash is precicely that. |
186 |
|
187 |
The only drawback we have registered is the overhead of managing |
188 |
blocks that are no longer used. |
189 |
However, our (somewhat) educated guess is that the unused blocks will |
190 |
be only a small percentage of the total. |
191 |
In any case, our current consensus is that we clean them up |
192 |
in maintenance operations. |
193 |
|
194 |
|
195 |
Specific Issues |
196 |
=============== |
197 |
|
198 |
Pseudo- vs real hierarchies |
199 |
--------------------------- |
200 |
|
201 |
The main difference between real and pseudo-hierarchies is in the |
202 |
way the namespace is built. |
203 |
In real hierarchies, like most disk filesystems, |
204 |
the namespace is built by by recursive containing of directories |
205 |
inside a root directory. |
206 |
In pseudo-hierarchies, the namespace is flat, and hierarchy is defined |
207 |
by the lexicographical ordering of the path of each file. |
208 |
|
209 |
There are two important practical consequences: |
210 |
|
211 |
- Pseudo-hierarchies can have less overhead and perform faster lookups, |
212 |
but cannot move files efficiently. |
213 |
|
214 |
- Real-hierarchies can move files instantly, |
215 |
but every lookup must iterate through all parents of the target file. |
216 |
|
217 |
Since Pithos Server, through the OpenStack Object Storage API, |
218 |
has adopted a pseudo-hierarchy for the containers, |
219 |
it is important to not contradict this design choice |
220 |
with other incompatible choices. |
221 |
|
222 |
File Properties, Attributes, Features |
223 |
------------------------------------- |
224 |
|
225 |
For the purpose of the design of Pithos Server Internals, |
226 |
we define three types of metadata for files. |
227 |
|
228 |
Properties |
229 |
are intrinsic qualities of files, such as their size, or content-type. |
230 |
Properties are interpreted and handled specially by the Server code, |
231 |
and their changing means fundamentally altering the object. |
232 |
|
233 |
Attributes |
234 |
are key-value pairs attached to each file by the user, |
235 |
and are not intrinsic qualities of files like properties. |
236 |
The System may interpret some special keys as user input, |
237 |
but never consider attributes to be a fundamental |
238 |
and trusted quality of a file. |
239 |
In the current design, |
240 |
file versions do not share one attribute set, but each has its own. |
241 |
|
242 |
X-Features |
243 |
are like attributes, but with one key difference and one key limitation. |
244 |
Unlike attributes, features are attached to paths and not versions. |
245 |
Therefore, each file "inherits" the features that are defined |
246 |
by its path, or some prefix of its path. |
247 |
The X stands for *exclusive*, which is the limitation of x-features. |
248 |
There can never be two X-Feature sets in two overlapping paths. |
249 |
Therefore, in order to set a feature set on a path, |
250 |
it is necessary to purge features from overlapping paths, |
251 |
or the operation fails. |
252 |
This limitation greatly reduces the practical overhead for the Server |
253 |
to query features and feature inheritance for arbitrary paths. |
254 |
However it does not cripple their use nearly as much. |
255 |
One may even argue that it simplifies things for the users. |
256 |
|
257 |
Sharing |
258 |
------- |
259 |
|
260 |
The basic idea is that sharing is one read-list and one write-list |
261 |
as x-features of a path, and that users and group of users may |
262 |
be specified in each list, granting the corresponding access rights. |
263 |
Currently, the 'write' permission implies the 'read' one. |
264 |
More permission types are possible with the addition of relevant lists. |
265 |
|