/docs/source/design.rst - Pithos - Greek Research and Technology Network's projects

| Branch: | Tag: | Revision:

root / docs / source / design.rst @ 5bd53e3b

History | View | Annotate | Download (10.1 kB)

       ********************
       Pithos Server Design
       ********************
       Up-to-date: 2011-06-14
       Key Target Features
       ===================
       OpenStack Object Storage API
       ----------------------------
       Originally derived from Rackspace's Cloudfiles API,
       it features, like Amazon S3, a two-level real hieararchy
       with containers and objects per account, while objects
       within each container are hosted in a flat namespace,
       albeit with options for 'virtual' hierarchy listings.
       Accounts, containers and objects may host user-defined metadata.
       The use of a well-known API, hopefully, will provide immediate
       compatibility with existing clients and provide familiarity
       to developers. Adoption and familiarity will also make
       custom extensions more accessible to developers.
       However, Pithos differs as a service, mainly because it (also)
       targets end users. This inevitably forces extensions to the API,
       but the plan is to keep OOS compatibility, anyway.
       Partial Transfers
       -----------------
       An important feature, especially for home users, is the ability
       to continue transfers after interruption by choice or failure,
       without significant loss of the partial transfer completed
       up to the point of interruption.
       The Manifest mechanism in the CloudFiles API and
       the the Multipart Upload mechanism in Amazon S3,
       both provide a means to partial transfers.
       Manifests is not (yet?) in the OpenStack specification
       but is considered for support in Pithos.
       The Pithos Server approach is similar, allowing appending
       (actually, any modification) to existing objects via
       HTTP 1.1 chunked transfers. Chunks reach stable storage
       before the whole transfer is complete and restarting
       a transfer from the point it was interrupted is possible
       by querying the status of the target object (e.g. its size).
       Atomic Updates
       --------------
       However, when updating existing objects, transfers do not
       happen instantaneously. To support partial transfers,
       the content must be committed to storage before the transfer
       is complete. This creates a hazard:
       The partially committed content may temporarily set the object
       in a visible, inconsistent state after the transfer has started
       and before the transfer has completed.
       Furthermore, failed transfers that are never retried result
       in permanent corruption and data loss.
       The Manifest and Multipart Upload mechanisms both provide atomicity,
       but they only specify the creation of a new object and not
       the updating of an existing one.
       The Pithos Server approach is similar, but more flexible.
       There are two types of updates to an object:
       Those with an HTTP body as a source which is not atomic,
       and those with another object as a source, which is atomic.
       This way, clients may choose to perform atomic updates in
       two stages, as with Manifests and Multipart Uploads,
       or choose to make a hazardous update directly.
       Transfer bandwidth efficiency
       -----------------------------
       Another issue is the volume of data needed to be transfered
       when updating objects.
       If multiple clients update the object regularly, each one of
       them cannot just send a patch because the state of the file
       on the server is unknown.
       A mechanism is needed that can compute delta patches between
       two versions. The standard candidate is librsync.
       However, content-hashing provides another alternative,
       discussed below.
       Cheap versions, snapshots and clones
       ------------------------------------
       The Pithos Server supports cheap versions, snapshots and clones
       for its hosted objects. First, to clarify the terms.
       Versions
           are different objects with the same path,
           identified by an extra version identifier.
           Version identifiers are ordered on the time of the version creation.
       Snapshots
           are immutable copies of objects at a specific point in time,
           archived for future reference.
       Clones
           are mutable copies of snapshots or other objects,
           and have their own private history after their creation.
       Note that snapshots and clones may have a different path that
       their source object, while versions always have the same path.
       It is not yet decided if snapshots will be explicitly in the API,
       by providing a read-only object type.
       It is also not yet decided if clones will be explicitly in the API,
       as objects that have a 'source' back to their source object.
       Effectively, the only prerequisite for cheap versions, shapshot and
       clones is cheap copying of objects. This Pithos Server feature is
       designed in concert with content-hashing, explained below.
       Pluggable Storage Backend
       -------------------------
       The Pithos service is destined to run on distributed block storage servers.
       Therefore the Pithos Server design assumes a pluggable storage backend
       to a reliable, redundant object storage service.
       Content-Hashing: The Case for Virtualizing Files
       ================================================
       All of the key target features (except the OOS API) can be well serviced
       by a content-hashing/blocking approach to storage.
       Raw storage is becoming a basic commodity over the internet,
       in many forms, and especially as a 'cloud' service.
       It makes sense to virtualize files and separate
       the storage containing and serving layer
       from the semantically rich and application-defined file serving layer.
       Content hashing with a consistent blocking (chunking) scheme,
       inherently provides both universal content identification
       and logical separation of file- and content-serving.
       Additionally it offers many benefits and
       its costs have minimum impact.
       We iterate through the benefits and comment:
       * Universal identification
           With the reservation of hash collisions, every file and every block-aligned
           part of a file can be uniquely identified by its hash digest.
       * Get data from anyone who might serve it, easy sharing
           Because of universal identification,
           it matters not where you get the actual contents from.
           Just like it (almost) does not matter where you get
           your internet feed from.
           Pithos, will seek to exploit that by deploying a
           separate layer for serving blocks.
       * Universal storage provisioning, efficient content serving.
           Because content-hashing separates the semantically and funtionally rich
           file-serving layer from the content serving layer,
           the actual storage service has a simple get/put interface
           to read-only objects.
           This enables easy deployment of diverse systems as storage backends.
           It also enables the content-serving system to be separate, simpler
           and more performant than the file-serving system.
       * Cheap file copies
           Files become maps of hashes, and are "virtualized", in the sense
           that they only contain pointers to the actual content.
           The maps are much smaller (depending on the block size)
           and copying them incurs little overhead compared to copying
           the data.
       * Cheap updates
           A small change in a large file will result only in small changes in
           the hashes map, therefore only the relevant blocks need to be uploaded
           and updated within the map.
       * Data checksums
           Data reliability is no longer an ignorable issue and we need to
           checksum our data anyway. Content-hash is precicely that.
       The only drawback we have registered is the overhead of managing
       blocks that are no longer used.
       However, our (somewhat) educated guess is that the unused blocks will
       be only a small percentage of the total.
       In any case, our current consensus is that we clean them up
       in maintenance operations.
       Specific Issues
       ===============
       Pseudo- vs real hierarchies
       ---------------------------
       The main difference between real and pseudo-hierarchies is in the
       way the namespace is built.
       In real hierarchies, like most disk filesystems,
       the namespace is built by by recursive containing of directories
       inside a root directory.
       In pseudo-hierarchies, the namespace is flat, and hierarchy is defined
       by the lexicographical ordering of the path of each file.
       There are two important practical consequences:
       - Pseudo-hierarchies can have less overhead and perform faster lookups,
         but cannot move files efficiently.
       - Real-hierarchies can move files instantly,
         but every lookup must iterate through all parents of the target file.
       Since Pithos Server, through the OpenStack Object Storage API,
       has adopted a pseudo-hierarchy for the containers,
       it is important to not contradict this design choice
       with other incompatible choices.
       File Properties, Attributes, Features
       -------------------------------------
       For the purpose of the design of Pithos Server Internals,
       we define three types of metadata for files.
       Properties
           are intrinsic qualities of files, such as their size, or content-type.
           Properties are interpreted and handled specially by the Server code,
           and their changing means fundamentally altering the object.
       Attributes
           are key-value pairs attached to each file by the user,
           and are not intrinsic qualities of files like properties.
           The System may interpret some special keys as user input,
           but never consider attributes to be a fundamental
           and trusted quality of a file.
           In the current design,
           file versions do not share one attribute set, but each has its own.
       X-Features
           are like attributes, but with one key difference and one key limitation.
           Unlike attributes, features are attached to paths and not versions.
           Therefore, each file "inherits" the features that are defined
           by its path, or some prefix of its path.
           The X stands for *exclusive*, which is the limitation of x-features.
           There can never be two X-Feature sets in two overlapping paths.
           Therefore, in order to set a feature set on a path,
           it is necessary to purge features from overlapping paths,
           or the operation fails.
           This limitation greatly reduces the practical overhead for the Server
           to query features and feature inheritance for arbitrary paths.
           However it does not cripple their use nearly as much.
           One may even argue that it simplifies things for the users.
       Sharing
       -------
           The basic idea is that sharing is one read-list and one write-list
           as x-features of a path, and that users and group of users may
           be specified in each list, granting the corresponding access rights.
           Currently, the 'write' permission implies the 'read' one.
           More permission types are possible with the addition of relevant lists.

Synnefo » Pithos

root / docs / source / design.rst @ 5bd53e3b