Pairtrees for object storage

  • 900 views
Uploaded on

 

More in: Technology , Education
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
900
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
3
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Pairtrees for object storage John Kunze and Stephen Abrams, California Digital Library (CDL) SummaryThe deadly embrace Objects in a pairtree Pairtree is the thinnest smear we can add to our very well-• Digital repositories tend to require a surrender of storage A pairtree is especially useful if, for each contained object, understood filesystems and their universal tools (the transparency that creates unhealthy system dependency all of the object’s parts, and nothing but its parts, are universal “API”) to create a very well-understood,• Internally objects are often broken up so that they can be enclosed in the object’s directory platform-independent object storage substrate difficult to piece together in case of trouble Import such a pairtree and, knowing nothing about the Pairtree is not a complete repository system, but it is objects’ structure and semantics, you can reliably complete for object storage and makes it easier to buildFig. 1. Object storage should not systems and to share objects between institutionsneed a fearful entanglement with • Enumerate all objects and their identifierssoftware. Since objects have to • Produce any object by requested idbe parked in a filesystem beforerepository software upgrade, what • Maintain and back it up with ordinary OS tools Why pairs of characters? • Rebuild the collection in case of database corruption Taking two chars at a time balances path depth andif we left them in there and built simply by walking the pairtree fanout (number of possible entries in any directory)our repositories around them? To walk a pairtree requires knowing path termination rules • Example: ab2def3 ⇒ ab/2d/ef/3/ Jim B L • A pairpath terminates when you reach a file or reach a • Each pair, letters+digits, has 36x36 possibilities directory name with 1 char or more than 2 chars Compared to taking one char at a time ab/ • Only 36 possibilities, but path depth grows rapidlyA pairtree maps ids to paths, --- cd/ • Example: ab2def3 ⇒ a/b/2/d/e/f/3/ At another extreme, taking seven characters at a time two characters at a time |--- foo/ | | README.txt • Short paths, but 78 billion (367) possible itemsA pairtree is a filesystem hierarchy that uses an identifier | | thumbnail.gif • Example: ab2def3 ⇒ ab2def3/ string to derive an object directory (or folder) location | |--- master_images/• The derivation takes successive pairs of characters and | | | ... creates a succession of directories, called a pairpath | | Pairtree credits and details | --- gh/ Pairtree specification: ab2def3 ⇒ ab/2d/ef/3/ --- e/ www.ietf.org/internet-drafts/draft-kunze-pairtree-01.txt• A pairpath ends at directory containing an object’s files; www.cdlib.org/inside/diglib/pairtree/pairtreespec.html --- bar/ most systems do variation of this (is variation needed?) Authors from CDL and University of Michigan (UM): | metadata• Reverse the mapping to find all ids/objects in a pairtree; Martin Haye, Erik Hetzner, John Kunze, Mark Reyes, | 54321.wav pairpath termination rules permit variable length ids and Cory Snavely; many thanks to Stephen Abrams, | index.html Sebastien Korner, Brian Tingle, et alPre-converting problematic characters Fig. 2. Example pairtree containing two objects: Pairtree origins includeSome identifier characters are inconvenient or illegal in abcd and abcde. The first object is enclosed in • Prototype: UCSF tobacco control filenames and must be hex-encoded (e.g., *→^2a) directory foo/, the second in bar/. While foo/ documents and CDL digitized books id: what-the-*@?#! does not subsume e/ at the same level, by • Early production: digitized books → what-the-^2a@^3f#! enclosure, it does subsume the gh/ underneath it. for UM and Hathi Trust ⇒ wh/at/-t/he/-^/2a/@^/3f/#! cyocumBut to keep paths short, 3 common chars are converted to 3 rare chars (at cost of complexity): /→= :→+ .→, Sample software implementation For further information id: ark:/13030/xt12t3 http://search.cpan.org/~jak/Pairtree-0.2/lib/File/Pairtree.pm Please contact jak@ucop.edu or stephen.abrams@ucop.edu → ark+=13030=xt12t3 A Perl module that implements two mappings: id2ppath() takes an For information on CDL’s Preservation Program, see ⇒ ar/k+/=1/30/30/=x/t1/2t/3/ id into a pairpath and ppath2id() performs the inverse mapping. http://www.cdlib.org/programs/digital_preservation.html