The BagIt file package format                               John Kunze and Stephen Abrams, California Digital Library (CDL)...
Upcoming SlideShare
Loading in …5

The BagIt file package format


Published on

Published in: Technology, Design
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

The BagIt file package format

  1. 1. The BagIt file package format John Kunze and Stephen Abrams, California Digital Library (CDL) 1. start: put your files… 2. in a directory, data Summary Replicas for a rainy day BagIt is a file hierarchy, suitable for disk- or network- • Need a data package format that can carry any data .pdf .mpg data/ based transfer, and maybe bag return, with just • Suitable for disk-based and network-based transfer .xls .mpg enough structure to safely enclose a manifest, • Main purpose is to move data from one digital library checksums, tag info, and an arbitrary payload. or archive to another for safe-keeping 4. put tag next to data .pdf .xls • Not important whether the receiver provide access to or even understand meaning of the sender’s content manifest-md5.txt bagit.txt Bags in action Fig. 1. Sleeping better at data night is more likely for the .mpg 3. create a tag archivist who has replicas safely tucked away at other .pdf manifest-md5.txt Library of Congress (LC) grantees send bags memory organizations. .xls bagit.txt • For example, bagged web crawls funded by LC are e³°°° Bag&Tag ... sent by CDL to LC, with replica sent to San Diego A BagIt bag directory listing Holey bags! for efficiency clarrisa~ • A BagIt bag is a special directory (or Windows folder) • A common optional tag file lists “holes” to be filled • The directory name is your choice, but inside the top • Payload is incomplete until these files are fetched LaurelJukebox level BagIt reserves names for required files: • File fetch.txt lists URLs for receiver to grab • data : a subdirectory where you can put anything • Benefits include (network transfers only): • bagit.txt : a 2-line file declaring this is a bag • No need to stage extra data copy at either end BagIt credits and details • manifest-md5.txt : a list of files present • Cheap parallelism from fetching URLs in batches Authors from LC and CDL: Andy Boyko, John Kunze, Justin Littman, Liz Madden, Brian Vargas MyFirstBag/ manifest-md5.txt bagit.txt manifest-md5.txt Many thanks to: Stephen Abrams, Mike Ashenfelder, Scott manifest-md5.txt Fisher, Erik Hetzner, Keith Johnson, David Loy, Tracy | manifest-md5.txt fetch.txt bagit.txt fetch.txt Seneca, Mark Phillips, Adam Turoff, Jim Tuttle | bagit.txt data BagIt specification: | bag-info.txt bag-info.txt data --- data/ data/ .mpg BagIt was informed by | | my.pdf my.pdf .pdf • Enclose-and-Deposit method, Tabata & Sugimoto, | | my.xsl my.xsl .xls • LC’s eDeposit Pilot and NDIIPP AIHT | my.mpg (http) • ARC/WARC aggregate file format | my.mpgFig. 2. Payload directory, data,holds anything you want. Fig. 3. On the left is a bag received with missing files, or For further information“Tag” files (purple) describe the bag itself, including list of “holes”. The “fetch.txt” file lists URLs and corresponding Please contact or stephen.abrams@ucop.edufiles and checksums in the manifest (algorithms other than payload filenames that the receiver must fetch before For information on CDL’s Preservation Program, seemd5 possible) and an optional “bag-info.txt” metadata file. declaring the bag on the right complete.