Leslie Johnston: Challenges of Preserving Every Digital Format, 2012
The Challenges of Preserving EveryDigital Format on the Face of the Planet Leslie Johnston March 26, 2012
Well, not every formatBut we often have little orno control over what comesinto the Library ofCongress DigitalCollections, and wemanage and preserve awide variety of formats.
What are examplesof some of thecollecting andpreservationchallenges?
NATIONAL DIGITALNEWSPAPER PROGRAMchroniclingamerica.loc.gov/A partnership between the National Endowment forthe Humanities and the Library of Congress: Enhance access to America newspapers Sustainable digital collection Scalable, phased, cost-effective managementThe program has: Multiple producers (25 now, ultimately 54) Digitization standards (http://loc.gov/ndnp/) Free and open public access APIs for machine access and automated processesFiles TIFFs, JPEGs, JPEG 2000s, and XML. Over 4 million newspaper pages ingested to date Over 250 Tb of data
WEB ARCHIVING http://www.loc.gov/webarchiving/ lcweb2.loc.gov/diglib/lcwa/html/lcwa- home.htmlThe Library has been archiving the web since2000. Subject area specialists curate thecollections, and Library catalogers createcollection-level metadata records.The collections include: U.S. elections Web sites created by members of the House and Senate Thematic collections around events, such as elections in the Philippines, the Iraq war, and the appointment of Supreme Court Justices. Collections around an area of study, such as Legal “Blawgs”The file formats include every format possibleon the web. The collection comprisesapproximately 5 billion files in 300 TB.
NATIONAL DIGITALINFORMATIONINFRASTRUCTURE& PRESERVATIONPROGRAMdigitalpreservation.gov
CONTENT TYPESImages and Text Audio Visual Geospatial Web Sites
PACKARD CAMPUSNATIONAL AUDIO-VISUALCENTERPreserving Film, Broadcast Television, andAudioThe Packard Campus is a variety of preservationworkflows, including those for obsolete physicalformats such as wire recordings, wax cylinders,and 2“ videotape. The Campus is fully equipped toplay back and preserve all antique film, video andsound formats, and to maintain that capability farinto the future.The facility also handles born-digital video andaudio received directly from producers.The formats include MPEG-4, MP3, BWF, AVI,and a wide variety of specialized commercialformats.
eDEPOSIT FOR eSERIALS eDeposit for eSerials is a collaborative effort between the U.S. Copyright Office and the Library of Congress. Copyright Mandatory Deposit represents the largest acquisitions channel for the Library. In general, all U.S. publishers are legally required to submit for deposit two copies of each of their publications to the Copyright Office. This mechanism has allowed the Library to build the collection and to preserve the publications. eSerials became subject to mandatory deposit in January 2010, with the publication of a new interim regulation. Demands began in June 2010 and files began to arrive in October 2010. The files must come to the Library “as published” – in whatever their original formats are. This means a wide variety of XML content and metadata, HTML, and PDFs.
WORLD DIGITAL LIBRARYwww.wdl.orgDeliver historically significant primary materials from cultures around the world to an international multilingual audience Over 100 participating partner institutions, and contributions from over 40 institutions so far. Representing all 193 UNESCO member countries. Maps, prints, photographs, rare books, manuscripts, journals, sound recordings, and motion pictures. Metadata in Arabic, Chinese, French, English, Portuguese, Russian, and Spanish. JPEG 2000s, PDFs, XML.
THE TWITTER ARCHIVEEvery public tweet since Twitter’s launch in March 2006.We have a historic 2006-2010 archive and ongoing access to new tweets.We do not receive personal account information, linked images, or linked web page content.Tweets will not move into the archive until six months after their initial posting.The Library’s researcher services will not recreate twitter, and cannot be openly accessible.We are testing various technologies, and entering a pilot phase with test researchers. We will announce it when the archive is open to all researchers.The collection comprises only a few TB, but over 80 billion tweets.An FAQ is available online at: http://blogs.loc.gov/loc/2010/04/the-library-and- twitter-an-faq/
So how are wemaking this easierfor the Library tomanage?
Preservation Infrastructure•The Library developed the BagIttransfer specification for the movementof files between and withinorganizations. • http://www.digitalpreservation.gov/documents/ bagitspec.pdf•The Library inventories all incomingfiles, and is inventorying all digitalcontent.• We maintain multiple copies of fileson servers and on tape, ingeographically distributed locations.
Preservation PartnershipsThe Library cannot collect everything onits own, so works as part of:The National Digital Stewardship Alliancehttp://www.digitalpreservation.gov/ndsa/The International Internet PreservationConsortium http://netpreserve.org/about/index.phpamong others…
What are the Library’sstrategies for formats?• The Library has documentedsustainability factors for file formats. • http://www.digitalpreservation.gov/format s/• For cases where we do have controlover what comes in, we have a “BestEdition” Preferred Formats statement,which is currently being updated. • http://www.copyright.gov/circs/circ07b.pdf• The Library is developing FormatPreservation Action Plans.
DISCUSSION? Leslie Johnston Chief of Repository Development Manager of Technical Architecture Initiatives, NDIIPP email@example.com