Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources
Leslie Johnston: Challenges of Preserving Every Digital Format, 2012
1. The Challenges of Preserving Every
Digital Format on the Face of the Planet
Leslie Johnston
March 26, 2012
2. Well, not every format
But we often have little or
no control over what comes
into the Library of
Congress Digital
Collections, and we
manage and preserve a
wide variety of formats.
4. NATIONAL DIGITAL
NEWSPAPER PROGRAM
chroniclingamerica.loc.gov/
A partnership between the National Endowment for
the Humanities and the Library of Congress:
Enhance access to America newspapers
Sustainable digital collection
Scalable, phased, cost-effective management
The program has:
Multiple producers (25 now, ultimately 54)
Digitization standards (http://loc.gov/ndnp/)
Free and open public access
APIs for machine access and automated processes
Files
TIFFs, JPEGs, JPEG 2000s, and XML.
Over 4 million newspaper pages ingested to date
Over 250 Tb of data
5.
6. WEB ARCHIVING
http://www.loc.gov/webarchiving/
lcweb2.loc.gov/diglib/lcwa/html/lcwa-
home.html
The Library has been archiving the web since
2000. Subject area specialists curate the
collections, and Library catalogers create
collection-level metadata records.
The collections include:
U.S. elections
Web sites created by members of the
House and Senate
Thematic collections around events, such
as elections in the Philippines, the Iraq
war, and the appointment of Supreme
Court Justices.
Collections around an area of study,
such as Legal “Blawgs”
The file formats include every format possible
on the web. The collection comprises
approximately 5 billion files in 300 TB.
10. PACKARD CAMPUS
NATIONAL AUDIO-VISUAL
CENTER
Preserving Film, Broadcast Television, and
Audio
The Packard Campus is a variety of preservation
workflows, including those for obsolete physical
formats such as wire recordings, wax cylinders,
and 2“ videotape. The Campus is fully equipped to
play back and preserve all antique film, video and
sound formats, and to maintain that capability far
into the future.
The facility also handles born-digital video and
audio received directly from producers.
The formats include MPEG-4, MP3, BWF, AVI,
and a wide variety of specialized commercial
formats.
11. eDEPOSIT FOR eSERIALS
eDeposit for eSerials is a collaborative effort
between the U.S. Copyright Office and the
Library of Congress.
Copyright Mandatory Deposit represents the
largest acquisitions channel for the Library. In
general, all U.S. publishers are legally required to
submit for deposit two copies of each of their
publications to the Copyright Office. This
mechanism has allowed the Library to build the
collection and to preserve the publications.
eSerials became subject to mandatory deposit in
January 2010, with the publication of a new
interim regulation. Demands began in June 2010
and files began to arrive in October 2010.
The files must come to the Library “as published”
– in whatever their original formats are. This
means a wide variety of XML content and
metadata, HTML, and PDFs.
12. WORLD DIGITAL LIBRARY
www.wdl.org
Deliver historically significant primary
materials from cultures around the world to
an international multilingual audience
Over 100 participating partner institutions, and
contributions from over 40 institutions so far.
Representing all 193 UNESCO member
countries.
Maps, prints, photographs, rare books,
manuscripts, journals, sound recordings, and
motion pictures.
Metadata in Arabic, Chinese, French, English,
Portuguese, Russian, and Spanish.
JPEG 2000s, PDFs, XML.
13.
14. THE TWITTER ARCHIVE
Every public tweet since Twitter’s launch in March
2006.
We have a historic 2006-2010 archive and ongoing
access to new tweets.
We do not receive personal account information,
linked images, or linked web page content.
Tweets will not move into the archive until six
months after their initial posting.
The Library’s researcher services will not recreate
twitter, and cannot be openly accessible.
We are testing various technologies, and entering a
pilot phase with test researchers. We will
announce it when the archive is open to all
researchers.
The collection comprises only a few TB, but over 80
billion tweets.
An FAQ is available online at:
http://blogs.loc.gov/loc/2010/04/the-library-and-
twitter-an-faq/
15. So how are we
making this easier
for the Library to
manage?
16. Preservation Infrastructure
•The Library developed the BagIt
transfer specification for the movement
of files between and within
organizations.
• http://www.digitalpreservation.gov/documents/
bagitspec.pdf
•The Library inventories all incoming
files, and is inventorying all digital
content.
• We maintain multiple copies of files
on servers and on tape, in
geographically distributed locations.
17. Preservation Partnerships
The Library cannot collect everything on
its own, so works as part of:
The National Digital Stewardship Alliance
http://www.digitalpreservation.gov/ndsa/
The International Internet Preservation
Consortium http://netpreserve.org/about/index.php
among others…
18. What are the Library’s
strategies for formats?
• The Library has documented
sustainability factors for file formats.
• http://www.digitalpreservation.gov/format
s/
• For cases where we do have control
over what comes in, we have a “Best
Edition” Preferred Formats statement,
which is currently being updated.
• http://www.copyright.gov/circs/circ07b.pdf
• The Library is developing Format
Preservation Action Plans.
19. DISCUSSION?
Leslie Johnston
Chief of Repository Development
Manager of Technical Architecture Initiatives, NDIIPP
lesliej@loc.gov