SlideShare a Scribd company logo
1 of 25
Download to read offline
Big Data: New Challenges for
Digital Preservation and Digital
Services




                              Leslie Johnston
            Acting Director, National Digital Information
                 Infrastructure & Preservation Program
                                    Library of Congress
What are the Biggest Insights that we
  have Learned in Fifteen Years of
    Building Digital Collections?

We can never guess every way that our
       collections will be used.

     Researchers do not use digital
collections the same way that they use
           analog collections.
Stewardship organizations
have, until recently, spoken of
“collections” and “content” and
“records” and even “files.”

Now it’s also data.
Data is not just generated by satellites,
identified during experiments, or collected
during surveys.

Datasets are not just scientific and business
tables and spreadsheets.

We have Big Data in our Libraries, Archives
and Museums.
What are examples of some
of the challenges of
collecting and preserving
large scale collections in
many formats, and making
them usable as collections
and as data?
More and more researchers want to use
collections as a whole, mining and organizing
the information in novel ways.

Researchers use algorithms to mine the rich
information and tools to create pictures that
translate that information into knowledge.

Researchers may want to interact with a
collection of artifacts, or they may want to
work with a data corpus.
We still have collections. But what we also
have is Big Data, which requires us to rethink
the infrastructure that is needed to support
Big Data services. Our community used to
expect researchers to come to us, ask us
questions about our collections, and use our
digital collections in our environment.

Now our collections are, more often than not,
self-serve.
What are some use
cases?
National Digital
       Newspaper Program
     chroniclingamerica.loc.gov/
Some researchers want to search for stories in historic
newspapers.

Some researchers want to mine newspaper OCR for trends
across time periods and geographic areas.

Requests have come in to analyze all 5 million pages.

The site gets approximately 4 million hits per day.

The program has:
  Multiple producers (25 now, ultimately 54)
  Free and open public access
  APIs for machine access and automated processes

 Over 5.25 million newspaper pages ingested to date
 Over 250 Tb of data
Packard Campus National
    Audio-Visual Center
Preserving Film, Broadcast Television, and
Audio

The Packard Campus is a variety of preservation
workflows, including those for obsolete physical
formats such as wire recordings, wax cylinders,
and 2“ videotape. The Campus is fully equipped
to play back and preserve all antique film, video
and sound formats, and to maintain that
capability far into the future.

The facility also handles born-digital video and
audio received directly from producers and
copyright owners.

Over 3 PB of files.
WEB ARCHIVES
                 http://www.loc.gov/webarchiving/
           lcweb2.loc.gov/diglib/lcwa/html/lcwa-home.html

The Library has been archiving the web since 2000. Subject area
specialists curate the collections, and Library catalogers create
collection-level metadata records. Permission requirements vary by
site category.

The collections include:
    • U.S. elections
    • Web sites created by members of the House and Senate
    • Thematic collections around events, such as elections in the
      Philippines, the Iraq war, and the appointment of Supreme Court
      Justices.
    • Collections around an area of study, such as Legal “Blawgs”

When we began archiving election web sites, we imagined users
browsing through the web pages, studying the graphics or use of
phrases or links. But when our first researchers came to the Library,
they wanted to know about all those topics, but they used scripts to
query for them and sort them into categories. They were not very much
interested in reading web pages.

Approximately 6 billion files
Over 300 TB
THE TWITTER ARCHIVE
                                        Every public tweet since Twitter’s launch in March
                                          2006.

                                        Research requests have included users looking for
                                          their own Twitter history, the study of the
                                          geographic spread of news, the study of the
                                          spread of epidemics, and the study of the
                                          transmission of new uses of language.

                                        The collection comprises only a few TB, but over
                                          10s of billions of tweets.

                                        A White Paper is available online at:
                                          http://blogs.loc.gov/loc/2013/01/update-on-the-
                                          twitter-archive-at-the-library-of-congress/
            social
           science
 visualization

social media                   status

      events

      personal
                     privacy
        commercial
eSerials
Copyright Mandatory Deposit represents a large
  acquisitions channel for the Library. In general, all
  U.S. publishers are legally required to submit for
  deposit two copies of each of their publications to
  the Copyright Office. This mechanism has allowed
  the Library to build the collection and to preserve
  the publications.

eSerials became subject to mandatory deposit in
  January 2010, with the publication of a new interim
  regulation. Demands began in June 2010 and files
  began to arrive in October 2010.

The files must come to the Library “as published” – in
  whatever their original formats are.

Articles may be accompanied by their associated
  datasets.
RESEARCH DATASETS
The datasets generated/used in the research
  process.

Datasets can be:

Small, such as surveys of a small sample
 population

Medium, such as a corpus of images

Big Data, such as years of observational
  astronomical data.
It is not enough to be collecting
publications.

We have to collect and preserve
research data, in addition to
recognizing that the collections
we already have are Big Data to
be mined.
Are our institutions ready?

We are building large digital
collections and must consider
new ways in which they should
be managed and used.
I will mention infrastructure only in passing.

There are scale issues related to:

     Storage
     Backup and tape archiving
     Bandwidth
     Software development
     Staffing for processing
Library of Congress Preservation Infrastructure
The Library developed the BagIt transfer specification for the
movement of files between and within organizations.
    http://www.digitalpreservation.gov/documents/bagitspec.pdf

The Library inventories incoming files, and is gradually inventorying all
digital content.

The Library maintains multiple copies of files on servers and on tape,
in geographically distributed locations.

The Library has documented sustainability factors for file formats.
    http://www.digitalpreservation.gov/formats/

For cases where we do have control over what comes in, we have a
“Best Edition” Preferred Formats statement, which is currently being
updated.
        •http://www.copyright.gov/circs/circ07b.pdf
There are many new
activities to be planned for
with new researcher uses
and expectations.
How much ingest processing should be done with
data collections, or collections that can be treated as
data?

Should collections be processed to create a variety of
derivatives that might be used in various forms of
analysis before ingesting them?

Do libraries have sufficient infrastructure to create full-
test indexes for millions/billions of files to support full
discovery?

Do libraries support analysis? Analytical tools are still
in early days for the scale of large datasets.
And what are the service models?
If libraries decide that they will simply provide access to
data, do they limit it to the native format or provide pre-
processed or on-the-fly format transformation services for
downloads?

Can libraries handle the download traffic?

Can staff develop the expertise to provide guidance to
researchers in using analytical tools? Or is the expectation
that researchers will fend for themselves?
Libraries are increasingly looking towards self-service
– researchers need not ask to download or tell us that
they have. We may never know.

BUT, libraries do have collections that are limited to
on-site only access due to licenses or gift agreements.
In that case, libraries may have to consider providing
high-powered workstations with analytical tools for
researchers to work with these collections and take
analysis outputs away with them.

Both have policy implications and implications for
public service staffing.
But the benefits outweigh
the challenges.
Libraries are managing and preserving
the datasets and big data necessary for
re-use and replicability.

This is an important new role for libraries
in enabling new research.

And libraries need to make the deposit
and management of such data easier to
accomplish.
Discussion…




              Leslie Johnston
              lesliej@loc.gov

More Related Content

What's hot

LIS 653 Posters Spring 2013
LIS 653 Posters Spring 2013LIS 653 Posters Spring 2013
LIS 653 Posters Spring 2013
PrattSILS
 
LIS 653 Posters Fall 2014
LIS 653 Posters Fall 2014 LIS 653 Posters Fall 2014
LIS 653 Posters Fall 2014
PrattSILS
 
20130805 Activating Linked Open Data in Libraries Archives and Museums
20130805 Activating Linked Open Data in Libraries Archives and Museums20130805 Activating Linked Open Data in Libraries Archives and Museums
20130805 Activating Linked Open Data in Libraries Archives and Museums
andrea huang
 
Pratt sils knowledge organization spring 2014
Pratt sils knowledge organization spring 2014Pratt sils knowledge organization spring 2014
Pratt sils knowledge organization spring 2014
PrattSILS
 
Pratt SILS Knowledge Organization Spring 2011
Pratt SILS Knowledge Organization Spring 2011Pratt SILS Knowledge Organization Spring 2011
Pratt SILS Knowledge Organization Spring 2011
PrattSILS
 

What's hot (20)

Ird3 2 lib
Ird3 2 libIrd3 2 lib
Ird3 2 lib
 
Research data management: a tale of two paradigms:
Research data management: a tale of two paradigms: Research data management: a tale of two paradigms:
Research data management: a tale of two paradigms:
 
Cambridge university library ess update for ucs
Cambridge university library  ess update for ucsCambridge university library  ess update for ucs
Cambridge university library ess update for ucs
 
LIS 653 Posters Spring 2013
LIS 653 Posters Spring 2013LIS 653 Posters Spring 2013
LIS 653 Posters Spring 2013
 
Access to electronic information resources in libraries
Access  to electronic information resources in librariesAccess  to electronic information resources in libraries
Access to electronic information resources in libraries
 
Biodiversity Heritage Library: Cornerstone of the Encyclopedia of Life
Biodiversity Heritage Library: Cornerstone of the Encyclopedia of LifeBiodiversity Heritage Library: Cornerstone of the Encyclopedia of Life
Biodiversity Heritage Library: Cornerstone of the Encyclopedia of Life
 
The Biodiversity Heritage Library: A Cornerstone of the Encyclopedia of Life
The Biodiversity Heritage Library: A Cornerstone of the Encyclopedia of LifeThe Biodiversity Heritage Library: A Cornerstone of the Encyclopedia of Life
The Biodiversity Heritage Library: A Cornerstone of the Encyclopedia of Life
 
LIS 653 fall 2013 final project posters
LIS 653 fall 2013 final project postersLIS 653 fall 2013 final project posters
LIS 653 fall 2013 final project posters
 
LIS 653 Posters Fall 2014
LIS 653 Posters Fall 2014 LIS 653 Posters Fall 2014
LIS 653 Posters Fall 2014
 
The Open Access Community, and OAIster
The Open Access Community, and OAIsterThe Open Access Community, and OAIster
The Open Access Community, and OAIster
 
Open Knowledge and the Benefits for University-based Research
Open Knowledge and the Benefits for University-based ResearchOpen Knowledge and the Benefits for University-based Research
Open Knowledge and the Benefits for University-based Research
 
Knowledge Organization | LIS653 | Fall 2017
Knowledge Organization | LIS653 | Fall 2017Knowledge Organization | LIS653 | Fall 2017
Knowledge Organization | LIS653 | Fall 2017
 
20130805 Activating Linked Open Data in Libraries Archives and Museums
20130805 Activating Linked Open Data in Libraries Archives and Museums20130805 Activating Linked Open Data in Libraries Archives and Museums
20130805 Activating Linked Open Data in Libraries Archives and Museums
 
Open Context and Publishing to the Web of Data: Eric Kansa's LAWDI Presentation
Open Context and Publishing to the Web of Data: Eric Kansa's LAWDI PresentationOpen Context and Publishing to the Web of Data: Eric Kansa's LAWDI Presentation
Open Context and Publishing to the Web of Data: Eric Kansa's LAWDI Presentation
 
Pratt sils knowledge organization spring 2014
Pratt sils knowledge organization spring 2014Pratt sils knowledge organization spring 2014
Pratt sils knowledge organization spring 2014
 
Research Data Management: a gentle introduction
Research Data Management: a gentle introductionResearch Data Management: a gentle introduction
Research Data Management: a gentle introduction
 
WORLD CAT AS BIG DATA
WORLD CAT AS  BIG DATAWORLD CAT AS  BIG DATA
WORLD CAT AS BIG DATA
 
Research Data Management: a gentle introduction for admin staff
Research Data Management: a gentle introduction for admin staffResearch Data Management: a gentle introduction for admin staff
Research Data Management: a gentle introduction for admin staff
 
Ensuring the Scholarly Record is Kept Safe: Measured Progress with Serials
Ensuring the Scholarly Record is Kept Safe: Measured Progress with SerialsEnsuring the Scholarly Record is Kept Safe: Measured Progress with Serials
Ensuring the Scholarly Record is Kept Safe: Measured Progress with Serials
 
Pratt SILS Knowledge Organization Spring 2011
Pratt SILS Knowledge Organization Spring 2011Pratt SILS Knowledge Organization Spring 2011
Pratt SILS Knowledge Organization Spring 2011
 

Viewers also liked (7)

Liam
LiamLiam
Liam
 
Trolley Shelter
Trolley ShelterTrolley Shelter
Trolley Shelter
 
Nerea
NereaNerea
Nerea
 
Leslie Johnston code4lib 2013 Keynote
Leslie Johnston code4lib 2013 KeynoteLeslie Johnston code4lib 2013 Keynote
Leslie Johnston code4lib 2013 Keynote
 
Leslie Johnston on Citizen Archiving, iPres 2011
Leslie Johnston on Citizen Archiving, iPres 2011Leslie Johnston on Citizen Archiving, iPres 2011
Leslie Johnston on Citizen Archiving, iPres 2011
 
Technology and Service Trends in Libraries: The Library of Congress and the B...
Technology and Service Trends in Libraries: The Library of Congress and the B...Technology and Service Trends in Libraries: The Library of Congress and the B...
Technology and Service Trends in Libraries: The Library of Congress and the B...
 
An Introduction to digital preservation at the Library of Congress
An Introduction to digital preservation at the Library of CongressAn Introduction to digital preservation at the Library of Congress
An Introduction to digital preservation at the Library of Congress
 

Similar to Leslie Johnston: Big Data at Libraries, Georgetown University Law School Symposium on Big Data, January 2013

Doctoral open day_digital_research_session_Social_Sciences_BL
Doctoral open day_digital_research_session_Social_Sciences_BLDoctoral open day_digital_research_session_Social_Sciences_BL
Doctoral open day_digital_research_session_Social_Sciences_BL
Aquiles Alencar Brayner
 
Leslie Johnston: Challenges of Preserving Every Digital Format, 2012
Leslie Johnston: Challenges of Preserving Every Digital Format, 2012Leslie Johnston: Challenges of Preserving Every Digital Format, 2012
Leslie Johnston: Challenges of Preserving Every Digital Format, 2012
lljohnston
 
Digital preservation and curation of information.presentation
Digital preservation and curation of information.presentationDigital preservation and curation of information.presentation
Digital preservation and curation of information.presentation
Prince Sterling
 
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & Museums
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & MuseumsALIAOnline Practical Linked (Open) Data for Libraries, Archives & Museums
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & Museums
Jon Voss
 

Similar to Leslie Johnston: Big Data at Libraries, Georgetown University Law School Symposium on Big Data, January 2013 (20)

Doctoral open day_digital_research_session_Social_Sciences_BL
Doctoral open day_digital_research_session_Social_Sciences_BLDoctoral open day_digital_research_session_Social_Sciences_BL
Doctoral open day_digital_research_session_Social_Sciences_BL
 
Open access (1)
Open access (1)Open access (1)
Open access (1)
 
Open access
Open accessOpen access
Open access
 
Leslie Johnston: Challenges of Preserving Every Digital Format, 2012
Leslie Johnston: Challenges of Preserving Every Digital Format, 2012Leslie Johnston: Challenges of Preserving Every Digital Format, 2012
Leslie Johnston: Challenges of Preserving Every Digital Format, 2012
 
Digitallibrary
DigitallibraryDigitallibrary
Digitallibrary
 
Digital library-overview
Digital library-overviewDigital library-overview
Digital library-overview
 
Who is looking after your e-journals
Who is looking after your e-journalsWho is looking after your e-journals
Who is looking after your e-journals
 
Is It Too Late to Ensure Continuity of Access to the Scholarly Record?
Is It Too Late to Ensure Continuity of Access to the Scholarly Record?Is It Too Late to Ensure Continuity of Access to the Scholarly Record?
Is It Too Late to Ensure Continuity of Access to the Scholarly Record?
 
Need of Digital Libraries in Education.
Need of Digital Libraries in Education.Need of Digital Libraries in Education.
Need of Digital Libraries in Education.
 
Aquiles imlr seminar
Aquiles imlr seminarAquiles imlr seminar
Aquiles imlr seminar
 
Ensuring Continuity of Access To Our Published Heritage
Ensuring Continuity of Access To Our Published HeritageEnsuring Continuity of Access To Our Published Heritage
Ensuring Continuity of Access To Our Published Heritage
 
Linked Open Data in Libraries Archives & Museums
Linked Open Data in Libraries Archives & MuseumsLinked Open Data in Libraries Archives & Museums
Linked Open Data in Libraries Archives & Museums
 
Social Networking: Tools and Technologies for enhancing user interaction
Social Networking: Tools and Technologies for enhancing user interactionSocial Networking: Tools and Technologies for enhancing user interaction
Social Networking: Tools and Technologies for enhancing user interaction
 
An ontology-based context aware system for Selective Dissemination of Informa...
An ontology-based context aware system for Selective Dissemination of Informa...An ontology-based context aware system for Selective Dissemination of Informa...
An ontology-based context aware system for Selective Dissemination of Informa...
 
Boundless Opportunity
Boundless OpportunityBoundless Opportunity
Boundless Opportunity
 
Information Retrieval Methods in Libraries and Information Centers
Information Retrieval Methods in Libraries and Information CentersInformation Retrieval Methods in Libraries and Information Centers
Information Retrieval Methods in Libraries and Information Centers
 
Digital preservation and curation of information.presentation
Digital preservation and curation of information.presentationDigital preservation and curation of information.presentation
Digital preservation and curation of information.presentation
 
Digital Libraries, Digital Repositories, Digital Copyright: Overview, Challen...
Digital Libraries, Digital Repositories, Digital Copyright: Overview, Challen...Digital Libraries, Digital Repositories, Digital Copyright: Overview, Challen...
Digital Libraries, Digital Repositories, Digital Copyright: Overview, Challen...
 
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & Museums
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & MuseumsALIAOnline Practical Linked (Open) Data for Libraries, Archives & Museums
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & Museums
 
Who is looking after your e-journals?
Who is looking after your e-journals?Who is looking after your e-journals?
Who is looking after your e-journals?
 

Leslie Johnston: Big Data at Libraries, Georgetown University Law School Symposium on Big Data, January 2013

  • 1. Big Data: New Challenges for Digital Preservation and Digital Services Leslie Johnston Acting Director, National Digital Information Infrastructure & Preservation Program Library of Congress
  • 2. What are the Biggest Insights that we have Learned in Fifteen Years of Building Digital Collections? We can never guess every way that our collections will be used. Researchers do not use digital collections the same way that they use analog collections.
  • 3. Stewardship organizations have, until recently, spoken of “collections” and “content” and “records” and even “files.” Now it’s also data.
  • 4. Data is not just generated by satellites, identified during experiments, or collected during surveys. Datasets are not just scientific and business tables and spreadsheets. We have Big Data in our Libraries, Archives and Museums.
  • 5. What are examples of some of the challenges of collecting and preserving large scale collections in many formats, and making them usable as collections and as data?
  • 6. More and more researchers want to use collections as a whole, mining and organizing the information in novel ways. Researchers use algorithms to mine the rich information and tools to create pictures that translate that information into knowledge. Researchers may want to interact with a collection of artifacts, or they may want to work with a data corpus.
  • 7. We still have collections. But what we also have is Big Data, which requires us to rethink the infrastructure that is needed to support Big Data services. Our community used to expect researchers to come to us, ask us questions about our collections, and use our digital collections in our environment. Now our collections are, more often than not, self-serve.
  • 8. What are some use cases?
  • 9. National Digital Newspaper Program chroniclingamerica.loc.gov/ Some researchers want to search for stories in historic newspapers. Some researchers want to mine newspaper OCR for trends across time periods and geographic areas. Requests have come in to analyze all 5 million pages. The site gets approximately 4 million hits per day. The program has:  Multiple producers (25 now, ultimately 54)  Free and open public access  APIs for machine access and automated processes Over 5.25 million newspaper pages ingested to date Over 250 Tb of data
  • 10. Packard Campus National Audio-Visual Center Preserving Film, Broadcast Television, and Audio The Packard Campus is a variety of preservation workflows, including those for obsolete physical formats such as wire recordings, wax cylinders, and 2“ videotape. The Campus is fully equipped to play back and preserve all antique film, video and sound formats, and to maintain that capability far into the future. The facility also handles born-digital video and audio received directly from producers and copyright owners. Over 3 PB of files.
  • 11. WEB ARCHIVES http://www.loc.gov/webarchiving/ lcweb2.loc.gov/diglib/lcwa/html/lcwa-home.html The Library has been archiving the web since 2000. Subject area specialists curate the collections, and Library catalogers create collection-level metadata records. Permission requirements vary by site category. The collections include: • U.S. elections • Web sites created by members of the House and Senate • Thematic collections around events, such as elections in the Philippines, the Iraq war, and the appointment of Supreme Court Justices. • Collections around an area of study, such as Legal “Blawgs” When we began archiving election web sites, we imagined users browsing through the web pages, studying the graphics or use of phrases or links. But when our first researchers came to the Library, they wanted to know about all those topics, but they used scripts to query for them and sort them into categories. They were not very much interested in reading web pages. Approximately 6 billion files Over 300 TB
  • 12. THE TWITTER ARCHIVE Every public tweet since Twitter’s launch in March 2006. Research requests have included users looking for their own Twitter history, the study of the geographic spread of news, the study of the spread of epidemics, and the study of the transmission of new uses of language. The collection comprises only a few TB, but over 10s of billions of tweets. A White Paper is available online at: http://blogs.loc.gov/loc/2013/01/update-on-the- twitter-archive-at-the-library-of-congress/ social science visualization social media status events personal privacy commercial
  • 13. eSerials Copyright Mandatory Deposit represents a large acquisitions channel for the Library. In general, all U.S. publishers are legally required to submit for deposit two copies of each of their publications to the Copyright Office. This mechanism has allowed the Library to build the collection and to preserve the publications. eSerials became subject to mandatory deposit in January 2010, with the publication of a new interim regulation. Demands began in June 2010 and files began to arrive in October 2010. The files must come to the Library “as published” – in whatever their original formats are. Articles may be accompanied by their associated datasets.
  • 14. RESEARCH DATASETS The datasets generated/used in the research process. Datasets can be: Small, such as surveys of a small sample population Medium, such as a corpus of images Big Data, such as years of observational astronomical data.
  • 15. It is not enough to be collecting publications. We have to collect and preserve research data, in addition to recognizing that the collections we already have are Big Data to be mined.
  • 16. Are our institutions ready? We are building large digital collections and must consider new ways in which they should be managed and used.
  • 17. I will mention infrastructure only in passing. There are scale issues related to:  Storage  Backup and tape archiving  Bandwidth  Software development  Staffing for processing
  • 18. Library of Congress Preservation Infrastructure The Library developed the BagIt transfer specification for the movement of files between and within organizations.  http://www.digitalpreservation.gov/documents/bagitspec.pdf The Library inventories incoming files, and is gradually inventorying all digital content. The Library maintains multiple copies of files on servers and on tape, in geographically distributed locations. The Library has documented sustainability factors for file formats.  http://www.digitalpreservation.gov/formats/ For cases where we do have control over what comes in, we have a “Best Edition” Preferred Formats statement, which is currently being updated. •http://www.copyright.gov/circs/circ07b.pdf
  • 19. There are many new activities to be planned for with new researcher uses and expectations.
  • 20. How much ingest processing should be done with data collections, or collections that can be treated as data? Should collections be processed to create a variety of derivatives that might be used in various forms of analysis before ingesting them? Do libraries have sufficient infrastructure to create full- test indexes for millions/billions of files to support full discovery? Do libraries support analysis? Analytical tools are still in early days for the scale of large datasets.
  • 21. And what are the service models? If libraries decide that they will simply provide access to data, do they limit it to the native format or provide pre- processed or on-the-fly format transformation services for downloads? Can libraries handle the download traffic? Can staff develop the expertise to provide guidance to researchers in using analytical tools? Or is the expectation that researchers will fend for themselves?
  • 22. Libraries are increasingly looking towards self-service – researchers need not ask to download or tell us that they have. We may never know. BUT, libraries do have collections that are limited to on-site only access due to licenses or gift agreements. In that case, libraries may have to consider providing high-powered workstations with analytical tools for researchers to work with these collections and take analysis outputs away with them. Both have policy implications and implications for public service staffing.
  • 23. But the benefits outweigh the challenges.
  • 24. Libraries are managing and preserving the datasets and big data necessary for re-use and replicability. This is an important new role for libraries in enabling new research. And libraries need to make the deposit and management of such data easier to accomplish.
  • 25. Discussion… Leslie Johnston lesliej@loc.gov