SlideShare a Scribd company logo
Cultural Heritage Institutions
and Big Data Collections
Leslie Johnston
Chief of Repository Development
Library of Congress
Cultural Heritage organizations
have, until recently, spoken of
“collections” and “content” and
“records” and even “files.”
Now it’s also data.
Data is not just generated by satellites,
identified during experiments, or collected
during surveys.
Datasets are not just scientific and business
tables and spreadsheets.
We have Big Data in our Libraries, Archives
and Museums.
Like other cultural heritage
organizations, the Library of
Congress has as one of its
mandates that it make its
collections freely available,
whether that is in person or on
the web.
What are some Library of
Congress examples of
collecting and preserving large
scale collections in many
formats, and making them
usable as collections and as
data?
National Digital
Newspaper Program
chroniclingamerica.loc.gov/
This collection was transformative for the Library of Congress:
it was the first to be made to be available as a bulk download
and exposed as a text and image dataset.
Some researchers want to search for stories in historic
newspapers. Some researchers want to mine newspaper OCR
for trends across time periods and geographic areas.
Requests have come in to analyze the full collection..
The program has:
 Multiple producers (36 now, ultimately 54)
 Free and open public access
 APIs for machine access and automated processes,
including access to RDF linked data.
Over 6.7 million newspaper pages ingested to date
Over 250 Tb of data
Web Archives
http://www.loc.gov/webarchiving/
lcweb2.loc.gov/diglib/lcwa/html/lcwa-home.html
The Library has been archiving the web since 2000. Subject
area specialists curate the collections, and Library catalogers
create collection-level metadata records.
The collections include:
• U.S. elections
• Web sites created by members of the House and Senate
• Thematic collections around events, such as elections in
the Philippines, the Iraq war, and the appointment of
Supreme Court Justices.
• Collections around an area of study, such as Legal “Blawgs”
We frequently receive requests for access to full collections for
full-text data mining.
Every format possible on the web
Almost 8 billion files
Over 425 TB
congress.gov
Congress.gov is still in its beta phase,
transforming congressional
information discovery.
Legislation from 1993 to the present,
The Congressional Record from
1995 to the present, Committee
Reports from 1995 to the present,
and Member profiles from 1973 to
the present (with some from 1947 to
1972).
The Twitter Archive
Every public tweet since Twitter’s launch in March
2006.
Research requests have included users looking for
their own Twitter history, the study of the
geographic spread of news, the study of the
spread of epidemics, and the study of the
transmission of new uses of language.
The collection comprises only a few TB, but 100s of
billions of tweets.
A White Paper is available online at:
http://blogs.loc.gov/loc/2013/01/update-on-the-twitter-arc
status
privacy
commercial
personal
events
social media
visualization
social
science
Research Datasets
Research datasets are created by
faculty, curators, researchers, and
federal and state agencies.
It is not enough to be collecting
publications; we must collect the
datasets that support the published
work, to allow for replicability and r-
use in research.
We are now planning to expands its
collections to preserve research
data, in addition to recognizing that
the collections we already have are
Big Data to be mined.
And the full breadth of the
Library’s Collections
The American Memory collection, one of the oldest
and most used digital collections on the web.
The oral histories of the Veteran’s History Project.
The audio and video collections of the American
Folklife Center.
More than 1.2 million images from Prints and
Photographs.
Digitized maps and GIS data from Geography and
Maps
More than 300,000 digitized audio and video files
comprising over 5 PB at the Packard Campus.
And many, many, many more.
id.loc.gov
The Library of Congress is, in part, a
standards agency for rules used to
create metadata records and in
controlled vocabularies (authorities)
used to describe items.
The Library is gradually making its
vocabularies available as serialized
RDF datasets (SKOS and JSON).
In the library community, The LC
authorities are one of the most
common tools for building linked
data relationships.
13
What are some of the
technological challenges of
managing and preserving
large digital collections in
many formats, and making
them available for use?
14
Sheer amount.
Huge variation in file formats.
Unclear and undocumented rights.
Security
Missing metadata.
Data citation and identifier issues.
Discovery expectations: discovery across collections and
institutions together.
Cost.
I will mention infrastructure only in passing.
There are scale issues related to:
 Storage
 Archiving
 Bandwidth
 Software development
 Staffing for processing
This Requires a Preservation Infrastructure
The Library developed the BagIt transfer specification for the
movement of files between and within organizations.
 http://www.digitalpreservation.gov/documents/bagitspec.pdf
The Library inventories incoming files, and is gradually inventorying all
digital content.
The Library maintains multiple copies of files on servers and on tape,
in geographically distributed locations.
The Library has documented sustainability factors for file formats.
 http://www.digitalpreservation.gov/formats/
For cases where we do have control over content we receive, we have
a “Best Edition” Preferred Formats statement, which is currently being
updated.
•http://www.copyright.gov/circs/circ07b.pdf
There are many new
activities to be planned for
with new researcher uses
and expectations.
We still have collections. But what we also have is Big Data,
which requires us to rethink the infrastructure that is needed
to support Big Data services. Our community used to
expect researchers to come to us, ask us questions about
our collections, and use our digital collections in our
environment.
Now our collections are, more often than not, self-serve.
Researchers are taking collections as data away to work
with in their own computational environments. This is a shift
away from recent service models where libraries built out
and housed lab spaces for specialized activities such as text
mining and geospatial modeling and provided staff to assist
in acquiring and manipulating data.
More and more researchers want to use one
or more collections as a whole, mining and
organizing the information in novel ways.
Researchers use what used to be
unimaginable computing power on a desktop
to mine the rich information and tools to
create pictures that translate that information
into knowledge.
Should collections be pre-processed to create a
variety of derivatives that might be used in various
forms of analysis before ingesting them? Or do we
limit access to the native format? Or put on-the-fly
format transformation services for downloads in
place?
We are beginning to put into place the infrastructure
needed to create full-text indexes for millions/billions
of items to support full discovery for researchers.
We are only just starting the process of generating
linked data representations of billions of items.
Cultural heritage institutions are increasingly looking
towards self-service – researchers need not ask to
download or tell us that they have. We may never
know.
BUT … we do have collections that are limited to on-
site only access due to licenses or gift agreements. In
that case, libraries may have to consider providing
high-powered workstations with analytical tools for
researchers to work with these collections and take
analysis outputs away with them.
Both have policy implications and implications for
public service staffing.
But the benefits outweigh
the challenges.
Cultural heritage institutions are managing
and preserving the datasets and big data
necessary for re-use and replicability.
We are working to make the deposit and
management of such data easier to
accomplish.
This is an important new role for our
organizations in enabling new research.
Discussion…
Leslie Johnston
lesliej@loc.gov

More Related Content

What's hot

LIS 653 fall 2013 final project posters
LIS 653 fall 2013 final project postersLIS 653 fall 2013 final project posters
LIS 653 fall 2013 final project posters
PrattSILS
 
INFO 653 Posters, Fall 2019
INFO 653 Posters, Fall 2019INFO 653 Posters, Fall 2019
INFO 653 Posters, Fall 2019
PrattSILS
 
LODLAM Landscape
LODLAM LandscapeLODLAM Landscape
LODLAM Landscape
Shana McDanold
 
Leslie Johnston: Library Big Data Repository Services, Open Repositories 2012
Leslie Johnston: Library Big Data Repository Services, Open Repositories 2012Leslie Johnston: Library Big Data Repository Services, Open Repositories 2012
Leslie Johnston: Library Big Data Repository Services, Open Repositories 2012lljohnston
 
SWSIG wlic2016
SWSIG wlic2016SWSIG wlic2016
SWSIG wlic2016
Figoblog
 
Ifla swsig meeting - Puerto Rico - 20110817
Ifla swsig meeting - Puerto Rico - 20110817Ifla swsig meeting - Puerto Rico - 20110817
Ifla swsig meeting - Puerto Rico - 20110817Figoblog
 
LIS 653 Posters Fall 2014
LIS 653 Posters Fall 2014 LIS 653 Posters Fall 2014
LIS 653 Posters Fall 2014 PrattSILS
 
Next Steps for IMLS's National Digital Platform
Next Steps for IMLS's National Digital PlatformNext Steps for IMLS's National Digital Platform
Next Steps for IMLS's National Digital Platform
Trevor Owens
 
Beyond the catalogue : BibFrame, Linked Data and Ending the Invisible Library
Beyond the catalogue : BibFrame, Linked Data and Ending the 	Invisible LibraryBeyond the catalogue : BibFrame, Linked Data and Ending the 	Invisible Library
Beyond the catalogue : BibFrame, Linked Data and Ending the Invisible Library
Ksenija Mincic Obradovic
 
LIS 653 Knowledge Organization | Pratt Institute School of Information | Fall...
LIS 653 Knowledge Organization | Pratt Institute School of Information | Fall...LIS 653 Knowledge Organization | Pratt Institute School of Information | Fall...
LIS 653 Knowledge Organization | Pratt Institute School of Information | Fall...
PrattSILS
 
Linked Data In Action
Linked Data In ActionLinked Data In Action
Linked Data In Action
Collabor8now Ltd
 
VRA_2015_CatalogingRoundup_Seneff
VRA_2015_CatalogingRoundup_SeneffVRA_2015_CatalogingRoundup_Seneff
VRA_2015_CatalogingRoundup_SeneffHeather Seneff
 
Cambridge university library ess update for ucs
Cambridge university library  ess update for ucsCambridge university library  ess update for ucs
Cambridge university library ess update for ucs
Edmund Chamberlain
 
Archives Hub - Data in :: Data out
Archives Hub - Data in :: Data outArchives Hub - Data in :: Data out
Archives Hub - Data in :: Data out
Jane Stevenson
 
Linked Open Data for Libraries
Linked Open Data for LibrariesLinked Open Data for Libraries
Linked Open Data for Libraries
Lukas Koster
 
LIS 653-02 Spring 2014 Final Presentation Posters
LIS 653-02 Spring 2014 Final Presentation PostersLIS 653-02 Spring 2014 Final Presentation Posters
LIS 653-02 Spring 2014 Final Presentation PostersPrattSILS
 
LODLAM Landscape NOTES
LODLAM Landscape NOTESLODLAM Landscape NOTES
LODLAM Landscape NOTES
Shana McDanold
 
Large Scale Data Clean-ups & Challenges for the Library
Large Scale Data Clean-ups & Challenges for the Library Large Scale Data Clean-ups & Challenges for the Library
Large Scale Data Clean-ups & Challenges for the Library
Ksenija Mincic Obradovic
 
INFO 653 posters Fall 2018
INFO 653 posters Fall 2018INFO 653 posters Fall 2018
INFO 653 posters Fall 2018
PrattSILS
 
Linked data 101: Getting Caught in the Semantic Web
Linked data 101: Getting Caught in the Semantic Web Linked data 101: Getting Caught in the Semantic Web
Linked data 101: Getting Caught in the Semantic Web
Morgan Briles
 

What's hot (20)

LIS 653 fall 2013 final project posters
LIS 653 fall 2013 final project postersLIS 653 fall 2013 final project posters
LIS 653 fall 2013 final project posters
 
INFO 653 Posters, Fall 2019
INFO 653 Posters, Fall 2019INFO 653 Posters, Fall 2019
INFO 653 Posters, Fall 2019
 
LODLAM Landscape
LODLAM LandscapeLODLAM Landscape
LODLAM Landscape
 
Leslie Johnston: Library Big Data Repository Services, Open Repositories 2012
Leslie Johnston: Library Big Data Repository Services, Open Repositories 2012Leslie Johnston: Library Big Data Repository Services, Open Repositories 2012
Leslie Johnston: Library Big Data Repository Services, Open Repositories 2012
 
SWSIG wlic2016
SWSIG wlic2016SWSIG wlic2016
SWSIG wlic2016
 
Ifla swsig meeting - Puerto Rico - 20110817
Ifla swsig meeting - Puerto Rico - 20110817Ifla swsig meeting - Puerto Rico - 20110817
Ifla swsig meeting - Puerto Rico - 20110817
 
LIS 653 Posters Fall 2014
LIS 653 Posters Fall 2014 LIS 653 Posters Fall 2014
LIS 653 Posters Fall 2014
 
Next Steps for IMLS's National Digital Platform
Next Steps for IMLS's National Digital PlatformNext Steps for IMLS's National Digital Platform
Next Steps for IMLS's National Digital Platform
 
Beyond the catalogue : BibFrame, Linked Data and Ending the Invisible Library
Beyond the catalogue : BibFrame, Linked Data and Ending the 	Invisible LibraryBeyond the catalogue : BibFrame, Linked Data and Ending the 	Invisible Library
Beyond the catalogue : BibFrame, Linked Data and Ending the Invisible Library
 
LIS 653 Knowledge Organization | Pratt Institute School of Information | Fall...
LIS 653 Knowledge Organization | Pratt Institute School of Information | Fall...LIS 653 Knowledge Organization | Pratt Institute School of Information | Fall...
LIS 653 Knowledge Organization | Pratt Institute School of Information | Fall...
 
Linked Data In Action
Linked Data In ActionLinked Data In Action
Linked Data In Action
 
VRA_2015_CatalogingRoundup_Seneff
VRA_2015_CatalogingRoundup_SeneffVRA_2015_CatalogingRoundup_Seneff
VRA_2015_CatalogingRoundup_Seneff
 
Cambridge university library ess update for ucs
Cambridge university library  ess update for ucsCambridge university library  ess update for ucs
Cambridge university library ess update for ucs
 
Archives Hub - Data in :: Data out
Archives Hub - Data in :: Data outArchives Hub - Data in :: Data out
Archives Hub - Data in :: Data out
 
Linked Open Data for Libraries
Linked Open Data for LibrariesLinked Open Data for Libraries
Linked Open Data for Libraries
 
LIS 653-02 Spring 2014 Final Presentation Posters
LIS 653-02 Spring 2014 Final Presentation PostersLIS 653-02 Spring 2014 Final Presentation Posters
LIS 653-02 Spring 2014 Final Presentation Posters
 
LODLAM Landscape NOTES
LODLAM Landscape NOTESLODLAM Landscape NOTES
LODLAM Landscape NOTES
 
Large Scale Data Clean-ups & Challenges for the Library
Large Scale Data Clean-ups & Challenges for the Library Large Scale Data Clean-ups & Challenges for the Library
Large Scale Data Clean-ups & Challenges for the Library
 
INFO 653 posters Fall 2018
INFO 653 posters Fall 2018INFO 653 posters Fall 2018
INFO 653 posters Fall 2018
 
Linked data 101: Getting Caught in the Semantic Web
Linked data 101: Getting Caught in the Semantic Web Linked data 101: Getting Caught in the Semantic Web
Linked data 101: Getting Caught in the Semantic Web
 

Similar to Cultural Heritage Insitutions and Big Data Collections

WORLD CAT AS BIG DATA
WORLD CAT AS  BIG DATAWORLD CAT AS  BIG DATA
WORLD CAT AS BIG DATA
Dr. Anjaiah Mothukuri
 
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & Museums
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & MuseumsALIAOnline Practical Linked (Open) Data for Libraries, Archives & Museums
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & MuseumsJon Voss
 
Fuller Disclosure: Getting More Collections into the Network Flow
Fuller Disclosure: Getting More Collections into the Network FlowFuller Disclosure: Getting More Collections into the Network Flow
Fuller Disclosure: Getting More Collections into the Network Flowkramsey
 
Are New Digital Literacies Skills Neededrscd2018
Are New Digital Literacies Skills Neededrscd2018Are New Digital Literacies Skills Neededrscd2018
Are New Digital Literacies Skills Neededrscd2018
SusanMRob
 
The Open Access Community, and OAIster
The Open Access Community, and OAIsterThe Open Access Community, and OAIster
The Open Access Community, and OAIster
Jessica Hedgecock and John Shannon
 
Digital preservation and curation of information.presentation
Digital preservation and curation of information.presentationDigital preservation and curation of information.presentation
Digital preservation and curation of information.presentation
Prince Sterling
 
Open Repositories and Interoperability Challenges in UK
Open Repositories and Interoperability Challenges in UKOpen Repositories and Interoperability Challenges in UK
Open Repositories and Interoperability Challenges in UK
EDINA, University of Edinburgh
 
Digitallibrary
DigitallibraryDigitallibrary
Digitallibrary
Devi Prasad
 
201399627 kovacs-collection-cyberspace
201399627 kovacs-collection-cyberspace201399627 kovacs-collection-cyberspace
201399627 kovacs-collection-cyberspace
homeworkping4
 
An ontology-based context aware system for Selective Dissemination of Informa...
An ontology-based context aware system for Selective Dissemination of Informa...An ontology-based context aware system for Selective Dissemination of Informa...
An ontology-based context aware system for Selective Dissemination of Informa...
Servicio de Difusión de la Creación Intelectual (SEDICI)
 
Digital library-overview
Digital library-overviewDigital library-overview
Digital library-overview
Ankit Dubey
 
Dp Geosc Info Presentation Final Version 2
Dp Geosc Info Presentation Final Version 2Dp Geosc Info Presentation Final Version 2
Dp Geosc Info Presentation Final Version 2Smita Chandra
 
Information Retrieval Methods in Libraries and Information Centers
Information Retrieval Methods in Libraries and Information CentersInformation Retrieval Methods in Libraries and Information Centers
Information Retrieval Methods in Libraries and Information Centers
Edeama Onwuchekwa
 
The Future of Metadata Management & Making Library Collections Discoverable o...
The Future of Metadata Management & Making Library Collections Discoverable o...The Future of Metadata Management & Making Library Collections Discoverable o...
The Future of Metadata Management & Making Library Collections Discoverable o...
tfons
 
Webscale Discovery with the Enduser in Mind
Webscale Discovery with the Enduser in Mind Webscale Discovery with the Enduser in Mind
Webscale Discovery with the Enduser in Mind
Debra Kolah
 
Open Data and Institutional Repositories
Open Data and Institutional RepositoriesOpen Data and Institutional Repositories
Open Data and Institutional Repositories
Robin Rice
 
We Have Interesting Problems: Some Applied Grand Challenges from Digital Libr...
We Have Interesting Problems: Some Applied Grand Challenges from Digital Libr...We Have Interesting Problems: Some Applied Grand Challenges from Digital Libr...
We Have Interesting Problems: Some Applied Grand Challenges from Digital Libr...
Trevor Owens
 
Boundless Opportunity
Boundless OpportunityBoundless Opportunity
Boundless Opportunity
Rachel Frick
 

Similar to Cultural Heritage Insitutions and Big Data Collections (20)

WORLD CAT AS BIG DATA
WORLD CAT AS  BIG DATAWORLD CAT AS  BIG DATA
WORLD CAT AS BIG DATA
 
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & Museums
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & MuseumsALIAOnline Practical Linked (Open) Data for Libraries, Archives & Museums
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & Museums
 
Fuller Disclosure: Getting More Collections into the Network Flow
Fuller Disclosure: Getting More Collections into the Network FlowFuller Disclosure: Getting More Collections into the Network Flow
Fuller Disclosure: Getting More Collections into the Network Flow
 
Are New Digital Literacies Skills Neededrscd2018
Are New Digital Literacies Skills Neededrscd2018Are New Digital Literacies Skills Neededrscd2018
Are New Digital Literacies Skills Neededrscd2018
 
Open access (1)
Open access (1)Open access (1)
Open access (1)
 
The Open Access Community, and OAIster
The Open Access Community, and OAIsterThe Open Access Community, and OAIster
The Open Access Community, and OAIster
 
Open access
Open accessOpen access
Open access
 
Digital preservation and curation of information.presentation
Digital preservation and curation of information.presentationDigital preservation and curation of information.presentation
Digital preservation and curation of information.presentation
 
Open Repositories and Interoperability Challenges in UK
Open Repositories and Interoperability Challenges in UKOpen Repositories and Interoperability Challenges in UK
Open Repositories and Interoperability Challenges in UK
 
Digitallibrary
DigitallibraryDigitallibrary
Digitallibrary
 
201399627 kovacs-collection-cyberspace
201399627 kovacs-collection-cyberspace201399627 kovacs-collection-cyberspace
201399627 kovacs-collection-cyberspace
 
An ontology-based context aware system for Selective Dissemination of Informa...
An ontology-based context aware system for Selective Dissemination of Informa...An ontology-based context aware system for Selective Dissemination of Informa...
An ontology-based context aware system for Selective Dissemination of Informa...
 
Digital library-overview
Digital library-overviewDigital library-overview
Digital library-overview
 
Dp Geosc Info Presentation Final Version 2
Dp Geosc Info Presentation Final Version 2Dp Geosc Info Presentation Final Version 2
Dp Geosc Info Presentation Final Version 2
 
Information Retrieval Methods in Libraries and Information Centers
Information Retrieval Methods in Libraries and Information CentersInformation Retrieval Methods in Libraries and Information Centers
Information Retrieval Methods in Libraries and Information Centers
 
The Future of Metadata Management & Making Library Collections Discoverable o...
The Future of Metadata Management & Making Library Collections Discoverable o...The Future of Metadata Management & Making Library Collections Discoverable o...
The Future of Metadata Management & Making Library Collections Discoverable o...
 
Webscale Discovery with the Enduser in Mind
Webscale Discovery with the Enduser in Mind Webscale Discovery with the Enduser in Mind
Webscale Discovery with the Enduser in Mind
 
Open Data and Institutional Repositories
Open Data and Institutional RepositoriesOpen Data and Institutional Repositories
Open Data and Institutional Repositories
 
We Have Interesting Problems: Some Applied Grand Challenges from Digital Libr...
We Have Interesting Problems: Some Applied Grand Challenges from Digital Libr...We Have Interesting Problems: Some Applied Grand Challenges from Digital Libr...
We Have Interesting Problems: Some Applied Grand Challenges from Digital Libr...
 
Boundless Opportunity
Boundless OpportunityBoundless Opportunity
Boundless Opportunity
 

More from lljohnston

Technology and Service Trends in Libraries: The Library of Congress and the B...
Technology and Service Trends in Libraries: The Library of Congress and the B...Technology and Service Trends in Libraries: The Library of Congress and the B...
Technology and Service Trends in Libraries: The Library of Congress and the B...
lljohnston
 
An Introduction to digital preservation at the Library of Congress
An Introduction to digital preservation at the Library of CongressAn Introduction to digital preservation at the Library of Congress
An Introduction to digital preservation at the Library of Congress
lljohnston
 
Strategies for Establishing Partnerships for Digital Preservation
Strategies for Establishing Partnerships for Digital PreservationStrategies for Establishing Partnerships for Digital Preservation
Strategies for Establishing Partnerships for Digital Preservation
lljohnston
 
Personal Digital Archiving Initiatives at the Library of Congress
Personal Digital Archiving Initiatives at the Library of CongressPersonal Digital Archiving Initiatives at the Library of Congress
Personal Digital Archiving Initiatives at the Library of Congress
lljohnston
 
Leslie Johnston on Citizen Archiving, iPres 2011
Leslie Johnston on Citizen Archiving, iPres 2011Leslie Johnston on Citizen Archiving, iPres 2011
Leslie Johnston on Citizen Archiving, iPres 2011
lljohnston
 
Leslie Johnston: Challenges of Preserving Every Digital Format, 2012
Leslie Johnston: Challenges of Preserving Every Digital Format, 2012Leslie Johnston: Challenges of Preserving Every Digital Format, 2012
Leslie Johnston: Challenges of Preserving Every Digital Format, 2012lljohnston
 
Leslie Johnston code4lib 2013 Keynote
Leslie Johnston code4lib 2013 KeynoteLeslie Johnston code4lib 2013 Keynote
Leslie Johnston code4lib 2013 Keynotelljohnston
 

More from lljohnston (7)

Technology and Service Trends in Libraries: The Library of Congress and the B...
Technology and Service Trends in Libraries: The Library of Congress and the B...Technology and Service Trends in Libraries: The Library of Congress and the B...
Technology and Service Trends in Libraries: The Library of Congress and the B...
 
An Introduction to digital preservation at the Library of Congress
An Introduction to digital preservation at the Library of CongressAn Introduction to digital preservation at the Library of Congress
An Introduction to digital preservation at the Library of Congress
 
Strategies for Establishing Partnerships for Digital Preservation
Strategies for Establishing Partnerships for Digital PreservationStrategies for Establishing Partnerships for Digital Preservation
Strategies for Establishing Partnerships for Digital Preservation
 
Personal Digital Archiving Initiatives at the Library of Congress
Personal Digital Archiving Initiatives at the Library of CongressPersonal Digital Archiving Initiatives at the Library of Congress
Personal Digital Archiving Initiatives at the Library of Congress
 
Leslie Johnston on Citizen Archiving, iPres 2011
Leslie Johnston on Citizen Archiving, iPres 2011Leslie Johnston on Citizen Archiving, iPres 2011
Leslie Johnston on Citizen Archiving, iPres 2011
 
Leslie Johnston: Challenges of Preserving Every Digital Format, 2012
Leslie Johnston: Challenges of Preserving Every Digital Format, 2012Leslie Johnston: Challenges of Preserving Every Digital Format, 2012
Leslie Johnston: Challenges of Preserving Every Digital Format, 2012
 
Leslie Johnston code4lib 2013 Keynote
Leslie Johnston code4lib 2013 KeynoteLeslie Johnston code4lib 2013 Keynote
Leslie Johnston code4lib 2013 Keynote
 

Recently uploaded

一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
ewymefz
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
nscud
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
vcaxypu
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
ocavb
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
oz8q3jxlp
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
mbawufebxi
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
Opendatabay
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
yhkoc
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
ewymefz
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
TravisMalana
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
ewymefz
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
ewymefz
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
Subhajit Sahu
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
ewymefz
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
slg6lamcq
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Linda486226
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
slg6lamcq
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
benishzehra469
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
axoqas
 

Recently uploaded (20)

一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
 

Cultural Heritage Insitutions and Big Data Collections

  • 1. Cultural Heritage Institutions and Big Data Collections Leslie Johnston Chief of Repository Development Library of Congress
  • 2. Cultural Heritage organizations have, until recently, spoken of “collections” and “content” and “records” and even “files.” Now it’s also data.
  • 3. Data is not just generated by satellites, identified during experiments, or collected during surveys. Datasets are not just scientific and business tables and spreadsheets. We have Big Data in our Libraries, Archives and Museums.
  • 4. Like other cultural heritage organizations, the Library of Congress has as one of its mandates that it make its collections freely available, whether that is in person or on the web.
  • 5. What are some Library of Congress examples of collecting and preserving large scale collections in many formats, and making them usable as collections and as data?
  • 6. National Digital Newspaper Program chroniclingamerica.loc.gov/ This collection was transformative for the Library of Congress: it was the first to be made to be available as a bulk download and exposed as a text and image dataset. Some researchers want to search for stories in historic newspapers. Some researchers want to mine newspaper OCR for trends across time periods and geographic areas. Requests have come in to analyze the full collection.. The program has:  Multiple producers (36 now, ultimately 54)  Free and open public access  APIs for machine access and automated processes, including access to RDF linked data. Over 6.7 million newspaper pages ingested to date Over 250 Tb of data
  • 7. Web Archives http://www.loc.gov/webarchiving/ lcweb2.loc.gov/diglib/lcwa/html/lcwa-home.html The Library has been archiving the web since 2000. Subject area specialists curate the collections, and Library catalogers create collection-level metadata records. The collections include: • U.S. elections • Web sites created by members of the House and Senate • Thematic collections around events, such as elections in the Philippines, the Iraq war, and the appointment of Supreme Court Justices. • Collections around an area of study, such as Legal “Blawgs” We frequently receive requests for access to full collections for full-text data mining. Every format possible on the web Almost 8 billion files Over 425 TB
  • 8. congress.gov Congress.gov is still in its beta phase, transforming congressional information discovery. Legislation from 1993 to the present, The Congressional Record from 1995 to the present, Committee Reports from 1995 to the present, and Member profiles from 1973 to the present (with some from 1947 to 1972).
  • 9. The Twitter Archive Every public tweet since Twitter’s launch in March 2006. Research requests have included users looking for their own Twitter history, the study of the geographic spread of news, the study of the spread of epidemics, and the study of the transmission of new uses of language. The collection comprises only a few TB, but 100s of billions of tweets. A White Paper is available online at: http://blogs.loc.gov/loc/2013/01/update-on-the-twitter-arc status privacy commercial personal events social media visualization social science
  • 10. Research Datasets Research datasets are created by faculty, curators, researchers, and federal and state agencies. It is not enough to be collecting publications; we must collect the datasets that support the published work, to allow for replicability and r- use in research. We are now planning to expands its collections to preserve research data, in addition to recognizing that the collections we already have are Big Data to be mined.
  • 11. And the full breadth of the Library’s Collections The American Memory collection, one of the oldest and most used digital collections on the web. The oral histories of the Veteran’s History Project. The audio and video collections of the American Folklife Center. More than 1.2 million images from Prints and Photographs. Digitized maps and GIS data from Geography and Maps More than 300,000 digitized audio and video files comprising over 5 PB at the Packard Campus. And many, many, many more.
  • 12. id.loc.gov The Library of Congress is, in part, a standards agency for rules used to create metadata records and in controlled vocabularies (authorities) used to describe items. The Library is gradually making its vocabularies available as serialized RDF datasets (SKOS and JSON). In the library community, The LC authorities are one of the most common tools for building linked data relationships.
  • 13. 13 What are some of the technological challenges of managing and preserving large digital collections in many formats, and making them available for use?
  • 14. 14 Sheer amount. Huge variation in file formats. Unclear and undocumented rights. Security Missing metadata. Data citation and identifier issues. Discovery expectations: discovery across collections and institutions together. Cost.
  • 15. I will mention infrastructure only in passing. There are scale issues related to:  Storage  Archiving  Bandwidth  Software development  Staffing for processing
  • 16. This Requires a Preservation Infrastructure The Library developed the BagIt transfer specification for the movement of files between and within organizations.  http://www.digitalpreservation.gov/documents/bagitspec.pdf The Library inventories incoming files, and is gradually inventorying all digital content. The Library maintains multiple copies of files on servers and on tape, in geographically distributed locations. The Library has documented sustainability factors for file formats.  http://www.digitalpreservation.gov/formats/ For cases where we do have control over content we receive, we have a “Best Edition” Preferred Formats statement, which is currently being updated. •http://www.copyright.gov/circs/circ07b.pdf
  • 17. There are many new activities to be planned for with new researcher uses and expectations.
  • 18. We still have collections. But what we also have is Big Data, which requires us to rethink the infrastructure that is needed to support Big Data services. Our community used to expect researchers to come to us, ask us questions about our collections, and use our digital collections in our environment. Now our collections are, more often than not, self-serve. Researchers are taking collections as data away to work with in their own computational environments. This is a shift away from recent service models where libraries built out and housed lab spaces for specialized activities such as text mining and geospatial modeling and provided staff to assist in acquiring and manipulating data.
  • 19. More and more researchers want to use one or more collections as a whole, mining and organizing the information in novel ways. Researchers use what used to be unimaginable computing power on a desktop to mine the rich information and tools to create pictures that translate that information into knowledge.
  • 20. Should collections be pre-processed to create a variety of derivatives that might be used in various forms of analysis before ingesting them? Or do we limit access to the native format? Or put on-the-fly format transformation services for downloads in place? We are beginning to put into place the infrastructure needed to create full-text indexes for millions/billions of items to support full discovery for researchers. We are only just starting the process of generating linked data representations of billions of items.
  • 21. Cultural heritage institutions are increasingly looking towards self-service – researchers need not ask to download or tell us that they have. We may never know. BUT … we do have collections that are limited to on- site only access due to licenses or gift agreements. In that case, libraries may have to consider providing high-powered workstations with analytical tools for researchers to work with these collections and take analysis outputs away with them. Both have policy implications and implications for public service staffing.
  • 22. But the benefits outweigh the challenges.
  • 23. Cultural heritage institutions are managing and preserving the datasets and big data necessary for re-use and replicability. We are working to make the deposit and management of such data easier to accomplish. This is an important new role for our organizations in enabling new research.