SlideShare a Scribd company logo
1 of 24
Cultural Heritage Institutions
and Big Data Collections
Leslie Johnston
Chief of Repository Development
Library of Congress
Cultural Heritage organizations
have, until recently, spoken of
“collections” and “content” and
“records” and even “files.”
Now it’s also data.
Data is not just generated by satellites,
identified during experiments, or collected
during surveys.
Datasets are not just scientific and business
tables and spreadsheets.
We have Big Data in our Libraries, Archives
and Museums.
Like other cultural heritage
organizations, the Library of
Congress has as one of its
mandates that it make its
collections freely available,
whether that is in person or on
the web.
What are some Library of
Congress examples of
collecting and preserving large
scale collections in many
formats, and making them
usable as collections and as
data?
National Digital
Newspaper Program
chroniclingamerica.loc.gov/
This collection was transformative for the Library of Congress:
it was the first to be made to be available as a bulk download
and exposed as a text and image dataset.
Some researchers want to search for stories in historic
newspapers. Some researchers want to mine newspaper OCR
for trends across time periods and geographic areas.
Requests have come in to analyze the full collection..
The program has:
 Multiple producers (36 now, ultimately 54)
 Free and open public access
 APIs for machine access and automated processes,
including access to RDF linked data.
Over 6.7 million newspaper pages ingested to date
Over 250 Tb of data
Web Archives
http://www.loc.gov/webarchiving/
lcweb2.loc.gov/diglib/lcwa/html/lcwa-home.html
The Library has been archiving the web since 2000. Subject
area specialists curate the collections, and Library catalogers
create collection-level metadata records.
The collections include:
• U.S. elections
• Web sites created by members of the House and Senate
• Thematic collections around events, such as elections in
the Philippines, the Iraq war, and the appointment of
Supreme Court Justices.
• Collections around an area of study, such as Legal “Blawgs”
We frequently receive requests for access to full collections for
full-text data mining.
Every format possible on the web
Almost 8 billion files
Over 425 TB
congress.gov
Congress.gov is still in its beta phase,
transforming congressional
information discovery.
Legislation from 1993 to the present,
The Congressional Record from
1995 to the present, Committee
Reports from 1995 to the present,
and Member profiles from 1973 to
the present (with some from 1947 to
1972).
The Twitter Archive
Every public tweet since Twitter’s launch in March
2006.
Research requests have included users looking for
their own Twitter history, the study of the
geographic spread of news, the study of the
spread of epidemics, and the study of the
transmission of new uses of language.
The collection comprises only a few TB, but 100s of
billions of tweets.
A White Paper is available online at:
http://blogs.loc.gov/loc/2013/01/update-on-the-twitter-arc
status
privacy
commercial
personal
events
social media
visualization
social
science
Research Datasets
Research datasets are created by
faculty, curators, researchers, and
federal and state agencies.
It is not enough to be collecting
publications; we must collect the
datasets that support the published
work, to allow for replicability and r-
use in research.
We are now planning to expands its
collections to preserve research
data, in addition to recognizing that
the collections we already have are
Big Data to be mined.
And the full breadth of the
Library’s Collections
The American Memory collection, one of the oldest
and most used digital collections on the web.
The oral histories of the Veteran’s History Project.
The audio and video collections of the American
Folklife Center.
More than 1.2 million images from Prints and
Photographs.
Digitized maps and GIS data from Geography and
Maps
More than 300,000 digitized audio and video files
comprising over 5 PB at the Packard Campus.
And many, many, many more.
id.loc.gov
The Library of Congress is, in part, a
standards agency for rules used to
create metadata records and in
controlled vocabularies (authorities)
used to describe items.
The Library is gradually making its
vocabularies available as serialized
RDF datasets (SKOS and JSON).
In the library community, The LC
authorities are one of the most
common tools for building linked
data relationships.
13
What are some of the
technological challenges of
managing and preserving
large digital collections in
many formats, and making
them available for use?
14
Sheer amount.
Huge variation in file formats.
Unclear and undocumented rights.
Security
Missing metadata.
Data citation and identifier issues.
Discovery expectations: discovery across collections and
institutions together.
Cost.
I will mention infrastructure only in passing.
There are scale issues related to:
 Storage
 Archiving
 Bandwidth
 Software development
 Staffing for processing
This Requires a Preservation Infrastructure
The Library developed the BagIt transfer specification for the
movement of files between and within organizations.
 http://www.digitalpreservation.gov/documents/bagitspec.pdf
The Library inventories incoming files, and is gradually inventorying all
digital content.
The Library maintains multiple copies of files on servers and on tape,
in geographically distributed locations.
The Library has documented sustainability factors for file formats.
 http://www.digitalpreservation.gov/formats/
For cases where we do have control over content we receive, we have
a “Best Edition” Preferred Formats statement, which is currently being
updated.
•http://www.copyright.gov/circs/circ07b.pdf
There are many new
activities to be planned for
with new researcher uses
and expectations.
We still have collections. But what we also have is Big Data,
which requires us to rethink the infrastructure that is needed
to support Big Data services. Our community used to
expect researchers to come to us, ask us questions about
our collections, and use our digital collections in our
environment.
Now our collections are, more often than not, self-serve.
Researchers are taking collections as data away to work
with in their own computational environments. This is a shift
away from recent service models where libraries built out
and housed lab spaces for specialized activities such as text
mining and geospatial modeling and provided staff to assist
in acquiring and manipulating data.
More and more researchers want to use one
or more collections as a whole, mining and
organizing the information in novel ways.
Researchers use what used to be
unimaginable computing power on a desktop
to mine the rich information and tools to
create pictures that translate that information
into knowledge.
Should collections be pre-processed to create a
variety of derivatives that might be used in various
forms of analysis before ingesting them? Or do we
limit access to the native format? Or put on-the-fly
format transformation services for downloads in
place?
We are beginning to put into place the infrastructure
needed to create full-text indexes for millions/billions
of items to support full discovery for researchers.
We are only just starting the process of generating
linked data representations of billions of items.
Cultural heritage institutions are increasingly looking
towards self-service – researchers need not ask to
download or tell us that they have. We may never
know.
BUT … we do have collections that are limited to on-
site only access due to licenses or gift agreements. In
that case, libraries may have to consider providing
high-powered workstations with analytical tools for
researchers to work with these collections and take
analysis outputs away with them.
Both have policy implications and implications for
public service staffing.
But the benefits outweigh
the challenges.
Cultural heritage institutions are managing
and preserving the datasets and big data
necessary for re-use and replicability.
We are working to make the deposit and
management of such data easier to
accomplish.
This is an important new role for our
organizations in enabling new research.
Discussion…
Leslie Johnston
lesliej@loc.gov

More Related Content

What's hot

LIS 653 fall 2013 final project posters
LIS 653 fall 2013 final project postersLIS 653 fall 2013 final project posters
LIS 653 fall 2013 final project postersPrattSILS
 
INFO 653 Posters, Fall 2019
INFO 653 Posters, Fall 2019INFO 653 Posters, Fall 2019
INFO 653 Posters, Fall 2019PrattSILS
 
Leslie Johnston: Library Big Data Repository Services, Open Repositories 2012
Leslie Johnston: Library Big Data Repository Services, Open Repositories 2012Leslie Johnston: Library Big Data Repository Services, Open Repositories 2012
Leslie Johnston: Library Big Data Repository Services, Open Repositories 2012lljohnston
 
SWSIG wlic2016
SWSIG wlic2016SWSIG wlic2016
SWSIG wlic2016Figoblog
 
Ifla swsig meeting - Puerto Rico - 20110817
Ifla swsig meeting - Puerto Rico - 20110817Ifla swsig meeting - Puerto Rico - 20110817
Ifla swsig meeting - Puerto Rico - 20110817Figoblog
 
LIS 653 Posters Fall 2014
LIS 653 Posters Fall 2014 LIS 653 Posters Fall 2014
LIS 653 Posters Fall 2014 PrattSILS
 
Next Steps for IMLS's National Digital Platform
Next Steps for IMLS's National Digital PlatformNext Steps for IMLS's National Digital Platform
Next Steps for IMLS's National Digital PlatformTrevor Owens
 
Beyond the catalogue : BibFrame, Linked Data and Ending the Invisible Library
Beyond the catalogue : BibFrame, Linked Data and Ending the 	Invisible LibraryBeyond the catalogue : BibFrame, Linked Data and Ending the 	Invisible Library
Beyond the catalogue : BibFrame, Linked Data and Ending the Invisible LibraryKsenija Mincic Obradovic
 
LIS 653 Knowledge Organization | Pratt Institute School of Information | Fall...
LIS 653 Knowledge Organization | Pratt Institute School of Information | Fall...LIS 653 Knowledge Organization | Pratt Institute School of Information | Fall...
LIS 653 Knowledge Organization | Pratt Institute School of Information | Fall...PrattSILS
 
VRA_2015_CatalogingRoundup_Seneff
VRA_2015_CatalogingRoundup_SeneffVRA_2015_CatalogingRoundup_Seneff
VRA_2015_CatalogingRoundup_SeneffHeather Seneff
 
Cambridge university library ess update for ucs
Cambridge university library  ess update for ucsCambridge university library  ess update for ucs
Cambridge university library ess update for ucsEdmund Chamberlain
 
Archives Hub - Data in :: Data out
Archives Hub - Data in :: Data outArchives Hub - Data in :: Data out
Archives Hub - Data in :: Data outJane Stevenson
 
Linked Open Data for Libraries
Linked Open Data for LibrariesLinked Open Data for Libraries
Linked Open Data for LibrariesLukas Koster
 
LIS 653-02 Spring 2014 Final Presentation Posters
LIS 653-02 Spring 2014 Final Presentation PostersLIS 653-02 Spring 2014 Final Presentation Posters
LIS 653-02 Spring 2014 Final Presentation PostersPrattSILS
 
LODLAM Landscape NOTES
LODLAM Landscape NOTESLODLAM Landscape NOTES
LODLAM Landscape NOTESShana McDanold
 
Large Scale Data Clean-ups & Challenges for the Library
Large Scale Data Clean-ups & Challenges for the Library Large Scale Data Clean-ups & Challenges for the Library
Large Scale Data Clean-ups & Challenges for the Library Ksenija Mincic Obradovic
 
INFO 653 posters Fall 2018
INFO 653 posters Fall 2018INFO 653 posters Fall 2018
INFO 653 posters Fall 2018PrattSILS
 
Linked data 101: Getting Caught in the Semantic Web
Linked data 101: Getting Caught in the Semantic Web Linked data 101: Getting Caught in the Semantic Web
Linked data 101: Getting Caught in the Semantic Web Morgan Briles
 

What's hot (20)

LIS 653 fall 2013 final project posters
LIS 653 fall 2013 final project postersLIS 653 fall 2013 final project posters
LIS 653 fall 2013 final project posters
 
INFO 653 Posters, Fall 2019
INFO 653 Posters, Fall 2019INFO 653 Posters, Fall 2019
INFO 653 Posters, Fall 2019
 
LODLAM Landscape
LODLAM LandscapeLODLAM Landscape
LODLAM Landscape
 
Leslie Johnston: Library Big Data Repository Services, Open Repositories 2012
Leslie Johnston: Library Big Data Repository Services, Open Repositories 2012Leslie Johnston: Library Big Data Repository Services, Open Repositories 2012
Leslie Johnston: Library Big Data Repository Services, Open Repositories 2012
 
SWSIG wlic2016
SWSIG wlic2016SWSIG wlic2016
SWSIG wlic2016
 
Ifla swsig meeting - Puerto Rico - 20110817
Ifla swsig meeting - Puerto Rico - 20110817Ifla swsig meeting - Puerto Rico - 20110817
Ifla swsig meeting - Puerto Rico - 20110817
 
LIS 653 Posters Fall 2014
LIS 653 Posters Fall 2014 LIS 653 Posters Fall 2014
LIS 653 Posters Fall 2014
 
Next Steps for IMLS's National Digital Platform
Next Steps for IMLS's National Digital PlatformNext Steps for IMLS's National Digital Platform
Next Steps for IMLS's National Digital Platform
 
Beyond the catalogue : BibFrame, Linked Data and Ending the Invisible Library
Beyond the catalogue : BibFrame, Linked Data and Ending the 	Invisible LibraryBeyond the catalogue : BibFrame, Linked Data and Ending the 	Invisible Library
Beyond the catalogue : BibFrame, Linked Data and Ending the Invisible Library
 
LIS 653 Knowledge Organization | Pratt Institute School of Information | Fall...
LIS 653 Knowledge Organization | Pratt Institute School of Information | Fall...LIS 653 Knowledge Organization | Pratt Institute School of Information | Fall...
LIS 653 Knowledge Organization | Pratt Institute School of Information | Fall...
 
Linked Data In Action
Linked Data In ActionLinked Data In Action
Linked Data In Action
 
VRA_2015_CatalogingRoundup_Seneff
VRA_2015_CatalogingRoundup_SeneffVRA_2015_CatalogingRoundup_Seneff
VRA_2015_CatalogingRoundup_Seneff
 
Cambridge university library ess update for ucs
Cambridge university library  ess update for ucsCambridge university library  ess update for ucs
Cambridge university library ess update for ucs
 
Archives Hub - Data in :: Data out
Archives Hub - Data in :: Data outArchives Hub - Data in :: Data out
Archives Hub - Data in :: Data out
 
Linked Open Data for Libraries
Linked Open Data for LibrariesLinked Open Data for Libraries
Linked Open Data for Libraries
 
LIS 653-02 Spring 2014 Final Presentation Posters
LIS 653-02 Spring 2014 Final Presentation PostersLIS 653-02 Spring 2014 Final Presentation Posters
LIS 653-02 Spring 2014 Final Presentation Posters
 
LODLAM Landscape NOTES
LODLAM Landscape NOTESLODLAM Landscape NOTES
LODLAM Landscape NOTES
 
Large Scale Data Clean-ups & Challenges for the Library
Large Scale Data Clean-ups & Challenges for the Library Large Scale Data Clean-ups & Challenges for the Library
Large Scale Data Clean-ups & Challenges for the Library
 
INFO 653 posters Fall 2018
INFO 653 posters Fall 2018INFO 653 posters Fall 2018
INFO 653 posters Fall 2018
 
Linked data 101: Getting Caught in the Semantic Web
Linked data 101: Getting Caught in the Semantic Web Linked data 101: Getting Caught in the Semantic Web
Linked data 101: Getting Caught in the Semantic Web
 

Similar to Cultural Heritage Insitutions and Big Data Collections

ALIAOnline Practical Linked (Open) Data for Libraries, Archives & Museums
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & MuseumsALIAOnline Practical Linked (Open) Data for Libraries, Archives & Museums
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & MuseumsJon Voss
 
Fuller Disclosure: Getting More Collections into the Network Flow
Fuller Disclosure: Getting More Collections into the Network FlowFuller Disclosure: Getting More Collections into the Network Flow
Fuller Disclosure: Getting More Collections into the Network Flowkramsey
 
Are New Digital Literacies Skills Neededrscd2018
Are New Digital Literacies Skills Neededrscd2018Are New Digital Literacies Skills Neededrscd2018
Are New Digital Literacies Skills Neededrscd2018SusanMRob
 
Digital preservation and curation of information.presentation
Digital preservation and curation of information.presentationDigital preservation and curation of information.presentation
Digital preservation and curation of information.presentationPrince Sterling
 
Open Repositories and Interoperability Challenges in UK
Open Repositories and Interoperability Challenges in UKOpen Repositories and Interoperability Challenges in UK
Open Repositories and Interoperability Challenges in UKEDINA, University of Edinburgh
 
201399627 kovacs-collection-cyberspace
201399627 kovacs-collection-cyberspace201399627 kovacs-collection-cyberspace
201399627 kovacs-collection-cyberspacehomeworkping4
 
Digital library-overview
Digital library-overviewDigital library-overview
Digital library-overviewAnkit Dubey
 
Dp Geosc Info Presentation Final Version 2
Dp Geosc Info Presentation Final Version 2Dp Geosc Info Presentation Final Version 2
Dp Geosc Info Presentation Final Version 2Smita Chandra
 
Information Retrieval Methods in Libraries and Information Centers
Information Retrieval Methods in Libraries and Information CentersInformation Retrieval Methods in Libraries and Information Centers
Information Retrieval Methods in Libraries and Information CentersEdeama Onwuchekwa
 
The Future of Metadata Management & Making Library Collections Discoverable o...
The Future of Metadata Management & Making Library Collections Discoverable o...The Future of Metadata Management & Making Library Collections Discoverable o...
The Future of Metadata Management & Making Library Collections Discoverable o...tfons
 
Webscale Discovery with the Enduser in Mind
Webscale Discovery with the Enduser in Mind Webscale Discovery with the Enduser in Mind
Webscale Discovery with the Enduser in Mind Debra Kolah
 
Open Data and Institutional Repositories
Open Data and Institutional RepositoriesOpen Data and Institutional Repositories
Open Data and Institutional RepositoriesRobin Rice
 
We Have Interesting Problems: Some Applied Grand Challenges from Digital Libr...
We Have Interesting Problems: Some Applied Grand Challenges from Digital Libr...We Have Interesting Problems: Some Applied Grand Challenges from Digital Libr...
We Have Interesting Problems: Some Applied Grand Challenges from Digital Libr...Trevor Owens
 
Boundless Opportunity
Boundless OpportunityBoundless Opportunity
Boundless OpportunityRachel Frick
 

Similar to Cultural Heritage Insitutions and Big Data Collections (20)

WORLD CAT AS BIG DATA
WORLD CAT AS  BIG DATAWORLD CAT AS  BIG DATA
WORLD CAT AS BIG DATA
 
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & Museums
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & MuseumsALIAOnline Practical Linked (Open) Data for Libraries, Archives & Museums
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & Museums
 
Fuller Disclosure: Getting More Collections into the Network Flow
Fuller Disclosure: Getting More Collections into the Network FlowFuller Disclosure: Getting More Collections into the Network Flow
Fuller Disclosure: Getting More Collections into the Network Flow
 
Are New Digital Literacies Skills Neededrscd2018
Are New Digital Literacies Skills Neededrscd2018Are New Digital Literacies Skills Neededrscd2018
Are New Digital Literacies Skills Neededrscd2018
 
Open access (1)
Open access (1)Open access (1)
Open access (1)
 
The Open Access Community, and OAIster
The Open Access Community, and OAIsterThe Open Access Community, and OAIster
The Open Access Community, and OAIster
 
Open access
Open accessOpen access
Open access
 
Digital preservation and curation of information.presentation
Digital preservation and curation of information.presentationDigital preservation and curation of information.presentation
Digital preservation and curation of information.presentation
 
Open Repositories and Interoperability Challenges in UK
Open Repositories and Interoperability Challenges in UKOpen Repositories and Interoperability Challenges in UK
Open Repositories and Interoperability Challenges in UK
 
Digitallibrary
DigitallibraryDigitallibrary
Digitallibrary
 
201399627 kovacs-collection-cyberspace
201399627 kovacs-collection-cyberspace201399627 kovacs-collection-cyberspace
201399627 kovacs-collection-cyberspace
 
An ontology-based context aware system for Selective Dissemination of Informa...
An ontology-based context aware system for Selective Dissemination of Informa...An ontology-based context aware system for Selective Dissemination of Informa...
An ontology-based context aware system for Selective Dissemination of Informa...
 
Digital library-overview
Digital library-overviewDigital library-overview
Digital library-overview
 
Dp Geosc Info Presentation Final Version 2
Dp Geosc Info Presentation Final Version 2Dp Geosc Info Presentation Final Version 2
Dp Geosc Info Presentation Final Version 2
 
Information Retrieval Methods in Libraries and Information Centers
Information Retrieval Methods in Libraries and Information CentersInformation Retrieval Methods in Libraries and Information Centers
Information Retrieval Methods in Libraries and Information Centers
 
The Future of Metadata Management & Making Library Collections Discoverable o...
The Future of Metadata Management & Making Library Collections Discoverable o...The Future of Metadata Management & Making Library Collections Discoverable o...
The Future of Metadata Management & Making Library Collections Discoverable o...
 
Webscale Discovery with the Enduser in Mind
Webscale Discovery with the Enduser in Mind Webscale Discovery with the Enduser in Mind
Webscale Discovery with the Enduser in Mind
 
Open Data and Institutional Repositories
Open Data and Institutional RepositoriesOpen Data and Institutional Repositories
Open Data and Institutional Repositories
 
We Have Interesting Problems: Some Applied Grand Challenges from Digital Libr...
We Have Interesting Problems: Some Applied Grand Challenges from Digital Libr...We Have Interesting Problems: Some Applied Grand Challenges from Digital Libr...
We Have Interesting Problems: Some Applied Grand Challenges from Digital Libr...
 
Boundless Opportunity
Boundless OpportunityBoundless Opportunity
Boundless Opportunity
 

More from lljohnston

Technology and Service Trends in Libraries: The Library of Congress and the B...
Technology and Service Trends in Libraries: The Library of Congress and the B...Technology and Service Trends in Libraries: The Library of Congress and the B...
Technology and Service Trends in Libraries: The Library of Congress and the B...lljohnston
 
An Introduction to digital preservation at the Library of Congress
An Introduction to digital preservation at the Library of CongressAn Introduction to digital preservation at the Library of Congress
An Introduction to digital preservation at the Library of Congresslljohnston
 
Strategies for Establishing Partnerships for Digital Preservation
Strategies for Establishing Partnerships for Digital PreservationStrategies for Establishing Partnerships for Digital Preservation
Strategies for Establishing Partnerships for Digital Preservationlljohnston
 
Personal Digital Archiving Initiatives at the Library of Congress
Personal Digital Archiving Initiatives at the Library of CongressPersonal Digital Archiving Initiatives at the Library of Congress
Personal Digital Archiving Initiatives at the Library of Congresslljohnston
 
Leslie Johnston on Citizen Archiving, iPres 2011
Leslie Johnston on Citizen Archiving, iPres 2011Leslie Johnston on Citizen Archiving, iPres 2011
Leslie Johnston on Citizen Archiving, iPres 2011lljohnston
 
Leslie Johnston: Challenges of Preserving Every Digital Format, 2012
Leslie Johnston: Challenges of Preserving Every Digital Format, 2012Leslie Johnston: Challenges of Preserving Every Digital Format, 2012
Leslie Johnston: Challenges of Preserving Every Digital Format, 2012lljohnston
 
Leslie Johnston code4lib 2013 Keynote
Leslie Johnston code4lib 2013 KeynoteLeslie Johnston code4lib 2013 Keynote
Leslie Johnston code4lib 2013 Keynotelljohnston
 

More from lljohnston (7)

Technology and Service Trends in Libraries: The Library of Congress and the B...
Technology and Service Trends in Libraries: The Library of Congress and the B...Technology and Service Trends in Libraries: The Library of Congress and the B...
Technology and Service Trends in Libraries: The Library of Congress and the B...
 
An Introduction to digital preservation at the Library of Congress
An Introduction to digital preservation at the Library of CongressAn Introduction to digital preservation at the Library of Congress
An Introduction to digital preservation at the Library of Congress
 
Strategies for Establishing Partnerships for Digital Preservation
Strategies for Establishing Partnerships for Digital PreservationStrategies for Establishing Partnerships for Digital Preservation
Strategies for Establishing Partnerships for Digital Preservation
 
Personal Digital Archiving Initiatives at the Library of Congress
Personal Digital Archiving Initiatives at the Library of CongressPersonal Digital Archiving Initiatives at the Library of Congress
Personal Digital Archiving Initiatives at the Library of Congress
 
Leslie Johnston on Citizen Archiving, iPres 2011
Leslie Johnston on Citizen Archiving, iPres 2011Leslie Johnston on Citizen Archiving, iPres 2011
Leslie Johnston on Citizen Archiving, iPres 2011
 
Leslie Johnston: Challenges of Preserving Every Digital Format, 2012
Leslie Johnston: Challenges of Preserving Every Digital Format, 2012Leslie Johnston: Challenges of Preserving Every Digital Format, 2012
Leslie Johnston: Challenges of Preserving Every Digital Format, 2012
 
Leslie Johnston code4lib 2013 Keynote
Leslie Johnston code4lib 2013 KeynoteLeslie Johnston code4lib 2013 Keynote
Leslie Johnston code4lib 2013 Keynote
 

Recently uploaded

Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Delhi Call girls
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023ymrp368
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...shivangimorya083
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceDelhi Call girls
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxMohammedJunaid861692
 

Recently uploaded (20)

Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
 

Cultural Heritage Insitutions and Big Data Collections

  • 1. Cultural Heritage Institutions and Big Data Collections Leslie Johnston Chief of Repository Development Library of Congress
  • 2. Cultural Heritage organizations have, until recently, spoken of “collections” and “content” and “records” and even “files.” Now it’s also data.
  • 3. Data is not just generated by satellites, identified during experiments, or collected during surveys. Datasets are not just scientific and business tables and spreadsheets. We have Big Data in our Libraries, Archives and Museums.
  • 4. Like other cultural heritage organizations, the Library of Congress has as one of its mandates that it make its collections freely available, whether that is in person or on the web.
  • 5. What are some Library of Congress examples of collecting and preserving large scale collections in many formats, and making them usable as collections and as data?
  • 6. National Digital Newspaper Program chroniclingamerica.loc.gov/ This collection was transformative for the Library of Congress: it was the first to be made to be available as a bulk download and exposed as a text and image dataset. Some researchers want to search for stories in historic newspapers. Some researchers want to mine newspaper OCR for trends across time periods and geographic areas. Requests have come in to analyze the full collection.. The program has:  Multiple producers (36 now, ultimately 54)  Free and open public access  APIs for machine access and automated processes, including access to RDF linked data. Over 6.7 million newspaper pages ingested to date Over 250 Tb of data
  • 7. Web Archives http://www.loc.gov/webarchiving/ lcweb2.loc.gov/diglib/lcwa/html/lcwa-home.html The Library has been archiving the web since 2000. Subject area specialists curate the collections, and Library catalogers create collection-level metadata records. The collections include: • U.S. elections • Web sites created by members of the House and Senate • Thematic collections around events, such as elections in the Philippines, the Iraq war, and the appointment of Supreme Court Justices. • Collections around an area of study, such as Legal “Blawgs” We frequently receive requests for access to full collections for full-text data mining. Every format possible on the web Almost 8 billion files Over 425 TB
  • 8. congress.gov Congress.gov is still in its beta phase, transforming congressional information discovery. Legislation from 1993 to the present, The Congressional Record from 1995 to the present, Committee Reports from 1995 to the present, and Member profiles from 1973 to the present (with some from 1947 to 1972).
  • 9. The Twitter Archive Every public tweet since Twitter’s launch in March 2006. Research requests have included users looking for their own Twitter history, the study of the geographic spread of news, the study of the spread of epidemics, and the study of the transmission of new uses of language. The collection comprises only a few TB, but 100s of billions of tweets. A White Paper is available online at: http://blogs.loc.gov/loc/2013/01/update-on-the-twitter-arc status privacy commercial personal events social media visualization social science
  • 10. Research Datasets Research datasets are created by faculty, curators, researchers, and federal and state agencies. It is not enough to be collecting publications; we must collect the datasets that support the published work, to allow for replicability and r- use in research. We are now planning to expands its collections to preserve research data, in addition to recognizing that the collections we already have are Big Data to be mined.
  • 11. And the full breadth of the Library’s Collections The American Memory collection, one of the oldest and most used digital collections on the web. The oral histories of the Veteran’s History Project. The audio and video collections of the American Folklife Center. More than 1.2 million images from Prints and Photographs. Digitized maps and GIS data from Geography and Maps More than 300,000 digitized audio and video files comprising over 5 PB at the Packard Campus. And many, many, many more.
  • 12. id.loc.gov The Library of Congress is, in part, a standards agency for rules used to create metadata records and in controlled vocabularies (authorities) used to describe items. The Library is gradually making its vocabularies available as serialized RDF datasets (SKOS and JSON). In the library community, The LC authorities are one of the most common tools for building linked data relationships.
  • 13. 13 What are some of the technological challenges of managing and preserving large digital collections in many formats, and making them available for use?
  • 14. 14 Sheer amount. Huge variation in file formats. Unclear and undocumented rights. Security Missing metadata. Data citation and identifier issues. Discovery expectations: discovery across collections and institutions together. Cost.
  • 15. I will mention infrastructure only in passing. There are scale issues related to:  Storage  Archiving  Bandwidth  Software development  Staffing for processing
  • 16. This Requires a Preservation Infrastructure The Library developed the BagIt transfer specification for the movement of files between and within organizations.  http://www.digitalpreservation.gov/documents/bagitspec.pdf The Library inventories incoming files, and is gradually inventorying all digital content. The Library maintains multiple copies of files on servers and on tape, in geographically distributed locations. The Library has documented sustainability factors for file formats.  http://www.digitalpreservation.gov/formats/ For cases where we do have control over content we receive, we have a “Best Edition” Preferred Formats statement, which is currently being updated. •http://www.copyright.gov/circs/circ07b.pdf
  • 17. There are many new activities to be planned for with new researcher uses and expectations.
  • 18. We still have collections. But what we also have is Big Data, which requires us to rethink the infrastructure that is needed to support Big Data services. Our community used to expect researchers to come to us, ask us questions about our collections, and use our digital collections in our environment. Now our collections are, more often than not, self-serve. Researchers are taking collections as data away to work with in their own computational environments. This is a shift away from recent service models where libraries built out and housed lab spaces for specialized activities such as text mining and geospatial modeling and provided staff to assist in acquiring and manipulating data.
  • 19. More and more researchers want to use one or more collections as a whole, mining and organizing the information in novel ways. Researchers use what used to be unimaginable computing power on a desktop to mine the rich information and tools to create pictures that translate that information into knowledge.
  • 20. Should collections be pre-processed to create a variety of derivatives that might be used in various forms of analysis before ingesting them? Or do we limit access to the native format? Or put on-the-fly format transformation services for downloads in place? We are beginning to put into place the infrastructure needed to create full-text indexes for millions/billions of items to support full discovery for researchers. We are only just starting the process of generating linked data representations of billions of items.
  • 21. Cultural heritage institutions are increasingly looking towards self-service – researchers need not ask to download or tell us that they have. We may never know. BUT … we do have collections that are limited to on- site only access due to licenses or gift agreements. In that case, libraries may have to consider providing high-powered workstations with analytical tools for researchers to work with these collections and take analysis outputs away with them. Both have policy implications and implications for public service staffing.
  • 22. But the benefits outweigh the challenges.
  • 23. Cultural heritage institutions are managing and preserving the datasets and big data necessary for re-use and replicability. We are working to make the deposit and management of such data easier to accomplish. This is an important new role for our organizations in enabling new research.