Leslie Johnston: Library Big Data Repository Services, Open Repositories 2012
Big Data Challenges inRepository Development Leslie Johnston, Library of Congress Open Repositories 2012
Why is the Library of Congress atconferences on Repositories? LCdoesn’t run an IR or have faculty or collect Research Data (yet).But LC does have digital collections and runs numerous repository services.
What are the Biggest Insights that we have Learned in Fifteen Years ofBuilding Digital Collections?
We can never guess every way that our collections will be used.Researchers do not use digitalcollections the same way that they use analog collections.
Stewardship organizationshave, until recently, spoken of“collections” or “content” or“records” or even “files,” butnot data.
Data is not just generated by satellites,identified during experiments, or collectedduring surveys.Datasets are not just scientific andbusiness tables and spreadsheets.I should not need to convince thisaudience that we have Big Data in ourLibraries, Archives and Museums.
More and more researchers want to useLibrary, Archive, and Museum (LAM)collections as a whole, mining and organizingthe information in novel ways.Researchers may want to interact with acollection of artifacts, or they may want towork with a data corpus.Researchers use algorithms to mine the richinformation and tools to create pictures thattranslate that information into knowledge.
Consider the Digging Into Data ChallengeRepositories available for research include not only scientificinformation—astronomy, botany, geology, physics, biology, social sciencesurveys—but: –Art (Canadian Art Database, New York Public Library Gallery) –Film and Audio (Prelinger Archive, Western Soundscape Archive, JISC Media Hub) –Newspapers (National Digital Newspaper Program) –Maps (Sanborn Fire Insurance Maps) –Music (English Broadside Ballad Archive) –Archaeology (Archaeology Data Service) –Architecture (ArtSTOR) –Journal Articles (Project Muse, JSTOR) –Government Records (National Archives UK, National Library of Wales, NTIS).What new, computationally-based research methods might be applied tosuch collections? And how can we make such collections available forcomputational use? http://www.diggingintodata.org/
These Collections are “Big Data”?What Constitutes Big Data?The definition of Big Data is very fluid, as it is a movingtarget — what cannot be easily manipulated with commontools — and specific to the organization: what can bemanaged and stewarded by any one institution in itsinfrastructure. One researcher or organization’s concept ofa large data set is small to another.Not too long ago, an organization would be surprised toneed 10 TB of storage for a large digital collection. Now acollection can increase by more than 10 TB in a single week.
We still have collections. But what we alsohave is Big Data, which requires us to rethinkthe infrastructure that is needed to supportBig Data services. Our community used toexpect researchers to come to us, ask usquestions about our collections, and use ourdigital collections in our environment.Now our collections are, more often than not,self-serve.
How do Big Datacollection needstranslate into RepositoryServices?
Case Study: Web Archives • Web Archives, such as the one at the Library of Congress, may be comprised of billions of files. • When we began archiving election web sites, we imagined users browsing through the web pages, studying the graphics or use of phrases or links. But when our first researchers came to the Library, they wanted to know about all those topics, but they used scripts to query for them and sort them into categories. They were not very much interested in reading web pages. • The Library is testing tools for full-text indexing of the entire archive and collection subsets http://www.loc.gov/webarchiving/
Case Study: Web Archives The challenge for developing repository and researcher services are twofold: – How are web archives best validated, characterized, and structurally documented as web sites upon ingest? – How are such large and varied collections described, indexed and made both discoverable and analyzable?
Case Study: Historic Newspapers • The Chronicling America collection has 5 million page images from historic newspapers with OCR from organizations in 25 states. • The site gets approximately 4 million views per day. • Some researchers want to search for stories in historic newspapers. • Some researchers want to mine newspaper OCR for trends across time periods and geographic areas. • Requests have come in to analyze all 5 million page images. http://chroniclingamerica.loc.gov/
Case Study: Historic Newspapers • This form of content is well understood in terms of repository ingest, but grows at a very quick rate and requires constant processing. • To accommodate unmediated research use, all NDNP content – metadata, OCR, and page images – is available via an API. Will we develop collection- or content- specific APIs for repositories?
Case Study: Twitter • The Twitter archive has 10s of billions of tweets in it. • Research requests have included users looking for their own Twitter history, the study of the geographic spread of news, the study of the spread of epidemics, and the study of the transmission of new uses of language. social science visualization social media status events personal privacy commercial
Case Study: Twitter • What infrastructure is required to process, ingest, and index collections at that scale? • What services should libraries consider supplying to researchers for the analysis and use of such large collections?
Case Study: Viewshare The Challenges: • We have different kinds of metadata, and its is messy • That data comes from multiple different systems • Much of that data is not in the format researchers might wish it was Digital cultural heritage collections include temporal, locative, and categorical data that could be tapped to better dynamically interact with to understand those collections. Viewshare: Functionality to review metadata, create visualizations, and expose the raw data for other others to create their own views. Data use and re-use http://viewshare.org/
Case Study: Viewshare • ViewShare can be used to analyze and visualize digital collections, and enable users to share not only the end results, but also the raw data for other others to create their own views. • While ViewShare has been launched as a standalone visualization service for digital collections and released as open source software, it is also a candidate for a visualization service on top of a repository datastore .
There are so many other possible use cases…One 100 GB delivery of electronic journal articlesreceived via mandatory copyright deposit had over 1million files in it. The article files and accompanyingmetadata are highly variable, requiring normalizationand description before ingest.Born-digital broadcast television is coming our waythrough Copyright registration and deposit.Petabytes of files have been created through digitalconversion at the Culpeper audio and video facility.
What are the questions that all ofour institutions are grappling withas we build large digitalcollections and discover newways in which they can bemanaged and used throughrepository services?
Can each of our organizations support ingest,processing, and (hoped-for) real-timequerying of billions of full-text items?Can we support the frequent downloading byresearchers of multiple collections that maybe over 200 TB each?Should/Can we provide tools for collectionanalysis and visualization?
The Library ofCongress isproceeding onmultiple fronts
The development of a variety ofrepository services that will beused to ingest and inventory BigData collections.The ingest and inventory of suchcollections, other than scale, isbasically understood.
How much ingest processing should be done withdata collections, or collections that can be treated asdata?Do we process collections to create a variety ofderivatives that might be used in various forms ofanalysis before ingesting them?Do we have sufficient infrastructure to create full-testindexes for billions of files to support full discovery?Do we load collections into analytical tools, such asBigInsights or Greenplum? These products are stillearly days for the scale of billions of files.
LC will be benchmarkingingest and indexingprocesses in three hardwareenvironments: a standardserver, a Hadoop cluster, andin the cloud.
And what are the service models?If we decide that we will simply provide access to data,do we limit it to the native format or provide pre-processed or on-the-fly format transformation servicesfor downloads?Can we handle the download traffic?Can our staff develop the expertise to provide guidanceto researchers in using analytical tools?Or do we leave researchers to fend for themselves?
The Library is increasingly looking towards self-service – researchers need not ask to download or tellus that they have. We may never know.BUT, we do have collections that are limited to on-siteonly access due to licenses or gift agreements. In thatcase, we may have to provide high-poweredworkstations with analytical tools for researchers towork with these collections and take analysis outputsaway with them.Both have policy implications and implications forpublic service staffing.