Leslie Johnston: Big Data at Libraries, Georgetown University Law School Symposium on Big Data, January 2013
Big Data: New Challenges forDigital Preservation and DigitalServices Leslie Johnston Acting Director, National Digital Information Infrastructure & Preservation Program Library of Congress
What are the Biggest Insights that we have Learned in Fifteen Years of Building Digital Collections?We can never guess every way that our collections will be used. Researchers do not use digitalcollections the same way that they use analog collections.
Stewardship organizationshave, until recently, spoken of“collections” and “content” and“records” and even “files.”Now it’s also data.
Data is not just generated by satellites,identified during experiments, or collectedduring surveys.Datasets are not just scientific and businesstables and spreadsheets.We have Big Data in our Libraries, Archivesand Museums.
What are examples of someof the challenges ofcollecting and preservinglarge scale collections inmany formats, and makingthem usable as collectionsand as data?
More and more researchers want to usecollections as a whole, mining and organizingthe information in novel ways.Researchers use algorithms to mine the richinformation and tools to create pictures thattranslate that information into knowledge.Researchers may want to interact with acollection of artifacts, or they may want towork with a data corpus.
We still have collections. But what we alsohave is Big Data, which requires us to rethinkthe infrastructure that is needed to supportBig Data services. Our community used toexpect researchers to come to us, ask usquestions about our collections, and use ourdigital collections in our environment.Now our collections are, more often than not,self-serve.
National Digital Newspaper Program chroniclingamerica.loc.gov/Some researchers want to search for stories in historicnewspapers.Some researchers want to mine newspaper OCR for trendsacross time periods and geographic areas.Requests have come in to analyze all 5 million pages.The site gets approximately 4 million hits per day.The program has: Multiple producers (25 now, ultimately 54) Free and open public access APIs for machine access and automated processes Over 5.25 million newspaper pages ingested to date Over 250 Tb of data
Packard Campus National Audio-Visual CenterPreserving Film, Broadcast Television, andAudioThe Packard Campus is a variety of preservationworkflows, including those for obsolete physicalformats such as wire recordings, wax cylinders,and 2“ videotape. The Campus is fully equippedto play back and preserve all antique film, videoand sound formats, and to maintain thatcapability far into the future.The facility also handles born-digital video andaudio received directly from producers andcopyright owners.Over 3 PB of files.
WEB ARCHIVES http://www.loc.gov/webarchiving/ lcweb2.loc.gov/diglib/lcwa/html/lcwa-home.htmlThe Library has been archiving the web since 2000. Subject areaspecialists curate the collections, and Library catalogers createcollection-level metadata records. Permission requirements vary bysite category.The collections include: • U.S. elections • Web sites created by members of the House and Senate • Thematic collections around events, such as elections in the Philippines, the Iraq war, and the appointment of Supreme Court Justices. • Collections around an area of study, such as Legal “Blawgs”When we began archiving election web sites, we imagined usersbrowsing through the web pages, studying the graphics or use ofphrases or links. But when our first researchers came to the Library,they wanted to know about all those topics, but they used scripts toquery for them and sort them into categories. They were not very muchinterested in reading web pages.Approximately 6 billion filesOver 300 TB
THE TWITTER ARCHIVE Every public tweet since Twitter’s launch in March 2006. Research requests have included users looking for their own Twitter history, the study of the geographic spread of news, the study of the spread of epidemics, and the study of the transmission of new uses of language. The collection comprises only a few TB, but over 10s of billions of tweets. A White Paper is available online at: http://blogs.loc.gov/loc/2013/01/update-on-the- twitter-archive-at-the-library-of-congress/ social science visualizationsocial media status events personal privacy commercial
eSerialsCopyright Mandatory Deposit represents a large acquisitions channel for the Library. In general, all U.S. publishers are legally required to submit for deposit two copies of each of their publications to the Copyright Office. This mechanism has allowed the Library to build the collection and to preserve the publications.eSerials became subject to mandatory deposit in January 2010, with the publication of a new interim regulation. Demands began in June 2010 and files began to arrive in October 2010.The files must come to the Library “as published” – in whatever their original formats are.Articles may be accompanied by their associated datasets.
RESEARCH DATASETSThe datasets generated/used in the research process.Datasets can be:Small, such as surveys of a small sample populationMedium, such as a corpus of imagesBig Data, such as years of observational astronomical data.
It is not enough to be collectingpublications.We have to collect and preserveresearch data, in addition torecognizing that the collectionswe already have are Big Data tobe mined.
Are our institutions ready?We are building large digitalcollections and must considernew ways in which they shouldbe managed and used.
I will mention infrastructure only in passing.There are scale issues related to: Storage Backup and tape archiving Bandwidth Software development Staffing for processing
Library of Congress Preservation InfrastructureThe Library developed the BagIt transfer specification for themovement of files between and within organizations. http://www.digitalpreservation.gov/documents/bagitspec.pdfThe Library inventories incoming files, and is gradually inventorying alldigital content.The Library maintains multiple copies of files on servers and on tape,in geographically distributed locations.The Library has documented sustainability factors for file formats. http://www.digitalpreservation.gov/formats/For cases where we do have control over what comes in, we have a“Best Edition” Preferred Formats statement, which is currently beingupdated. •http://www.copyright.gov/circs/circ07b.pdf
There are many newactivities to be planned forwith new researcher usesand expectations.
How much ingest processing should be done withdata collections, or collections that can be treated asdata?Should collections be processed to create a variety ofderivatives that might be used in various forms ofanalysis before ingesting them?Do libraries have sufficient infrastructure to create full-test indexes for millions/billions of files to support fulldiscovery?Do libraries support analysis? Analytical tools are stillin early days for the scale of large datasets.
And what are the service models?If libraries decide that they will simply provide access todata, do they limit it to the native format or provide pre-processed or on-the-fly format transformation services fordownloads?Can libraries handle the download traffic?Can staff develop the expertise to provide guidance toresearchers in using analytical tools? Or is the expectationthat researchers will fend for themselves?
Libraries are increasingly looking towards self-service– researchers need not ask to download or tell us thatthey have. We may never know.BUT, libraries do have collections that are limited toon-site only access due to licenses or gift agreements.In that case, libraries may have to consider providinghigh-powered workstations with analytical tools forresearchers to work with these collections and takeanalysis outputs away with them.Both have policy implications and implications forpublic service staffing.
Libraries are managing and preservingthe datasets and big data necessary forre-use and replicability.This is an important new role for librariesin enabling new research.And libraries need to make the depositand management of such data easier toaccomplish.