Leslie Johnston Keynote, Best Practices Exchange 2011
From Records to Data:It’s Not Just AboutCollections Any More Leslie Johnston, Library of Congress Best Practices Exchange 2011
What are the Biggest Insights that we have Learned in Fifteen Years ofBuilding Digital Collections?
Researchers do not use digitalcollections the same way that they use analog collections
We Can Never Guess EveryWay that Our Collections Will Be Used
Stewardship organizationshave, until recently, spoken of“collections” or “content” or“records” or even “files,” butnot data.
We Have Data in our Libraries, Archives and Museums? Yes.Data is not just generated by satellites, identified during experiments, or collected during surveys.
Datasets are not just scientific and businesstables and spreadsheets: our collections arenow considered data.They are the building blocks for interpretationand discovery that transform and combinethem into entities that we may not recognize.
More and more researchers want to usecollections as a whole, mining and organizingthe information in novel ways.Researchers use algorithms to mine the richinformation and tools to create pictures thattranslate that information into knowledge.Researchers may want to interact with acollection of artifacts, or they may want towork with a data corpus.
Consider the Digging Into DataChallengeThe repositories available for research include not onlyscientific information—astronomy, geology, physics, biology,social science surveys—but images, film, sound,newspapers, maps, art, archaeology, architecture andgovernment records. http://www.diggingintodata.org/
What Constitutes “Big Data?”The definition of Big Data is very fluid, as it is a movingtarget — what cannot be easily manipulated with commontools — and specific to the organization: what can bemanaged and stewarded by any one institution in itsinfrastructure. One researcher or organization’s concept ofa large data set is small to another.Not too long ago, an organization would be surprised toneed 10 TB of storage for a large digital collection. Now acollection can increase by 10 TB in a single week.
We still have collections. But what we alsohave is Big Data, which requires us to rethinkthe infrastructure that is needed to supportBig Data services. Our community used toexpect researchers to come to us, ask usquestions about our collections, and use ourdigital collections in our environment.Now our collections are, more often than not,self-serve.
Case Study: Web Archives • Web Archives, such as the one at the Library of Congress, may be comprised of billions of files. • When we began archiving election web sites, we imagined users browsing through the web pages, studying the graphics or use of phrases or links. But when our first researchers came to the Library, they wanted to know about all those topics, but they used scripts to query for them and sort them into categories. They were not very much interested in reading web pages. http://www.loc.gov/webarchiving/
Case Study: Historic Newspapers • The Chronicling America collection has over 4 million page images from historic newspapers with OCR from organizations in 25 states. • The site gets approximately 4 million views per day. • Some researchers want to search for stories in historic newspapers. • Some researchers want to mine newspaper OCR for trends across time periods and geographic areas. • Requests have come in to analyze all 4 million page images. http://chroniclingamerica.loc.gov/
Case Study: Twitter • The Twitter archive has 10s of billions of tweets in it. • Research requests have included users looking for their own Twitter history, the study of the geographic spread of news, the study of the spread of epidemics, and the study of the transmission of new uses of language. social science visualization social media status events personal privacy commercial
Can each of our organizations support real-time querying of billions of full-textitems? Can we provide tools for collectionanalysis and visualization? Can we supportthe frequent downloading by researchers ofcollections that may be over 200 TB each?These are among the questions that all of ourinstitutions are grappling with as we buildlarge digital collections and discover newways in which they can be used.
So what are ourinstitutions doingabout preservationand access to ourBig Collections andBig Data?
Collaboration www.digitalpreservation.gov/ndsaThe National Digital Stewardship Alliance is aninitiative of the National Digital InformationInfrastructure and Preservation Program at theLibrary of Congress, with almost 100 memberorganizations that share a sense of dedication todigital preservation, and want to workcollaboratively across the community.The NDSA operates through five working groups:Content; Standards and Practices; Infrastructure;Innovation; and Outreach.
Tool DevelopmentAll stewardship organizations can and shouldparticipate in the development and use of openaccess tools for use across the community.NDIIPP is revising its Tools and ServicesDirectory to include a broader range of projects,some of which are always looking for otherorganizations to contribute to!http://www.digitalpreservation.gov/partners/resources/tools
As an Example…Seeing and Sharing Digital CulturalHeritage Collections Differentlywith ViewShare/Recollection
bigish ideas› heterogeneous data› one big distributed collection› open distributed infrastructure› mindset: records -> data
share data and viewsshare not only the end results, butalso the raw data for other others tocreate their own views.data use and re-use
recent work› support for public/private views and data› beta support for OAI and ContentDM dataloading› full open source release on SourceForge:http://sourceforge.net/projects/loc-recollect/
what’s next?› viewshare.org public launch onNovember 1, 2011› big data sets: in a while› remix across data sets: long view
contact us› Let us know if you are interested inparticipation in the NDSA through the website› Let us know if there is a tool or service thatis missing from our directory› visit http://recollection.zepheira.com/ to geta sneak peek at ViewShare› email NDIIPPaccess@loc.gov if you areinterested in a ViewShare account