Chuck Henry and Christa Williford, DLF Forum, November 2011 Lessons from the Digging Into Data Challenge What Information Professionals Should Know about Computationally Intensive Research in the Humanities and Social Sciences For the past two years, the Council on Library and Information Resources (CLIR) has partnered with the National Endowment for Humanities Office of Digital Humanities (NEH‐ODH) in an intensive assessment of the inaugural year of the Digging Into Data grant program. Launched in 2009, this unprecedented international initiative involved four funding agencies in three countries and supported eight international collaborative research projects in the social sciences and humanities, all of which bring innovative applications of computer technology to bear on the collection, mining, and interpretation of large data corpora. Here is a sampling of what CLIR has learned: Lesson 1: Computationally intensive research requires open sharing of resources among participants. Essential resources include hardware, software, data corpora, and communication tools. Information professionals can facilitate open sharing by helping researchers forge partnership agreements based upon trust and transparency. Example: To support the project “Digging Into Data to Answer Authorship Related Questions,” participants drafted a Memorandum of Understanding that made clear how shared resources would be funded as well as established a plan for project communication and credit sharing. See: Michael Simeone, Jennifer Guiliano, Rob Kooper and Peter Bajcsy, "Digging into Data Using New Collaborative Infrastructures Supporting Humanities‐based Computer Science Research." First Monday 16.5 (2 May 2011): http://firstmonday.org/htbin/cgiwrap/bin/ojs/index.php/fm/article/viewArticle/3372/2950 Lesson 2: Computationally intensive research projects rely upon diverse kinds of expertise: domain (or subject) expertise, analytical expertise, data management expertise, and project management expertise. Information professionals can offer and/or develop skills and knowledge in each of these areas, enabling them to participate actively as research partners. Example: For their project, “Digging Into the Enlightenment: Mapping the Republic of Letters,” Stanford University provided resources and project management support to their international partners through “embedded” information professional Nicole Coleman, who is based at the Stanford Humanities Center. As Academic Technology Specialist, Nicole’s focus is on finding new research opportunities and supporting the production of new knowledge, and she has developed expertise in the kinds of infrastructure and management practices that contribute to successful research collaborations. For more information about this project, see: http://enlightenment.humanitiesnetwork.org/ Lesson 3: When it comes to analytical tools, one size does not fit all. As their questions evolve throughout their projects, researchers want the flexibility to alternate between looking closely at select data and performing “distant” readings of entire corpora. Information professionals can educate researchers to help them refine their questions, select appropriate tools, and use their tools effectively. Example: While both close and distant readings of evidence characterized most of the Digging Into Data project methodologies, Richard Healey, co‐principal investigator of “Railroads and the Making of Modern America,” has an interesting take on why humanities and social science data requires the continual adaptation and evolution of analytical tools. He hypothesizes many “different levels of data‐related operations,” and these levels determine the research outcomes that are possible at each level. He writes: The levels relate to the degree of scholarly input involved and I see them…as a data ‘hierarchy’: • Level 0 ‐ Data so riddled with error it should come with a serious intellectual health warning! (We have much more of this than most people seem willing to admit and much of the Google data from scanned railroad reports admirably fits into this category). • Level 1 ‐ Raw datasets…corrected for obvious errors.
Chuck Henry and Christa Williford, DLF Forum, November 2011 • Level 2 ‐ Value‐added datasets: those that have been standardised/coded etc. in a consistent fashion according to some recognised scheme or procedure, which may require significant domain expertise [to produce]…) • Level 3 ‐ Integrated data resources: These will contain value‐added datasets but…explicit linkages have been made between multiple related datasets (or they have been coded/tagged in such a way that the linkages can be made by software. Hence, these are not just data because so much additional research time has been invested in them, which is why I prefer the word ‘resource’…. Many GIS resources are of this kind, because they require linkage of spatial and non‐spatial data. • Level 4 ‐ Digging Enabler or Digging Key data/classificatory resources: These require extensive domain expertise, and use of/analysis of multiple sources/relevant literature to create. They facilitate extensive additional types of digging activity to be undertaken on substantive projects beyond those of the investigators who created them, i.e they become authority files for the wider research community. Gazetteers, structured occupational coding systems, data cross‐ classifiers etc. fit into this category. Lesson 4: Big data isn’t just for scientists anymore. Not only do humanists and social scientists work with big data, their research can also produce large data corpora. Some scholars engaged in computationally intensive research see the new data they create as their most significant research outcomes. Researchers risk losing their valuable data unless they take steps to protect and sustain them. As practices for publishing research data evolve, information professionals can curate this data, working with scholars to appraise, normalize, validate, provide access to and, ultimately, preserve research data for the long term. Example: In the final white paper for “Mining a Year of Speech,” John Coleman draws a compelling comparison between the sizes of data sets with which current major science and humanities projects are engaged (see below). This paper is available at: http://www.phon.ox.ac.uk/files/pdfs/MiningaYearofSpeechWhitePaper.pdf