This presentation was provided by John Walsh of the HathiTrust Research Center, during the NISO Hot Topic Virtual Conference "Text and Data Mining." The event was held on May 25, 2022.
1. John A. Walsh, Indiana University
Text Data Mining with HTRC
NISO Virtual Workshop on Text and Data Mining
25 May 2022
2. Text Data Mining with HTRC and SCWAReD
Abstract
The mission of the HathiTrust Research Center (HTRC) is to provide tools,
environments, and services for computational research on the content of the 17-
million-volume HathiTrust Digital Library. In this talk, I will provide an overview of
the Text Data Mining (TDM) activities and services provided by HTRC, with
additional detail on two current initiatives, Scholar Curated Worksets for
Analysis, Re-use, and Dissemination (SCWAReD), supported by the Andrew W.
Mellon Foundation, and Tools for Open Research and Computation with
HathiTrust: Leveraging Intelligent Text Extraction (TORCHLITE), supported by
the National Endowment for the Humanities.
3. Text Data Mining with HTRC
Outline
• The HathiTrust Digital Library
• The HathiTrust Research Center
• Organization
• Research tools
• Data Capsule
• Web Algorithms
• Data Sets
• Outreach and education services
• SCWAReD: Scholar-Curated Worksets for Analysis Re-Use and Dissemination
• TORCHLITE: Tools for Open Research and Computation with HathiTrust: Leveraging Intelligent Text Extraction
4. The HathiTrust Digital Library (HTDL)
• Non-profit academic partnership
• 160 member libraries
• Mission to support teaching,
learning, scholarship
5. The HathiTrust Digital Library
• A unique organization
• Demonstration of library
cooperation
• Collaborative with members
• Balances preservation with
access
6. The HathiTrust Digital Library
Mass digitization
• Minimizes curation
• Collection gathered in swathes of
digitization
• 3+ billion page turns by
thousands of scanning staff
Art of Google Books (http://theartofgooglebooks.tumblr.com/) Accessed October
25, 2018.
7. The HathiTrust Digital Library
The collection
• 17+ million volumes
• Grows every day
• Composed of many sub-collections
• U.S. government documents (> 1 million items)
• Scripps Institute of Oceanography (> 100 thousand items)
9. The HathiTrust Digital Library
The collection
• HathiTrust collection mirrors an academic library collection
• Library acquisition and publishing patterns impact content
• Dearth of romance novels (Bode, 2019)
• Follows many library conventions
10. The HathiTrust Digital Library
The collection: examples
Moby Dick; or, The White Whale,
illustrated by Mead Schaeffer, 1923
https://hdl.handle.net/2027/mdp.49015002400035
United States Department of Agriculture yearbook, 1991
https://hdl.handle.net/2027/mdp.39015022545605
11. The HathiTrust Digital Library
The collection: examples
Darwin, Erasmus. The Temple of Nature, 1803.
https://hdl.handle.net/2027/uc2.ark:/13960/t27941p4f
Darwin, Chartles. On the Origin of Species.
https://hdl.handle.net/2027/nyp.33433007401833
12. HTRC Mission
Enable and support the computational analysis of the HathiTrust corpus by
developing and implementing:
• secure computational environments,
• tools for text data mining (TDM) and non-consumptive research,
• data sets derived from the HT corpus,
• outreach and education services.
13. HTRC Audience
Audience
The target audience for
HTRC is the worldwide
research community,
particularly the subset of that
community represented by
the HathiTrust membership.
15. HTRC Services
Outreach and Education
• General communication, promotion, outreach
• In-person and virtual workshops
• Testing and documentation of new and existing tools
• Virtual office hours
• HTRC User Group Meetings
• Advanced Collaborative Support program
• Data capsule export reviews
16. HTRC Services
CyberInfrastructure
• Implementation and maintenance of HTRC production IT environment and
Web presence
• Security
• Authentication and user management
• Data Capsules
• Web Analytics
• Implementation of new tools and services
17. HTRC Services
Research Support
• Text Data Mining research
• “Non-consumptive” (a.k.a. “non-expressive”) focus
• Tool development and maintenance
• Dataset development and maintenance
• Extracted Features
• Researcher-created, e.g., “Geographic Locations in English-Language Literature”
• Technical support (with OES) for HTRC users and Advanced Collaborative Support
projects
27. Advanced Collaborative Support (ACS) program
https://cutt.ly/htrc-acs
• 5 rounds of awards
since 2015
• 26 projects
• 41 researchers
• 24 institutions
28. Advanced Collaborative Support (ACS) program
Example projects
• Detecting and Transcribing Arabographic Texts
• David Smith (Northeastern University), Matthew Thomas Miller (University of Maryland), Maxim Romanov
(University of Vienna), and Sarah Bowen Savant (Aga Khan University, London)
• Tracing the Shifting Rhetoric of Ethnoracial Difference in Federal Responses
to Education, 1958-2018
• Andrés Castro Samayoa (Boston College)
• Building Large-Scale Collections of Genre Fiction
• Laure Thompson and David Mimno (Cornell University)
• Deriving Basic Illustration Metadata
• Stephen Krewson (Yale University)
• A Computational History of the U.S. Novel, 1950-2000
• Richard Jean So (McGill University)
https://cutt.ly/htrc-acs
29. Scholar-Curated Worksets for Re-use, Dissemination, and Analysis
(SCWAReD)
• The SCWAReD project, funded by the Andrew W. Mellon Foundation,
will advance the mission of the HTRC by providing scholar-curated
worksets and illustrative, reusable models of HTRC-enabled research,
with an emphasis on content related to historically under-resourced and
marginalized textual communities.
• Flagship workset will be based on the Project on the History of Black
Writing in collaboration with Co-PI Dr. Maryemma Graham (University
of Kansas).
• Four additional researchers/teams have been recruited through our
ACS program to develop additional worksets and research models.
30. SCWAReD Projects
https://www.hathitrust.org/htrc_scwaredACS_awards
• History of Black Writing
• Maryemma Graham (University of Kansas)
• Mining the Native American Authored Works in HathiTrust for Insights
• Kun Lu, Raina Heaton, and Raymond Orr (University of Oklahoma)
• The Black Fantastic: Curated Vocabularies, Artifact Analysis and Identification
• Clarissa West-White (Bethune Cookman University) and Seretha Williams (Augusta University)
• Creating Period-Specific Worksets for Latin American Fiction
• José Eduardo González (University of Nebraska, Lincoln)
• The National Negro Health Digital Project: Recovering and Restoring a Black Public Health Corpus
• Kim Gallon (Purdue University)
31. What are SCWAReD models?
Illustrative, re-usable research models…
• Scholar-curated worksets focused on a specific topic, discipline, or theme.
• Scholarly introductions to worksets that will address topics such as historical
and cultural context, scope, significance, potential research questions that may
be addressed by the workset, and potential audiences for the workset.
• Analysis of the content in the curated workset.
• Documented derived datasets (e.g., geospatial data, temporal data, named
entities, specialized vocabularies)
• Project white paper that summarizes the overall project, methods, processes,
and findings.
32. TORCHLITE
Tools for Open Research and Computation with HathiTrust: Leveraging Intelligent Text Extraction
The TORCHLITE project, supported by the National Endowment for the
Humanities’ Office of Digital Humanities will:
• Develop an API for access to HTRC Extracted Features data;
• Building on the API, develop an interactive data dashboard with widgets for
analyzing and visualizing volumes and worksets;
• Building on the API, develop a suite of Jupyter notebooks for working with
Extracted Features data;
• Public events for engaging with research and development communities to
promote use of the TORCHLITE API, data dashboard, and Jupyter notebooks
and facilitate community development of additional resources.
33. TORCHLITE
Tools for Open Research and Computation with
HathiTrust: Leveraging Intelligent Text Extraction
34. TORCHLITE
Tools for Open Research and Computation with
HathiTrust: Leveraging Intelligent Text Extraction
HathiTrust is a non-profit membership organization founded in 2008. It’s based at the University of Michigan, and there are 160 members serving 220 campuses. The mission of HathiTrust is to support teaching, learning, and scholarship by operating a large-scale digital library.
HathiTrust is unlike other similar organization, in that the library it operates is not a subscription database. The mission is also different from that of Internet Archive or Project Gutenberg. It has roots in the Google Books, but is an example of libraries banding together to build infrastructure for a shared cause. It is non-commercial and non-governmental, and welcomes input from the member community. HathiTrust balances preservation with access, by operating a preservation repository of digitized content, and providing access to all that it is legally able to via the digital library website.
HathiTrust was built primarily through mass digitization, which is an approach to scanning at scale. Books, and book-like objects, are scanned shelf-by-shelf, with little to no decision making about whether to scan individual items. All told, to amass the HathiTrust collection took more than 3 billion page turns by thousands of scanning staff over the last 12 years. You can sometimes find artifacts of that labor in the scans.
The HathiTrust collection contains more than 17 million items digitized at member libraries, which are academic and research libraries, and content is added to it daily. Within such a large body of materials, there are discrete sub-collections, such as US Federal Government documents and all of the items in the Scripps Institute of Oceanography at UC San Diego.
The content in HathiTrust ranges from the very old to the contemporary, with items dating to the 1500s and a concentration in the 20th century. The predominate language in HathiTrust is English, but the chart on the right shows the way that the share of English in the collection has gone down over time. About 40% public domain, 60% in copyright. See https://www.hathitrust.org/statistics_visualizations.
The overall HathiTrust collection mirrors the print collections from which it was scanned – which are primarily large, well-resourced academic libraries. This means that library acquisition and publishing patterns have an impact on the kinds of materials you will find in HathiTrust. For example, while you are likely to find many editions of Pride and Prejudice, you are less likely to find popular fiction, such as romance novels. HathiTrust follows many library conventions in how it describes and displays the material in the repository. This means that there are certain aspects of HathiTrust infrastructure and practices that are very library-like, and occasionally make it challenging to approach the repository as a data source.
Really a mix of services and other things the group does. Not exhaustive, but highlights.
Really a mix of services and other things the group does. Not exhaustive, but highlights.
Really a mix of services and other things the group does. Not exhaustive, but highlights.
“Inside the Creativity Boom,” Samuel Franklin, Brown University: This project will map the increasing use and shifting meanings of the words “creative” and “creativity,” with a particular focus on the twentieth century. A custom “creativity corpus” will be assembled and processed to identify linguistic patterns via a number of text analysis and natural language processing techniques. Brown’s project will make use of the functionality developed for HathiTrust + Bookworm.
One flagship project, The Project on the History of Black Writing, led by Mary Emma Graham at University of Kansas. Then through our ACS program we will be recruiting three additional scholars and teams, resulting in at last four of these research models.
They will be completely realized use cases. The worksets will be documented and reusable by other researchers or in the classroom. The workset may not be relevant to a particular researcher or teacher, but the process and methodology could be used as a model for a workset of different content.
The Digital Collections Strategy Working Group is charged with looking at more directed and intentional approach to building the HathiTrust collection, with an eye to addressing gaps and increasing the diversity of voices in the collection. I think the group would be quite interested in a similar overview of HTRC's work and a conversation about where HTRC researchers are finding gaps in the collection and where some analytics support from HTRC might help the working group move its charge forward.