Walsh "Text Data Mining with HTRC"

John A. Walsh, Indiana University
Text Data Mining with HTRC
NISO Virtual Workshop on Text and Data Mining
25 May 2022

Text Data Mining with HTRC and SCWAReD
Abstract
The mission of the HathiTrust Research Center (HTRC) is to provide tools,
environments, and services for computational research on the content of the 17-
million-volume HathiTrust Digital Library. In this talk, I will provide an overview of
the Text Data Mining (TDM) activities and services provided by HTRC, with
additional detail on two current initiatives, Scholar Curated Worksets for
Analysis, Re-use, and Dissemination (SCWAReD), supported by the Andrew W.
Mellon Foundation, and Tools for Open Research and Computation with
HathiTrust: Leveraging Intelligent Text Extraction (TORCHLITE), supported by
the National Endowment for the Humanities.

Text Data Mining with HTRC
Outline
• The HathiTrust Digital Library
• The HathiTrust Research Center
• Organization
• Research tools
• Data Capsule
• Web Algorithms
• Data Sets
• Outreach and education services
• SCWAReD: Scholar-Curated Worksets for Analysis Re-Use and Dissemination
• TORCHLITE: Tools for Open Research and Computation with HathiTrust: Leveraging Intelligent Text Extraction

The HathiTrust Digital Library (HTDL)
• Non-profit academic partnership
• 160 member libraries
• Mission to support teaching,
learning, scholarship

The HathiTrust Digital Library
• A unique organization
• Demonstration of library
cooperation
• Collaborative with members
• Balances preservation with
access

Mass digitization
• Minimizes curation
• Collection gathered in swathes of
digitization
• 3+ billion page turns by
thousands of scanning staff
Art of Google Books (http://theartofgooglebooks.tumblr.com/) Accessed October
25, 2018.

The collection
• 17+ million volumes
• Grows every day
• Composed of many sub-collections
• U.S. government documents (> 1 million items)
• Scripps Institute of Oceanography (> 100 thousand items)

Arabic
Portuguese
Italian
Russian
Japanese
Chinese
Spanish
French
German
English
Publication dates for items in HTDL Languages of items in HTDL

The collection
• HathiTrust collection mirrors an academic library collection
• Library acquisition and publishing patterns impact content
• Dearth of romance novels (Bode, 2019)
• Follows many library conventions

The collection: examples
Moby Dick; or, The White Whale,
illustrated by Mead Schaeffer, 1923
https://hdl.handle.net/2027/mdp.49015002400035
United States Department of Agriculture yearbook, 1991
https://hdl.handle.net/2027/mdp.39015022545605

The collection: examples
Darwin, Erasmus. The Temple of Nature, 1803.
https://hdl.handle.net/2027/uc2.ark:/13960/t27941p4f
Darwin, Chartles. On the Origin of Species.
https://hdl.handle.net/2027/nyp.33433007401833

HTRC Mission
Enable and support the computational analysis of the HathiTrust corpus by
developing and implementing:
• secure computational environments,
• tools for text data mining (TDM) and non-consumptive research,
• data sets derived from the HT corpus,
• outreach and education services.

HTRC Audience
Audience
The target audience for
HTRC is the worldwide
research community,
particularly the subset of that
community represented by
the HathiTrust membership.

HTRC Units
Outreach
and
Education
CyberInfrastructure
Operations
Research
Support
Services

HTRC Services
Outreach and Education
• General communication, promotion, outreach
• In-person and virtual workshops
• Testing and documentation of new and existing tools
• Virtual office hours
• HTRC User Group Meetings
• Advanced Collaborative Support program
• Data capsule export reviews

HTRC Services
CyberInfrastructure
• Implementation and maintenance of HTRC production IT environment and
Web presence
• Security
• Authentication and user management
• Data Capsules
• Web Analytics
• Implementation of new tools and services

HTRC Services
Research Support
• Text Data Mining research
• “Non-consumptive” (a.k.a. “non-expressive”) focus
• Tool development and maintenance
• Dataset development and maintenance
• Extracted Features
• Researcher-created, e.g., “Geographic Locations in English-Language Literature”
• Technical support (with OES) for HTRC users and Advanced Collaborative Support
projects

Accessing
HTRC Tools
and Services
• Text Analysis
Algorithms
• Extracted Features
• Data Capsules
https://analytics.hathitrust.org

Web-based
tools
• Topic modeling
• Word frequency
statistics and
visualizations
• Named-entity
recognition

Bookworm
in practice
Samuel Franklin. “Inside
the Creativity Boom.”
https://cutt.ly/htrc-bookworm
https://cutt.ly/htrc-bookworm

InPhO
Topic
Explorer
Walter, Scott. Waverly
Novels. 48 vols.
Edinburgh: R. Cadell,
1829-1833.

Token Count
and
Tag Cloud Creator
Walter, Scott. Waverly Novels.
48 vols. Edinburgh: R. Cadell,
1829-1833.

Named
Entity
Recognizer
Walter, Scott. Waverly Novels.
48 vols. Edinburgh: R. Cadell,
1829-1833.

Extracted
Features
Dataset
https://cutt.ly/htrc-ef-docs

Data Capsule
Secure computing environment

Advanced Collaborative Support (ACS) program
https://cutt.ly/htrc-acs
• 5 rounds of awards
since 2015
• 26 projects
• 41 researchers
• 24 institutions

Advanced Collaborative Support (ACS) program
Example projects
• Detecting and Transcribing Arabographic Texts
• David Smith (Northeastern University), Matthew Thomas Miller (University of Maryland), Maxim Romanov
(University of Vienna), and Sarah Bowen Savant (Aga Khan University, London)
• Tracing the Shifting Rhetoric of Ethnoracial Difference in Federal Responses
to Education, 1958-2018
• Andrés Castro Samayoa (Boston College)
• Building Large-Scale Collections of Genre Fiction
• Laure Thompson and David Mimno (Cornell University)
• Deriving Basic Illustration Metadata
• Stephen Krewson (Yale University)
• A Computational History of the U.S. Novel, 1950-2000
• Richard Jean So (McGill University)
https://cutt.ly/htrc-acs

Scholar-Curated Worksets for Re-use, Dissemination, and Analysis
(SCWAReD)
• The SCWAReD project, funded by the Andrew W. Mellon Foundation,
will advance the mission of the HTRC by providing scholar-curated
worksets and illustrative, reusable models of HTRC-enabled research,
with an emphasis on content related to historically under-resourced and
marginalized textual communities.
• Flagship workset will be based on the Project on the History of Black
Writing in collaboration with Co-PI Dr. Maryemma Graham (University
of Kansas).
• Four additional researchers/teams have been recruited through our
ACS program to develop additional worksets and research models.

SCWAReD Projects
https://www.hathitrust.org/htrc_scwaredACS_awards
• History of Black Writing
• Maryemma Graham (University of Kansas)
• Mining the Native American Authored Works in HathiTrust for Insights
• Kun Lu, Raina Heaton, and Raymond Orr (University of Oklahoma)
• The Black Fantastic: Curated Vocabularies, Artifact Analysis and Identification
• Clarissa West-White (Bethune Cookman University) and Seretha Williams (Augusta University)
• Creating Period-Specific Worksets for Latin American Fiction
• José Eduardo González (University of Nebraska, Lincoln)
• The National Negro Health Digital Project: Recovering and Restoring a Black Public Health Corpus
• Kim Gallon (Purdue University)

What are SCWAReD models?
Illustrative, re-usable research models…
• Scholar-curated worksets focused on a specific topic, discipline, or theme.
• Scholarly introductions to worksets that will address topics such as historical
and cultural context, scope, significance, potential research questions that may
be addressed by the workset, and potential audiences for the workset.
• Analysis of the content in the curated workset.
• Documented derived datasets (e.g., geospatial data, temporal data, named
entities, specialized vocabularies)
• Project white paper that summarizes the overall project, methods, processes,
and findings.

TORCHLITE
Tools for Open Research and Computation with HathiTrust: Leveraging Intelligent Text Extraction
The TORCHLITE project, supported by the National Endowment for the
Humanities’ Office of Digital Humanities will:
• Develop an API for access to HTRC Extracted Features data;
• Building on the API, develop an interactive data dashboard with widgets for
analyzing and visualizing volumes and worksets;
• Building on the API, develop a suite of Jupyter notebooks for working with
Extracted Features data;
• Public events for engaging with research and development communities to
promote use of the TORCHLITE API, data dashboard, and Jupyter notebooks
and facilitate community development of additional resources.

TORCHLITE
Tools for Open Research and Computation with
HathiTrust: Leveraging Intelligent Text Extraction

Thank you, questions, discussion…
jawalsh@indiana.edu
htrc-help@hathitrust.org

Walsh "Text Data Mining with HTRC"

Recommended

Recommended

More Related Content

Similar to Walsh "Text Data Mining with HTRC"

Similar to Walsh "Text Data Mining with HTRC" (20)

More from National Information Standards Organization (NISO)

More from National Information Standards Organization (NISO) (20)

Recently uploaded

Recently uploaded (20)

Walsh "Text Data Mining with HTRC"

Editor's Notes