SlideShare a Scribd company logo
1 of 35
John A. Walsh, Indiana University
Text Data Mining with HTRC
NISO Virtual Workshop on Text and Data Mining
25 May 2022
Text Data Mining with HTRC and SCWAReD
Abstract
The mission of the HathiTrust Research Center (HTRC) is to provide tools,
environments, and services for computational research on the content of the 17-
million-volume HathiTrust Digital Library. In this talk, I will provide an overview of
the Text Data Mining (TDM) activities and services provided by HTRC, with
additional detail on two current initiatives, Scholar Curated Worksets for
Analysis, Re-use, and Dissemination (SCWAReD), supported by the Andrew W.
Mellon Foundation, and Tools for Open Research and Computation with
HathiTrust: Leveraging Intelligent Text Extraction (TORCHLITE), supported by
the National Endowment for the Humanities.
Text Data Mining with HTRC
Outline
• The HathiTrust Digital Library
• The HathiTrust Research Center
• Organization
• Research tools
• Data Capsule
• Web Algorithms
• Data Sets
• Outreach and education services
• SCWAReD: Scholar-Curated Worksets for Analysis Re-Use and Dissemination
• TORCHLITE: Tools for Open Research and Computation with HathiTrust: Leveraging Intelligent Text Extraction
The HathiTrust Digital Library (HTDL)
• Non-profit academic partnership
• 160 member libraries
• Mission to support teaching,
learning, scholarship
The HathiTrust Digital Library
• A unique organization
• Demonstration of library
cooperation
• Collaborative with members
• Balances preservation with
access
The HathiTrust Digital Library
Mass digitization
• Minimizes curation
• Collection gathered in swathes of
digitization
• 3+ billion page turns by
thousands of scanning staff
Art of Google Books (http://theartofgooglebooks.tumblr.com/) Accessed October
25, 2018.
The HathiTrust Digital Library
The collection
• 17+ million volumes
• Grows every day
• Composed of many sub-collections
• U.S. government documents (> 1 million items)
• Scripps Institute of Oceanography (> 100 thousand items)
Arabic
Portuguese
Italian
Russian
Japanese
Chinese
Spanish
French
German
English
Publication dates for items in HTDL Languages of items in HTDL
The HathiTrust Digital Library
The collection
• HathiTrust collection mirrors an academic library collection
• Library acquisition and publishing patterns impact content
• Dearth of romance novels (Bode, 2019)
• Follows many library conventions
The HathiTrust Digital Library
The collection: examples
Moby Dick; or, The White Whale,
illustrated by Mead Schaeffer, 1923
https://hdl.handle.net/2027/mdp.49015002400035
United States Department of Agriculture yearbook, 1991
https://hdl.handle.net/2027/mdp.39015022545605
The HathiTrust Digital Library
The collection: examples
Darwin, Erasmus. The Temple of Nature, 1803.
https://hdl.handle.net/2027/uc2.ark:/13960/t27941p4f
Darwin, Chartles. On the Origin of Species.
https://hdl.handle.net/2027/nyp.33433007401833
HTRC Mission
Enable and support the computational analysis of the HathiTrust corpus by
developing and implementing:
• secure computational environments,
• tools for text data mining (TDM) and non-consumptive research,
• data sets derived from the HT corpus,
• outreach and education services.
HTRC Audience
Audience
The target audience for
HTRC is the worldwide
research community,
particularly the subset of that
community represented by
the HathiTrust membership.
HTRC Units
Outreach
and
Education
CyberInfrastructure
Operations
Research
Support
Services
HTRC Services
Outreach and Education
• General communication, promotion, outreach
• In-person and virtual workshops
• Testing and documentation of new and existing tools
• Virtual office hours
• HTRC User Group Meetings
• Advanced Collaborative Support program
• Data capsule export reviews
HTRC Services
CyberInfrastructure
• Implementation and maintenance of HTRC production IT environment and
Web presence
• Security
• Authentication and user management
• Data Capsules
• Web Analytics
• Implementation of new tools and services
HTRC Services
Research Support
• Text Data Mining research
• “Non-consumptive” (a.k.a. “non-expressive”) focus
• Tool development and maintenance
• Dataset development and maintenance
• Extracted Features
• Researcher-created, e.g., “Geographic Locations in English-Language Literature”
• Technical support (with OES) for HTRC users and Advanced Collaborative Support
projects
Accessing
HTRC Tools
and Services
• Text Analysis
Algorithms
• Extracted Features
• Data Capsules
https://analytics.hathitrust.org
Web-based
tools
• Topic modeling
• Word frequency
statistics and
visualizations
• Named-entity
recognition
https://analytics.hathitrust.org
Bookworm
in practice
Samuel Franklin. “Inside
the Creativity Boom.”
https://cutt.ly/htrc-bookworm
https://cutt.ly/htrc-bookworm
InPhO
Topic
Explorer
Walter, Scott. Waverly
Novels. 48 vols.
Edinburgh: R. Cadell,
1829-1833.
Token Count
and
Tag Cloud Creator
Walter, Scott. Waverly Novels.
48 vols. Edinburgh: R. Cadell,
1829-1833.
Named
Entity
Recognizer
Walter, Scott. Waverly Novels.
48 vols. Edinburgh: R. Cadell,
1829-1833.
Extracted
Features
Dataset
Extracted
Features
Dataset
https://cutt.ly/htrc-ef-docs
Data Capsule
Secure computing environment
Advanced Collaborative Support (ACS) program
https://cutt.ly/htrc-acs
• 5 rounds of awards
since 2015
• 26 projects
• 41 researchers
• 24 institutions
Advanced Collaborative Support (ACS) program
Example projects
• Detecting and Transcribing Arabographic Texts
• David Smith (Northeastern University), Matthew Thomas Miller (University of Maryland), Maxim Romanov
(University of Vienna), and Sarah Bowen Savant (Aga Khan University, London)
• Tracing the Shifting Rhetoric of Ethnoracial Difference in Federal Responses
to Education, 1958-2018
• Andrés Castro Samayoa (Boston College)
• Building Large-Scale Collections of Genre Fiction
• Laure Thompson and David Mimno (Cornell University)
• Deriving Basic Illustration Metadata
• Stephen Krewson (Yale University)
• A Computational History of the U.S. Novel, 1950-2000
• Richard Jean So (McGill University)
https://cutt.ly/htrc-acs
Scholar-Curated Worksets for Re-use, Dissemination, and Analysis
(SCWAReD)
• The SCWAReD project, funded by the Andrew W. Mellon Foundation,
will advance the mission of the HTRC by providing scholar-curated
worksets and illustrative, reusable models of HTRC-enabled research,
with an emphasis on content related to historically under-resourced and
marginalized textual communities.
• Flagship workset will be based on the Project on the History of Black
Writing in collaboration with Co-PI Dr. Maryemma Graham (University
of Kansas).
• Four additional researchers/teams have been recruited through our
ACS program to develop additional worksets and research models.
SCWAReD Projects
https://www.hathitrust.org/htrc_scwaredACS_awards
• History of Black Writing
• Maryemma Graham (University of Kansas)
• Mining the Native American Authored Works in HathiTrust for Insights
• Kun Lu, Raina Heaton, and Raymond Orr (University of Oklahoma)
• The Black Fantastic: Curated Vocabularies, Artifact Analysis and Identification
• Clarissa West-White (Bethune Cookman University) and Seretha Williams (Augusta University)
• Creating Period-Specific Worksets for Latin American Fiction
• José Eduardo González (University of Nebraska, Lincoln)
• The National Negro Health Digital Project: Recovering and Restoring a Black Public Health Corpus
• Kim Gallon (Purdue University)
What are SCWAReD models?
Illustrative, re-usable research models…
• Scholar-curated worksets focused on a specific topic, discipline, or theme.
• Scholarly introductions to worksets that will address topics such as historical
and cultural context, scope, significance, potential research questions that may
be addressed by the workset, and potential audiences for the workset.
• Analysis of the content in the curated workset.
• Documented derived datasets (e.g., geospatial data, temporal data, named
entities, specialized vocabularies)
• Project white paper that summarizes the overall project, methods, processes,
and findings.
TORCHLITE
Tools for Open Research and Computation with HathiTrust: Leveraging Intelligent Text Extraction
The TORCHLITE project, supported by the National Endowment for the
Humanities’ Office of Digital Humanities will:
• Develop an API for access to HTRC Extracted Features data;
• Building on the API, develop an interactive data dashboard with widgets for
analyzing and visualizing volumes and worksets;
• Building on the API, develop a suite of Jupyter notebooks for working with
Extracted Features data;
• Public events for engaging with research and development communities to
promote use of the TORCHLITE API, data dashboard, and Jupyter notebooks
and facilitate community development of additional resources.
TORCHLITE
Tools for Open Research and Computation with
HathiTrust: Leveraging Intelligent Text Extraction
TORCHLITE
Tools for Open Research and Computation with
HathiTrust: Leveraging Intelligent Text Extraction
Thank you, questions, discussion…
jawalsh@indiana.edu
htrc-help@hathitrust.org
https://analytics.hathitrust.org

More Related Content

Similar to Walsh "Text Data Mining with HTRC"

Rebecca Grant DAH Research Presentation
Rebecca Grant DAH Research PresentationRebecca Grant DAH Research Presentation
Rebecca Grant DAH Research Presentationdri_ireland
 
7th BL Labs Symposium (2019): 08_An update on the ‘Living with machines’ project
7th BL Labs Symposium (2019): 08_An update on the ‘Living with machines’ project7th BL Labs Symposium (2019): 08_An update on the ‘Living with machines’ project
7th BL Labs Symposium (2019): 08_An update on the ‘Living with machines’ projectlabsbl
 
EarthCube's OceanLink - Project Overview and Presentation Updates (March 2014)
EarthCube's OceanLink - Project Overview and Presentation Updates (March 2014)EarthCube's OceanLink - Project Overview and Presentation Updates (March 2014)
EarthCube's OceanLink - Project Overview and Presentation Updates (March 2014)EarthCube
 
Rebecca Grant - DH research data: identification and challenges (DH2016)
Rebecca Grant - DH research data: identification and challenges (DH2016)Rebecca Grant - DH research data: identification and challenges (DH2016)
Rebecca Grant - DH research data: identification and challenges (DH2016)dri_ireland
 
Contributing to the global commons: Repositories and Wikimedia
Contributing to the global commons: Repositories and WikimediaContributing to the global commons: Repositories and Wikimedia
Contributing to the global commons: Repositories and WikimediaNick Sheppard
 
Foundations to Actions: Extending Innovations to Digital Libraries in Partner...
Foundations to Actions: Extending Innovations to Digital Libraries in Partner...Foundations to Actions: Extending Innovations to Digital Libraries in Partner...
Foundations to Actions: Extending Innovations to Digital Libraries in Partner...Trish Rose-Sandler
 
American Art Collaborative Linked Open Data presentation to "The Networked Cu...
American Art Collaborative Linked Open Data presentation to "The Networked Cu...American Art Collaborative Linked Open Data presentation to "The Networked Cu...
American Art Collaborative Linked Open Data presentation to "The Networked Cu...American Art Collaborative
 
ESWC 2017 Tutorial Knowledge Graphs
ESWC 2017 Tutorial Knowledge GraphsESWC 2017 Tutorial Knowledge Graphs
ESWC 2017 Tutorial Knowledge GraphsPeter Haase
 
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & Museums
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & MuseumsALIAOnline Practical Linked (Open) Data for Libraries, Archives & Museums
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & MuseumsJon Voss
 
Current metadata landscape in the library world (Getaneh Alemu)
Current metadata landscape in the library world (Getaneh Alemu)Current metadata landscape in the library world (Getaneh Alemu)
Current metadata landscape in the library world (Getaneh Alemu)Getaneh Alemu
 
Institutional Repositories
Institutional RepositoriesInstitutional Repositories
Institutional RepositoriesSridhar Gutam
 
Linked data for knowledge curation in humanities research
Linked data for knowledge curation in humanities researchLinked data for knowledge curation in humanities research
Linked data for knowledge curation in humanities researchEnrico Daga
 
Getting Started with Knowledge Graphs
Getting Started with Knowledge GraphsGetting Started with Knowledge Graphs
Getting Started with Knowledge GraphsPeter Haase
 
Dh presentation helig 2014
Dh presentation helig 2014Dh presentation helig 2014
Dh presentation helig 2014HELIGLIASA
 
Lorna hughes 12 05-2013 NeDiMAH and ontology for DH
Lorna hughes 12 05-2013 NeDiMAH and ontology for DHLorna hughes 12 05-2013 NeDiMAH and ontology for DH
Lorna hughes 12 05-2013 NeDiMAH and ontology for DHlorna_hughes
 
Plale HathiTrust El Colegio de Mexico May2014
Plale HathiTrust El Colegio de Mexico May2014Plale HathiTrust El Colegio de Mexico May2014
Plale HathiTrust El Colegio de Mexico May2014Beth Plale
 
Digital research: Collections, data, tools and methods
Digital research: Collections, data, tools and methods Digital research: Collections, data, tools and methods
Digital research: Collections, data, tools and methods Stella Wisdom
 
Visualizing the Transcribe Bentham Corpus
Visualizing the Transcribe Bentham CorpusVisualizing the Transcribe Bentham Corpus
Visualizing the Transcribe Bentham CorpusUCLDH
 
Humanities data curation slides
Humanities data curation slidesHumanities data curation slides
Humanities data curation slidesHarriett Green
 

Similar to Walsh "Text Data Mining with HTRC" (20)

Rebecca Grant DAH Research Presentation
Rebecca Grant DAH Research PresentationRebecca Grant DAH Research Presentation
Rebecca Grant DAH Research Presentation
 
7th BL Labs Symposium (2019): 08_An update on the ‘Living with machines’ project
7th BL Labs Symposium (2019): 08_An update on the ‘Living with machines’ project7th BL Labs Symposium (2019): 08_An update on the ‘Living with machines’ project
7th BL Labs Symposium (2019): 08_An update on the ‘Living with machines’ project
 
EarthCube's OceanLink - Project Overview and Presentation Updates (March 2014)
EarthCube's OceanLink - Project Overview and Presentation Updates (March 2014)EarthCube's OceanLink - Project Overview and Presentation Updates (March 2014)
EarthCube's OceanLink - Project Overview and Presentation Updates (March 2014)
 
Rebecca Grant - DH research data: identification and challenges (DH2016)
Rebecca Grant - DH research data: identification and challenges (DH2016)Rebecca Grant - DH research data: identification and challenges (DH2016)
Rebecca Grant - DH research data: identification and challenges (DH2016)
 
Contributing to the global commons: Repositories and Wikimedia
Contributing to the global commons: Repositories and WikimediaContributing to the global commons: Repositories and Wikimedia
Contributing to the global commons: Repositories and Wikimedia
 
Foundations to Actions: Extending Innovations to Digital Libraries in Partner...
Foundations to Actions: Extending Innovations to Digital Libraries in Partner...Foundations to Actions: Extending Innovations to Digital Libraries in Partner...
Foundations to Actions: Extending Innovations to Digital Libraries in Partner...
 
American Art Collaborative Linked Open Data presentation to "The Networked Cu...
American Art Collaborative Linked Open Data presentation to "The Networked Cu...American Art Collaborative Linked Open Data presentation to "The Networked Cu...
American Art Collaborative Linked Open Data presentation to "The Networked Cu...
 
ESWC 2017 Tutorial Knowledge Graphs
ESWC 2017 Tutorial Knowledge GraphsESWC 2017 Tutorial Knowledge Graphs
ESWC 2017 Tutorial Knowledge Graphs
 
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & Museums
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & MuseumsALIAOnline Practical Linked (Open) Data for Libraries, Archives & Museums
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & Museums
 
Current metadata landscape in the library world (Getaneh Alemu)
Current metadata landscape in the library world (Getaneh Alemu)Current metadata landscape in the library world (Getaneh Alemu)
Current metadata landscape in the library world (Getaneh Alemu)
 
Institutional Repositories
Institutional RepositoriesInstitutional Repositories
Institutional Repositories
 
Linked data for knowledge curation in humanities research
Linked data for knowledge curation in humanities researchLinked data for knowledge curation in humanities research
Linked data for knowledge curation in humanities research
 
Getting Started with Knowledge Graphs
Getting Started with Knowledge GraphsGetting Started with Knowledge Graphs
Getting Started with Knowledge Graphs
 
Dh presentation helig 2014
Dh presentation helig 2014Dh presentation helig 2014
Dh presentation helig 2014
 
Lorna hughes 12 05-2013 NeDiMAH and ontology for DH
Lorna hughes 12 05-2013 NeDiMAH and ontology for DHLorna hughes 12 05-2013 NeDiMAH and ontology for DH
Lorna hughes 12 05-2013 NeDiMAH and ontology for DH
 
Plale HathiTrust El Colegio de Mexico May2014
Plale HathiTrust El Colegio de Mexico May2014Plale HathiTrust El Colegio de Mexico May2014
Plale HathiTrust El Colegio de Mexico May2014
 
Digital Research at the British Library, by Stella Wisdom
Digital Research at the British Library, by Stella WisdomDigital Research at the British Library, by Stella Wisdom
Digital Research at the British Library, by Stella Wisdom
 
Digital research: Collections, data, tools and methods
Digital research: Collections, data, tools and methods Digital research: Collections, data, tools and methods
Digital research: Collections, data, tools and methods
 
Visualizing the Transcribe Bentham Corpus
Visualizing the Transcribe Bentham CorpusVisualizing the Transcribe Bentham Corpus
Visualizing the Transcribe Bentham Corpus
 
Humanities data curation slides
Humanities data curation slidesHumanities data curation slides
Humanities data curation slides
 

More from National Information Standards Organization (NISO)

More from National Information Standards Organization (NISO) (20)

Bazargan "NISO Webinar, Sustainability in Publishing"
Bazargan "NISO Webinar, Sustainability in Publishing"Bazargan "NISO Webinar, Sustainability in Publishing"
Bazargan "NISO Webinar, Sustainability in Publishing"
 
Rapple "Scholarly Communications and the Sustainable Development Goals"
Rapple "Scholarly Communications and the Sustainable Development Goals"Rapple "Scholarly Communications and the Sustainable Development Goals"
Rapple "Scholarly Communications and the Sustainable Development Goals"
 
Compton "NISO Webinar, Sustainability in Publishing"
Compton "NISO Webinar, Sustainability in Publishing"Compton "NISO Webinar, Sustainability in Publishing"
Compton "NISO Webinar, Sustainability in Publishing"
 
Mattingly "AI & Prompt Design: Large Language Models"
Mattingly "AI & Prompt Design: Large Language Models"Mattingly "AI & Prompt Design: Large Language Models"
Mattingly "AI & Prompt Design: Large Language Models"
 
Hazen, Morse, and Varnum "Spring 2024 ODI Conformance Statement Workshop for ...
Hazen, Morse, and Varnum "Spring 2024 ODI Conformance Statement Workshop for ...Hazen, Morse, and Varnum "Spring 2024 ODI Conformance Statement Workshop for ...
Hazen, Morse, and Varnum "Spring 2024 ODI Conformance Statement Workshop for ...
 
Mattingly "AI & Prompt Design" - Introduction to Machine Learning"
Mattingly "AI & Prompt Design" - Introduction to Machine Learning"Mattingly "AI & Prompt Design" - Introduction to Machine Learning"
Mattingly "AI & Prompt Design" - Introduction to Machine Learning"
 
Mattingly "Text and Data Mining: Building Data Driven Applications"
Mattingly "Text and Data Mining: Building Data Driven Applications"Mattingly "Text and Data Mining: Building Data Driven Applications"
Mattingly "Text and Data Mining: Building Data Driven Applications"
 
Mattingly "Text and Data Mining: Searching Vectors"
Mattingly "Text and Data Mining: Searching Vectors"Mattingly "Text and Data Mining: Searching Vectors"
Mattingly "Text and Data Mining: Searching Vectors"
 
Mattingly "Text Mining Techniques"
Mattingly "Text Mining Techniques"Mattingly "Text Mining Techniques"
Mattingly "Text Mining Techniques"
 
Mattingly "Text Processing for Library Data: Representing Text as Data"
Mattingly "Text Processing for Library Data: Representing Text as Data"Mattingly "Text Processing for Library Data: Representing Text as Data"
Mattingly "Text Processing for Library Data: Representing Text as Data"
 
Carpenter "Designing NISO's New Strategic Plan: 2023-2026"
Carpenter "Designing NISO's New Strategic Plan: 2023-2026"Carpenter "Designing NISO's New Strategic Plan: 2023-2026"
Carpenter "Designing NISO's New Strategic Plan: 2023-2026"
 
Ross and Clark "Strategic Planning"
Ross and Clark "Strategic Planning"Ross and Clark "Strategic Planning"
Ross and Clark "Strategic Planning"
 
Mattingly "Data Mining Techniques: Classification and Clustering"
Mattingly "Data Mining Techniques: Classification and Clustering"Mattingly "Data Mining Techniques: Classification and Clustering"
Mattingly "Data Mining Techniques: Classification and Clustering"
 
Straza "Global collaboration towards equitable and open science: UNESCO Recom...
Straza "Global collaboration towards equitable and open science: UNESCO Recom...Straza "Global collaboration towards equitable and open science: UNESCO Recom...
Straza "Global collaboration towards equitable and open science: UNESCO Recom...
 
Lippincott "Beyond access: Accelerating discovery and increasing trust throug...
Lippincott "Beyond access: Accelerating discovery and increasing trust throug...Lippincott "Beyond access: Accelerating discovery and increasing trust throug...
Lippincott "Beyond access: Accelerating discovery and increasing trust throug...
 
Kriegsman "Integrating Open and Equitable Research into Open Science"
Kriegsman "Integrating Open and Equitable Research into Open Science"Kriegsman "Integrating Open and Equitable Research into Open Science"
Kriegsman "Integrating Open and Equitable Research into Open Science"
 
Mattingly "Ethics and Cleaning Data"
Mattingly "Ethics and Cleaning Data"Mattingly "Ethics and Cleaning Data"
Mattingly "Ethics and Cleaning Data"
 
Mercado-Lara "Open & Equitable Program"
Mercado-Lara "Open & Equitable Program"Mercado-Lara "Open & Equitable Program"
Mercado-Lara "Open & Equitable Program"
 
Ratner "Enhancing Open Science: Assessing Tools & Charting Progress"
Ratner "Enhancing Open Science: Assessing Tools & Charting Progress"Ratner "Enhancing Open Science: Assessing Tools & Charting Progress"
Ratner "Enhancing Open Science: Assessing Tools & Charting Progress"
 
Pfeiffer "Enhancing Open Science: Assessing Tools & Charting Progress"
Pfeiffer "Enhancing Open Science: Assessing Tools & Charting Progress"Pfeiffer "Enhancing Open Science: Assessing Tools & Charting Progress"
Pfeiffer "Enhancing Open Science: Assessing Tools & Charting Progress"
 

Recently uploaded

Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Celine George
 
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfGrade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfJemuel Francisco
 
Expanded definition: technical and operational
Expanded definition: technical and operationalExpanded definition: technical and operational
Expanded definition: technical and operationalssuser3e220a
 
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSGRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSJoshuaGantuangco2
 
How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17Celine George
 
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxAnupkumar Sharma
 
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...Postal Advocate Inc.
 
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptxAUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptxiammrhaywood
 
Presentation Activity 2. Unit 3 transv.pptx
Presentation Activity 2. Unit 3 transv.pptxPresentation Activity 2. Unit 3 transv.pptx
Presentation Activity 2. Unit 3 transv.pptxRosabel UA
 
EMBODO Lesson Plan Grade 9 Law of Sines.docx
EMBODO Lesson Plan Grade 9 Law of Sines.docxEMBODO Lesson Plan Grade 9 Law of Sines.docx
EMBODO Lesson Plan Grade 9 Law of Sines.docxElton John Embodo
 
The Contemporary World: The Globalization of World Politics
The Contemporary World: The Globalization of World PoliticsThe Contemporary World: The Globalization of World Politics
The Contemporary World: The Globalization of World PoliticsRommel Regala
 
4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptx4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptxmary850239
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Mark Reed
 
ROLES IN A STAGE PRODUCTION in arts.pptx
ROLES IN A STAGE PRODUCTION in arts.pptxROLES IN A STAGE PRODUCTION in arts.pptx
ROLES IN A STAGE PRODUCTION in arts.pptxVanesaIglesias10
 
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptxQ4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptxlancelewisportillo
 
TEACHER REFLECTION FORM (NEW SET........).docx
TEACHER REFLECTION FORM (NEW SET........).docxTEACHER REFLECTION FORM (NEW SET........).docx
TEACHER REFLECTION FORM (NEW SET........).docxruthvilladarez
 
Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4JOYLYNSAMANIEGO
 

Recently uploaded (20)

YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptxYOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
 
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
 
INCLUSIVE EDUCATION PRACTICES FOR TEACHERS AND TRAINERS.pptx
INCLUSIVE EDUCATION PRACTICES FOR TEACHERS AND TRAINERS.pptxINCLUSIVE EDUCATION PRACTICES FOR TEACHERS AND TRAINERS.pptx
INCLUSIVE EDUCATION PRACTICES FOR TEACHERS AND TRAINERS.pptx
 
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptxYOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
 
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfGrade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
 
Expanded definition: technical and operational
Expanded definition: technical and operationalExpanded definition: technical and operational
Expanded definition: technical and operational
 
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSGRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
 
How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17
 
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
 
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
 
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptxAUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptx
 
Presentation Activity 2. Unit 3 transv.pptx
Presentation Activity 2. Unit 3 transv.pptxPresentation Activity 2. Unit 3 transv.pptx
Presentation Activity 2. Unit 3 transv.pptx
 
EMBODO Lesson Plan Grade 9 Law of Sines.docx
EMBODO Lesson Plan Grade 9 Law of Sines.docxEMBODO Lesson Plan Grade 9 Law of Sines.docx
EMBODO Lesson Plan Grade 9 Law of Sines.docx
 
The Contemporary World: The Globalization of World Politics
The Contemporary World: The Globalization of World PoliticsThe Contemporary World: The Globalization of World Politics
The Contemporary World: The Globalization of World Politics
 
4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptx4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptx
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)
 
ROLES IN A STAGE PRODUCTION in arts.pptx
ROLES IN A STAGE PRODUCTION in arts.pptxROLES IN A STAGE PRODUCTION in arts.pptx
ROLES IN A STAGE PRODUCTION in arts.pptx
 
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptxQ4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
 
TEACHER REFLECTION FORM (NEW SET........).docx
TEACHER REFLECTION FORM (NEW SET........).docxTEACHER REFLECTION FORM (NEW SET........).docx
TEACHER REFLECTION FORM (NEW SET........).docx
 
Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4
 

Walsh "Text Data Mining with HTRC"

  • 1. John A. Walsh, Indiana University Text Data Mining with HTRC NISO Virtual Workshop on Text and Data Mining 25 May 2022
  • 2. Text Data Mining with HTRC and SCWAReD Abstract The mission of the HathiTrust Research Center (HTRC) is to provide tools, environments, and services for computational research on the content of the 17- million-volume HathiTrust Digital Library. In this talk, I will provide an overview of the Text Data Mining (TDM) activities and services provided by HTRC, with additional detail on two current initiatives, Scholar Curated Worksets for Analysis, Re-use, and Dissemination (SCWAReD), supported by the Andrew W. Mellon Foundation, and Tools for Open Research and Computation with HathiTrust: Leveraging Intelligent Text Extraction (TORCHLITE), supported by the National Endowment for the Humanities.
  • 3. Text Data Mining with HTRC Outline • The HathiTrust Digital Library • The HathiTrust Research Center • Organization • Research tools • Data Capsule • Web Algorithms • Data Sets • Outreach and education services • SCWAReD: Scholar-Curated Worksets for Analysis Re-Use and Dissemination • TORCHLITE: Tools for Open Research and Computation with HathiTrust: Leveraging Intelligent Text Extraction
  • 4. The HathiTrust Digital Library (HTDL) • Non-profit academic partnership • 160 member libraries • Mission to support teaching, learning, scholarship
  • 5. The HathiTrust Digital Library • A unique organization • Demonstration of library cooperation • Collaborative with members • Balances preservation with access
  • 6. The HathiTrust Digital Library Mass digitization • Minimizes curation • Collection gathered in swathes of digitization • 3+ billion page turns by thousands of scanning staff Art of Google Books (http://theartofgooglebooks.tumblr.com/) Accessed October 25, 2018.
  • 7. The HathiTrust Digital Library The collection • 17+ million volumes • Grows every day • Composed of many sub-collections • U.S. government documents (> 1 million items) • Scripps Institute of Oceanography (> 100 thousand items)
  • 9. The HathiTrust Digital Library The collection • HathiTrust collection mirrors an academic library collection • Library acquisition and publishing patterns impact content • Dearth of romance novels (Bode, 2019) • Follows many library conventions
  • 10. The HathiTrust Digital Library The collection: examples Moby Dick; or, The White Whale, illustrated by Mead Schaeffer, 1923 https://hdl.handle.net/2027/mdp.49015002400035 United States Department of Agriculture yearbook, 1991 https://hdl.handle.net/2027/mdp.39015022545605
  • 11. The HathiTrust Digital Library The collection: examples Darwin, Erasmus. The Temple of Nature, 1803. https://hdl.handle.net/2027/uc2.ark:/13960/t27941p4f Darwin, Chartles. On the Origin of Species. https://hdl.handle.net/2027/nyp.33433007401833
  • 12. HTRC Mission Enable and support the computational analysis of the HathiTrust corpus by developing and implementing: • secure computational environments, • tools for text data mining (TDM) and non-consumptive research, • data sets derived from the HT corpus, • outreach and education services.
  • 13. HTRC Audience Audience The target audience for HTRC is the worldwide research community, particularly the subset of that community represented by the HathiTrust membership.
  • 15. HTRC Services Outreach and Education • General communication, promotion, outreach • In-person and virtual workshops • Testing and documentation of new and existing tools • Virtual office hours • HTRC User Group Meetings • Advanced Collaborative Support program • Data capsule export reviews
  • 16. HTRC Services CyberInfrastructure • Implementation and maintenance of HTRC production IT environment and Web presence • Security • Authentication and user management • Data Capsules • Web Analytics • Implementation of new tools and services
  • 17. HTRC Services Research Support • Text Data Mining research • “Non-consumptive” (a.k.a. “non-expressive”) focus • Tool development and maintenance • Dataset development and maintenance • Extracted Features • Researcher-created, e.g., “Geographic Locations in English-Language Literature” • Technical support (with OES) for HTRC users and Advanced Collaborative Support projects
  • 18. Accessing HTRC Tools and Services • Text Analysis Algorithms • Extracted Features • Data Capsules https://analytics.hathitrust.org
  • 19. Web-based tools • Topic modeling • Word frequency statistics and visualizations • Named-entity recognition https://analytics.hathitrust.org
  • 20. Bookworm in practice Samuel Franklin. “Inside the Creativity Boom.” https://cutt.ly/htrc-bookworm https://cutt.ly/htrc-bookworm
  • 21. InPhO Topic Explorer Walter, Scott. Waverly Novels. 48 vols. Edinburgh: R. Cadell, 1829-1833.
  • 22. Token Count and Tag Cloud Creator Walter, Scott. Waverly Novels. 48 vols. Edinburgh: R. Cadell, 1829-1833.
  • 23. Named Entity Recognizer Walter, Scott. Waverly Novels. 48 vols. Edinburgh: R. Cadell, 1829-1833.
  • 27. Advanced Collaborative Support (ACS) program https://cutt.ly/htrc-acs • 5 rounds of awards since 2015 • 26 projects • 41 researchers • 24 institutions
  • 28. Advanced Collaborative Support (ACS) program Example projects • Detecting and Transcribing Arabographic Texts • David Smith (Northeastern University), Matthew Thomas Miller (University of Maryland), Maxim Romanov (University of Vienna), and Sarah Bowen Savant (Aga Khan University, London) • Tracing the Shifting Rhetoric of Ethnoracial Difference in Federal Responses to Education, 1958-2018 • Andrés Castro Samayoa (Boston College) • Building Large-Scale Collections of Genre Fiction • Laure Thompson and David Mimno (Cornell University) • Deriving Basic Illustration Metadata • Stephen Krewson (Yale University) • A Computational History of the U.S. Novel, 1950-2000 • Richard Jean So (McGill University) https://cutt.ly/htrc-acs
  • 29. Scholar-Curated Worksets for Re-use, Dissemination, and Analysis (SCWAReD) • The SCWAReD project, funded by the Andrew W. Mellon Foundation, will advance the mission of the HTRC by providing scholar-curated worksets and illustrative, reusable models of HTRC-enabled research, with an emphasis on content related to historically under-resourced and marginalized textual communities. • Flagship workset will be based on the Project on the History of Black Writing in collaboration with Co-PI Dr. Maryemma Graham (University of Kansas). • Four additional researchers/teams have been recruited through our ACS program to develop additional worksets and research models.
  • 30. SCWAReD Projects https://www.hathitrust.org/htrc_scwaredACS_awards • History of Black Writing • Maryemma Graham (University of Kansas) • Mining the Native American Authored Works in HathiTrust for Insights • Kun Lu, Raina Heaton, and Raymond Orr (University of Oklahoma) • The Black Fantastic: Curated Vocabularies, Artifact Analysis and Identification • Clarissa West-White (Bethune Cookman University) and Seretha Williams (Augusta University) • Creating Period-Specific Worksets for Latin American Fiction • José Eduardo González (University of Nebraska, Lincoln) • The National Negro Health Digital Project: Recovering and Restoring a Black Public Health Corpus • Kim Gallon (Purdue University)
  • 31. What are SCWAReD models? Illustrative, re-usable research models… • Scholar-curated worksets focused on a specific topic, discipline, or theme. • Scholarly introductions to worksets that will address topics such as historical and cultural context, scope, significance, potential research questions that may be addressed by the workset, and potential audiences for the workset. • Analysis of the content in the curated workset. • Documented derived datasets (e.g., geospatial data, temporal data, named entities, specialized vocabularies) • Project white paper that summarizes the overall project, methods, processes, and findings.
  • 32. TORCHLITE Tools for Open Research and Computation with HathiTrust: Leveraging Intelligent Text Extraction The TORCHLITE project, supported by the National Endowment for the Humanities’ Office of Digital Humanities will: • Develop an API for access to HTRC Extracted Features data; • Building on the API, develop an interactive data dashboard with widgets for analyzing and visualizing volumes and worksets; • Building on the API, develop a suite of Jupyter notebooks for working with Extracted Features data; • Public events for engaging with research and development communities to promote use of the TORCHLITE API, data dashboard, and Jupyter notebooks and facilitate community development of additional resources.
  • 33. TORCHLITE Tools for Open Research and Computation with HathiTrust: Leveraging Intelligent Text Extraction
  • 34. TORCHLITE Tools for Open Research and Computation with HathiTrust: Leveraging Intelligent Text Extraction
  • 35. Thank you, questions, discussion… jawalsh@indiana.edu htrc-help@hathitrust.org https://analytics.hathitrust.org

Editor's Notes

  1. HathiTrust is a non-profit membership organization founded in 2008. It’s based at the University of Michigan, and there are 160 members serving 220 campuses. The mission of HathiTrust is to support teaching, learning, and scholarship by operating a large-scale digital library.
  2. HathiTrust is unlike other similar organization, in that the library it operates is not a subscription database. The mission is also different from that of Internet Archive or Project Gutenberg. It has roots in the Google Books, but is an example of libraries banding together to build infrastructure for a shared cause. It is non-commercial and non-governmental, and welcomes input from the member community. HathiTrust balances preservation with access, by operating a preservation repository of digitized content, and providing access to all that it is legally able to via the digital library website.
  3. HathiTrust was built primarily through mass digitization, which is an approach to scanning at scale. Books, and book-like objects, are scanned shelf-by-shelf, with little to no decision making about whether to scan individual items. All told, to amass the HathiTrust collection took more than 3 billion page turns by thousands of scanning staff over the last 12 years. You can sometimes find artifacts of that labor in the scans.
  4. The HathiTrust collection contains more than 17 million items digitized at member libraries, which are academic and research libraries, and content is added to it daily. Within such a large body of materials, there are discrete sub-collections, such as US Federal Government documents and all of the items in the Scripps Institute of Oceanography at UC San Diego.
  5. The content in HathiTrust ranges from the very old to the contemporary, with items dating to the 1500s and a concentration in the 20th century. The predominate language in HathiTrust is English, but the chart on the right shows the way that the share of English in the collection has gone down over time. About 40% public domain, 60% in copyright. See https://www.hathitrust.org/statistics_visualizations.
  6. The overall HathiTrust collection mirrors the print collections from which it was scanned – which are primarily large, well-resourced academic libraries. This means that library acquisition and publishing patterns have an impact on the kinds of materials you will find in HathiTrust. For example, while you are likely to find many editions of Pride and Prejudice, you are less likely to find popular fiction, such as romance novels. HathiTrust follows many library conventions in how it describes and displays the material in the repository. This means that there are certain aspects of HathiTrust infrastructure and practices that are very library-like, and occasionally make it challenging to approach the repository as a data source.
  7. Really a mix of services and other things the group does. Not exhaustive, but highlights.
  8. Really a mix of services and other things the group does. Not exhaustive, but highlights.
  9. Really a mix of services and other things the group does. Not exhaustive, but highlights.
  10. “Inside the Creativity Boom,” Samuel Franklin, Brown University: This project will map the increasing use and shifting meanings of the words “creative” and “creativity,” with a particular focus on the twentieth century. A custom “creativity corpus” will be assembled and processed to identify linguistic patterns via a number of text analysis and natural language processing techniques. Brown’s project will make use of the functionality developed for HathiTrust + Bookworm.
  11. One flagship project, The Project on the History of Black Writing, led by Mary Emma Graham at University of Kansas. Then through our ACS program we will be recruiting three additional scholars and teams, resulting in at last four of these research models.
  12. They will be completely realized use cases. The worksets will be documented and reusable by other researchers or in the classroom. The workset may not be relevant to a particular researcher or teacher, but the process and methodology could be used as a model for a workset of different content.
  13. https://observablehq.com/d/ef3168fae1f1b358
  14. https://observablehq.com/@jchristie01/playing-with-ef-torchlite-data
  15. The Digital Collections Strategy Working Group is charged with looking at more directed and intentional approach to building the HathiTrust collection, with an eye to addressing gaps and increasing the diversity of voices in the collection. I think the group would be quite interested in a similar overview of HTRC's work and a conversation about where HTRC researchers are finding gaps in the collection and where some analytics support from HTRC might help the working group move its charge forward.