Scholarly Requirements for Large Scale Text Analysis

Scholarly Requirements
for Large Scale Text
Analysis
A USER NEEDS ASSESSMENT FOR THE HATHITRUST RESEARCH CENTER
Harriett Green, Eleanor Dickson, and Sayan Bhattacharyya
DH 2016, July 15, 2016

What is the HathiTrust Research Center?
• Jointly led by the University of Illinois at Urbana-Champaign and
Indiana University Bloomington
• Facilitates text analysis of HTDL content  Focus on large-scale,
computational research
• Research & Development
• Finding technical solutions
• Building tools and services
• Conducting user studies
http://www.hathitrust.org/htrc

Scholarly Practices with Digital
Collections and Tools
How humanities scholars use digital collections: Brockman et al., 2001;
Palmer and Neumann, 2002; Babeu 2011; Rutner and Schonfeld, 2011; Green
and Courtney, 2015
How humanities scholars use digital tools: Frischer et al., 2006; Warwick
2008; Toms and O’Brien, 2008; Gibbs and Owen, 2012
Tools and resources for textual analysis: ARTFL and Philologic (Argamon et
al., 2009; Horton et al., 2009), MONK (Unsworth, 2011), Wordseer
(Muralidharan and Hearst, 2013), Voyant and TaPOR (Rockwell et al., 2010),
and Lexos (LeBlanc et al., 2013)

Workset Creation for Scholarly Analysis
GOAL: Find out how researchers collect together digital materials and build
textual corpora for research purposes.
Findings (Green et al. 2014, Fenlon et al. 2014):
 Need the ability to create and manipulate collections as reusable datasets
and research products
 The ability to work at different units of analysis
 Access to highly enriched metadata

HTRC User Requirements Study:
Research Goals
 Learn how researchers use digitized textual corpora, apply relevant
methods and approaches, and seek needed tools
 Develop illustrative use cases of text analysis research that will help
shape the development and expansion of HTRC research services and
training curricula for scholars
 Obtain information that can inform development of text analysis data
providers and research services

HTRC Users Requirements Study:
Methods
 Recruited interviewees from 2015 professional conferences and
meetings on digital libraries and digital humanities
 Semi-Structured Interviews with 15 scholars
All interviews coded by hand and in ATLAS.ti by HTRC Scholarly
Commons members
 Currently conducting in-depth qualitative content analysis

Preliminary Findings
What are scholars’ needs and practices for conducting textual analysis
with large text archives?
Data Acquisition and Management
Negotiating Results and Findings
Research Collaborations
Teaching and Training

Analysis: Data Acquisition and
Management
“I think the biggest challenge is data, getting good data to work with. I think people
underestimate the problems and difficulties in doing that.”
“My general like philosophical approach to these things is I like to do things small. I
build my corpora. I like to read them myself. I’m a little weary of like big distant
reading approaches, especially with stuff as far away from the present as my stuff.
So I’m still trying to perfect the stuff that I’m currently doing.”
“The bad thing is that you can get a negative result in a way you can’t get a negative
result in other methods...I might get garbage data. I might get stuff that doesn’t
make sense. I might get no findings at all.”

Analysis: Generating and Negotiating
Findings
“I yearn, I think, for workflows where we can actually—I don’t know what this would
look like in an interface particularly, but so that the scholar could actually set their own
tokenization rules. I think that would be really valuable. It would be a way that we could
create less language specific or actually, I should say control the language specificity of
the algorithm. I think that is the real need.”
“I wish more people were archiving their data and their algorithms from the source
code, as you see CS papers that will benchmark results against a dataset that’s no
longer valid, available. Then how do you try to replicate or beat those results? It
becomes impossible to evaluate your own methods against theirs and really slows
down the pace of research, because if one could surpass state of the art, then that’s an
application and [a] step forward.”

Analysis: Research Collaborations
“I’m not worried about publishing venues, I’m not worried about
reproducibility, I’m not worried about statistics. My own knowledge of
that is pretty good. But the collaborative work style is really hard.”
“I do think that at some point in the not too distant future we need to
form a test bed, which would be a subset of the HathiTrust corpus that
meets certain characteristics. So rather than a random sample I could see
one step would be a corpus around which multiple people could work
and do different kinds of machine tasks.”

Analysis: Teaching and Training
“I once imagined teaching a class in which students learn to script and
actually run analyses against data, but I was told, basically, that that class
isn’t a humanities class anymore— that belongs in computer science.”
“There is however, I will say, a demand amongst faculty to learn this stuff.
I’ve been asked to think about teaching a faculty course, or like a short
course to tell faculty members what is out there.”
“Because the technology moves so quickly, smart people will move with it.
There’s no escape from the fact that this is a self-educational problem. So,
the real challenge is the data itself and getting the data to talk.”

Findings: User Personas
Credits: Alex
Kinnaman, Peter
Organisciak,
Eleanor Dickson
Digital Project
Librarian
• Wants flexible,
transparent
tools
• Role: Research
Support staff
• Challenges:
Inaccessible
data, matching
tool to
researcher
Faculty Member
• Wants
computational
resources
• Role:
Experienced
Researcher
• Challenges:
Collaboration,
Finding texts
Graduate
Student
• Wants
examples
• Role: New
Researcher
• Challenges:
Understanding
stats, choosing
areas of
interest

Looking Forward
 IMLS-funded “Digging Deeper, Reaching Further: Libraries
Empowering Users to Mine the HathiTrust Digital Library”:
http://teach.htrc.Illinois.edu
 Data Capsule development (WCSA II Mellon grant)
 Revision to HTRC Portal and Workset Builder
 Release of extracted features from in-copyright works

Interested in working with HTRC?
HTRC Announcements:
htrc-announce-l @ list.indiana.edu
HTRC User Group:
htrc-usergroup-l @ list.indiana.edu
Questions?
htrc-help@hathitrust.org
Advanced Collaborative Support
program:
htrc.acs.awards@gmail.com
http://www.hathitrust.org/htrc

Acknowledgements
University of Illinois:
Beth Sandore Namachichivaya
Stephen Downie
Megan Senseney
Peter Organisciak, UX Specialist
Alex Kinnaman, Graduate Assistant
Indiana University:
Angela Courtney
Nicholae Cline
Leanne Mobley
Robert McDonald

Thank you!
Harriett Green
green19@Illinois.edu | @greenharr
Eleanor Dickson
dicksone@Illinois.edu
Sayan Bhattacharyya
sayan@Illinois.edu

Scholarly Requirements for Large Scale Text Analysis

More Related Content

What's hot

Viewers also liked

Similar to Scholarly Requirements for Large Scale Text Analysis

More from Harriett Green

Recently uploaded

Scholarly Requirements for Large Scale Text Analysis