Scholarly Requirements
for Large Scale Text
Analysis
A USER NEEDS ASSESSMENT FOR THE HATHITRUST RESEARCH CENTER
Harriett Green, Eleanor Dickson, and Sayan Bhattacharyya
DH 2016, July 15, 2016
What is the HathiTrust Research Center?
• Jointly led by the University of Illinois at Urbana-Champaign and
Indiana University Bloomington
• Facilitates text analysis of HTDL content  Focus on large-scale,
computational research
• Research & Development
• Finding technical solutions
• Building tools and services
• Conducting user studies
http://www.hathitrust.org/htrc
Scholarly Practices with Digital
Collections and Tools
How humanities scholars use digital collections: Brockman et al., 2001;
Palmer and Neumann, 2002; Babeu 2011; Rutner and Schonfeld, 2011; Green
and Courtney, 2015
How humanities scholars use digital tools: Frischer et al., 2006; Warwick
2008; Toms and O’Brien, 2008; Gibbs and Owen, 2012
Tools and resources for textual analysis: ARTFL and Philologic (Argamon et
al., 2009; Horton et al., 2009), MONK (Unsworth, 2011), Wordseer
(Muralidharan and Hearst, 2013), Voyant and TaPOR (Rockwell et al., 2010),
and Lexos (LeBlanc et al., 2013)
Workset Creation for Scholarly Analysis
GOAL: Find out how researchers collect together digital materials and build
textual corpora for research purposes.
Findings (Green et al. 2014, Fenlon et al. 2014):
 Need the ability to create and manipulate collections as reusable datasets
and research products
 The ability to work at different units of analysis
 Access to highly enriched metadata
HTRC User Requirements Study:
Research Goals
 Learn how researchers use digitized textual corpora, apply relevant
methods and approaches, and seek needed tools
 Develop illustrative use cases of text analysis research that will help
shape the development and expansion of HTRC research services and
training curricula for scholars
 Obtain information that can inform development of text analysis data
providers and research services
HTRC Users Requirements Study:
Methods
 Recruited interviewees from 2015 professional conferences and
meetings on digital libraries and digital humanities
 Semi-Structured Interviews with 15 scholars
All interviews coded by hand and in ATLAS.ti by HTRC Scholarly
Commons members
 Currently conducting in-depth qualitative content analysis
Preliminary Findings
What are scholars’ needs and practices for conducting textual analysis
with large text archives?
Data Acquisition and Management
Negotiating Results and Findings
Research Collaborations
Teaching and Training
Analysis: Data Acquisition and
Management
“I think the biggest challenge is data, getting good data to work with. I think people
underestimate the problems and difficulties in doing that.”
“My general like philosophical approach to these things is I like to do things small. I
build my corpora. I like to read them myself. I’m a little weary of like big distant
reading approaches, especially with stuff as far away from the present as my stuff.
So I’m still trying to perfect the stuff that I’m currently doing.”
“The bad thing is that you can get a negative result in a way you can’t get a negative
result in other methods...I might get garbage data. I might get stuff that doesn’t
make sense. I might get no findings at all.”
Analysis: Generating and Negotiating
Findings
“I yearn, I think, for workflows where we can actually—I don’t know what this would
look like in an interface particularly, but so that the scholar could actually set their own
tokenization rules. I think that would be really valuable. It would be a way that we could
create less language specific or actually, I should say control the language specificity of
the algorithm. I think that is the real need.”
“I wish more people were archiving their data and their algorithms from the source
code, as you see CS papers that will benchmark results against a dataset that’s no
longer valid, available. Then how do you try to replicate or beat those results? It
becomes impossible to evaluate your own methods against theirs and really slows
down the pace of research, because if one could surpass state of the art, then that’s an
application and [a] step forward.”
Analysis: Research Collaborations
“I’m not worried about publishing venues, I’m not worried about
reproducibility, I’m not worried about statistics. My own knowledge of
that is pretty good. But the collaborative work style is really hard.”
“I do think that at some point in the not too distant future we need to
form a test bed, which would be a subset of the HathiTrust corpus that
meets certain characteristics. So rather than a random sample I could see
one step would be a corpus around which multiple people could work
and do different kinds of machine tasks.”
Analysis: Teaching and Training
“I once imagined teaching a class in which students learn to script and
actually run analyses against data, but I was told, basically, that that class
isn’t a humanities class anymore— that belongs in computer science.”
“There is however, I will say, a demand amongst faculty to learn this stuff.
I’ve been asked to think about teaching a faculty course, or like a short
course to tell faculty members what is out there.”
“Because the technology moves so quickly, smart people will move with it.
There’s no escape from the fact that this is a self-educational problem. So,
the real challenge is the data itself and getting the data to talk.”
Findings: User Personas
Credits: Alex
Kinnaman, Peter
Organisciak,
Eleanor Dickson
Digital Project
Librarian
• Wants flexible,
transparent
tools
• Role: Research
Support staff
• Challenges:
Inaccessible
data, matching
tool to
researcher
Faculty Member
• Wants
computational
resources
• Role:
Experienced
Researcher
• Challenges:
Collaboration,
Finding texts
Graduate
Student
• Wants
examples
• Role: New
Researcher
• Challenges:
Understanding
stats, choosing
areas of
interest
Looking Forward
 IMLS-funded “Digging Deeper, Reaching Further: Libraries
Empowering Users to Mine the HathiTrust Digital Library”:
http://teach.htrc.Illinois.edu
 Data Capsule development (WCSA II Mellon grant)
 Revision to HTRC Portal and Workset Builder
 Release of extracted features from in-copyright works
Interested in working with HTRC?
HTRC Announcements:
htrc-announce-l @ list.indiana.edu
HTRC User Group:
htrc-usergroup-l @ list.indiana.edu
Questions?
htrc-help@hathitrust.org
Advanced Collaborative Support
program:
htrc.acs.awards@gmail.com
http://www.hathitrust.org/htrc
Acknowledgements
University of Illinois:
Beth Sandore Namachichivaya
Stephen Downie
Megan Senseney
Peter Organisciak, UX Specialist
Alex Kinnaman, Graduate Assistant
Indiana University:
Angela Courtney
Nicholae Cline
Leanne Mobley
Robert McDonald
Thank you!
Harriett Green
green19@Illinois.edu | @greenharr
Eleanor Dickson
dicksone@Illinois.edu
Sayan Bhattacharyya
sayan@Illinois.edu

Scholarly Requirements for Large Scale Text Analysis

  • 1.
    Scholarly Requirements for LargeScale Text Analysis A USER NEEDS ASSESSMENT FOR THE HATHITRUST RESEARCH CENTER Harriett Green, Eleanor Dickson, and Sayan Bhattacharyya DH 2016, July 15, 2016
  • 2.
    What is theHathiTrust Research Center? • Jointly led by the University of Illinois at Urbana-Champaign and Indiana University Bloomington • Facilitates text analysis of HTDL content  Focus on large-scale, computational research • Research & Development • Finding technical solutions • Building tools and services • Conducting user studies http://www.hathitrust.org/htrc
  • 3.
    Scholarly Practices withDigital Collections and Tools How humanities scholars use digital collections: Brockman et al., 2001; Palmer and Neumann, 2002; Babeu 2011; Rutner and Schonfeld, 2011; Green and Courtney, 2015 How humanities scholars use digital tools: Frischer et al., 2006; Warwick 2008; Toms and O’Brien, 2008; Gibbs and Owen, 2012 Tools and resources for textual analysis: ARTFL and Philologic (Argamon et al., 2009; Horton et al., 2009), MONK (Unsworth, 2011), Wordseer (Muralidharan and Hearst, 2013), Voyant and TaPOR (Rockwell et al., 2010), and Lexos (LeBlanc et al., 2013)
  • 4.
    Workset Creation forScholarly Analysis GOAL: Find out how researchers collect together digital materials and build textual corpora for research purposes. Findings (Green et al. 2014, Fenlon et al. 2014):  Need the ability to create and manipulate collections as reusable datasets and research products  The ability to work at different units of analysis  Access to highly enriched metadata
  • 5.
    HTRC User RequirementsStudy: Research Goals  Learn how researchers use digitized textual corpora, apply relevant methods and approaches, and seek needed tools  Develop illustrative use cases of text analysis research that will help shape the development and expansion of HTRC research services and training curricula for scholars  Obtain information that can inform development of text analysis data providers and research services
  • 6.
    HTRC Users RequirementsStudy: Methods  Recruited interviewees from 2015 professional conferences and meetings on digital libraries and digital humanities  Semi-Structured Interviews with 15 scholars All interviews coded by hand and in ATLAS.ti by HTRC Scholarly Commons members  Currently conducting in-depth qualitative content analysis
  • 7.
    Preliminary Findings What arescholars’ needs and practices for conducting textual analysis with large text archives? Data Acquisition and Management Negotiating Results and Findings Research Collaborations Teaching and Training
  • 8.
    Analysis: Data Acquisitionand Management “I think the biggest challenge is data, getting good data to work with. I think people underestimate the problems and difficulties in doing that.” “My general like philosophical approach to these things is I like to do things small. I build my corpora. I like to read them myself. I’m a little weary of like big distant reading approaches, especially with stuff as far away from the present as my stuff. So I’m still trying to perfect the stuff that I’m currently doing.” “The bad thing is that you can get a negative result in a way you can’t get a negative result in other methods...I might get garbage data. I might get stuff that doesn’t make sense. I might get no findings at all.”
  • 9.
    Analysis: Generating andNegotiating Findings “I yearn, I think, for workflows where we can actually—I don’t know what this would look like in an interface particularly, but so that the scholar could actually set their own tokenization rules. I think that would be really valuable. It would be a way that we could create less language specific or actually, I should say control the language specificity of the algorithm. I think that is the real need.” “I wish more people were archiving their data and their algorithms from the source code, as you see CS papers that will benchmark results against a dataset that’s no longer valid, available. Then how do you try to replicate or beat those results? It becomes impossible to evaluate your own methods against theirs and really slows down the pace of research, because if one could surpass state of the art, then that’s an application and [a] step forward.”
  • 10.
    Analysis: Research Collaborations “I’mnot worried about publishing venues, I’m not worried about reproducibility, I’m not worried about statistics. My own knowledge of that is pretty good. But the collaborative work style is really hard.” “I do think that at some point in the not too distant future we need to form a test bed, which would be a subset of the HathiTrust corpus that meets certain characteristics. So rather than a random sample I could see one step would be a corpus around which multiple people could work and do different kinds of machine tasks.”
  • 11.
    Analysis: Teaching andTraining “I once imagined teaching a class in which students learn to script and actually run analyses against data, but I was told, basically, that that class isn’t a humanities class anymore— that belongs in computer science.” “There is however, I will say, a demand amongst faculty to learn this stuff. I’ve been asked to think about teaching a faculty course, or like a short course to tell faculty members what is out there.” “Because the technology moves so quickly, smart people will move with it. There’s no escape from the fact that this is a self-educational problem. So, the real challenge is the data itself and getting the data to talk.”
  • 12.
    Findings: User Personas Credits:Alex Kinnaman, Peter Organisciak, Eleanor Dickson Digital Project Librarian • Wants flexible, transparent tools • Role: Research Support staff • Challenges: Inaccessible data, matching tool to researcher Faculty Member • Wants computational resources • Role: Experienced Researcher • Challenges: Collaboration, Finding texts Graduate Student • Wants examples • Role: New Researcher • Challenges: Understanding stats, choosing areas of interest
  • 13.
    Looking Forward  IMLS-funded“Digging Deeper, Reaching Further: Libraries Empowering Users to Mine the HathiTrust Digital Library”: http://teach.htrc.Illinois.edu  Data Capsule development (WCSA II Mellon grant)  Revision to HTRC Portal and Workset Builder  Release of extracted features from in-copyright works
  • 14.
    Interested in workingwith HTRC? HTRC Announcements: htrc-announce-l @ list.indiana.edu HTRC User Group: htrc-usergroup-l @ list.indiana.edu Questions? htrc-help@hathitrust.org Advanced Collaborative Support program: htrc.acs.awards@gmail.com http://www.hathitrust.org/htrc
  • 15.
    Acknowledgements University of Illinois: BethSandore Namachichivaya Stephen Downie Megan Senseney Peter Organisciak, UX Specialist Alex Kinnaman, Graduate Assistant Indiana University: Angela Courtney Nicholae Cline Leanne Mobley Robert McDonald
  • 16.
    Thank you! Harriett Green green19@Illinois.edu| @greenharr Eleanor Dickson dicksone@Illinois.edu Sayan Bhattacharyya sayan@Illinois.edu