HathiTrust
and
HTRC:
the
changing
Digital
Library
El
Colegio
de
Mexico
|
20.May.14
Beth
Plale
–
@bplale
Professor,
School
of
InformaCcs
and
CompuCng
Director,
HathiTrust
Research
Center
Indiana
University
Tweet
us
-‐
@HathiTrust
#HTRC
HATHI TRUST
RESEARCH CENTER!
#HTRC
@HathiTrust
HathiTrust
• HathiTrust
is
a
consorBum
of
academic
&
research
insBtuBons,
offering
a
collecBon
of
millions
of
Btles
digiBzed
from
libraries
around
the
world.
– Founding
members:
University
of
Michigan,
Indiana
University,
University
of
California,
and
University
of
Virginia
http://www.hathitrust.org/htrc
http://www.hathitrust.org
à
DisBnguished
from
#HTRC
@HathiTrust
Content
• Books
and
journals
– Pilots
around
images,
audio,
born-‐digital
• DigiBzaBon
sources
– Google
(96.8%,
10,162,104)
– Internet
Archive
(2.9%,
301,972)
– Local
(0.3%,
31,840)
#HTRC
@HathiTrust
Content
distribuBon
Not
public
domain
outside
available
#HTRC
@HathiTrust
à HathiTrust repository is a latent
goldmine for text mining analysis,
analysis of large-scale corpi through
computational tools, and time-based
analysis
à Restricted nature of HT content
suggests need for new forms of
access that preserve intimate nature
of research investigation while
honoring restrictions
à Paradigm: computation moves to
the data (not vice versa)
#HTRC
@HathiTrust
Mission
of
HT
Research
Center
• Research
arm
of
HathiTrust
• Goal:
enable
researchers
world-‐wide
to
carry
out
computaBonal
invesBgaBon
of
HT
repository
through
– Develop
model
for
access:
the
‘workset’
– Develop
tools
that
facilitate
research
by
digital
humaniBes
and
informaBcs
communiBes
– Develop
secure
cyberinfrastructure
that
allows
computaBonal
invesBgaBon
of
enBre
copyrighted
and
public
domain
HathiTrust
repository
• Established:
July,
2011
• CollaboraBve
effort
of
Indiana
University
and
University
of
Illinois
HTRC
system
Complexity
hiding
interface
The
complexity
Tabular
info
StaBsBcal
plots
SpaBal
plots
Request
#HTRC
@HathiTrust
Access
to
copyrighted
materials:
HTRC
Data
Capsule
A
secure
compuBng
framework
that:
• Trusts
that
researcher
will
not
deliberately
leak
repository
data,
but
• Prevents
malware
acBng
on
user's
behalf
from
leaking
data.
Enforces:
• Non-‐consumpBve
use:
framework
provides
safe
handling
of
large
volumes
of
protected
data
• Openness:
framework
supports
user-‐contributed
analysis
tools
(that
is,
not
limit
uses
to
a
known
set
of
algorithms)
• Efficiency:
framework
supports
user-‐contributed
analysis
tools
without
resorBng
to
code
walkthroughs
prior
to
acceptance
• Large-‐scale
and
low
cost:
protecBons
can
be
extended
to
uBlizaBon
of
large-‐scale
naBonal
(public)
supercomputers
VM
Image
Manager
VM
Image
Store
VM
Image
Builder
VM
Manager
VM
instance
Secure
Capsule
cluster
SSH
Research
results
Researcher
HTRC
Secure
Capsule
Architectural
Components
Registry
Services,
worksets
VM
Image
Manager
VM
Image
Store
VM
Image
Builder
VM
Manager
VM
instance
Upon
run,
Secure
Capsule:
controls
I/O
behind
scenes
SSH
Research
results
Researcher
HTRC
Secure
Capsule
Architecture
Researcher
requests
new
VM
of
type
X
Researcher
install
tools
onto
VM
through
window
on
her
desktop.
Registry
Services,
worksets
Final
locaBon
of
results
is
registry
1)
2)
Image
instance
is
created
3)
4)
EXAMPLES
OF
RESEARCH
CARRIED
OUT
THROUGH
HATHI
TRUST
RESEARCH
CENTER
• Author
Gender
IdenBficaBon
• Using
Topic
Modeling
to
Locate
(down
to
sentence
level)
Philosophical
Arguments
in
Science
Texts
GENDER
IDENTIFICATION
OF
HTRC
AUTHORS
BY
NAMES
Stacy
Kowalczyk,
Asst.
Professor,
Dominican
University
Zong
Peng,
HTRC,
Indiana
University
Ref
talk
by
Stacy
Kowalczyk,
hjp://www.hathitrust.org/htrc_uncamp2013
#HTRC
@HathiTrust
Gender
IdenBficaBon
of
Text
• QuesBon
InvesBgated:
Can
we
use
author
names
in
bibliographic
records
to
idenBfy
gender?
• 2.6
million
bibliographic
records
– Extracted
personal
author
data
– Marc
100
abcd
and
700
abcd
• 606,437
unique
personal
author
strings
• Bibliographic
data
is
not
fielded
like
patent
names
• Relying
on
Standard
cataloging
pracBce
– Last
name,
first
name
middle
name,
Btles/honorifics,
dates
Why
interesBng
to
HTRC?
Introduces
new
source
of
metadata
and
from
sources
with
varying
authority
Raises
quesBons:
1) How
should
community
contributed
metadata
be
disBnguished
from
more
authoritaBve
sources?
2) How
should
variability
of
quality
even
within
a
single
contribuBon
be
conveyed
to
community?
#HTRC
@HathiTrust
Sources
of
Data
• The
Virtual
InternaBonal
Authority
File
– Hosted
by
OCLC
• Harvested
names
from
mulBple
data
sources
– Census
bureau
– Baby
name
sites
• EU
Patent
Research
names
list
(Frietsch
et
al,
2009;
Naldi
et
al.
2005)
– Developed
an
extensive
list
of
European
names
• Titles
and
honorifics
– MulBple
web
resources
– Sir,
Baron,
Count,
Duke,
Father,
Cardinal,
etc
– Lady,
Mrs.
Miss,
Countess,
Duchess,
Sister,
etc
#HTRC
@HathiTrust
IniBal
Gender
Results
• Approximately
80%
of
name
strings
have
iniBal
gender
idenBficaBon
– Female
• 59,365
• 10%
– Male
• 425,994
• 70%
– Unknown
• 114,204
• 19%
– Ambiguous
• 5,965
• Less
than
1%
#HTRC
@HathiTrust
Results
by
Data
Source
Against
the
whole
set
of
name
strings
• VIAF
– 19%
hit
rate
• Web
Names
– 54%
hit
rate
• Patents
Names
– 8%
Colin
Allen,
Jamie
Murdock
CogniCve
Science,
Indiana
University
Ref
talk
by
Jamie
Murdock,
hjp://www.hathitrust.org/htrc_uncamp2013
The
InPhO
project
is
instrucBve
because
it
demonstrates
an
interacBon
sequence
between
a
researcher
and
his/her
corpus
that
is
nuanced,
is
mulBstep,
and
mulB-‐modal.
The
HTRC
cyberinfrastructure
must
be
able
to
handle
such
a
nuanced
form
of
interacBon
between
a
researcher
and
their
texts.
Digging
into
philosophy
of
science
• Establish
points
of
contact
between
philosophy
and
science:
where
philosophical
arguments
on
anthropomorphism
appear
in
science
texts
• Use
topic
modeling
to
idenBfy
the
volumes
and
pages
within
these
volumes
that
are
“rich”
in
a
chosen
topic
• Use
semi-‐formal
discourse
analysis
technique
to
idenBfy
key
arguments
in
selected
pages
to
incrementally
expose
and
represent
argument
structures
The
How
• 1315
volumes
from
HTRC
selected
using
keyword
search
for
‘darwin’,
‘romanes’,
‘anthropomorphism’,
and
‘comparaBve
psychology’
• Set
contains
lots
of
uninteresBng
books:
e.g.,
college
course
catalogs
• Apply
LDA
on
86
volume
subset
• Using
iPy
Notebook
LDA
topic
modeling
• LDA
(Latent
Dirichlet
Analysis)
uses
a
Bayesian
updaBng
method
to
generate
a
set
of
“topics”
–
probability
distribuBons
over
set
of
terms
in
a
corpus
• Number
of
topics
is
a
parameter
in
the
modeling
technique
• Method
finds
set
of
topics
that
is
best
able
to
reproduce
the
term
distribuBons
in
documents
belonging
to
the
corpus
• Documents
may
be
whole
volumes,
chapters,
arBcles,
single
pages,
even
individual
sentences
–
modeler’s
choice
Drop
to
sentence
level
• Select
three
books
with
highest
aggregate
of
20-‐40
topic-‐relevant
pages
for
more
precise
analysis
• Manually
augment
argument
analysis
– Remodeling
of
three
volumes
at
sentence
level
– Training
other
methods
using
human
analysis
plus
sentence
similarity
Scholarly
Commons
User
Support
Service
• Develop
training
materials
• EducaBonal
workshops
• Tool
and
workset
creaBon
• Collaborate
with
librarians
and
DH
centers
at
HT
insBtuBons
• Assist
researchers
in
HTRC
text
data
mining
research
projects
• Based
at
University
of
Illinois
Library
47
Scholarly
Commons
User
Support
• Gives
HT
insBtuBons
exclusive
access
to
training
and
learning
materials
that
help
them
establish
programs
that
integrate
HTRC
tools
and
services
into
their
scholarly
commons
programs
in
libraries
and
digital
humaniBes
centers.
• Physically
located
on
the
University
of
Illinois
Library’s
Scholarly
commons.
• Supported
by
several
Library
staff
and
faculty.
Key
among
these
is
the
Digital
Humani,es
Research
Specialist
who
will
assist
with
the
development
of
training
and
outreach
iniBaBves
in
support
of
researchers
working
with
the
Hathi
Trust
Research
Center
and
HathiTrust
digital
library
affiliates
who
seek
to
start
their
own
HTRC
research
services.
• Effort
involves
planning,
implementaBon
and
conBnuous
development
of
training
materials,
educaBonal
workshops,
and
potenBal
tools,
and
outreach
acBviBes
in
support
of
the
usage
of
HTRC
tools
and
datasets.