Introduces HTRC secure commons, expanded secure infrastructure and services for text mining of HT digital data. Shows results comparing n-gram discovery using Solr full text index and a framework using mapReduce. Compute time over 1 million digital volumes is 1 day with 1024 cores. Weaknesses of Solr in n-gram identification are explored.
1. HathiTrust
Research
Center
Secure
Commons
Beth
Plale
Co-‐Director,
HathiTrust
Research
Center
Professor
of
Informa:cs
Director,
Data
To
Insight
Center
Indiana
University
@bplale
University
of
Toronto,
25
June
2015
2. HathiTrust
is...
• A
trusted
digital
preserva:on
service
enabling
the
broadest
possible
access
worldwide.
• An
organiza:on
with
over
100
research
libraries
making
up
its
membership.
• A
distributed
set
of
services
operated
by
different
members
(California
Digital
Library,
Illinois,
Indiana,
Michigan).
• A
range
of
programs
enabled
by
the
large
scale
collec:on
of
digi:zed
materials.
3. Mission
To
contribute
to
research,
scholarship,
and
the
common
good
by
collabora:vely
collec:ng,
organizing,
preserving,
communica:ng,
and
sharing
the
record
of
human
knowledge.
…building
comprehensive
collec:ons
and
infrastructure
co-‐owned
and
managed
by
partners.
…infrastructure
for
digital
content
of
value
to
scholars
and
researchers
…enabling
access
by
users
with
print
disabili:es.
…suppor:ng
research
with
the
collec:ons.
…s:mula:ng
shared
collec:on
storage
strategies.
5. Preserva:on
with
Access
• Preserva:on
– TRAC-‐cer:fied
• Discovery
– Bibliographic
and
full-‐text
search
of
all
materials
• Access
and
Use
– Full
text
search
(all
users)
– Public
domain
and
open
access
works
(all
users)
– Collec:ons
and
APIs
(all
users)
– Lawful
uses
of
in-‐copyright
works
(members)
6. HathiTrust
in
April
2015
• 13.3
million
total
items
– 6.8
million
book
:tles
– 355,000
serial
:tles
– 612,000
US
federal
government
documents
– 5.03
million
items
open
(public
domain
&
CC-‐
licenses)
The
collec:on
primarily
includes
published
materials
in
bound
form,
digi:zed
from
library
collec:ons.
10. Mission
of
the
HT
Research
Center
• Research
arm
of
HathiTrust
• Established:
July,
2011
• Collabora:ve
center:
Indiana
University
&
University
of
Illinois
• Mission:
Enable
researchers
world-‐wide
to
accomplish
tera-‐scale
text
data-‐mining
and
analysis
• Major
effort
to
date:
– Build
secure
and
trusted
environment
surrounding
the
sensi:ve
text
and
image
data:
Trust
Ring
– Make
the
data
more
useable
and
accessible
to
researcher
11. Secure
Commons
“Trust
Ring”
• Logical
ring
within
which
exist
trusted
services
and
computers
that
protect
and
provide
access
to
the
sensi:ve
(copyright)
data
• Computa:on
moves
to
the
data
not
vice
versa
• Computa:on
carried
out
in
the
trust
ring
IU
UIUC
12. Raw
copyright
data
on
1)
file
system
and
in
archive
in
pairtree
form,
2)
chunked
form
for
parallel
processing
and
2)
in
full
text
Solr
index
Knowledge
product
services
Data
Capsule
VMs
Services
and
Tools:
data
discovery,
extrac:on,
cleaning,
mining/analysis,
visualiza:on
Knowledge
products
(public):
workset,
ontology,
feature
sets
HTRC
Portal
(for
authen:ca:on)
Knowledge
products
(private):
personal
worksets
External
data
cache
DH
CS
NLP
R
.
.
.
Data
management
services
Secure
Commons
Services
Stack
13. Trust
Ring
gains
core
of
its
trustworthiness
from
the
highly
secure
and
heavily
managed
storage
and
compute
environment
at
Indiana
University
14. Researcher
Interac:on
Interac:on
with
HTRC
is
through
one
of
three
op:ons:
1. Services
and
tools
for
data
extrac:on,
data
cleaning,
data
analysis
and
results
visualiza:on.
Self
service,
browser-‐based.
2. Check
out
a
Data
Capsule
VM.
Researcher
checks
out
and
configures
for
their
use
(currently
for
the
technology
savvy)
3. Direct
engagement
with
HTRC
staff
HTRC
Portal:
h.ps://sharc.hathitrust.org/
16. Data
Extrac:on
Data
Cleaning
Data
Analysis
Visualiza:on
HT
DLib
Result:
stored
to
workset
Input
Parameters
(JSON)
Task
output
(JSON)
Overall
Result
• Graphs
• Raw
data
• Structured
data
• etc
Tasks
can
be
programs
wrifen
in
any
language
• Python,
R,
Java,
C#,
…
Current
solu:on
of
SEASR
workflows
being
deprecated;
New
solu:on:
four
stage
framework
for
researcher
to
plug
together
desired
tasks.
New
tools
in
each
task
come
from
HTRC
community,
open
source,
etc.
17. Data
Capsule
Founda:ons
of
HT
Data
Capsule:
K.
Borders,
E.
V.
Weele,
B.
Lau,
and
A.
Prakash.
Protec:ng
confiden:al
data
on
personal
computers
with
storage
capsules.
18th
USENIX
Security
Symposium,
pp
367–382.
USENIX
Associa:on,
2009.
18. HathiTrust
Data
Capsule
concept
• Researcher
“checks
out”
a
virtual
machine
(VM)
• VM
runs
in
the
Trust
Ring
• Researcher
owns
their
VM
through
weeks/
months
of
analysis
• Geong
stuff
into
VM
is
easy,
but
there
is
a
controlled
and
audited
process
for
geong
results
out
of
the
VM
20. Mode
switch
protec:on:
maintenance
mode
Data
Capsule
Data
Capsule
User
traffic
from
desktop
allowed
Arbitrary
network
download
allowed
Arbitrary
network
upload
allowed
during
maintenance
mode,
researcher
installs
new
soqware
and
loads
data
into
capsule
HTRC
raw
data
sources
21. Mode
switch
protec:on:
secure
mode
Data
Capsule
Data
Capsule
User
traffic
from
desktop
allowed
Arbitrary
network
download
not
allowed
Arbitrary
network
upload
not
allowed
Researcher
switches
to
secure
mode
when
ready
to
run
her
tools
HTRC
raw
data
sources
Results
:
researcher
tools
must
write
results
to
special
directory;
these
are
reviewed
before
release
22. Threat
Model
• User
is
trustworthy
• Virtual
machine
(VM)
manager
and
the
host
it
runs
on
are
also
trusted.
• VM
is
NOT
trusted.
We
assume
the
possibility
of
malware
being
installed
as
well
as
other
remotely
ini:ated
afacks
on
the
VM,
which
are
undetectable
to
the
user.
23. HTRC
Data
Capsules
See
Data
Capsule
Tutorial
for
step-‐by-‐step
instruc:ons:
Go
to
wiki:
hfps://wiki.htrc.illinois.edu
Navigate
to:
Community
>
HTRC
Data
Capsule
>
HTRC
Data
Capsule
Tutorial
25. HTRC
Advanced
Collabora:ve
Support
Awards
for
HTRC
developer
Hme
1st
round
awards:
• Detec:ng
Literary
Plagiarisms:
The
Case
of
Oliver
Goldsmith
• Taxonomizing
the
Texts:
Towards
Cultural-‐Scale
Models
of
Full
Text
• The
Trace
of
Theory
• Tracking
technology
diffusion
thru
:me
using
HT
Corpus
Coming:
call
for
2nd
round
Proposals.
h?p://hathitrust.org/htrc
for
details
…
or
Dr.
Miao
Chen,
miaochen@indiana.edu
26. Advanced
CollaboraHve
Support
• Pairs
HT
ins:tu:on
researchers
with
expert
staff
for
an
extended
period
during
which
they
work
together
to
address
a
par:cularly
vexing
issue
(e.g.,
efficient
paralleliza:on
and
op:miza:on
of
a
machine
learning
algorithm)
• 20
hours/week
available:
example:
at
any
one
:me
4
ac:ve
projects,
each
receiving
5
hours
a
week
for
up
to
2
months.
• Resourced
at
1.25
FTE
• Staffed
by
HTRC
Staff
who
have
signed
the
staff
agreement
26
HTRC*Advisory*Board*
7ve*
nt*
earch*
*Students*
D*Students*
ystems*
nistrator*
*FTE)*
Advanced*Collabora7ve*
Support*(coordinated*by*
M.*Chen)*
Research*Programmer*
(.5*FTE)*
Computa7onal*Research*
Liaison*
(.5*FTE)*
Asst*Dir*Outreach*&*
Educa7on*(M.*Chen)*
(1*year*at*.25*FTE)**
Scholarly*Commons*
Dig*Humani7es*Specialist*
(1.0*FTE)*
CLIR*Postdoctoral*
Research*Associate*
(2*years*at*1.0*FTE)*
Digital*Research*
Librarian*support*
(.2*FTE)*
Scholars*Commons*
Support*
(.5*FTE)*
LIS*MS*Students*
UI*Managing*Director*
(.11*FTE)*
27. Scholarly
Commons
User
Support
Services
• Develop
training
materials
• Educa:onal
workshops
• Tool
and
workset
support
• Collaborate
with
librarians
and
DH
centers
at
HT
ins:tu:ons
• Assist
researchers
in
HTRC
text
data
mining
research
projects
• Collabora:on:
University
Libraries,
Illinois
and
Indiana
29. Worksets
• The
ability
to
slice
through
a
massive
corpus
constructed
from
many
different
library
collecHons,
and
out
of
that
to
construct
the
precise
workset
required
for
a
parHcular
scholarly
invesHgaHon,
is
an
example
of
the
“game
changing”
potenHal
of
the
HathiTrust...
30. Dimensions
of
Workset
Crea:on
(Illustra:ve)
My
workset
should
contain
(inspired
by
2012
UnCamp):
• Volumes
pertaining
to
Japan
/
in
Japanese
• All
volumes
relevant
to
the
study
of
Francis
Bacon
• Music
scores
or
nota:on
extracted
from
HT
volumes
• Images
of
Victorian
England
extracted
from
HT
vols.
• Volumes
in
HT
similar
to
TCP-‐ECCO
novels
• 19th
c.
English-‐language
novels
by
female
authors
• Representa:ve
sample
(by
pub
date
&
genre)
of
French
language
items
in
HT
31. What
is
Workset?
#1
• A
workset
is
an
aggrega:on
of
materials
brought
together
for
the
purpose
of
discovery
and
analysis.
32. What
is
a
Workset?
#2
• Worksets
are
conceptual
and
must
be
expressible
in
a
variety
of
ways
• Need
to
facilitate
inclusion
of
resources
beyond
HathiTrust
• Need
to
facilitate
the
inclusion
of
resources
at
many
different
levels
of
granularity
beyond
the
book
33. What
is
Workset
#3
• Worksets
encapsulate
the
specific
materials
that
underwent
analysis
• Need
to
capture
provenance
informa:on
• Possible
recording
of
parameters
34. What
is
a
Workset?
#4
• Worksets
should
be
able
to
spawn
descendants
but
otherwise
immutable
36. rdf:type
Draq
Workset
Data
Model
V.
0.2
cnt:content
rdf:type
htrc:isGatheredInto
dcterms:created
dcterms:extent
rdf:type
rdf:type
foaf:accountName
dc:creator
rdf:type
:_workset1
htrc:Collec:on
dc::tle
:_desc1
dcterms:abstract
cnt:ContentAsText
:_curator1
foaf:Agent
“rkfritz”^^xsd:string
9^^xsd:integer
“2013-‐11-‐11T15:55:48-‐5:00Z”^^xsd:dateTime
dul1.ark:/13960/
t77s8cw40
htrc:BibliographicResource
“Agrippa”^^xsd:string
“Agrippa
and
Mexia”^^xsd:string
rdf:about
hfp://catalog.hathitrust.org/
Record/010944168
htrc:BibliographicRecord
37. Page-‐level
Sta:s:cs
Extrac:on
Over
HathiTrust
Corpus
for
Tech
Terms
Acknowledgements:
collabora:on
with
Michelle
Alexopolous,
University
of
Toronto.
Extrac:on
and
analysis
by
Guangchen
Ruan,
CS
PhD
student
at
Indiana
University
University
of
Toronto,
25
June
2015
38. Mo:va:on
and
Problem
• Given
a
list
of
terms
(n-‐grams),
extract
page-‐
level
sta:s:cs
for
each
term
For
instance,
seek
frequency
of
appearance
of
term
“diesel
engine”
at
volume
level
and
page
level:
in
which
volumes,
and
on
which
pages
with
frequency
count
per
page
• We
undertook
to
compare
the
accuracy
of
two
approaches:
one
that
extracts
terms
from
Solr
index,
and
other
that
extracts
terms
using
a
single-‐pass
processing
framework
we
developed
to
work
directly
on
the
raw
data
40. Approach
one:
page
level
index
using
Solr
§ Build
page-‐level
index
from
raw
text.
Obtain
page-‐level
stats
through
Solr
query
§ Computa:on
and
:me
cost
high
to
build
page-‐
level
index
for
each
tech
term,
so
build
single
page-‐level
index
for
group
of
words
with
similar
seman:cs
§ e.g.,
“diesel
engine”,
“diesel
motor”,
“diesel
powered
engine”
41. Approach
two:
single-‐pass
processing
distributed
compu:ng
framework
§ For
each
volume,
directly
scan
tcontent
of
each
page
to
check
match
using
regular
expression
§ Divide
volumes
and
computa:on
across
mul:ple
machines
to
speed
up
§ Not
computa:on
sensi:ve
to
the
#
of
tech
terms
being
searched
so
can
provide
page-‐level
stats
for
each
tech
term
rather
than
one
for
a
group
42. • We
compare
results
of
approaches
under
8
tech
term
groups
or
equivalently
57
tech
terms
• Overall,
results
from
two
approaches
have
over
95%
consistency
• For
inconsistent
por:on,
we
manually
inspect
the
raw
text
content
to
verify
the
ground
truth
• Evalua:on
shows
that
single-‐pass
processing
approach
is
more
accurate
(less
false
posi:ves
and
nega:ves)
than
Solr
approach
Quality
evalua:on:
Solr-‐based
vs.
Single-‐pass
processing
45. Volume-‐level
and
page
level
comparison
Tech
term
group
Volume
level
comparison
Page
level
comparison
in
common
volume
set
(vols
appearing
in
s1
but
not
s2)
/
(total
#
of
volumes
in
s1)
(vols
appearing
in
s2
but
not
s1)
/
(total
#
of
volumes
in
s2)
(#
of
page
records
in
s1
but
not
s2
)
/
(total
#
of
page
records
in
s1
)
(#
of
page
records
in
s2
but
not
s1
)
/
(total
#
of
page
records
in
s2
)
diesel
engine
(6
terms)
472/19,869
(2.38
%)
98/19,495
(0.50%)
2,503/84,142
(2.97
%)
1,986/83,625
(2.37
%)
gas
engine
(20
terms)
1,132/45,321
(2.50%)
846/45,035
(1.88%)
7,065/187,735
(3.76%)
5,831/186,501
(3.12%)
internal-‐combusGon-‐
engine
(4
terms)
2,418/21,122
(11.4%)
80/18,784
(0.43%)
8,480/75,567
(11.2%)
3,994/71,081
(5.61%)
steam
boat
(2
terms)
4,209/176,652
(2.38%)
1,166/173,609
(0.64%)
25,808/794,832
(3.25%)
33,194/802,218
(4.14%)
**
s1
refers
to
single
pass
processing,
s2
refers
to
Solr-‐based
approach
46. Volume-‐level
and
page
level
comparison
Tech
term
group
Volume
level
comparison
Page
level
comparison
in
common
volume
set
(Vols
appearing
in
s1
but
not
s2)
/
(total
#
of
volumes
in
s1)
(Vols
appearing
in
s2
but
not
s1)
/
(total
#
of
volumes
in
s2)
(#
of
page
records
in
s1
but
not
s2
)
/
(total
#
of
page
records
in
s1
)
(#
of
page
records
in
s2
but
not
s1
)
/
(total
#
of
page
records
in
s2
)
steam
engine
(12
terms)
9,259/127,385
(7.27%)
828/118,954
(0.70%)
31,475/476,992
(6.60%)
19,145/464,662
(4.12%)
steam
locomoGve
(4
terms)
707/13,539
(5.22%)
169/13001
(1.30%)
2,880/36,294
(7.93%)
1,752/35,166
(4.98%)
steam
ship
(2
terms)
1,876/134,220
(1.39%)
920/133,264
(0.69%)
12,024/573,453
(2.09%)
15,141/576,570
(2.63%)
Telegraph
(7
terms)
67,293/
361,483
(18.6%)
375/294,565
(0.13%)
787,481/2,140,28
4
(36.7%)
164,601/1,517,40
4
(10.8%)
47. Analysis
of
Solr
false
posi:ve/nega:ve
• False
posi:ve
§ Example
one:
false
posi:ve
match
for
“diesel
engine”
“17
Engines
and
Turbines
(Excludes
aircraq
and
rocket
engines;
automo:ve
engines,
except
diesel;
engine
generator
sets;
and
locomo:ves.)”
§ Example
two:
false
posi:ve
match
for
“diesel
motor”
“Fossil
fuel
consump:on
(gasoline,
diesel,
motor
oil)
would
decrease
as
a
result
of
this
alterna:ve.”
§ Cause
analysis:
Solr
builds
page-‐level
index
by
Lucene
tokenizaHon
which
removes
non-‐word
character.
One-‐pass
processing
can
correctly
handle
such
cases
as
it
matches
by
regular
expression
against
raw
text
48. Cause
analysis
of
Solr’s
false
posi:ve/nega:ve
(Cont.)
• False
nega:ve
§ Example
one:
false
nega:ve
match
for
“diesel
engine”
“Steam
boilers
and
equipment,
steam
and
gas
turbines,
nuclear
reactors,
steam
engines,
diesel
en-‐
gines,
and
other
prime
movers”
§ Example
two:
false
posi:ve
match
for
“diesel
motor”
“The
introduc:on
of
commercial-‐model
diesel
engines,
in
a
rela:vely
small
quan:ty
of
trucks.”
Lucence
tokeniza:on
splits
en-‐gines
into
“en”
and
“gines”,
Thus
leads
to
false
nega:ve
End
of
line
Single-‐pass
processing
will
handle
word
con:nua:on
case
by
concatena:ng
“en-‐gines”
into
“engines”
first
before
matching
Solr
failed
to
detect
this
straigh‚orward
case,
we
do
not
know
the
reason
and
need
further
inves:ga:on
49. False
nega:ve
caused
by
OCR
errors
• Example
one:
“Burdick,
R.
H.
Performance
of
diesel.engine
plants
in
Texas.”
• Example
two:
“from
gasoline-‐powered
to
fuel-‐efficient
diesel-‐
_powered
engines”
• One-‐pass
processing
approach
failed
to
detect
them
in
such
cases
Tokens
generated
as
result
of
OCR
error
50. Raw
copyright
data
on
1)
file
system
and
in
archive
in
pairtree
form,
2)
chunked
form
for
parallel
processing
and
2)
in
full
text
Solr
index
Knowledge
product
services
Data
Capsule
VMs
Services
and
Tools
for
data
extrac:on,
data
cleaning,
data
analysis
and
results
visualiza:on
Knowledge
products
(public):
workset,
ontology,
feature
sets
Portal
(for
authen:ca:on)
Knowledge
products
(private):
personal
worksets
External
data
cache
DH
CS
NLP
R
.
.
.
Data
management
services
51. Raw
copyright
data
on
1)
file
system
and
in
archive
in
pairtree
form,
2)
chunked
form
for
parallel
processing
and
2)
in
full
text
Solr
index
Knowledge
product
services
Data
Capsule
VMs
Services
and
Tools
for
data
extrac:on,
data
cleaning,
data
analysis
and
results
visualiza:on
Knowledge
products
(public):
workset,
ontology,
feature
sets
Portal
(for
authen:ca:on)
Knowledge
products
(private):
personal
worksets
External
data
cache
DH
CS
NLP
R
.
.
.
Data
management
services