The session will provide an overview of the HathiTrust Research Center including its mission and current status. It will also include a demonstration of current HTRC phase one technology and services. Additionally, the speakers will address the HTRC's role in supporting humanities research at scale.
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
The HathiTrust Research Center (HTRC): An Overview and Demo
1. The
HathiTrust
Research
Center
(HTRC):
An
Overview
and
Demo
IU
Librarians’
Day
|
IUPUI
Libraries
|
06.07.13
Robert
H.
McDonald
-‐
@mcdonald
-‐
IU
Libraries
Yiming
Sun
–
IU
Data
to
Insight
Center
Miao
Chen
–
IU
Data
to
Insight
Center
Tweet
US
-‐
@HathiTrust
#HTRC
2. Speaker
Deck
Slides
06.07.13
IU
Librarians’
Day
#HTRC
@HathiTrust
hHp://bit.ly/13gWD7C
3. HTRC
Mission
• Public
research
arm
of
the
HathiTrust
• Help
researchers
world-‐wide
to
accomplish
tera-‐scale
text
data-‐mining
and
analysis
– Develop
cuOng-‐edge
soPware
tools
for
processing,
analyzing
text
– Develop
cyberinfrastructure
to
enable
HPC
access
to
the
HathiTrust
Digital
Library
• Established:
July,
2011
• CollaboraWve
center:
Indiana
University
&
University
of
Illinois
06.07.13
IU
Librarians’
Day
#HTRC
@HathiTrust
4. HTRC
Governance
• Reports
to
the
HathiTrust
Board
of
Governors
• HTRC
ExecuWve
Commi[ee
– J.
Stephen
Downie
(Co-‐director),
Professor
and
Associate
Dean
for
Research,
University
of
Illinois
GSLIS
– Beth
Plale
(Co-‐director
and
Chair),
Director
Data
To
Insight
Center
and
professor
in
the
School
of
InformaWcs
and
CompuWng
at
Indiana
University
– Robert
H.
McDonald,
Associate
Dean
of
Libraries/Deputy
Director
Data
to
Insight
Center
at
Indiana
University
– Beth
Sandore
Namachchivaya,
Associate
University
Librarian
for
InformaWon
Technology
Planning
&
Policy
at
the
University
of
Illinois
– John
Unsworth,
Vice
Provost
for
Library
&
Technology
Services
and
Chief
InformaWon
Officer
at
Brandeis
University
• HTRC
Advisory
Board
(See
members
next
slide)
• Google
Public
Domain
agreement
–
in
place
for
IU
and
UIUC
5. HTRC
Advisory
Board
• Cathy
Blake,
University
of
Illinois,
Urbana-‐Champaign
• Beth
Cate,
Indiana
University
• Greg
Crane,
TuPs
University
• Laine
Farley,
California
Digital
Library
• Brian
Geiger,
University
of
California
at
Riverside
• David
Greenbaum,
University
of
California
at
Berkeley
• FoWs
Jannidis,
University
of
Wurzberg,
Germany
• Ma[hew
Jockers,
Stanford
University
• Jim
Neal,
Columbia
University
• Bill
Newman,
Indiana
University
• Bethany
Nowviskie,
University
of
Virginia
• Andrey
Rzhetsky,
University
of
Chicago
• Pat
Steele,
University
of
Maryland
• Craig
Stewart,
Indiana
University
• David
Theo
Goldberg,
University
of
California
at
Irvine
• John
Towns,
NaWonal
Center
for
SupercompuWng
ApplicaWons
• Madelyn
Wessel,
University
of
Virginia
6. HTRC
Timeline
• Phase
I:
18-‐month
development
cycle
– Began
01
July
2011
– Demo
of
capability
September
2012
(14
mo
mark)
at
HTRC
UnCamp
I
• Phase
II:
broad
availability
of
resource,
begins
31
March
2013
– New
HTRC
Asst.
Director
for
EducaWon
and
Outreach
(Miao
Chen)
– New
listserv
to
drive
user
input:
htrc-‐usergroup-‐l
@
list.indiana.edu
06.07.13
IU
Librarians’
Day
#HTRC
@HathiTrust
7. HTRC
Next
Steps
• Phase
2
availability
of
resource
31
March
2013
• Thanks
to:
Photos from HTRC UnCamp 9.10.12
at Indiana University
8. HTRC
Phase
2:
Current
Thrusts
• Grow
HTRC
User-‐base
– Outreach
and
Engagement
• Input
from
HTRC
Advisory
Board
• Input
from
HT
BOG
– Town
Hall
Groups
at
DH,
JCDL,
JADH
– Online
Town
Hall
Groups
• Develop
New
SpecificaWons
from
User-‐Based
Agile
Development
Methodology
• Develop
and
Integrate
Sloan
Cloud
Components
into
the
HTRC
Infrastructure
06.07.13
IU
Librarians’
Day
#HTRC
@HathiTrust
9.
06.07.13
IU
Librarians’
Day
#HTRC
@HathiTrust
HTRC
Architecture
Overview
Data
API
access
interface
Portal
Security
(OAuth2
WSO2
IS)
Algorithms
and
Worksets
Registry
(WSO2
GR)
ApplicaWon
submission
Audit
Cassandra
cluster
volume
store
Solr
index
EnWty
ExtracWon
Topic
Modeling
Sentence
Tokenizer
Word
posiWon
Latent
semanWc
analysis
High
level
apps
Compute
resources
Storage
resources
Blacklight
10.
06.07.13
IU
Librarians’
Day
#HTRC
@HathiTrust
VM
Request
VM
Image
Manager
VM
Image
Store
VM
Image
Builder
VM
Manager
VM
instance
Sloan
Cloud
SSH
Non-‐consumpWve
Output
Storage
user
HTRC
Non-‐
ConsumpUve
Research
Access
(Sloan
Cloud)
11. HTRC
DemonstraWon
• Yiming
Sun
–
Lead
Technical
Architect
HTRC
06.07.13
IU
Librarians’
Day
#HTRC
@HathiTrust
12. Metadata
Enhancement
• Current
metadata
fields
are
MARC-‐based
– E.g.
publicaWon
date,
authors,
Wtle,
subject
• MARC
fields
are
fundamental
• Needed
more
fields
of
users’
interest
for
granular
analyWcs
(Metadata
Enhancement)
• Solicit
user
requirements
and
prioriWze
for
implementaWon
– Mainly
digital
humaniWes
uses
now
06.07.13
IU
Librarians’
Day
#HTRC
@HathiTrust
13. Top
Metadata
Enhancement
Items
• 1st
round
user
requirement
collecWon,
top
3
items
were
metadata
related:
– Word
frequency
count
and
document
length
for
a
volume
– Metadata
de-‐duplicaWon
– Author
Gender
Analysis
• These
top
3
items
are
in
process
for
funcWonality
within
the
current
producWon
system
and
will
be
available
in
the
next
quarter.
06.07.13
IU
Librarians’
Day
#HTRC
@HathiTrust
14. Other
Metadata
Enhancement
Items
• Stats
analysis:
m-‐idf
• Readability
score
• Language
• Topic
modeling
(e.g.
LDA
probability)
• Genre
• Era
of
compilaWon
• Book
length
(e.g.
short
or
long)
• Concordance
index
(indexing
with
context)
06.07.13
IU
Librarians’
Day
#HTRC
@HathiTrust
15. HTRC
Upcoming
Events
• DH
2013
–
July
16-‐29,
2013
• JCDL
2013
–
July
22-‐26,
2013
• HathiTrust
Research
Center
UnCamp
–
Sept
8-‐9,
2013
–
University
of
Illinois
• Catapult
Symposium
–
IUB
–
Sept
2013
• JADH
2013
–
September
19-‐21,
2013
• Ohio
State
University
–
Library
Symposium
–
October
2013
• Educause
2013
–
October
15-‐18,
2013
16. Thank
You
• This
presentaWon
was
made
possible
with
content
provided
by
many
HTRC
colleagues
John
Unsworth,
J.
Stephen
Downie,
Robert
H.
McDonald,
Beth
Sandore,
Yiming
Sun,
Guangchen
Ruan,
Lore[a
Auvil,
Kirk
Hess,
and
many
others…
• The
HTRC
Non-‐ConsumpWve
Research
Grant
is
graciously
funded
by
the
Alfred
P.
Sloan
FoundaWon
• IU
D2I-‐PTI
is
graciously
funded
by
The
Lilly
Endowment,
Inc.
• HTRC
-‐
h[p://www.hathitrust.org/htrc
• IU
D2I
Center
-‐
h[p://d2i.indiana.edu/
• UIUC
GSLIS
-‐
h[p://www.lis.illinois.edu/
06.07.13
IU
Librarians’
Day
#HTRC
@HathiTrust
17. Contact
InformaWon
• General
– Robert
H.
McDonald,
HTRC
ExecuWve
Commi[ee
– rhmcdona@indiana.edu
• Technical
– Yiming
Sun,
Chief
Architect,
yimsun@indiana.edu
• Requests
for
capability,
interest
– Miao
Chen,
HTRC
Asst.
Director
of
EducaWon
and
Outreach,
miaochen@indiana.edu
06.07.13
IU
Librarians’
Day
#HTRC
@HathiTrust