The HathiTrust Research Center: Big Data Analytics in a Secure Data Framework

Robert H. McDonald
Robert H. McDonaldAssociate Dean for Library Technologies | Deputy Director Data to Insight Center
The HathiTrust Research Center: 
Big Data Analytics in a Secure 
pti.iu.edu/sc14 
Data Framework 
@hathitrust #SC14 
Beth Plale | @bplale 
Director Data to Insight Center | Indiana University 
Robert H. McDonald | @mcdonald 
Deputy Director Data to Insight Center | Indiana University
pti.iu.edu/sc14 
@hathitrust #SC14 
Outline 
• What is the HTRC? 
• Non-Consumptive Research Paradigm 
• Current Architecture 
• Future Architecture 
• Advanced Collaborative Support (RFP) 
• HTRC Science on a Sphere 
• HTRC @ Events
pti.iu.edu/sc14 
@hathitrust #SC14 
HathiTrust Digital Library 
• HathiTrust is a partnership of 
90+ academic & research 
institutions, offering a collection 
of millions of digitized titles. 
• http://hathitrust.org 
– IU is a founding member of the 
HathiTrust along with University of 
Michigan, University of California, 
and the University of Virginia
@hathitrust #SC14 
HathiTrust Research Center 
Mission 
• Public research arm of HathiTrust 
• Goal: enable researchers world-wide to accomplish tera-scale 
pti.iu.edu/sc14 
text data-mining and analysis 
– Develop cutting-edge software tools for processing, analyzing 
text 
– Develop cyberinfrastructure to enable HPC access to the 
HathiTrust Digital Library 
• Established: July, 2011 
• Collaborative center: Indiana University & University of 
Illinois
pti.iu.edu/sc14 
@hathitrust #SC14 
HTRC Timeline 
• Phase I: development 01 Jul 2011 – 31 Mar 2013 
– HTRC software and services release v1.0 
https://github.com/htrc 
• Phase II: outreach, 01 Apr 2013 – 30 June 2014 
– 2nd HTRC UnCamp Sep ’13 
• Phase III: operations, 01 July 2014 - present
pti.iu.edu/sc14 
@hathitrust #SC14 
HTRC Current Users 
Projected Use 2019 
Digital 
Humanities 
(60) 
Education 
(60) 
Informatics 
(60) 
Observers 
(20) 
194 existing user accounts 
Lots of user accounts; good 
starting point. 
Improve : 
• Increase amount of real work 
being accomplished as 
measured by usage on HTRC’s 
compute resources Quarry and 
Big Red II at IU 
• Develop educational uses 
• Develop informatics uses 
• Decrease number of observers 
to 10% 
 Project 200 users at any one time 
of which 90% are doing relevant 
education/scholarship 
6
pti.iu.edu/sc14 
@hathitrust #SC14 
Non-Consumptive Research 
Paradigm 
• No action or set of actions on part of users, 
either acting alone or in cooperation with other 
users over duration of one or multiple sessions 
can result in sufficient information gathered from 
collection of copyrighted works to reassemble 
pages from collection. 
• Definition disallows collusion between users, or 
accumulation of material over time. 
Differentiates human researcher from proxy 
which is not a user. Users are human beings.
pti.iu.edu/sc14 
@hathitrust #SC14 
HTRC 
All the complexity 
Complexity hiding interface 
Request 
Spatial plots 
Statistical plots 
Tabular info
pti.iu.edu/sc14 
@hathitrust #SC14 
HTRC Version 2.0
pti.iu.edu/sc14 
@hathitrust #SC14 
HTRC Goals 
• Provide a persistent and sustainable structure to 
enable original and cutting edge research. 
– Leverage data storage and computational infrastructure at Indiana 
& Illinois 
– Stimulate community development of new functionality and tools 
– Use tools to enable discoveries that would not be possible without 
the HTRC 
• Enable scholars to fully utilize content of 
HathiTrust Library while preventing intellectual 
property misuse within U.S. copyright law. 
– Provision secure computational and data environment for scholars 
to perform research using HathiTrust Digital Library.
pti.iu.edu/sc14 
@hathitrust #SC14 
HTRC Organization 
2014-18 
HTRC Executive 
Mgmt 
Administrative 
Support 
Core 
Development 
Advanced 
Research 
Advanced 
Collaborative 
Support 
Scholarly 
Commons
HTRC Data Capsule 
pti.iu.edu/sc14 
@hathitrust #SC14 
HTRC Data Capsule@IU 
Team 
• Beth Plale (PI) 
• Jiaan Zeng 
• Guangchen Ruan 
HTRC Data Capsule@Michigan Team 
• Atul Prakash (PI) 
• Alexander Crowell 
Jiaan Zeng, Guangchen Ruan, Alexander Crowell, Atul Prakash, and 
Beth Plale. 2014. Cloud computing data capsules for non-consumptiveuse 
of texts. In Proceedings of the 5th ACM workshop 
on Scientific cloud computing (ScienceCloud '14). ACM, New York, 
NY, USA, 9-16. DOI=10.1145/2608029.2608031 
http://doi.acm.org/10.1145/2608029.2608031 
Special Thanks to 
• Samitha Liyanage 
• Milinda Pathirage 
• Zong Peng 
• Earlence Fernandes 
• Ajit Aluri
User Authentication 
pti.iu.edu/sc14 
@hathitrust #SC14 
HTRC Data Capsule 
VM-1 … 
Host-1 
Web UI 
Web Services 
Hypervisor Scripts 
… 
Database 
Firewall 
Audit 
Image Store 
Volume Store 
VM-k 
VM-1 … VM-k 
Host-N 
Web front end Web service Backend
@hathitrust #SC14 
HTRC Data Capsule Workflow 
pti.iu.edu/sc14
@hathitrust #SC14 
Data Capsule Screenshots 
pti.iu.edu/sc14 
Maintenance Mode 
Secure Mode
pti.iu.edu/sc14 
@hathitrust #SC14 
HTRC Science on a Sphere #SC14 
1. Texts published per 
country 
2. HathiTrust Member 
Institutions 
3. HT Google analytics
@hathitrust #SC14 
HTRC Advanced Collaborative Support 
• ACS will be offered on a rolling basis over next 
pti.iu.edu/sc14 
four years 2014-18 
• 1st RFP Call Deadline is Jan 8, 2015 5:00pm 
eastern 
– RFP - http://www.hathitrust.org/htrc/acs-rfp 
• For more info on the Advanced Collaborative 
Support please contact: 
htrc.acs.awards@gmail.com
pti.iu.edu/sc14 
@hathitrust #SC14 
HTRC@Events 
• DHCS 2014, Oct 22, 2014 
Evanston, IL 
• SC14 – IU Booth, Nov 17-19, 
2014, New Orleans, LA 
• CLIR/CNI Workshop on 
Expanded Access to 
Collections, Dec. 7, 2014, 
Washington, DC 
• HTRC UnCamp 2015 – March 
30-31, 2015 Ann Arbor, MI
pti.iu.edu/sc14 
@hathitrust #SC14 
Thank You 
HTRC IU Team 
• Beth Plale (PI) 
• Robert H. McDonald 
• Miao Chen 
• Guangchen Ruan 
• Zong Peng 
• Milinda Pathirage 
• Samitha Liyanage 
• Leena Unnikrishnan 
• Nicholae Cline 
HTRC UIUC Team 
• J. Stephen Downie (PI) 
• Beth Namachchivaya 
• Megan Senseney 
• Sayan Bhattacharyya 
• Colleen Fallaw 
• Loretta Auvil 
• Boris Capitanu 
• Harriet Green
@hathitrust #SC14 
More Information on HTRC 
• For details http://www.hathitrust.org/htrc/faq 
• General contact info 
pti.iu.edu/sc14 
– J. Stephen Downie, Co-Director HTRC, 
jdownie@Illinois.edu 
– Beth Plale, Co-Director HTRC, plale@indiana.edu 
• Requests for capability, interest 
– Miao Chen, Asst. Director for Outreach HTRC 
miaochen@indiana.edu
@hathitrust #SC14 
The HathiTrust Research Center: 
Big Data Analytics in a Secure 
pti.iu.edu/sc14 
Data Framework 
For more on HTRC: http://www.hathitrust.org/htrc 
For these slides go to:
1 of 21

More Related Content

What's hot(20)

Building Capacity for Open ScienceBuilding Capacity for Open Science
Building Capacity for Open Science
Kaitlin Thaney902 views
CST4599 July 2020 CST4599 July 2020
CST4599 July 2020
EISLibrarian93 views
Linked Open Data for Digital HumanitiesLinked Open Data for Digital Humanities
Linked Open Data for Digital Humanities
Christophe Guéret1.5K views
Research Data Management at the University of EdinburghResearch Data Management at the University of Edinburgh
Research Data Management at the University of Edinburgh
EDINA, University of Edinburgh1.2K views
Pampel/Bertelnmann/Hobohm: Data LibrarianshipPampel/Bertelnmann/Hobohm: Data Librarianship
Pampel/Bertelnmann/Hobohm: Data Librarianship
Hans-Christoph Hobohm1.1K views
Research 101 for Mid-Career StudentsResearch 101 for Mid-Career Students
Research 101 for Mid-Career Students
Abby Clobridge494 views
co:op-READ-Convention Marburg - Milena Dobrevaco:op-READ-Convention Marburg - Milena Dobreva
co:op-READ-Convention Marburg - Milena Dobreva
ICARUS - International Centre for Archival Research814 views
Cosi   Usage DataCosi   Usage Data
Cosi Usage Data
daveyp577 views

Similar to The HathiTrust Research Center: Big Data Analytics in a Secure Data Framework(20)

JCDL 2015 Tutorial Opening SlidesJCDL 2015 Tutorial Opening Slides
JCDL 2015 Tutorial Opening Slides
Robert H. McDonald1.6K views
BLC & Digital Science: Mark Hahnel, FigshareBLC & Digital Science: Mark Hahnel, Figshare
BLC & Digital Science: Mark Hahnel, Figshare
Boston Library Consortium, Inc.475 views
African Open Science Platform: Pilot PhaseAfrican Open Science Platform: Pilot Phase
African Open Science Platform: Pilot Phase
Academy of Science of South Africa (ASSAf)314 views
RDM skillsRDM skills
RDM skills
Sarah Jones362 views

More from Robert H. McDonald(20)

TLT Discussion on "Saving My Stuff" - 06.05.15TLT Discussion on "Saving My Stuff" - 06.05.15
TLT Discussion on "Saving My Stuff" - 06.05.15
Robert H. McDonald2.2K views
ER&L 2015 Closing Keynote SlidesER&L 2015 Closing Keynote Slides
ER&L 2015 Closing Keynote Slides
Robert H. McDonald10.2K views
Kuali OLE: Enabling Choices for LibrariesKuali OLE: Enabling Choices for Libraries
Kuali OLE: Enabling Choices for Libraries
Robert H. McDonald1.6K views
SCONUL Kuali OLE BriefingSCONUL Kuali OLE Briefing
SCONUL Kuali OLE Briefing
Robert H. McDonald1.8K views
SEAD Datanet and Sustainability Science SEAD Datanet and Sustainability Science
SEAD Datanet and Sustainability Science
Robert H. McDonald996 views
Kuali OLE @ LITA Forum 2012Kuali OLE @ LITA Forum 2012
Kuali OLE @ LITA Forum 2012
Robert H. McDonald1.2K views
HathiTrust Research Center: The Fast VersionHathiTrust Research Center: The Fast Version
HathiTrust Research Center: The Fast Version
Robert H. McDonald429 views
HTRC Architecture OverviewHTRC Architecture Overview
HTRC Architecture Overview
Robert H. McDonald1.2K views

Recently uploaded(20)

GSoC 2024GSoC 2024
GSoC 2024
DeveloperStudentClub1049 views
Psychology KS4Psychology KS4
Psychology KS4
WestHatch52 views
Azure DevOps Pipeline setup for Mule APIs #36Azure DevOps Pipeline setup for Mule APIs #36
Azure DevOps Pipeline setup for Mule APIs #36
MysoreMuleSoftMeetup75 views
Nico Baumbach IMR Media ComponentNico Baumbach IMR Media Component
Nico Baumbach IMR Media Component
InMediaRes1186 views
Dance KS5 BreakdownDance KS5 Breakdown
Dance KS5 Breakdown
WestHatch52 views
Gopal Chakraborty Memorial Quiz 2.0 Prelims.pptxGopal Chakraborty Memorial Quiz 2.0 Prelims.pptx
Gopal Chakraborty Memorial Quiz 2.0 Prelims.pptx
Debapriya Chakraborty221 views
Education and Diversity.pptxEducation and Diversity.pptx
Education and Diversity.pptx
DrHafizKosar56 views
Psychology KS5Psychology KS5
Psychology KS5
WestHatch53 views
ICANNICANN
ICANN
RajaulKarim2057 views
231112 (WR) v1  ChatGPT OEB 2023.pdf231112 (WR) v1  ChatGPT OEB 2023.pdf
231112 (WR) v1 ChatGPT OEB 2023.pdf
WilfredRubens.com100 views
STERILITY TEST.pptxSTERILITY TEST.pptx
STERILITY TEST.pptx
Anupkumar Sharma102 views
ACTIVITY BOOK key water sports.pptxACTIVITY BOOK key water sports.pptx
ACTIVITY BOOK key water sports.pptx
Mar Caston Palacio132 views
Narration lesson plan.docxNarration lesson plan.docx
Narration lesson plan.docx
TARIQ KHAN90 views
NS3 Unit 2 Life processes of animals.pptxNS3 Unit 2 Life processes of animals.pptx
NS3 Unit 2 Life processes of animals.pptx
manuelaromero201389 views

The HathiTrust Research Center: Big Data Analytics in a Secure Data Framework

  • 1. The HathiTrust Research Center: Big Data Analytics in a Secure pti.iu.edu/sc14 Data Framework @hathitrust #SC14 Beth Plale | @bplale Director Data to Insight Center | Indiana University Robert H. McDonald | @mcdonald Deputy Director Data to Insight Center | Indiana University
  • 2. pti.iu.edu/sc14 @hathitrust #SC14 Outline • What is the HTRC? • Non-Consumptive Research Paradigm • Current Architecture • Future Architecture • Advanced Collaborative Support (RFP) • HTRC Science on a Sphere • HTRC @ Events
  • 3. pti.iu.edu/sc14 @hathitrust #SC14 HathiTrust Digital Library • HathiTrust is a partnership of 90+ academic & research institutions, offering a collection of millions of digitized titles. • http://hathitrust.org – IU is a founding member of the HathiTrust along with University of Michigan, University of California, and the University of Virginia
  • 4. @hathitrust #SC14 HathiTrust Research Center Mission • Public research arm of HathiTrust • Goal: enable researchers world-wide to accomplish tera-scale pti.iu.edu/sc14 text data-mining and analysis – Develop cutting-edge software tools for processing, analyzing text – Develop cyberinfrastructure to enable HPC access to the HathiTrust Digital Library • Established: July, 2011 • Collaborative center: Indiana University & University of Illinois
  • 5. pti.iu.edu/sc14 @hathitrust #SC14 HTRC Timeline • Phase I: development 01 Jul 2011 – 31 Mar 2013 – HTRC software and services release v1.0 https://github.com/htrc • Phase II: outreach, 01 Apr 2013 – 30 June 2014 – 2nd HTRC UnCamp Sep ’13 • Phase III: operations, 01 July 2014 - present
  • 6. pti.iu.edu/sc14 @hathitrust #SC14 HTRC Current Users Projected Use 2019 Digital Humanities (60) Education (60) Informatics (60) Observers (20) 194 existing user accounts Lots of user accounts; good starting point. Improve : • Increase amount of real work being accomplished as measured by usage on HTRC’s compute resources Quarry and Big Red II at IU • Develop educational uses • Develop informatics uses • Decrease number of observers to 10%  Project 200 users at any one time of which 90% are doing relevant education/scholarship 6
  • 7. pti.iu.edu/sc14 @hathitrust #SC14 Non-Consumptive Research Paradigm • No action or set of actions on part of users, either acting alone or in cooperation with other users over duration of one or multiple sessions can result in sufficient information gathered from collection of copyrighted works to reassemble pages from collection. • Definition disallows collusion between users, or accumulation of material over time. Differentiates human researcher from proxy which is not a user. Users are human beings.
  • 8. pti.iu.edu/sc14 @hathitrust #SC14 HTRC All the complexity Complexity hiding interface Request Spatial plots Statistical plots Tabular info
  • 10. pti.iu.edu/sc14 @hathitrust #SC14 HTRC Goals • Provide a persistent and sustainable structure to enable original and cutting edge research. – Leverage data storage and computational infrastructure at Indiana & Illinois – Stimulate community development of new functionality and tools – Use tools to enable discoveries that would not be possible without the HTRC • Enable scholars to fully utilize content of HathiTrust Library while preventing intellectual property misuse within U.S. copyright law. – Provision secure computational and data environment for scholars to perform research using HathiTrust Digital Library.
  • 11. pti.iu.edu/sc14 @hathitrust #SC14 HTRC Organization 2014-18 HTRC Executive Mgmt Administrative Support Core Development Advanced Research Advanced Collaborative Support Scholarly Commons
  • 12. HTRC Data Capsule pti.iu.edu/sc14 @hathitrust #SC14 HTRC Data Capsule@IU Team • Beth Plale (PI) • Jiaan Zeng • Guangchen Ruan HTRC Data Capsule@Michigan Team • Atul Prakash (PI) • Alexander Crowell Jiaan Zeng, Guangchen Ruan, Alexander Crowell, Atul Prakash, and Beth Plale. 2014. Cloud computing data capsules for non-consumptiveuse of texts. In Proceedings of the 5th ACM workshop on Scientific cloud computing (ScienceCloud '14). ACM, New York, NY, USA, 9-16. DOI=10.1145/2608029.2608031 http://doi.acm.org/10.1145/2608029.2608031 Special Thanks to • Samitha Liyanage • Milinda Pathirage • Zong Peng • Earlence Fernandes • Ajit Aluri
  • 13. User Authentication pti.iu.edu/sc14 @hathitrust #SC14 HTRC Data Capsule VM-1 … Host-1 Web UI Web Services Hypervisor Scripts … Database Firewall Audit Image Store Volume Store VM-k VM-1 … VM-k Host-N Web front end Web service Backend
  • 14. @hathitrust #SC14 HTRC Data Capsule Workflow pti.iu.edu/sc14
  • 15. @hathitrust #SC14 Data Capsule Screenshots pti.iu.edu/sc14 Maintenance Mode Secure Mode
  • 16. pti.iu.edu/sc14 @hathitrust #SC14 HTRC Science on a Sphere #SC14 1. Texts published per country 2. HathiTrust Member Institutions 3. HT Google analytics
  • 17. @hathitrust #SC14 HTRC Advanced Collaborative Support • ACS will be offered on a rolling basis over next pti.iu.edu/sc14 four years 2014-18 • 1st RFP Call Deadline is Jan 8, 2015 5:00pm eastern – RFP - http://www.hathitrust.org/htrc/acs-rfp • For more info on the Advanced Collaborative Support please contact: htrc.acs.awards@gmail.com
  • 18. pti.iu.edu/sc14 @hathitrust #SC14 HTRC@Events • DHCS 2014, Oct 22, 2014 Evanston, IL • SC14 – IU Booth, Nov 17-19, 2014, New Orleans, LA • CLIR/CNI Workshop on Expanded Access to Collections, Dec. 7, 2014, Washington, DC • HTRC UnCamp 2015 – March 30-31, 2015 Ann Arbor, MI
  • 19. pti.iu.edu/sc14 @hathitrust #SC14 Thank You HTRC IU Team • Beth Plale (PI) • Robert H. McDonald • Miao Chen • Guangchen Ruan • Zong Peng • Milinda Pathirage • Samitha Liyanage • Leena Unnikrishnan • Nicholae Cline HTRC UIUC Team • J. Stephen Downie (PI) • Beth Namachchivaya • Megan Senseney • Sayan Bhattacharyya • Colleen Fallaw • Loretta Auvil • Boris Capitanu • Harriet Green
  • 20. @hathitrust #SC14 More Information on HTRC • For details http://www.hathitrust.org/htrc/faq • General contact info pti.iu.edu/sc14 – J. Stephen Downie, Co-Director HTRC, jdownie@Illinois.edu – Beth Plale, Co-Director HTRC, plale@indiana.edu • Requests for capability, interest – Miao Chen, Asst. Director for Outreach HTRC miaochen@indiana.edu
  • 21. @hathitrust #SC14 The HathiTrust Research Center: Big Data Analytics in a Secure pti.iu.edu/sc14 Data Framework For more on HTRC: http://www.hathitrust.org/htrc For these slides go to:

Editor's Notes

  1. HTRC hides complexity of analytics. In this sense, it is like Google search, which is a simple interface that hides complexity to search billions of pages. The kinds of things returned from HTRC interaction are spatial relationship of words (and their frequency obviously), statistical plots of information or tabular information.
  2. Shifting the complexity hiding interface to the right, we open up the cloud to see what’s inside. HTRC at it simplest has 1) algorithms – these are drawn from SEASR and from other analysis tool suites including Mahout and mapreduce, the 2) HT corpus (and subsets of the corpus that users either have personally as part of a workset, or are publically available, and 3) other data sets that are used. HTRC brokers the bringing together of these pieces so that computation can take place on a resource like Big Red II (or XSEDE). Note that there is an arrow from the compute engine to the complexity hiding interface. This is because researcher interaction with the texts isn’t an automated workflow; it is one requiring levels of interaction with the computation as it is running.
  3. Jiaan Zeng, Guangchen Ruan, Alexander Crowell, Atul Prakash, and Beth Plale. 2014. Cloud computing data capsules for non-consumptiveuse of texts. In Proceedings of the 5th ACM workshop on Scientific cloud computing (ScienceCloud '14). ACM, New York, NY, USA, 9-16. DOI=10.1145/2608029.2608031 http://doi.acm.org/10.1145/2608029.2608031
  4. 1.) Texts published per country Data were from the Gender metadata work. It was used because it has volume authors and country of publication information. The total records have 60K volumes, with some country fields missing 2.) HathiTrust Member institutions It maps geolocations of the UnCamp 2013 participants; The text band shows UnCamp 15’ and 13M books available for non-consumptive use soon 3.) HT Google analytics Shows HT webpage use over the time, by aggregating over the quarter A drop around 2013 summer: could possibly be cause by summer break