Building a Public Research Center for the HathiTrust Digital Library


Published on

This is a ppt by Robert H. McDonald from the panel moderated by Stephen Downie at JCDL 2011 called Big Data! Dig Deal?

Published in: Technology
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • State Core Team NamesTalk about Partnership between IU and UIUC
  • Basic History of HathiTrust Digital Library – Digital Public Library of America - LAC
  • Building a Public Research Center for the HathiTrust Digital Library

    1. 1. Building a Public Research Center for the HathiTrust Digital Library<br />@hathitresearch | @hathitrust<br /><br />Robert H. McDonald<br />Associate Dean for Library Technologies and Digital Libraries<br />Associate Director-Data to Insight Center, Pervasive Technology Institute<br />Indiana University<br />June 14, 2011<br />JCDL 2011: Big Data! Big Deal? Panel<br />
    2. 2. HathiTrust Research Center (HTRC) Team<br />Indiana University<br />Beth Plale – Director<br />Robert McDonald – Executive Committee<br />University of Illinois<br />Scott Poole – Co-Director<br />John Unsworth – Executive Committee<br />
    3. 3. HathiTrust Digital Library History<br />To contribute to the common good by collecting, organizing, preserving, communicating, and sharing the record of human knowledge.<br />Launched in October 2008<br />University of Michigan<br />Indiana University<br />Used Google Books Repository at Michigan as Model<br />Expanded to include content from <br />CIC Member Libraries<br />UC System Libraries<br />University of Virginia<br />Now includes more than 50 partner institutions and more than 8 million volumes<br />
    4. 4. Towards a HathiTrust Research Center<br />Started in response to proposed Google Settlement - June 2009<br /><ul><li>Specific Funding set aside by Google to build a public research center
    5. 5. Worked to identify key stakeholders from HT institutions to collaborate and write RFP
    6. 6. Google Settlement in early 2011 did not stop the center</li></ul>Developed specific RFP for HathiTrust to solicit proposals – Summer/Fall 2009<br />HTRC RFP Working Group<br />RFP Released – Winter 2010<br />
    7. 7. Our Collaboration<br />HTRC is founded as a joint venture between Indiana University and the University of Illinois Urbana-Champaign, aimed at solving the difficult challenges of increasing computational access to the public domain and copyrighted material in HathiTrust.<br />
    8. 8. Our Mission<br />Phase I : starting Apr 2011 and going for 18 mos.<br />Phase II : starting Fall 2012 and going for …<br />Goal: enable strong computational research and education on a collection that has not been amenable to computational exploration EVER before!<br />
    9. 9. Our Goals<br />Maintain repository of text mining algorithms and retrieval tools available on-line for human and programmatic discovery. Also register derived data sets, indexes, and versions in registry repository. <br />Be a user-driven resource, with an active advisory board, and a community model that allows users to share algorithms and tools. <br />Support interoperability across collections and institutions, through use of inCommon SAML identity. <br />
    10. 10. Our Future<br />Support innovation in cyberinfrastructure to deliver optimal access and use of the HathiTrust corpus.<br />Implement “Non-consumptive” research: a technical and intellectual challenge<br />Identify and host existing data analysis, text mining and retrieval toolsthat are of interest to the community.  <br />Stimulate development of new analytical methods and tools. We hope that the scale of the HTRC will promote new levels of collaboration in tool development.<br />
    11. 11. HathiTrust Research Center Today<br />HTRC is dedicated to the provision of access to a comprehensive body of published works for scholarship and education for computational research purposes.<br />Lightweight Organization<br />Executive Committee<br />Beth Plale, Indiana<br />Scott Poole, Illinois<br />Robert H. McDonald, Indiana<br />John Unsworth, Illinois<br />Advisory Board<br />TBD<br />HathiTrust Executive Committee Liaison<br />Laine Farley, California Digital Library<br />
    12. 12. HathiTrust Research Center Today <br />$250K in funding for initial 18 month startup<br />Creating Themed Collections for early Use Cases<br />Astronomy – Victorian Literature - Influenza<br />Ingest and Replication Mechanisms Between HT and HTRC<br />Full-text<br />SOLR indexes<br />Data Capsule integration<br />Karma integration<br />Integration with SEASR/MEANDRE SOA services at NCSA<br />Alignment with Bamboo Technology Project<br />Alignment with international Google Books Research Centers<br />Establishing long-term non-consumptive research methodologies<br />
    13. 13. HTRC Proposed Technical Architecture<br />Courtesy IU Data to Insight Center – Beth Plale/Yiming Sun<br />
    14. 14. Courtesy IU Data to Insight Center – Felix Terkhorn/Yiming Sun<br />Current SEASR Integration Demo<br />1. User enters<br />Author name or Volume title<br />2. Query RIS for Author Name or Volume Title<br />Sample Collection Bibliography Database<br />JS/PHP Auto-completer<br />Book Search Interface by Author or Title<br />3. Volume ID<br />7. Tag Cloud returned to user<br />4. Invoke Tag Cloud service with URL<br />Converted from MARC to RIS<br />5. Use URL to Retrieve Volume<br />Public-domain OCR Web Access Servlet<br />A persistent RESTful Web Service<br />Tag Cloud Viewer Data Flow<br />6. OCR for volume<br />Sample Public Domain Collection<br />Meandre Workbench<br />Organized as pairtree for demo only<br />SEASR Infrastructure<br />Administrator creates tag cloud viewer in advance through SEASR<br />
    15. 15. Non-Consumptive Research Track<br />No action or set of actions on the part of HathiTrust Research Center users, either acting alone or in cooperation with other users over the duration of one or multiple sessions can result in sufficient information gathered from the HathiTrust collection to reassemble pages from the collection. <br />Beth Plale<br />(Indiana University)<br />Atul Prakash<br />(University of Michigan)<br />Geoffrey Fox<br />(Indiana University)<br />Robert H. McDonald<br />(Indiana University)<br />
    16. 16. HTRC Managed Data-Intensive Compute Resources<br />HathiTrust Digital Library Content<br /><ul><li> Access to HT open content indices
    17. 17. Access to HT copyrighted indices
    18. 18. Auditable Secure Mechanisms for legal mandated MOU based and fair-use compliance</li></ul>Researcher Driven Applications for Use as Services within the Data Capsule<br /><ul><li> Can HTRC provide a services framework for researcher applications to run within the secure data capsule compute resources?</li></ul>Secure Data Capsule<br />Researcher Access<br />Provision access to copyrighted content for research purpose giving researcher flexible computing resources in controlled environment <br />
    19. 19. HathiTrust Research Center Events<br />HTRC Kickoff Event at Digital Humanities Conference 2011<br />Stanford University - June 20, 2011<br />Working on models for collaborative research<br />AHRC/ESRC/IMLS/JISC/NEH/NSF/NOW/SSHRC Digging into Data Round 2<br /><br />Working on early advanced user case studies for the HathiTrust Corpus<br />
    20. 20. Support and Acknowledgements<br />IU UITS Research Technologies<br />National Center for Supercomputing Applications<br />IU Data to Insight Center<br />iCHASS<br />Illinois Informatics Institute<br />Lilly Endowment, Inc.<br />The Alfred P. Sloan Foundation<br />
    21. 21. For More on HathiTrust Research Center<br />See –<br />Follow us @hathitresearch on twitter<br />Robert H. McDonald<br />@mcdonald on twitter<br /><br />