HTRC Architecture Overview

844 views

Published on

These are my slides for the architecture overview of the HathiTrust Research Center UnCamp.

Published in: Education, Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
844
On SlideShare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
Downloads
11
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • Registry – agent can deploy any service listed in this digram and can run with the computational resources – Original Plan iis to use XSEDE – not using this on IIS machine but are using ODIN (128 node cluster each core has 4Gb memory and 4 computation cores)– smoketree (D2I server)(24 cores physical 48 loical cores 128 GB memory) – these are not long term just using for now -
  • Registry – agent can deploy any service listed in this digram and can run with the computational resources – Original Plan iis to use XSEDE – not using this on IIS machine but are using ODIN (128 node cluster each core has 4Gb memory and 4 computation cores)– smoketree (D2I server)(24 cores physical 48 loical cores 128 GB memory) – these are not long term just using for now -
  • Registry – agent can deploy any service listed in this diagram and can run with the computational resources – Original Plan iis to use XSEDE – not using this on IIS machine but are using ODIN (128 node cluster each core has 4Gb memory and 4 computation cores)– smoketree (D2I server)(24 cores physical 48 loical cores 128 GB memory) – these are not long term just using for now -
  • Registry – agent can deploy any service listed in this diagram and can run with the computational resources – Original Plan iis to use XSEDE – not using this on IIS machine but are using ODIN (128 node cluster each core has 4Gb memory and 4 computation cores)– smoketree (D2I server)(24 cores physical 48 loical cores 128 GB memory) – these are not long term just using for now -
  • Registry – agent can deploy any service listed in this digram and can run with the computational resources – Original Plan iis to use XSEDE – not using this on IIS machine but are using ODIN (128 node cluster each core has 4Gb memory and 4 computation cores)– smoketree (D2I server)(24 cores physical 48 loical cores 128 GB memory) – these are not long term just using for now -
  • HTRC Architecture Overview

    1. 1. HathiTrust Research Center Architecture Overview Robert H. McDonald | @mcdonaldExecutive Committee-HathiTrust Research Center (HTRC) Deputy Director-Data to Insight Center Associate Dean-University Libraries Indiana University
    2. 2. Follow Alonghttp://slidesha.re/U4z1gW
    3. 3. HTRC Architecture GroupIndiana University University of Illinois• Beth Plale, Lead • J. Stephen Downie• Yiming Sun • Loretta Auvil• Stacy Kowalczyk • Boris Capitanu• Aaron Todd • Kirk Hess• Jiaan Zeng • Harriett Green• Guangchen Ruan• Zong Peng• Swati Nagde
    4. 4. Presentation Overview• Considerations for Current Architecture• Architecture - Use Case Methodology• Technical Overview• UnCamp Sessions for Further Review
    5. 5. Main Case – Data Near Computation HTRC HT Volume HT Store andVolume Volume Index Store Store (IUB) (UM) XSEDE (IUPUI) Compute FutureGrid Allocation Computation Cloud UIUC Compute Allocation IU Compute Allocation
    6. 6. Non-Consumptive Research Paradigm• No action or set of actions on part of users, either acting alone or in cooperation with other users over duration of one or multiple sessions can result in sufficient information gathered from collection of copyrighted works to reassemble pages from collection.• Definition disallows collusion between users, or accumulation of material over time. Differentiates human researcher from proxy which is not a user. Users are human beings.
    7. 7. Amicus Brief and NCR• Jockers, Sag, Schultz –• http://tinyurl.com/cy34hhr
    8. 8. Use Cases for Phase 1 Architecture• Use Case #1 - Previously registered user submitted algorithm retrieved and run with results set• Use Case #2 - HTRC applications/portal access (SEASR)• Use Case #3 – Blacklight Lucene/Solr faceted access• Use Case #4 - Direct programmatic access through Secure Data API
    9. 9. HTRC Current Infrastructure• Servers – 14 production-level quad-core servers • 16 – 32GB of memory • 250 – 500GB of local disk each – 6-node Cassandra cluster for volume store – Ingest service and secure Data API access point• Storage (IU University Infrastructure) – 13TB of 15,000 RPM SAS disk storage – Increase up to 17TB by end of 2012 – 500TB available in late year 2-year 3
    10. 10. Key Components of Architecture• Portal Access• Blacklight Access• Agent• Registry• Secured Data API Access
    11. 11. HTRC Architecture Portal Access Blacklight Direct Agent programmatic access (byApplication Collection programs runningsubmission building on HTRC machines) Security (OAuth2) Data API access interface Solr Proxy Registry (WSO2) Audit Meandre Algorithms Cassandra Workflows cluster volume store Result Sets Collections Solr index Compute resources Storage resources
    12. 12. HTRC Architecture Portal Access Portal Access Blacklight HTRC Portal Direct Agent programmatic access (byApplication Collection programs running Blacklightsubmission building on HTRC machines) Security (OAuth2) App SEAR App Blacklight Data API access interface Solr Proxy Registry (WSO2) Audit Meandre Algorithms Cassandra Workflows cluster volume store Result Sets Collections Solr index Compute resources Storage resources
    13. 13. HTRC Architecture Agent Portal Access HTRC Agent Blacklight Direct Agent Application programmatic Collection access (byApplication Collection submission programs running buildingsubmission building on HTRC machines) Security (OAuth2) Data API access interface Solr Proxy Registry (WSO2) Audit Meandre Algorithms Cassandra Workflows cluster volume store Result Sets Collections Solr index Compute resources Storage resources
    14. 14. HTRC Architecture HTRC Registry Portal Access Registry (WSO2) Blacklight Meandre Algorithms Direct programmatic Workflows Agent access (byApplication Collection programs running 1submission building on HTRC Sets Resultmachines) Collections Security (OAuth2) Data API access interface Solr Proxy Registry (WSO2) Audit Meandre Algorithms Cassandra Workflows cluster volume store Result Sets Collections Solr index Compute resources Storage resources
    15. 15. HTRC Architecture Secure Data API Portal Access Blacklight • RESTful Web Service Direct – Language agnostic Agent programmatic – Clients don’t have to access (byApplication Collection programs running deal with Cassandrasubmission building • Simple OAuth2 on HTRC machines) authentication Security (OAuth2) • HTTP over SSL • Audits Data API access interface client access Solr Proxy Registry (WSO2) • Protected behind Audit Algorithms Meandre Workflows firewall, accessible Cassandra only to authorized IPs cluster volume store Result Sets Collections Solr index HTRC Compute resources Storage resources
    16. 16. NoSQL Methodology• Currently HT content is stored in a pair-tree file system convention (CDL)• Moving these files into a NoSQL store like Cassandra enabled HTRC to aggregate them into larger sets of files for use in retrieval• Use of Cassandra enabled HTRC to share content over a commodity based Cassandra cluster of virtual machines• Originally investigated use of MongoDB, CouchDB, Hbase and Cassandra
    17. 17. HTRC Solr index• The Solr Data API 0.1 test version – Preserves all query syntax of original Solr – Prevents user from modification – Hides the host machine and port number HTRC Solr is actually running on – Creates audit log of requests – Provides filtered term vector for words starting with user-specified letter
    18. 18. Data Capsules VM Cluster HTRC Volume Store and Index Remote Provide secure Desktop VM Or VNC Submit secure Scholars capsule FutureGrid map/reduce Data Computation Capsule images Cloud to FutureGrid. Receive and review resultsNon-Consumptive Research-Secure Data Capsule
    19. 19. Sessions for Further Review• For more on API – Tues Topic I/II (Yiming Sun)• For more on Portal/SEASR – Tues Topic II (Loretta Auvil)• For more on Portal/Blacklight – Tues Topic III (Stacy Kowalczyk)
    20. 20. Contact Information• Robert H. McDonald – Email – robert@indiana.edu – Chat – rhmcdonald on googletalk | skype – Twitter - @mcdonald – Blog – http://www.rmcdonald.net – Twitter Hashtag: #HTRC12

    ×