Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) | AWS re:Invent 2013

921 views
717 views

Published on

"In this talk, hear about two high-performant research services developed and operated by the Computation Institute at the University of Chicago running on AWS. Globus.org, a high-performance, reliable, robust file transfer service, has over 10,000 registered users who have moved over 25 petabytes of data using the service. The Globus service is operated entirely on AWS, leveraging Amazon EC2, Amazon EBS, Amazon S3, Amazon SES, Amazon SNS, etc. Globus Genomics is an end-to-end next-gen sequencing analysis service with state-of-art research data management capabilities. Globus Genomics uses Amazon EC2 for scaling out analysis, Amazon EBS for persistent storage, and Amazon S3 for archival storage. Attend this session to learn how to move data quickly at any scale as well as how to use genomic analysis tools and pipelines for next generation sequencers using Globus on AWS.
"

Published in: Technology, Education
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
921
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
26
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) | AWS re:Invent 2013

  1. 1. Science as a Service Ian Foster, The University of Chicago and Argonne National Laboratory November 14, 2013
  2. 2. A time of disruptive change
  3. 3. A time of disruptive change
  4. 4. Most labs have limited resources Heidorn: NSF grants in 2007 $1,000,000 $100,000 $10,000 < $350,000 80% of awards 50% of grant $$ $1,000 2000 4000 6000 8000
  5. 5. Automation is required to apply more sophisticated methods to far more data
  6. 6. Automation is required to apply more sophisticated methods to far more data Outsourcing is needed to achieve economies of scale in the use of automated methods
  7. 7. Building a discovery cloud • Identify time-consuming activities amenable to automation and outsourcing • Implement as high-quality, low-touch SaaS • Leverage IaaS for reliability, Software as a service economies of scale Platform as a service Infrastructure as a service • Extract common elements as research automation platform Bonus question: Sustainability
  8. 8. We aspire (initially) to create a great user experience for research data management What would a “dropbox for science” look like?
  9. 9. • Collect • Move • Sync • Share • Analyze • Annotate • Publish • Search • Backup • Archive BIG DATA
  10. 10. It should be trivial to Collect, Move, Sync, Share, Analyze, Annotate, Publish, Search, Backup, & Archive BIG DATA … but in reality it’s often very challenging ! Staging Store ! Ingest Expired Store credentials Registry Permission denied Communit Community yStore Store ! Analysis ! Store Quota Network failed. Retry. exceeded Archive Mirror
  11. 11. • Collect • Move • Sync • Share • Analyze • Annotate • Publish • Search • Backup • Archive BIG DATA
  12. 12. • Collect • Annotate Move • Publish • Move Sync • Search • Sync • Share Share • Backup Capabilities delivered using • Analyze • Archive Software-as-Service (SaaS) model BIG DATA
  13. 13. 2 Data Source 1 Globus Online moves/syncs files Data Destination User initiates transfer request Globus Online notifies user 3
  14. 14. 2 1 User A selects file(s) to share; selects user/group, sets share permissions Globus Online tracks shared files; no need to move files to cloud storage! Data Source 3 User B logs in to Globus Online and accesses shared file
  15. 15. Extreme ease of use • • • • • • • • InCommon, Oauth, OpenID, X.509, … Credential management Group definition and management Transfer management and optimization Reliability via transfer retries Web interface, REST API, command line One-click “Globus Connect” install 5-minute Globus Connect Multi User install
  16. 16. Early adoption is encouraging
  17. 17. Early adoption is encouraging >12,000 registered users; >150 daily >27 PB moved; >1B files 10x (or better) performance vs. scp 99.9% availability Entirely hosted on Amazon
  18. 18. Amazon web services used • Amazon EC2 for hosting Globus services • Elastic Load Balancing to use multiple Availability Zones for reliability and uptime • Amazon S3 to store historical state • Amazon RDS PostgreSQL for active state
  19. 19. K. Heitmann (Argonne) moves 22 TB of cosmology data LANL  ANL at 5 Gb/s
  20. 20. B. Winjum (UCLA) moves 900K-file plasma physics datasets UCLA NERSC
  21. 21. Dan Kozak (Caltech) replicates 1 PB LIGO astronomy data for resilience
  22. 22. Erin Miller (PNNL) collects data at Advanced Photon Source, renders at PNNL, and views at ANL Credit: Kerstin Kleese-van Dam
  23. 23. • Collect • Annotate Move • Publish • Move Sync • Search • Sync • Share Share • Backup Capabilities delivered using • Analyze • Archive Software-as-Service (SaaS) model BIG DATA
  24. 24. • Collect • Move • Sync • Share • Analyze • Annotate • Publish • Search • Backup • Archive BIG DATA
  25. 25. Sharing Service Transfer Service Globus Nexus (Identity, Group, Profile) Globus Toolkit Globus Connect Globus Online APIs Globus Online already does a lot
  26. 26. The identity challenge in science • Research communities often need to – Assign identities to their users – Manage user profiles – Organize users into groups for authorization • Obstacles to high-quality implementations – – – – Complexity of associated security protocols Creation of identity silos Multiple credentials for users Reliability, availability, scalability, security
  27. 27. Nexus provides four key capabilities • Identity provisioning I I I – Create, manage Globus identities I I G I V U aI b • Identity hub – Link with other identities; use to authenticate to services • Group hub – User-managed groups; groups can be used for authorization • Profile management – User-managed attributes; can use in group admission Key points: 1) Outsource identity, group, profile management 2) REST API for flexible integration 3) Intuitive, customizable Web interfaces
  28. 28. Branded sites XSEDE Open Science Grid University of Chicago DOE kBase Indiana University University of Exeter Globus Online NERSC NIH BIRN
  29. 29. A platform for integration
  30. 30. A platform for integration
  31. 31. A platform for integration
  32. 32. Data management SaaS (Globus) + Next-gen sequence analysis pipelines (Galaxy) + Cloud IaaS (Amazon) = Flexible, scalable, easy-to-use genomics analysis for all biologists globus genomics
  33. 33. Sharing Service Transfer Service Globus Nexus (Identity, Group, Profile) Globus Toolkit Globus Connect Globus Online APIs We are adding capabilities
  34. 34. Dataset Services Sharing Service Transfer Service Globus Nexus (Identity, Group, Profile) Globus Toolkit Globus Connect Globus Online APIs We are adding capabilities
  35. 35. We are adding capabilities • Ingest and publication – Imagine a DropBox that not only replicates, but also extracts metadata, catalogs, converts • Cataloging – Virtual views of data based on user-defined and/or automatically extracted metadata • Computation – Associate computational procedures, orchestrate application, catalog results, record provenance
  36. 36. Next Gen Sequencing Analysis for Everyone – No IT Required Ravi K Madduri, The University of Chicago and Argonne National Laboratory November 14, 2013
  37. 37. One slide to get your attention
  38. 38. Outline • Globus Vision • Challenges in Sequencing Analysis – Big Data Management – Analysis at Scale – Reproducibility • Proposed Approach Using Globus Genomics • Example Collaborations • Q&A
  39. 39. Globus Vision Goal: Accelerate discovery and innovation worldwide by providing research IT as a service Leverage software-as-a-service to: – provide millions of researchers with unprecedented access to powerful tools for managing Big Data – reduce research IT costs dramatically via economies of scale “Civilization advances by extending the number of important operations which we can perform without thinking of them” —Alfred North Whitehead , 1911
  40. 40. Challenges in Sequencing Analysis Data Movement and Access Challenges • • • • Shell scripts to sequentially execute the tools Manually modify the scripts for any change • Public Data Manually move the data to the Compute node Install all the tools required for the Analysis Difficult to maintain and transfer the knowledge • BWA, Picard, GATK, Filtering Scripts, etc. • Error Prone, difficult to keep track, messy.. Storage Sequencing Centers Fastq Ref Genome Research Lab Seq Center Local Cluster/ Cloud Modify Picard Install • • • • Data is distributed in different locations Research labs need access to the data for analysis Be able to share data with other researchers/collaborators • Inefficient ways of data movement Data needs to be available on the local and distributed compute Resources • Local clusters, cloud, grid and transfer the knowledge Alignment (Re)Run GATK Script Variant Calling How do we analyze this Sequence Data Manual Data Analysis
  41. 41. Globus Genomics Globus Genomics Galaxy Based Workflow Management System • Public Data Sequencin g Centers Globus Provides a • High-performance Research Lab • Fault-tolerant Seq Secure • Center Storage • • Galaxy Data Libraries • Local Cluster/ Cloud • file transfer Service between all data-endpoints Globus Integrated within Galaxy Web-based UI Drag-Drop workflow creations Easily modify Workflows with new tools Analytical tools are automatically run on the scalable compute resources when possible Globus Genomics on Amazon EC2 Data Management Data Analysis
  42. 42. Globus Genomics Architecture Figure 2: Globus Genomics Architecture
  43. 43. Globus Genomics Usage
  44. 44. Globus Genomics • Computational profiles for various analysis tools • Resources can be provisioned on-demand with Amazon Web Services cloud based infrastructure • Glusterfs as a shared file system between head nodes and compute nodes • Provisioned I/O on Amazon EBS
  45. 45. Coming soon! • Integration with Globus Catalog – Better data discovery and metadata management • Integration with Globus Sharing – Easy and secure method to share large datasets with collaborators • Integration with Amazon Glacier for data archiving • Support for high throughput computational modalities through Apache Mesos – MapReduce and MPI clusters • Dynamic Storage Strategies using Amazon S3 or LVM-based shared file system
  46. 46. Our vision for a 21st century discovery infrastructure Provide more capability for more people at lower cost by building a “Discovery Cloud” Delivering “Science as a service”
  47. 47. Thank you to our sponsors
  48. 48. For more information • More information on Globus Genomics and to sign up: www.globus.org/genomics • More information on Globus: www.globusonline.org • Follow us on Twitter: @ianfoster, @madduri, @globusgenomics, @globusonline
  49. 49. Thank you!
  50. 50. Please give us your feedback on this presentation BDT 310 As a thank you, we will select prize winners daily for completed surveys!

×