Successfully reported this slideshow.
Your SlideShare is downloading. ×

Accelerating Analytics for the Future of Genomics

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Loading in …3
×

Check these out next

1 of 49 Ad

Accelerating Analytics for the Future of Genomics

Healthcare systems around the world are looking to Precision Medicine -- care decisions tailored for the individual patient -- as a means to drive better care outcomes at lower cost. Today, the most promising technology that has made this possible in certain diseases like cancer is sequencing a patient's genome. For infectious diseases, sequencing has revolutionized our understanding of outbreaks and how they spread. Genome sequencing has progressed significantly in the past decade to improve throughput and lower costs by 100X or more. It is a data and compute intensive endeavor, which most biomedical research and care delivery networks are not equipped to handle. This session features Dr. Swaine Chen from the Genome Institute of Singapore, and the Broad Institute Cromwell team, discussing the problem of dealing with the scale of genomic data, and how they solved these to deliver results.

Healthcare systems around the world are looking to Precision Medicine -- care decisions tailored for the individual patient -- as a means to drive better care outcomes at lower cost. Today, the most promising technology that has made this possible in certain diseases like cancer is sequencing a patient's genome. For infectious diseases, sequencing has revolutionized our understanding of outbreaks and how they spread. Genome sequencing has progressed significantly in the past decade to improve throughput and lower costs by 100X or more. It is a data and compute intensive endeavor, which most biomedical research and care delivery networks are not equipped to handle. This session features Dr. Swaine Chen from the Genome Institute of Singapore, and the Broad Institute Cromwell team, discussing the problem of dealing with the scale of genomic data, and how they solved these to deliver results.

Advertisement
Advertisement

More Related Content

Slideshows for you (20)

Similar to Accelerating Analytics for the Future of Genomics (20)

Advertisement

More from Amazon Web Services (20)

Accelerating Analytics for the Future of Genomics

  1. 1. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Dr. Swaine Chen Genome Institute of Singapore, National University of Singapore 223537 Accelerating Analytics for the Future of Genomics
  2. 2. Accelerating Genomics Research with the Cloud Swaine Chen Genome Institute of Singapore National University of Singapore
  3. 3. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. GEN MICS WHAT IS HUMAN CELLS NUCLEUS CHROMOSOMES DNA
  4. 4. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. What is DNA? G, A, T, C 4 “bases”; 2 bits Explicitly digital A – T C – G T – A G – C
  5. 5. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. DNA sequencing technology Maxam-Gilbert Chemistry and radiation Solexa (Illumina) Higher density, imaging Oxford Nanopore Electric current detection Capillary seq Miniaturization Parallelization 1970 1980 1990 2000 2010
  6. 6. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. DNA sequencing technology Maxam-Gilbert Chemistry and radiation Solexa (Illumina) Higher density, imaging Oxford Nanopore Electric current detection Capillary seq Miniaturization Parallelization 1970 1980 1990 2000 2010 Miniaturization, Parallelization, Digitization
  7. 7. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Genomics data is exploding (in the usual way) Moore’s law
  8. 8. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Genomics Genomics data is exploding (in the usual way) Moore’s law
  9. 9. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Genomics Analytics is imploding (in an unusual way) Moore’s law Hyper-Moore gap
  10. 10. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. The evolution of genomics compute Driven by the data scale Enabled by AWS Our journey at GIS
  11. 11. GIS, Pre-AWS 128 nodes 40-80 CPUs 128-512 GB RAM Head node On-site data centerOffice area User workstations “SMPs” 96 CPUs 1 TB RAM 1Gbps 40-100 Gbps Archival Storage (10 PB) Office, home Storage 3 PB Compute Storage 4 PB
  12. 12. Cluster nodes (~500) 4-8 CPUs 64-128GB RAM Head node On-site data centerOffice area User workstations “SMPs” 128 CPUs 1TB RAM 1Gbps 10-100 Gbps Archival Storage (3 PB) Office, home Storage 1PB Compute Storage 100TBChallenges First-time command line users Heterogeneous compute, storage, network No/low experience • Job management • Optimization • Software config/documentation Spiky workloads Self-inflicted denial of service GIS, Pre-AWS
  13. 13. How did we first use AWS? Phase 1 • Reimplement “SMPs” • Users can’t DOS each other • Infinite capacity (and potential for waste) • Full complexity Single instance EBS / compute storage S3 / Object storage Individual user AWSGIS
  14. 14. How did we first use AWS? Phase 1 • Reimplement “SMPs” • Users can’t DOS each other • Infinite capacity (and potential for waste) • Full complexity
  15. 15. Our current efforts on AWS Phase 2 • Nextflow + AWS Batch • Totally new paradigm, enabled by cloud • AWS for elastic provisioning • Cluster is abstracted away • Leverage this for software S3 / Object storage Individual user AWSGIS AWS Batch
  16. 16. Phase 2 • Nextflow + AWS Batch • Totally new paradigm, enabled by cloud • AWS for elastic provisioning • Cluster is abstracted away • Leverage this for software S3 / Object storage Individual user AWSGIS Job repo Jobtasks Docker repo (ECR) AWS Batch
  17. 17. Why is this complexity needed? GATK Best Practices – a standard workflow in genomics
  18. 18. Capacity + Simplicity on AWS
  19. 19. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. The Impact at GIS
  20. 20. GIS Bacterial Projects: 100× in 4 years 1 10 100 1000 10000 1995 2000 2005 2010 2015 2020 Bacterial genomics: # genomes/paper tracks Moore’s Law Year of publication Numberofgenomes GIS
  21. 21. GIS Bacterial Projects: 100× in 4 years 1 10 100 1000 10000 1995 2000 2005 2010 2015 2020 Year of publication Numberofgenomes GIS 2013: 10-100 strains
  22. 22. GIS Bacterial Projects: 100× in 4 years 1 10 100 1000 10000 1995 2000 2005 2010 2015 2020 Year of publication Numberofgenomes GIS 2017: 10,000 strains Higher resolution, more perspective
  23. 23. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Capacity + Simplicity = Opportunity? Does AWS fundamentally change our thinking?
  24. 24. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. T H E NN O W Genomics: Approaching IoTTransition
  25. 25. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. SINGAPORE’S DENGUE MONITORING Since 2006 All nonresidential buildings checked every 3 months 1 million inspections per year INFRASTRUCTURE
  26. 26. Preparing for 1 Million Genomic Devices Phase 3 • Serverless, event-driven model • Massive scale • No user intervention • Fundamentally cloud-driven transformation of our problem solving • Enables continuous monitoring
  27. 27. Preparing for 1 Million Genomic Devices Reimplement variant calling 6 hours 15 minutes Auto scatter-gather, high parallelism 1,000 genomes, 25 million GB-s, no intervention 12 genomes on Lambda free tier! 1 10 100 1000 10000 100000 Run own servers GIS + Lambda Genomes per unit cost 20×
  28. 28. MANY SMART IDEAS ONE SMART NATION ENABLED BY GENOMICS
  29. 29. Maggie Leong Vincent Quah Adrian White Julian Lau Liew Jun Xian Andreas Wilm Shih Chih Chuan Ng Huck Hui Pauline Ng Anders Skanderup National Precision Medicine Program
  30. 30. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Ruchi Munshi The Broad Institute 223537 Accelerating Analytics for the Future of Genomics
  31. 31. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. An Introduction To Cromwell Bioinformatics workflows at any scale
  32. 32. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. The backdrop: data generation set to explode Story begins here Quarterly output (in TBases) of the Genomics Platform
  33. 33. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. The players in the trenches • Medical Population Genetics Platform • Tasked with developing tools/ BP pipelines • Scope creep: run workflow for researchers • Workflowing solution: GATK-Queue (scala) GATK dev team Picard / Ops team • Genomics Platform • Initial data processing -> Picard toolkit • Took over workflows in production • Workflowing solution: Zamboni (scala) Cancer Genome Analysis team • Cancer Program • Tasked with developing tools/ BP workflows for somatic analysis • Workflowing solution: Firehose self-service (python?) The drama: low portability, silos, duplication of effort, looming bottlenecks
  34. 34. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Sharing (securely) is caring Traditional Way: Bring data to the researchers Problems Data sharing = data copying Requires big infrastructure at each site Largely fixed compute Individual security implementations Cloud Way: Bring researchers to the data Solutions True data sharing Cloud provides the infrastructure Elastic compute and storage Centralized security implementation
  35. 35. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Genome analysis pipeline throughput is “spiky” • Solution: move to Cloud! Advantages over on-premises computing: – No need to pay for compute power when we aren’t using it – Can tolerate spikes without being forced to maintain a backlog of “things to process once everything calms down” Genome processing requests per day over a several month period
  36. 36. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Use containers for portability & reproducibility A container encapsulates all the software dependencies associated with running a program Takes the guesswork out of running workflows on different platforms! GATK 2.8 Java 7 R 2.5.0 GATK 3.8 Java 8 R 3.0.1 BWA Picard Modified from https://www.docker.com/what-container
  37. 37. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Meet Cromwell & WDL Execution engine that can • Run on any platform (on-prem and on Cloud) • Seamlessly scale based on workflow needs • Provide maximal flexibility for all use cases • https://github.com/broadinstitute/cromwell Workflow language that humans can read/write • Methods developers and biomedical scientists at large • https://github.com/openwdl/wdl/
  38. 38. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Two main ways to run Cromwell • API endpoints • More scalable • Some devops needs • Appropriate for production environments • Call caching • Simple self-contained command • Appropriate for independent analysts One-off Server mode java -jar cromwell.jar run hello.wdl hello_inputs.json
  39. 39. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Use a workflow execution engine that runs anywhere* Cromwell … HPC TESLocal Google Funnel https://github.com/broadinstitute/cromwell AWS* Alicloud *in development
  40. 40. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Enable local development of workflows, run on the cloud S3 data buckets Managed compute environment AWS Persistent Cromwell server REST API Direct CLI
  41. 41. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Cromwell will submit jobs to AWS Batch Job Queues Cromwell inputs inputs outputs GATK = gatk.jar RefFasta = hg38.fasta RefIndex = hg38.fai RefDict = hg38.dict sampleName = sample.name inputBAM = sample.bam bamIndex = sample.bai AWS Batch Workflow Cromwell stages the inputs/outputs for your jobs
  42. 42. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Being able to send escalate jobs is nice! URGENT!
  43. 43. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Workflow description Language (WDL)
  44. 44. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. WDL runtime parameters resourcing cost savings! containers
  45. 45. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Basic WDL plumbing options call stepA call stepB { input: in=stepA.out } call stepC { input: in=stepB.out } LINEAR CHAINING MULTI-IN/OUT call stepC { input : in1=stepB.out1, in2=stepB.out2 } Array[File] inputFiles scatter(oneFile in inputFiles) { call stepA { input: in=oneFile } } call stepB { input: files=stepA.out } SCATTER-GATHER
  46. 46. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. OpenWDL: WDL meets open development Randall Munroe, XKCD https://www.xkcd.com/225/
  47. 47. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. But what about CWL? Randall Munroe, XKCD https://www.xkcd.com/1739/ Thanks to our Workflow Object Model (WOM), Cromwell now supports multiple versions of WDL as well as CWL 1.0!
  48. 48. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Cromwell has been busy Cromwell in production at Broad: Processed 47.5 million jobs over the last two years And this is just the tip of the iceberg!
  49. 49. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Want to discuss further? My Email: rmunshi@broadinstitute.org More Information: Docs: http://cromwell.readthedocs.io/en/develop/ Github: https://www.github.com/broadinstitute/cromwell WDL: http://www.openwdl.org

×