• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
A Step to the Clouded Solution of Scalable Clinical Genome Sequencing (BDT308) | AWS re:Invent 2013
 

A Step to the Clouded Solution of Scalable Clinical Genome Sequencing (BDT308) | AWS re:Invent 2013

on

  • 699 views

Professors Wall and Tonellato of Harvard Medical School in collaboration with Beth Israel Deaconess Medical Center discuss the emerging area of clinical whole genome sequencing analysis and tools. ...

Professors Wall and Tonellato of Harvard Medical School in collaboration with Beth Israel Deaconess Medical Center discuss the emerging area of clinical whole genome sequencing analysis and tools. They report on the use of Amazon EC2 and Spot Instances to achieve a robust clinical time processing solution and examine the barriers to and resolution of producing clinical-grade whole genome results in the cloud. They benchmark an AWS solution, called COSMOS, against local computing solutions and demonstrate the time and capacity gains conferred through the use of AWS.

Statistics

Views

Total Views
699
Views on SlideShare
699
Embed Views
0

Actions

Likes
0
Downloads
24
Comments
0

0 Embeds 0

No embeds

Accessibility

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    A Step to the Clouded Solution of Scalable Clinical Genome Sequencing (BDT308) | AWS re:Invent 2013 A Step to the Clouded Solution of Scalable Clinical Genome Sequencing (BDT308) | AWS re:Invent 2013 Presentation Transcript

    • The Problem and Promise of Translational Genetics and a Step to the Clouded Solution of Scalable Clinical Whole Genome Sequencing Jafar Shameem Amazon Web Services November 14, 2013 © 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.
    • Agenda • Bio-Informatics and Amazon Web Services • Examples of collaboration • Building Blocks – – – – Compute Storage Tools Pricing Models
    • A rich history of collaboration with Life Sciences organizations
    • AWS Public Data Sets • A centralized repository of public datasets • Seamless integration with cloud based applications • No charge to the community • Some of the datasets available today: – – Human Microbiome Project – Ensembl – GenBank – Illumina – Jay Flateley Human Genome Dataset – YRI Trio Dataset – The Cannabis Sativa Genome – UniGene – Influenza Virrus – • 1000 Genomes Project PubChem Tell us what else you’d like for us to host …
    • Understanding how human genetics contributes to heart disease and aging CHARGE Consortium - aimed at better understanding how human genetics contributes to heart disease and aging DNANexus Baylor College of Medicine
    • Cluster High Mem 8XL 89 EC2 Compute Units 244 GB SSD instance storage Compute High Storage 8XL 117 GB 35 EC2 Compute Units 24 * 2 TB instance store Cluster Compute 8XL 60.5 GB 88 EC2 Compute Units Memory (GiB) Hi-Mem 4XL 68.4 GB 26 EC2 Compute Units 8 virtual cores Hi-Mem 2XL 34.2 GB 13 EC2 Compute Units 4 virtual cores Cluster Compute 4XL 23 GB 33.5 EC2 Compute Units High I/O 4XL 60.5 GB, 35 EC2 Compute Units, 2*1024 GB SSD-based local instance storage Hi-Mem XL 17.1 GB 6.5 EC2 Compute Units 2 virtual cores Medium 3.7 GB, 2 EC2 Compute Units 1 virtual core Extra Large 15 GB 8 EC2 Compute Units 4 virtual cores Small 1.7 GB, 1 EC2 Compute Unit 1 virtual core Micro 613 MB Up to 2 ECUs Large 7.5 GB 4 EC2 Compute Units 2 virtual cores High-CPU Med 1.7 GB 5 EC2 Compute Units 2 virtual cores EC2 Compute Units Cluster GPU 4XL 22 GB 33.5 EC2 Compute Units, 2 x NVIDIA Tesla “Fermi” M2050 GPUs High-CPU XL 7 GB 20 EC2 Compute Units 8 virtual cores
    • Storage Relational Database Service SimpleDB DynamoDB S3 Fully managed database NoSQL, Schemaless NoSQL, Schemaless, Object datastore up to 5TB (MySQL, Oracle, MSSQL) Smaller datasets Provisioned throughput per object database 99.999999999% durability Redshift Petabyte scale data warehousing service Fully managed
    • Tools of the trade • • • • • • • GATK NCBI BLAST Crossbow CloudBurst Myrna Clovr BioPerl Max • • • • • • • VIPDAC Superfamily Cloud-Coffee BioNimbus GMOD CloudAligner BioConductor • • • • QIIME SNAP BWA Bowtie/TopHat/Cufflin ks • STAR, GSNAP, RUM MIT StarCluster Galaxy CloudMan Rocks Torque Slurm Condor Chef Puppet SaltStack Get links to AMIs at: https://github.com/mndoci/mndoci.github.com/wiki/Life-Science-Apps-on-AWS
    • Many purchase models to support different needs Free Tier On-Demand Reserved Spot Dedicated Get Started on AWS with free usage & no commitment Pay for compute capacity by the hour with no long-term commitments Make a low, one-time payment and receive a significant discount on the hourly charge Bid for unused capacity, charged at a Spot Price which fluctuates based on supply and demand Launch instances within Amazon VPC that run on hardware dedicated to a single customer For POCs and getting started For spiky workloads, or to define needs For committed utilization For time-insensitive or transient workloads For highly sensitive or compliance related workloads
    • How to use Spot? Ideal Applications Batch Processing Time-Delayable Fault-Tolerant or Restartable Compute-Intensive Horizontally Scalable Stateless Worker Nodes Region and AZ Independent Uses Deployment Automation Less Ideal Applications Interactive Strict/Tight SLA for Completion Expensive to Handle Terminations Data-Intensive In-Memory Scaling Long-Running Worker Nodes (weeks) Requires a Single AZ Manually Launched and Managed
    • Tractable, scalable, and economical processing of clinical whole genome sequences in AWS Clinical Genomics for Cancer Diagnosis Amazon Web Services Re-Invent 2013 Nov 14th, 2013 Las Vegas, NV Peter J. Tonellato, PhD Harvard Medical School Dennis P. Wall, PhD* Stanford University* Stanford University
    • Whole Genome Breast Cancer Program Objective: The objective of the Whole Genome Breast Cancer Program (WGBC) is to demonstrate the clinical utility and value of the use of whole genome analysis to practical breast cancer detection, diagnosis, prognosis and improved outcomes. WGA in Clinical Turn-Around Demonstrate the use of Amazon Web Services to establish Clinical Whole Genome Analysis in “clinical turn-around”: WG NGS Sequence to Actionable Health Care Information Clock time: Cost: < 3 hours < $100 Stanford University
    • Whole Genome Breast Cancer Program 1. Organization and Progress to date 2. Historical BIDMC Breast Cancer cases 3. Clinical Whole Genome Analysis – Laboratory Test 4. COSMOS: Clinical Whole Genome Analysis on AWS Stanford University
    • Whole Genome Breast Cancer Program 1. Organization and Progress to date 2. Historical BIDMC Breast Cancer cases 3. Clinical Whole Genome Analysis – Laboratory Test 4. COSMOS: Clinical Whole Genome Analysis on AWS Stanford University
    • N - MDBCTB Surgery Mike Genetics Nadine Oncology Gerburg Pathology Stu Radiation Oncology Abram Genetic Counseling Jill Social Work Barbara Bioinformatics Peter & Dennis LPM - Sheida - Latrice - Jared - Yassine - Val - Michiyo Program Coordinator – Michiyo - Research Assistant * (Emily Poles?) - Technician * Case management - Case Identification - Case review (N-MDBCTB) - Consent (RN) - Clinical Data Management - Tissue Collection - Sample Management - Follow-up Assay/Bioinformatics Imaging Tejas Assay - Preparation/Storage - DNA extraction/purification - Sample delivery (to outsource) - Whole genome sequencing -OncoScan v3 (BI) - DNA/RNA/NGS sequencing (outsource) Bioinformatics - Data Transfer - Genome Data Integration/Management - Annotation - Analysis - Translation - Case Evidence Report Translation - Data Integration/Management - Case report (N-MDBCTB) Oversee External Advisory Board Clinical Executive Committee Cancer Center Stanford University Regulatory Affairs * Hiring
    • Clinical Lab BIDMC Breast Cancer Patient/Sample Process Surgical specimen BIDMC Clinic (Oncology, Surgery, Radiation Oncology) N-BCMDTB Case selection Diagnosis Work-up MMG US MRI Biopsy (Immunohistochemistry, FISH) Pathology FFPE BCMDTB Diagnoses Yes Surgery After NAC Consent to care X Biopsy specimen Blood sample FF Presentation Case and Schedule Case Evaluation Surgery? No Blood test lab No Yes Consent to Research Surgery Biopsy Blood sample Tissue specimen Storage FF Diagnoses OMR OMR Yes -> * and ** No -> * Blood test Storage FFPE, FF Research * Clinical Research Pathology Lab Blood sample ** Research Tissue sample DNA, RNA Extraction - DNA Sequencing - Exome sequencing - OncoScan v3™ Copy number Somatic Mutation Adjuvant Therapy Analysis Lab (LPM) Chemotherapy Radiation therapy Patient flow Clinical Evaluation Case identification workflow Sample workflow Analysis workflow - Gene expression pipeline - OncoScan™ pipeline - SNP Chip pipeline - Integrative pipeline OMR - Clinical Data - Follow up Outcome Translation Workflow X: No further treatment and research NAC: Neoadjuvant chemotherapy OMR: Online Medical Record FFPE: Formalin-Fixed, Paraffin-Embedded (tissue) Stanford University FF: fresh frozen (tissue) N-BCMDTB Result Evaluation Identification of Targeted Therapy Personalized Medicine
    • IRB Approved Protocol No Evaluation - Treatment Decision Yes -> Undergo surgery No Excluded Decision - Case Decision Yes -> Eligible No surgery Not eligible Disagreed Poor sample No Consent - Getting consent Yes -> Agreed No No Clinical Workup - Blood Test - Breast Surgery Tissue Workup - Pathology Workup - Sample Collection (Extract DNA/RNA from Tissue and Blood) - DNA genome sequencing - Exome sequencing - Copy number and somatic mutations analysis using an array platform (OncoScan) - Analysis outcome data Assay Clinical Outcome Analysis Clinicopathological Characteristic Translation - Discussion at NBCMDTB Traditional Treatment Personalized - Identification of Targeted Therapy Stanford University and Personalized Medicine Medicine
    • Whole Genome Breast Cancer Program 1. Organization and Progress to date 2. Historical BIDMC Breast Cancer cases 3. Clinical Whole Genome Analysis – Laboratory Test 4. COSMOS: Clinical Whole Genome Analysis on AWS Stanford University
    • Breast Cancer Clinical Use of WGA 1. 2. 3. 4. 5. Family and Individual Risk prediction Breast Cancer Tumor Characterization Breast Cancer Diagnosis Breast Cancer Prognosis Prediction of response to targeted therapies 6. Indications of outcome and assessment for future treatment refinement Stanford University
    • Breast Cancer Genomic Devices 35 devices reviewed; 26 used clinically Prognosis Risk Prediction 23andMe* deCODEme* BRACAnalysis* Ambry Genetics* CCDG Panel OncoScan TargetPrint BluePrint** PAM50* BreastProfile* Her2Pro* MammaPrint Methyl-Profiler Rotterdam Signature MammoStrat BreastGeneDX Breast Cancer Array OncotypeDX* Breast Cancer Index Research OncoMap3** AsuraSeq-1000** OncoCarta** *Associated CPT/CMS codes **Not for clinical use SNaPshot MapQuant DX TheraPrint** NexCourse Bca Wash U Panel Target Now Stanford University
    • Clinically Actionable Breast Cancer Information Data Type # Unique Entries Gene 773 SNP 52 SNPs for risk prediction. 1681 SNPs for prognosis 1733 Small Insertion 75 Small Deletion 205 Translocation 3 Gene Expression Drug target commonly based on gene expression profile 383 Protein Expression 7 Amplification 64 Deletion HER2, Estrogen, Progesterone receptor status 48 Total “Clinically” Actionable 3291 9 Deletions in BRCA1 or BRCA2 detected by BRACAnalysis confer increased breast cancer risk Stanford University
    • Whole Genome Breast Cancer Program 1. Organization and Progress to date 2. Historical BIDMC Breast Cancer cases 3. Clinical Whole Genome Analysis (WGA) – Laboratory Test 4. COSMOS: Clinical Whole Genome Analysis on AWS Stanford University
    • Clinical WGA Workflow Patients Samples Next Generation Sequencers Bioinformatics Analysis Clinical Genomics Interpretation Service Clinical Report Biomedical Report Stanford University
    • BWA GATK Picard SNP/indel CNV-seq ReadDepth Segseq CNV Risk Prediction DNA-Seq RNA-Seq Tophat Cufflinks BLAST Gene Exp. miRNA miRNAkey miRBase miRNA targets Bismark % Gene Methyl Methyl Stanford University Pre-clinical and clinical variant annotation Classification (Tumor, disease) Pathway Analysis
    • Reduced Cost of Next Generation Sequencing (NGS) • NGS platforms: 5,000 Megabases/day • Drop of the per-base sequencing cost • Data on petabyte scale • NGS analysis involves complex workflows Stanford University
    • WGA in “Clinical Turn-around” – Future 12 hours Sample Collection 500 hours Sequencing < 40hours < $100 3 hours Analysis 12 hours Clinical Action Stanford University
    • Current Costs to Run on Amazon Web Services Details:  1 Whole Genome  60x  Spot and Reserved Instances  Utilizing Amazon Glacier for long term storage Whole Genome Analysis: Approximately 1 day Approximately $1500 Stanford University
    • Whole Genome Breast Cancer Program 1. Organization and Progress to date 2. Historical BIDMC Breast Cancer cases 3. Clinical Whole Genome Analysis – Laboratory Test 4. COSMOS: Clinical Whole Genome Analysis on AWS • • • • AWS Applications Workflow COSMOS Stanford University
    • Clinical Whole Genome Analysis Computational Objective: < 3 hours < $100 Four approaches to optimize and achieve our Clinical Turn-Around Objective: • AWS • Refine and Improve WGA Applications • Create a Standardized, Robust CWGA Workflow • Stabilize a new Workflow and Distributive Computing Platform: COSMOS Stanford University
    • Clinical Whole Genome Analysis Computational Objective: < 3 hours < $100 Four approaches to optimize and achieve our Clinical Turn-Around Objective: • AWS • Refine and Improve WGA Applications • Create a Standardized, Robust CWGA Workflow • Stabilize a new Workflow and Distributive Computing Platform: COSMOS Stanford University
    • Dynamic Cluster with number and the type of instances adapted to data-sets, jobs, and applications. EC2 instances AMIs S3 storage BAM BAM On-demand Master(s) Load Balanced Spot Instance Workers Stanford University BAM
    • EC2 instances Optimization: Correct type and number of EC2s and cluster Current non-optimized Master: CC2.8xlarge High Memory: Single job (BWA) ~ 10GB RAM High IO: Access to common data files Virtualization: HVM for HugePage AMIs Current non-optimized Worker: CC2.8xlarge S3 storage High Memory: Single job (BWA) ~ 10GB RAM High IO: Access to common data files Virtualization: HVM for HugePage Stanford University
    • Create stable CWGA AMI(s) EC2 instances Required Applications, libraries and dependencies: Applications (GATK): Samtools, BWA, … Human Reference Genome AMIs Annotation Databases S3 storage Stanford University
    • Optimize: AMI EC2 instances Compiler: GCC 4.6+ supports AVX mode Refined GCC parameters Compressed libraries: zlib and snappy Refined JAVA parameters for GATK optimization AMIs S3 storage Memory: HugePage (2M) configured for every node/application Disks: Ephemeral: Cluster Disks: RAID 0 GlusterFS Stanford University
    • EC2 instances AMIs S3 storage: • • • • • Storage of BAM files Transfer of BAM and other files “checkpoint” after each successful workflow stage Backup of intermediate and final results Storage of all timings and job information S3 storage Stanford University
    • Clinical Whole Genome Analysis Computational Objective: < 3 hours < $100 Four approaches to optimize and achieve our Clinical Turn-Around Objective: • AWS • Refine and Improve WGA Applications • Create a Standardized, Robust CWGA Workflow • Stabilize a new Workflow and Distributive Computing Platform: COSMOS Stanford University
    • BWA GATK Picard SNP/indel CNV-seq ReadDepth Segseq CNV Risk Prediction DNA-Seq RNA-Seq Tophat Cufflinks BLAST Gene Exp. miRNA miRNAkey miRBase miRNA targets Bismark % Gene Methyl Methyl Stanford University Pre-clinical and clinical variant annotation Classification (Tumor, disease) Pathway Analysis
    • WGA Applications Genome Analysis Toolkit (GATK) “best practice”. Preparation/Alignment Variant calling Stanford University Annotation Source: GATK best practices, BROAD Institute, http://www.broadinstitute.org/gatk/guide/topic?name=best-practices
    • Applications Parallelization 5 exomes example 5 exome 600 500 400 300 200 5 exome 100 0 Preparation/Alignment Variant calling Stanford University Annotation
    • Alignment: Burrows-Wheeler Aligner Stanford University
    • Clinical Whole Genome Analysis Computational Objective: < 3 hours < $100 Four approaches to optimize and achieve our Clinical Turn-Around Objective: • AWS • Refine and Improve WGA Applications • Create a Standardized, Robust CWGA Workflow • Stabilize a new Workflow and Distributive Computing Platform: COSMOS Stanford University
    • BWA GATK Picard SNP/indel CNV-seq ReadDepth Segseq CNV Risk Prediction DNA-Seq RNA-Seq Tophat Cufflinks BLAST Gene Exp. miRNA miRNAkey miRBase miRNA targets Bismark % Gene Methyl Methyl Stanford University Pre-clinical and clinical variant annotation Classification (Tumor, disease) Pathway Analysis
    • BWA GATK Picard SNP/indel CNV-seq ReadDepth Segseq CNV Risk Prediction DNA-Seq RNA-Seq Tophat Cufflinks BLAST Gene Exp. miRNA miRNAkey miRBase miRNA targets Bismark % Gene Methyl Methyl Stanford University Pre-clinical and clinical variant annotation Classification (Tumor, disease) Pathway Analysis
    • GenomeKey Implements GATK "best practices" for variant calling. GenomeKey Preparation/Alignment Variant calling Stanford University Annotation Source: GATK best practices, BROAD Institute, http://www.broadinstitute.org/gatk/guide/topic?name=best-practices
    • Databases Integrated Stanford University
    • Databases Integrated CytoBank The_1000g_Febuary_all dbSNP135 NHLBI_Exome_Project_euro TFBS NHLBI_Exome_Project_aa Segmental_Duplications NHLBI_Exome_Project_all RepeatMasker HGMD_INDEL Self Chain HGMD_SNP mirBase COSMIC TargetScan GWAS_Catalog Plus support for generic database file formats such as .bed and .gff3 SIFT ENCODE_DNaseI_Hypersensitivity PolyPhen2 ENCODE_Transcription_Factor Mutation_Taster UCSC_Gene GERP Refseq_Gene PhyloP Ensembl_Gene LRT CCDS_Gene Mce46way DrugBank Complete_Genomics_69 Stanford University
    • Workflow Optimization • Speed: • Replacing BWA with SNAP (for the same accuracy) • Re-implement some slow algorithms (e.g. BQSR) • Accuracy: • Add additional quality control steps • Replacing Unified Genotyper with Haplotype Caller Stanford University
    • Clinical Whole Genome Analysis Computational Objective: < 3 hours < $100 Four approaches to optimize and achieve our Clinical Turn-Around Objective: • AWS • Refine and Improve WGA Applications • Create a Standardized, Robust CWGA Workflow • Stabilize a new Workflow and Distributive Computing Platform: COSMOS Stanford University
    • COSMOS Workflow management System Job splitting COSMOS Job tracking GenomeKey Preparation/Alignm ent Variant calling Annotation Gluster FS MySQL DB Networking Web Interface OS & Software Grid engine EC2 and S3 AWS Instances Stanford University Storage
    • COSMOS Parallelization 1200 Preparation/Alignment Variant calling Annotation Number of Jobs 1000 800 600 1 Exome 5 Exomes 400 10 Exomes All Runs 200 0 Stanford University
    • COSMOS Job Splitting Stanford University
    • COSMOS Job Splitting Stanford University
    • COSMOS Job Splitting Stanford University
    • Job Dependency Tracking PREPARATION / ALIGNMENT VARIANT CALLING ANNOTATION Stanford University
    • COSMOS Web Interface PREPARATION / ALIGNMENT VARIANT CALLING ANNOTATION Stanford University
    • Clinical Whole Genome Analysis Computational Objective: < 3 hours < $100 Four approaches to optimize and achieve our Clinical Turn-Around Objective: • AWS • Refine and Improve WGA Applications • Create a Standardized, Robust CWGA Workflow • Stabilize a new Workflow and Distributive Computing Platform: COSMOS Stanford University
    • Whole Exome Analysis Pre and Post-Optimization 30 Before 25 Wall time 20 ~$90 Before 15 10 5 0 ~$48 After Before ~$27 After ~$47 After ~$27 ~$10 1 exome 5 exomes Stanford University 10 exomes
    • Whole Exome Analysis: 30 Before 25 Wall time 20 ~$90 Before 15 10 5 0 ~$48 After Before ~$27 After ~$47 After ~$27 ~$10 1 exome 5 exomes Stanford University 10 exomes
    • Whole Exome Analysis: 30 Before 25 Wall time 20 ~$90 Before 15 10 5 0 ~$48 After Before ~$27 After ~$47 After ~$27 ~$10 1 exome 5 exomes Stanford University 10 exomes
    • Whole Genome Breast Cancer Program Objective: The objective of the Whole Genome Breast Cancer Program (WGBC) is to demonstrate the clinical utility and value of the use of whole genome analysis to practical breast cancer detection, diagnosis, prognosis and improved outcomes. WGA in Clinical Turn-Around Demonstrate the use of Amazon Web Services to establish Clinical Whole Genome Analysis in “clinical turn-around”: WG NGS Sequence to Actionable Health Care Information Clock time: Cost: < 3 hours < $100 Stanford University
    • Acknowledgments LPM (Tonellato) Erik Gafni (InVitae) Vince Fusaro (InVitae) Jared B. Hawkins Ryan Powles Yassine Souilmi Autism Speaks 6000 Exomes (current) 10,000 Genomes Wall lab (Harvard & Stanford University) Jae-Yoon Jung Alex Lancaster David Tulga Ancient Human Genomes David Reich Stanford University
    • Tractable, scalable, and economical processing of clinical whole genome sequences in AWS Clinical Genomics for Cancer Diagnosis Amazon Web Services Re-Invent 2013 Nov 14th, 2013 Las Vegas, NV Peter J. Tonellato, PhD Harvard Medical School Dennis P. Wall, PhD* Stanford University* Stanford University