0
The Problem and Promise of Translational Genetics and a
Step to the Clouded Solution of Scalable Clinical Whole
Genome Seq...
Agenda
• Bio-Informatics and Amazon Web Services
• Examples of collaboration
• Building Blocks
–
–
–
–

Compute
Storage
To...
A rich history of collaboration with Life Sciences organizations
AWS Public Data Sets
•

A centralized repository of public datasets

•

Seamless integration with cloud based applications...
Understanding how human genetics contributes
to heart disease and aging
CHARGE Consortium
- aimed at better understanding ...
Cluster High Mem 8XL
89 EC2 Compute Units
244 GB SSD instance storage

Compute
High Storage 8XL 117 GB
35 EC2 Compute Unit...
Storage

Relational Database Service

SimpleDB

DynamoDB

S3

Fully managed database

NoSQL, Schemaless

NoSQL, Schemaless...
Tools of the trade
•
•
•
•
•
•
•

GATK
NCBI BLAST
Crossbow
CloudBurst
Myrna
Clovr
BioPerl Max

•
•
•
•
•
•
•

VIPDAC
Super...
Many purchase models to support different needs
Free Tier

On-Demand

Reserved

Spot

Dedicated

Get Started on AWS
with f...
How to use Spot?
Ideal Applications
Batch Processing
Time-Delayable
Fault-Tolerant or Restartable
Compute-Intensive
Horizo...
Tractable, scalable, and economical processing of
clinical whole genome sequences in AWS
Clinical Genomics for Cancer Diag...
Whole Genome Breast Cancer Program
Objective: The objective of the Whole Genome Breast Cancer Program
(WGBC) is to demonst...
Whole Genome Breast Cancer Program
1. Organization and Progress to date
2. Historical BIDMC Breast Cancer cases
3. Clinica...
Whole Genome Breast Cancer Program
1. Organization and Progress to date

2. Historical BIDMC Breast Cancer cases
3. Clinic...
N - MDBCTB
Surgery
Mike

Genetics
Nadine

Oncology
Gerburg

Pathology
Stu

Radiation
Oncology
Abram

Genetic
Counseling
Ji...
Clinical Lab

BIDMC Breast Cancer Patient/Sample Process

Surgical
specimen

BIDMC Clinic (Oncology, Surgery, Radiation On...
IRB Approved
Protocol
No

Evaluation

- Treatment Decision

Yes -> Undergo surgery
No

Excluded

Decision

- Case Decision...
Whole Genome Breast Cancer Program
1. Organization and Progress to date
2. Historical BIDMC Breast Cancer cases
3. Clinica...
Breast Cancer Clinical Use of WGA
1.
2.
3.
4.
5.

Family and Individual Risk prediction
Breast Cancer Tumor Characterizati...
Breast Cancer Genomic Devices
35 devices reviewed; 26 used clinically
Prognosis
Risk Prediction
23andMe*
deCODEme*
BRACAna...
Clinically Actionable Breast Cancer
Information
Data Type

# Unique Entries

Gene

773

SNP

52 SNPs for risk prediction. ...
Whole Genome Breast Cancer Program
1. Organization and Progress to date
2. Historical BIDMC Breast Cancer cases
3. Clinica...
Clinical WGA Workflow
Patients
Samples

Next Generation
Sequencers

Bioinformatics Analysis
Clinical Genomics
Interpretati...
BWA
GATK
Picard

SNP/indel

CNV-seq
ReadDepth
Segseq

CNV

Risk
Prediction

DNA-Seq

RNA-Seq

Tophat
Cufflinks
BLAST

Gene...
Reduced Cost of Next Generation Sequencing
(NGS)
• NGS platforms: 5,000
Megabases/day
• Drop of the per-base
sequencing co...
WGA in “Clinical Turn-around” – Future

12 hours
Sample Collection

500 hours
Sequencing

< 40hours < $100
3 hours
Analysi...
Current Costs to Run on
Amazon Web Services
Details:

1 Whole Genome

60x

Spot and Reserved Instances

Utilizing Amaz...
Whole Genome Breast Cancer Program
1. Organization and Progress to date
2. Historical BIDMC Breast Cancer cases
3. Clinica...
Clinical Whole Genome Analysis Computational
Objective:

< 3 hours < $100
Four approaches to optimize and achieve our
Clin...
Clinical Whole Genome Analysis Computational
Objective:

< 3 hours < $100
Four approaches to optimize and achieve our
Clin...
Dynamic Cluster with number and the type of instances
adapted to data-sets, jobs, and applications.
EC2
instances

AMIs

S...
EC2
instances

Optimization: Correct type and number of EC2s and
cluster
Current non-optimized Master: CC2.8xlarge
High Me...
Create stable CWGA AMI(s)
EC2
instances

Required Applications, libraries and
dependencies:

Applications (GATK): Samtools...
Optimize: AMI
EC2
instances

Compiler:

GCC 4.6+ supports AVX mode
Refined GCC parameters

Compressed libraries: zlib and ...
EC2
instances

AMIs

S3 storage:

•
•
•
•
•

Storage of BAM files
Transfer of BAM and other files
“checkpoint” after each ...
Clinical Whole Genome Analysis Computational
Objective:

< 3 hours < $100
Four approaches to optimize and achieve our
Clin...
BWA
GATK
Picard

SNP/indel

CNV-seq
ReadDepth
Segseq

CNV

Risk
Prediction

DNA-Seq

RNA-Seq

Tophat
Cufflinks
BLAST

Gene...
WGA Applications
Genome Analysis Toolkit (GATK) “best practice”.
Preparation/Alignment

Variant calling

Stanford Universi...
Applications Parallelization
5 exomes example
5 exome
600

500

400

300

200

5 exome

100

0

Preparation/Alignment

Var...
Alignment: Burrows-Wheeler Aligner

Stanford University
Clinical Whole Genome Analysis Computational
Objective:

< 3 hours < $100
Four approaches to optimize and achieve our
Clin...
BWA
GATK
Picard

SNP/indel

CNV-seq
ReadDepth
Segseq

CNV

Risk
Prediction

DNA-Seq

RNA-Seq

Tophat
Cufflinks
BLAST

Gene...
BWA
GATK
Picard

SNP/indel

CNV-seq
ReadDepth
Segseq

CNV

Risk
Prediction

DNA-Seq

RNA-Seq

Tophat
Cufflinks
BLAST

Gene...
GenomeKey
Implements GATK "best practices" for variant calling.
GenomeKey
Preparation/Alignment

Variant calling

Stanford...
Databases Integrated

Stanford University
Databases Integrated
CytoBank
The_1000g_Febuary_all
dbSNP135
NHLBI_Exome_Project_euro
TFBS
NHLBI_Exome_Project_aa
Segmenta...
Workflow Optimization
• Speed:
• Replacing BWA with SNAP (for the same accuracy)
• Re-implement some slow algorithms (e.g....
Clinical Whole Genome Analysis Computational
Objective:

< 3 hours < $100
Four approaches to optimize and achieve our
Clin...
COSMOS
Workflow management System
Job
splitting

COSMOS

Job
tracking

GenomeKey

Preparation/Alignm
ent

Variant calling
...
COSMOS Parallelization
1200

Preparation/Alignment

Variant calling

Annotation

Number of Jobs

1000
800
600
1 Exome
5 Ex...
COSMOS Job Splitting

Stanford University
COSMOS Job Splitting

Stanford University
COSMOS Job Splitting

Stanford University
Job Dependency Tracking

PREPARATION / ALIGNMENT

VARIANT CALLING

ANNOTATION

Stanford University
COSMOS Web Interface

PREPARATION / ALIGNMENT

VARIANT CALLING

ANNOTATION
Stanford University
Clinical Whole Genome Analysis Computational
Objective:

< 3 hours < $100
Four approaches to optimize and achieve our
Clin...
Whole Exome Analysis
Pre and Post-Optimization
30
Before
25

Wall time

20

~$90
Before

15
10
5
0

~$48 After

Before

~$...
Whole Exome Analysis:
30
Before
25

Wall time

20

~$90
Before

15
10
5
0

~$48 After

Before

~$27

After

~$47
After

~$...
Whole Exome Analysis:
30
Before
25

Wall time

20

~$90
Before

15
10
5
0

~$48 After

Before

~$27

After

~$47
After

~$...
Whole Genome Breast Cancer Program
Objective: The objective of the Whole Genome Breast Cancer Program
(WGBC) is to demonst...
Acknowledgments
LPM (Tonellato)
Erik Gafni (InVitae)
Vince Fusaro (InVitae)
Jared B. Hawkins
Ryan Powles
Yassine Souilmi

...
Tractable, scalable, and economical processing of
clinical whole genome sequences in AWS
Clinical Genomics for Cancer Diag...
Upcoming SlideShare
Loading in...5
×

A Step to the Clouded Solution of Scalable Clinical Genome Sequencing (BDT308) | AWS re:Invent 2013

846

Published on

Professors Wall and Tonellato of Harvard Medical School in collaboration with Beth Israel Deaconess Medical Center discuss the emerging area of clinical whole genome sequencing analysis and tools. They report on the use of Amazon EC2 and Spot Instances to achieve a robust clinical time processing solution and examine the barriers to and resolution of producing clinical-grade whole genome results in the cloud. They benchmark an AWS solution, called COSMOS, against local computing solutions and demonstrate the time and capacity gains conferred through the use of AWS.

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
846
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
37
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Transcript of "A Step to the Clouded Solution of Scalable Clinical Genome Sequencing (BDT308) | AWS re:Invent 2013"

  1. 1. The Problem and Promise of Translational Genetics and a Step to the Clouded Solution of Scalable Clinical Whole Genome Sequencing Jafar Shameem Amazon Web Services November 14, 2013 © 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.
  2. 2. Agenda • Bio-Informatics and Amazon Web Services • Examples of collaboration • Building Blocks – – – – Compute Storage Tools Pricing Models
  3. 3. A rich history of collaboration with Life Sciences organizations
  4. 4. AWS Public Data Sets • A centralized repository of public datasets • Seamless integration with cloud based applications • No charge to the community • Some of the datasets available today: – – Human Microbiome Project – Ensembl – GenBank – Illumina – Jay Flateley Human Genome Dataset – YRI Trio Dataset – The Cannabis Sativa Genome – UniGene – Influenza Virrus – • 1000 Genomes Project PubChem Tell us what else you’d like for us to host …
  5. 5. Understanding how human genetics contributes to heart disease and aging CHARGE Consortium - aimed at better understanding how human genetics contributes to heart disease and aging DNANexus Baylor College of Medicine
  6. 6. Cluster High Mem 8XL 89 EC2 Compute Units 244 GB SSD instance storage Compute High Storage 8XL 117 GB 35 EC2 Compute Units 24 * 2 TB instance store Cluster Compute 8XL 60.5 GB 88 EC2 Compute Units Memory (GiB) Hi-Mem 4XL 68.4 GB 26 EC2 Compute Units 8 virtual cores Hi-Mem 2XL 34.2 GB 13 EC2 Compute Units 4 virtual cores Cluster Compute 4XL 23 GB 33.5 EC2 Compute Units High I/O 4XL 60.5 GB, 35 EC2 Compute Units, 2*1024 GB SSD-based local instance storage Hi-Mem XL 17.1 GB 6.5 EC2 Compute Units 2 virtual cores Medium 3.7 GB, 2 EC2 Compute Units 1 virtual core Extra Large 15 GB 8 EC2 Compute Units 4 virtual cores Small 1.7 GB, 1 EC2 Compute Unit 1 virtual core Micro 613 MB Up to 2 ECUs Large 7.5 GB 4 EC2 Compute Units 2 virtual cores High-CPU Med 1.7 GB 5 EC2 Compute Units 2 virtual cores EC2 Compute Units Cluster GPU 4XL 22 GB 33.5 EC2 Compute Units, 2 x NVIDIA Tesla “Fermi” M2050 GPUs High-CPU XL 7 GB 20 EC2 Compute Units 8 virtual cores
  7. 7. Storage Relational Database Service SimpleDB DynamoDB S3 Fully managed database NoSQL, Schemaless NoSQL, Schemaless, Object datastore up to 5TB (MySQL, Oracle, MSSQL) Smaller datasets Provisioned throughput per object database 99.999999999% durability Redshift Petabyte scale data warehousing service Fully managed
  8. 8. Tools of the trade • • • • • • • GATK NCBI BLAST Crossbow CloudBurst Myrna Clovr BioPerl Max • • • • • • • VIPDAC Superfamily Cloud-Coffee BioNimbus GMOD CloudAligner BioConductor • • • • QIIME SNAP BWA Bowtie/TopHat/Cufflin ks • STAR, GSNAP, RUM MIT StarCluster Galaxy CloudMan Rocks Torque Slurm Condor Chef Puppet SaltStack Get links to AMIs at: https://github.com/mndoci/mndoci.github.com/wiki/Life-Science-Apps-on-AWS
  9. 9. Many purchase models to support different needs Free Tier On-Demand Reserved Spot Dedicated Get Started on AWS with free usage & no commitment Pay for compute capacity by the hour with no long-term commitments Make a low, one-time payment and receive a significant discount on the hourly charge Bid for unused capacity, charged at a Spot Price which fluctuates based on supply and demand Launch instances within Amazon VPC that run on hardware dedicated to a single customer For POCs and getting started For spiky workloads, or to define needs For committed utilization For time-insensitive or transient workloads For highly sensitive or compliance related workloads
  10. 10. How to use Spot? Ideal Applications Batch Processing Time-Delayable Fault-Tolerant or Restartable Compute-Intensive Horizontally Scalable Stateless Worker Nodes Region and AZ Independent Uses Deployment Automation Less Ideal Applications Interactive Strict/Tight SLA for Completion Expensive to Handle Terminations Data-Intensive In-Memory Scaling Long-Running Worker Nodes (weeks) Requires a Single AZ Manually Launched and Managed
  11. 11. Tractable, scalable, and economical processing of clinical whole genome sequences in AWS Clinical Genomics for Cancer Diagnosis Amazon Web Services Re-Invent 2013 Nov 14th, 2013 Las Vegas, NV Peter J. Tonellato, PhD Harvard Medical School Dennis P. Wall, PhD* Stanford University* Stanford University
  12. 12. Whole Genome Breast Cancer Program Objective: The objective of the Whole Genome Breast Cancer Program (WGBC) is to demonstrate the clinical utility and value of the use of whole genome analysis to practical breast cancer detection, diagnosis, prognosis and improved outcomes. WGA in Clinical Turn-Around Demonstrate the use of Amazon Web Services to establish Clinical Whole Genome Analysis in “clinical turn-around”: WG NGS Sequence to Actionable Health Care Information Clock time: Cost: < 3 hours < $100 Stanford University
  13. 13. Whole Genome Breast Cancer Program 1. Organization and Progress to date 2. Historical BIDMC Breast Cancer cases 3. Clinical Whole Genome Analysis – Laboratory Test 4. COSMOS: Clinical Whole Genome Analysis on AWS Stanford University
  14. 14. Whole Genome Breast Cancer Program 1. Organization and Progress to date 2. Historical BIDMC Breast Cancer cases 3. Clinical Whole Genome Analysis – Laboratory Test 4. COSMOS: Clinical Whole Genome Analysis on AWS Stanford University
  15. 15. N - MDBCTB Surgery Mike Genetics Nadine Oncology Gerburg Pathology Stu Radiation Oncology Abram Genetic Counseling Jill Social Work Barbara Bioinformatics Peter & Dennis LPM - Sheida - Latrice - Jared - Yassine - Val - Michiyo Program Coordinator – Michiyo - Research Assistant * (Emily Poles?) - Technician * Case management - Case Identification - Case review (N-MDBCTB) - Consent (RN) - Clinical Data Management - Tissue Collection - Sample Management - Follow-up Assay/Bioinformatics Imaging Tejas Assay - Preparation/Storage - DNA extraction/purification - Sample delivery (to outsource) - Whole genome sequencing -OncoScan v3 (BI) - DNA/RNA/NGS sequencing (outsource) Bioinformatics - Data Transfer - Genome Data Integration/Management - Annotation - Analysis - Translation - Case Evidence Report Translation - Data Integration/Management - Case report (N-MDBCTB) Oversee External Advisory Board Clinical Executive Committee Cancer Center Stanford University Regulatory Affairs * Hiring
  16. 16. Clinical Lab BIDMC Breast Cancer Patient/Sample Process Surgical specimen BIDMC Clinic (Oncology, Surgery, Radiation Oncology) N-BCMDTB Case selection Diagnosis Work-up MMG US MRI Biopsy (Immunohistochemistry, FISH) Pathology FFPE BCMDTB Diagnoses Yes Surgery After NAC Consent to care X Biopsy specimen Blood sample FF Presentation Case and Schedule Case Evaluation Surgery? No Blood test lab No Yes Consent to Research Surgery Biopsy Blood sample Tissue specimen Storage FF Diagnoses OMR OMR Yes -> * and ** No -> * Blood test Storage FFPE, FF Research * Clinical Research Pathology Lab Blood sample ** Research Tissue sample DNA, RNA Extraction - DNA Sequencing - Exome sequencing - OncoScan v3™ Copy number Somatic Mutation Adjuvant Therapy Analysis Lab (LPM) Chemotherapy Radiation therapy Patient flow Clinical Evaluation Case identification workflow Sample workflow Analysis workflow - Gene expression pipeline - OncoScan™ pipeline - SNP Chip pipeline - Integrative pipeline OMR - Clinical Data - Follow up Outcome Translation Workflow X: No further treatment and research NAC: Neoadjuvant chemotherapy OMR: Online Medical Record FFPE: Formalin-Fixed, Paraffin-Embedded (tissue) Stanford University FF: fresh frozen (tissue) N-BCMDTB Result Evaluation Identification of Targeted Therapy Personalized Medicine
  17. 17. IRB Approved Protocol No Evaluation - Treatment Decision Yes -> Undergo surgery No Excluded Decision - Case Decision Yes -> Eligible No surgery Not eligible Disagreed Poor sample No Consent - Getting consent Yes -> Agreed No No Clinical Workup - Blood Test - Breast Surgery Tissue Workup - Pathology Workup - Sample Collection (Extract DNA/RNA from Tissue and Blood) - DNA genome sequencing - Exome sequencing - Copy number and somatic mutations analysis using an array platform (OncoScan) - Analysis outcome data Assay Clinical Outcome Analysis Clinicopathological Characteristic Translation - Discussion at NBCMDTB Traditional Treatment Personalized - Identification of Targeted Therapy Stanford University and Personalized Medicine Medicine
  18. 18. Whole Genome Breast Cancer Program 1. Organization and Progress to date 2. Historical BIDMC Breast Cancer cases 3. Clinical Whole Genome Analysis – Laboratory Test 4. COSMOS: Clinical Whole Genome Analysis on AWS Stanford University
  19. 19. Breast Cancer Clinical Use of WGA 1. 2. 3. 4. 5. Family and Individual Risk prediction Breast Cancer Tumor Characterization Breast Cancer Diagnosis Breast Cancer Prognosis Prediction of response to targeted therapies 6. Indications of outcome and assessment for future treatment refinement Stanford University
  20. 20. Breast Cancer Genomic Devices 35 devices reviewed; 26 used clinically Prognosis Risk Prediction 23andMe* deCODEme* BRACAnalysis* Ambry Genetics* CCDG Panel OncoScan TargetPrint BluePrint** PAM50* BreastProfile* Her2Pro* MammaPrint Methyl-Profiler Rotterdam Signature MammoStrat BreastGeneDX Breast Cancer Array OncotypeDX* Breast Cancer Index Research OncoMap3** AsuraSeq-1000** OncoCarta** *Associated CPT/CMS codes **Not for clinical use SNaPshot MapQuant DX TheraPrint** NexCourse Bca Wash U Panel Target Now Stanford University
  21. 21. Clinically Actionable Breast Cancer Information Data Type # Unique Entries Gene 773 SNP 52 SNPs for risk prediction. 1681 SNPs for prognosis 1733 Small Insertion 75 Small Deletion 205 Translocation 3 Gene Expression Drug target commonly based on gene expression profile 383 Protein Expression 7 Amplification 64 Deletion HER2, Estrogen, Progesterone receptor status 48 Total “Clinically” Actionable 3291 9 Deletions in BRCA1 or BRCA2 detected by BRACAnalysis confer increased breast cancer risk Stanford University
  22. 22. Whole Genome Breast Cancer Program 1. Organization and Progress to date 2. Historical BIDMC Breast Cancer cases 3. Clinical Whole Genome Analysis (WGA) – Laboratory Test 4. COSMOS: Clinical Whole Genome Analysis on AWS Stanford University
  23. 23. Clinical WGA Workflow Patients Samples Next Generation Sequencers Bioinformatics Analysis Clinical Genomics Interpretation Service Clinical Report Biomedical Report Stanford University
  24. 24. BWA GATK Picard SNP/indel CNV-seq ReadDepth Segseq CNV Risk Prediction DNA-Seq RNA-Seq Tophat Cufflinks BLAST Gene Exp. miRNA miRNAkey miRBase miRNA targets Bismark % Gene Methyl Methyl Stanford University Pre-clinical and clinical variant annotation Classification (Tumor, disease) Pathway Analysis
  25. 25. Reduced Cost of Next Generation Sequencing (NGS) • NGS platforms: 5,000 Megabases/day • Drop of the per-base sequencing cost • Data on petabyte scale • NGS analysis involves complex workflows Stanford University
  26. 26. WGA in “Clinical Turn-around” – Future 12 hours Sample Collection 500 hours Sequencing < 40hours < $100 3 hours Analysis 12 hours Clinical Action Stanford University
  27. 27. Current Costs to Run on Amazon Web Services Details:  1 Whole Genome  60x  Spot and Reserved Instances  Utilizing Amazon Glacier for long term storage Whole Genome Analysis: Approximately 1 day Approximately $1500 Stanford University
  28. 28. Whole Genome Breast Cancer Program 1. Organization and Progress to date 2. Historical BIDMC Breast Cancer cases 3. Clinical Whole Genome Analysis – Laboratory Test 4. COSMOS: Clinical Whole Genome Analysis on AWS • • • • AWS Applications Workflow COSMOS Stanford University
  29. 29. Clinical Whole Genome Analysis Computational Objective: < 3 hours < $100 Four approaches to optimize and achieve our Clinical Turn-Around Objective: • AWS • Refine and Improve WGA Applications • Create a Standardized, Robust CWGA Workflow • Stabilize a new Workflow and Distributive Computing Platform: COSMOS Stanford University
  30. 30. Clinical Whole Genome Analysis Computational Objective: < 3 hours < $100 Four approaches to optimize and achieve our Clinical Turn-Around Objective: • AWS • Refine and Improve WGA Applications • Create a Standardized, Robust CWGA Workflow • Stabilize a new Workflow and Distributive Computing Platform: COSMOS Stanford University
  31. 31. Dynamic Cluster with number and the type of instances adapted to data-sets, jobs, and applications. EC2 instances AMIs S3 storage BAM BAM On-demand Master(s) Load Balanced Spot Instance Workers Stanford University BAM
  32. 32. EC2 instances Optimization: Correct type and number of EC2s and cluster Current non-optimized Master: CC2.8xlarge High Memory: Single job (BWA) ~ 10GB RAM High IO: Access to common data files Virtualization: HVM for HugePage AMIs Current non-optimized Worker: CC2.8xlarge S3 storage High Memory: Single job (BWA) ~ 10GB RAM High IO: Access to common data files Virtualization: HVM for HugePage Stanford University
  33. 33. Create stable CWGA AMI(s) EC2 instances Required Applications, libraries and dependencies: Applications (GATK): Samtools, BWA, … Human Reference Genome AMIs Annotation Databases S3 storage Stanford University
  34. 34. Optimize: AMI EC2 instances Compiler: GCC 4.6+ supports AVX mode Refined GCC parameters Compressed libraries: zlib and snappy Refined JAVA parameters for GATK optimization AMIs S3 storage Memory: HugePage (2M) configured for every node/application Disks: Ephemeral: Cluster Disks: RAID 0 GlusterFS Stanford University
  35. 35. EC2 instances AMIs S3 storage: • • • • • Storage of BAM files Transfer of BAM and other files “checkpoint” after each successful workflow stage Backup of intermediate and final results Storage of all timings and job information S3 storage Stanford University
  36. 36. Clinical Whole Genome Analysis Computational Objective: < 3 hours < $100 Four approaches to optimize and achieve our Clinical Turn-Around Objective: • AWS • Refine and Improve WGA Applications • Create a Standardized, Robust CWGA Workflow • Stabilize a new Workflow and Distributive Computing Platform: COSMOS Stanford University
  37. 37. BWA GATK Picard SNP/indel CNV-seq ReadDepth Segseq CNV Risk Prediction DNA-Seq RNA-Seq Tophat Cufflinks BLAST Gene Exp. miRNA miRNAkey miRBase miRNA targets Bismark % Gene Methyl Methyl Stanford University Pre-clinical and clinical variant annotation Classification (Tumor, disease) Pathway Analysis
  38. 38. WGA Applications Genome Analysis Toolkit (GATK) “best practice”. Preparation/Alignment Variant calling Stanford University Annotation Source: GATK best practices, BROAD Institute, http://www.broadinstitute.org/gatk/guide/topic?name=best-practices
  39. 39. Applications Parallelization 5 exomes example 5 exome 600 500 400 300 200 5 exome 100 0 Preparation/Alignment Variant calling Stanford University Annotation
  40. 40. Alignment: Burrows-Wheeler Aligner Stanford University
  41. 41. Clinical Whole Genome Analysis Computational Objective: < 3 hours < $100 Four approaches to optimize and achieve our Clinical Turn-Around Objective: • AWS • Refine and Improve WGA Applications • Create a Standardized, Robust CWGA Workflow • Stabilize a new Workflow and Distributive Computing Platform: COSMOS Stanford University
  42. 42. BWA GATK Picard SNP/indel CNV-seq ReadDepth Segseq CNV Risk Prediction DNA-Seq RNA-Seq Tophat Cufflinks BLAST Gene Exp. miRNA miRNAkey miRBase miRNA targets Bismark % Gene Methyl Methyl Stanford University Pre-clinical and clinical variant annotation Classification (Tumor, disease) Pathway Analysis
  43. 43. BWA GATK Picard SNP/indel CNV-seq ReadDepth Segseq CNV Risk Prediction DNA-Seq RNA-Seq Tophat Cufflinks BLAST Gene Exp. miRNA miRNAkey miRBase miRNA targets Bismark % Gene Methyl Methyl Stanford University Pre-clinical and clinical variant annotation Classification (Tumor, disease) Pathway Analysis
  44. 44. GenomeKey Implements GATK "best practices" for variant calling. GenomeKey Preparation/Alignment Variant calling Stanford University Annotation Source: GATK best practices, BROAD Institute, http://www.broadinstitute.org/gatk/guide/topic?name=best-practices
  45. 45. Databases Integrated Stanford University
  46. 46. Databases Integrated CytoBank The_1000g_Febuary_all dbSNP135 NHLBI_Exome_Project_euro TFBS NHLBI_Exome_Project_aa Segmental_Duplications NHLBI_Exome_Project_all RepeatMasker HGMD_INDEL Self Chain HGMD_SNP mirBase COSMIC TargetScan GWAS_Catalog Plus support for generic database file formats such as .bed and .gff3 SIFT ENCODE_DNaseI_Hypersensitivity PolyPhen2 ENCODE_Transcription_Factor Mutation_Taster UCSC_Gene GERP Refseq_Gene PhyloP Ensembl_Gene LRT CCDS_Gene Mce46way DrugBank Complete_Genomics_69 Stanford University
  47. 47. Workflow Optimization • Speed: • Replacing BWA with SNAP (for the same accuracy) • Re-implement some slow algorithms (e.g. BQSR) • Accuracy: • Add additional quality control steps • Replacing Unified Genotyper with Haplotype Caller Stanford University
  48. 48. Clinical Whole Genome Analysis Computational Objective: < 3 hours < $100 Four approaches to optimize and achieve our Clinical Turn-Around Objective: • AWS • Refine and Improve WGA Applications • Create a Standardized, Robust CWGA Workflow • Stabilize a new Workflow and Distributive Computing Platform: COSMOS Stanford University
  49. 49. COSMOS Workflow management System Job splitting COSMOS Job tracking GenomeKey Preparation/Alignm ent Variant calling Annotation Gluster FS MySQL DB Networking Web Interface OS & Software Grid engine EC2 and S3 AWS Instances Stanford University Storage
  50. 50. COSMOS Parallelization 1200 Preparation/Alignment Variant calling Annotation Number of Jobs 1000 800 600 1 Exome 5 Exomes 400 10 Exomes All Runs 200 0 Stanford University
  51. 51. COSMOS Job Splitting Stanford University
  52. 52. COSMOS Job Splitting Stanford University
  53. 53. COSMOS Job Splitting Stanford University
  54. 54. Job Dependency Tracking PREPARATION / ALIGNMENT VARIANT CALLING ANNOTATION Stanford University
  55. 55. COSMOS Web Interface PREPARATION / ALIGNMENT VARIANT CALLING ANNOTATION Stanford University
  56. 56. Clinical Whole Genome Analysis Computational Objective: < 3 hours < $100 Four approaches to optimize and achieve our Clinical Turn-Around Objective: • AWS • Refine and Improve WGA Applications • Create a Standardized, Robust CWGA Workflow • Stabilize a new Workflow and Distributive Computing Platform: COSMOS Stanford University
  57. 57. Whole Exome Analysis Pre and Post-Optimization 30 Before 25 Wall time 20 ~$90 Before 15 10 5 0 ~$48 After Before ~$27 After ~$47 After ~$27 ~$10 1 exome 5 exomes Stanford University 10 exomes
  58. 58. Whole Exome Analysis: 30 Before 25 Wall time 20 ~$90 Before 15 10 5 0 ~$48 After Before ~$27 After ~$47 After ~$27 ~$10 1 exome 5 exomes Stanford University 10 exomes
  59. 59. Whole Exome Analysis: 30 Before 25 Wall time 20 ~$90 Before 15 10 5 0 ~$48 After Before ~$27 After ~$47 After ~$27 ~$10 1 exome 5 exomes Stanford University 10 exomes
  60. 60. Whole Genome Breast Cancer Program Objective: The objective of the Whole Genome Breast Cancer Program (WGBC) is to demonstrate the clinical utility and value of the use of whole genome analysis to practical breast cancer detection, diagnosis, prognosis and improved outcomes. WGA in Clinical Turn-Around Demonstrate the use of Amazon Web Services to establish Clinical Whole Genome Analysis in “clinical turn-around”: WG NGS Sequence to Actionable Health Care Information Clock time: Cost: < 3 hours < $100 Stanford University
  61. 61. Acknowledgments LPM (Tonellato) Erik Gafni (InVitae) Vince Fusaro (InVitae) Jared B. Hawkins Ryan Powles Yassine Souilmi Autism Speaks 6000 Exomes (current) 10,000 Genomes Wall lab (Harvard & Stanford University) Jae-Yoon Jung Alex Lancaster David Tulga Ancient Human Genomes David Reich Stanford University
  62. 62. Tractable, scalable, and economical processing of clinical whole genome sequences in AWS Clinical Genomics for Cancer Diagnosis Amazon Web Services Re-Invent 2013 Nov 14th, 2013 Las Vegas, NV Peter J. Tonellato, PhD Harvard Medical School Dennis P. Wall, PhD* Stanford University* Stanford University
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×