Professors Wall and Tonellato of Harvard Medical School in collaboration with Beth Israel Deaconess Medical Center discuss the emerging area of clinical whole genome sequencing analysis and tools. They report on the use of Amazon EC2 and Spot Instances to achieve a robust clinical time processing solution and examine the barriers to and resolution of producing clinical-grade whole genome results in the cloud. They benchmark an AWS solution, called COSMOS, against local computing solutions and demonstrate the time and capacity gains conferred through the use of AWS.
2. Agenda
• Bio-Informatics and Amazon Web Services
• Examples of collaboration
• Building Blocks
–
–
–
–
Compute
Storage
Tools
Pricing Models
3. A rich history of collaboration with Life Sciences organizations
4. AWS Public Data Sets
•
A centralized repository of public datasets
•
Seamless integration with cloud based applications
•
No charge to the community
•
Some of the datasets available today:
–
–
Human Microbiome Project
–
Ensembl
–
GenBank
–
Illumina – Jay Flateley Human Genome Dataset
–
YRI Trio Dataset
–
The Cannabis Sativa Genome
–
UniGene
–
Influenza Virrus
–
•
1000 Genomes Project
PubChem
Tell us what else you’d like for us to host …
5. Understanding how human genetics contributes
to heart disease and aging
CHARGE Consortium
- aimed at better understanding how human genetics contributes to heart disease
and aging
DNANexus
Baylor College of Medicine
6. Cluster High Mem 8XL
89 EC2 Compute Units
244 GB SSD instance storage
Compute
High Storage 8XL 117 GB
35 EC2 Compute Units
24 * 2 TB instance store
Cluster Compute 8XL 60.5 GB
88 EC2 Compute Units
Memory
(GiB)
Hi-Mem 4XL 68.4 GB
26 EC2 Compute Units
8 virtual cores
Hi-Mem 2XL 34.2 GB
13 EC2 Compute Units
4 virtual cores
Cluster Compute 4XL 23 GB
33.5 EC2 Compute Units
High I/O 4XL 60.5 GB, 35
EC2 Compute Units,
2*1024 GB SSD-based
local instance storage
Hi-Mem XL 17.1 GB
6.5 EC2 Compute Units
2 virtual cores
Medium 3.7 GB,
2 EC2 Compute Units
1 virtual core
Extra Large 15 GB
8 EC2 Compute Units
4 virtual cores
Small 1.7 GB,
1 EC2 Compute Unit
1 virtual core
Micro 613 MB
Up to 2 ECUs
Large 7.5 GB
4 EC2 Compute Units
2 virtual cores
High-CPU Med 1.7 GB
5 EC2 Compute Units
2 virtual cores
EC2 Compute Units
Cluster GPU 4XL 22 GB
33.5 EC2 Compute Units,
2 x NVIDIA Tesla “Fermi”
M2050 GPUs
High-CPU XL 7 GB
20 EC2 Compute Units
8 virtual cores
7. Storage
Relational Database Service
SimpleDB
DynamoDB
S3
Fully managed database
NoSQL, Schemaless
NoSQL, Schemaless,
Object datastore up to 5TB
(MySQL, Oracle, MSSQL)
Smaller datasets
Provisioned throughput
per object
database
99.999999999% durability
Redshift
Petabyte scale
data warehousing service
Fully managed
8. Tools of the trade
•
•
•
•
•
•
•
GATK
NCBI BLAST
Crossbow
CloudBurst
Myrna
Clovr
BioPerl Max
•
•
•
•
•
•
•
VIPDAC
Superfamily
Cloud-Coffee
BioNimbus
GMOD
CloudAligner
BioConductor
•
•
•
•
QIIME
SNAP
BWA
Bowtie/TopHat/Cufflin
ks
• STAR, GSNAP, RUM
MIT StarCluster
Galaxy CloudMan
Rocks
Torque
Slurm
Condor
Chef
Puppet
SaltStack
Get links to AMIs at:
https://github.com/mndoci/mndoci.github.com/wiki/Life-Science-Apps-on-AWS
9. Many purchase models to support different needs
Free Tier
On-Demand
Reserved
Spot
Dedicated
Get Started on AWS
with free usage & no
commitment
Pay for compute
capacity by the hour
with no long-term
commitments
Make a low, one-time
payment and receive a
significant discount on
the hourly charge
Bid for unused capacity,
charged at a Spot Price
which fluctuates based
on supply and demand
Launch instances within
Amazon VPC that run
on hardware dedicated
to a single customer
For POCs and
getting started
For spiky workloads,
or to define needs
For committed
utilization
For time-insensitive or
transient workloads
For highly sensitive or
compliance related
workloads
10. How to use Spot?
Ideal Applications
Batch Processing
Time-Delayable
Fault-Tolerant or Restartable
Compute-Intensive
Horizontally Scalable
Stateless Worker Nodes
Region and AZ Independent
Uses Deployment Automation
Less Ideal Applications
Interactive
Strict/Tight SLA for Completion
Expensive to Handle Terminations
Data-Intensive
In-Memory Scaling
Long-Running Worker Nodes (weeks)
Requires a Single AZ
Manually Launched and Managed
11. Tractable, scalable, and economical processing of
clinical whole genome sequences in AWS
Clinical Genomics for Cancer Diagnosis
Amazon Web Services Re-Invent 2013
Nov 14th, 2013 Las Vegas, NV
Peter J. Tonellato, PhD
Harvard Medical School
Dennis P. Wall, PhD*
Stanford University*
Stanford University
12. Whole Genome Breast Cancer Program
Objective: The objective of the Whole Genome Breast Cancer Program
(WGBC) is to demonstrate the clinical utility and value of the use of whole
genome analysis to practical breast cancer detection, diagnosis, prognosis
and improved outcomes.
WGA in Clinical Turn-Around
Demonstrate the use of Amazon Web Services to establish Clinical Whole
Genome Analysis in “clinical turn-around”:
WG NGS Sequence to Actionable Health Care Information
Clock time:
Cost:
< 3 hours
< $100
Stanford University
13. Whole Genome Breast Cancer Program
1. Organization and Progress to date
2. Historical BIDMC Breast Cancer cases
3. Clinical Whole Genome Analysis – Laboratory Test
4. COSMOS: Clinical Whole Genome Analysis on AWS
Stanford University
14. Whole Genome Breast Cancer Program
1. Organization and Progress to date
2. Historical BIDMC Breast Cancer cases
3. Clinical Whole Genome Analysis – Laboratory Test
4. COSMOS: Clinical Whole Genome Analysis on AWS
Stanford University
15. N - MDBCTB
Surgery
Mike
Genetics
Nadine
Oncology
Gerburg
Pathology
Stu
Radiation
Oncology
Abram
Genetic
Counseling
Jill
Social Work
Barbara
Bioinformatics
Peter & Dennis
LPM
- Sheida - Latrice
- Jared - Yassine
- Val
- Michiyo
Program Coordinator – Michiyo
- Research Assistant * (Emily Poles?)
- Technician *
Case management
- Case Identification
- Case review (N-MDBCTB)
- Consent (RN)
- Clinical Data Management
- Tissue Collection
- Sample Management
- Follow-up Assay/Bioinformatics
Imaging
Tejas
Assay
- Preparation/Storage
- DNA extraction/purification
- Sample delivery (to outsource)
- Whole genome sequencing
-OncoScan v3 (BI)
- DNA/RNA/NGS sequencing
(outsource)
Bioinformatics
- Data Transfer
- Genome Data
Integration/Management
- Annotation
- Analysis
- Translation
- Case Evidence Report
Translation
- Data Integration/Management
- Case report (N-MDBCTB)
Oversee
External Advisory Board
Clinical Executive Committee
Cancer Center
Stanford University
Regulatory Affairs
* Hiring
16. Clinical Lab
BIDMC Breast Cancer Patient/Sample Process
Surgical
specimen
BIDMC Clinic (Oncology, Surgery, Radiation Oncology)
N-BCMDTB
Case selection
Diagnosis Work-up
MMG
US
MRI
Biopsy
(Immunohistochemistry, FISH)
Pathology
FFPE
BCMDTB
Diagnoses
Yes
Surgery
After
NAC
Consent
to care
X
Biopsy
specimen
Blood
sample
FF
Presentation
Case and Schedule
Case Evaluation
Surgery?
No
Blood test lab
No
Yes
Consent
to
Research
Surgery
Biopsy
Blood
sample
Tissue
specimen
Storage
FF
Diagnoses
OMR
OMR
Yes -> * and **
No -> *
Blood test
Storage
FFPE, FF
Research
*
Clinical
Research Pathology Lab
Blood
sample
**
Research
Tissue
sample
DNA, RNA
Extraction
- DNA Sequencing
- Exome sequencing
- OncoScan v3™
Copy number
Somatic Mutation
Adjuvant
Therapy
Analysis Lab (LPM)
Chemotherapy
Radiation therapy
Patient flow
Clinical Evaluation
Case identification workflow
Sample workflow
Analysis workflow
- Gene expression pipeline
- OncoScan™ pipeline
- SNP Chip pipeline
- Integrative pipeline
OMR
- Clinical Data
- Follow up Outcome
Translation
Workflow
X: No further treatment and research
NAC: Neoadjuvant chemotherapy
OMR: Online Medical Record
FFPE: Formalin-Fixed, Paraffin-Embedded (tissue)
Stanford University
FF: fresh frozen (tissue)
N-BCMDTB
Result Evaluation
Identification of Targeted Therapy
Personalized Medicine
17. IRB Approved
Protocol
No
Evaluation
- Treatment Decision
Yes -> Undergo surgery
No
Excluded
Decision
- Case Decision
Yes -> Eligible
No surgery
Not eligible
Disagreed
Poor sample
No
Consent
- Getting consent
Yes -> Agreed
No
No
Clinical Workup
- Blood Test
- Breast Surgery
Tissue Workup
- Pathology Workup
- Sample Collection
(Extract DNA/RNA from Tissue and Blood)
- DNA genome sequencing
- Exome sequencing
- Copy number and somatic mutations analysis
using an array platform (OncoScan)
- Analysis outcome data
Assay
Clinical Outcome
Analysis
Clinicopathological
Characteristic
Translation
- Discussion at NBCMDTB
Traditional Treatment
Personalized
- Identification of Targeted Therapy
Stanford University
and Personalized Medicine
Medicine
18. Whole Genome Breast Cancer Program
1. Organization and Progress to date
2. Historical BIDMC Breast Cancer cases
3. Clinical Whole Genome Analysis – Laboratory Test
4. COSMOS: Clinical Whole Genome Analysis on AWS
Stanford University
19. Breast Cancer Clinical Use of WGA
1.
2.
3.
4.
5.
Family and Individual Risk prediction
Breast Cancer Tumor Characterization
Breast Cancer Diagnosis
Breast Cancer Prognosis
Prediction of response to targeted
therapies
6. Indications of outcome and assessment for
future treatment refinement
Stanford University
20. Breast Cancer Genomic Devices
35 devices reviewed; 26 used clinically
Prognosis
Risk Prediction
23andMe*
deCODEme*
BRACAnalysis*
Ambry Genetics*
CCDG Panel
OncoScan
TargetPrint
BluePrint**
PAM50*
BreastProfile*
Her2Pro*
MammaPrint
Methyl-Profiler
Rotterdam Signature
MammoStrat
BreastGeneDX
Breast Cancer Array
OncotypeDX*
Breast Cancer Index
Research
OncoMap3**
AsuraSeq-1000**
OncoCarta**
*Associated CPT/CMS codes
**Not for clinical use
SNaPshot
MapQuant DX
TheraPrint**
NexCourse Bca
Wash U Panel
Target Now
Stanford University
21. Clinically Actionable Breast Cancer
Information
Data Type
# Unique Entries
Gene
773
SNP
52 SNPs for risk prediction. 1681
SNPs for prognosis
1733
Small Insertion
75
Small Deletion
205
Translocation
3
Gene Expression
Drug target commonly based on
gene expression profile
383
Protein Expression
7
Amplification
64
Deletion
HER2, Estrogen, Progesterone
receptor status
48
Total “Clinically”
Actionable
3291
9 Deletions in BRCA1 or BRCA2
detected by BRACAnalysis confer
increased breast cancer risk
Stanford University
22. Whole Genome Breast Cancer Program
1. Organization and Progress to date
2. Historical BIDMC Breast Cancer cases
3. Clinical Whole Genome Analysis (WGA) – Laboratory Test
4. COSMOS: Clinical Whole Genome Analysis on AWS
Stanford University
23. Clinical WGA Workflow
Patients
Samples
Next Generation
Sequencers
Bioinformatics Analysis
Clinical Genomics
Interpretation
Service
Clinical Report
Biomedical Report
Stanford University
25. Reduced Cost of Next Generation Sequencing
(NGS)
• NGS platforms: 5,000
Megabases/day
• Drop of the per-base
sequencing cost
• Data on petabyte scale
• NGS analysis involves
complex workflows
Stanford University
27. Current Costs to Run on
Amazon Web Services
Details:
1 Whole Genome
60x
Spot and Reserved Instances
Utilizing Amazon Glacier for long term storage
Whole Genome Analysis:
Approximately 1 day
Approximately $1500
Stanford University
28. Whole Genome Breast Cancer Program
1. Organization and Progress to date
2. Historical BIDMC Breast Cancer cases
3. Clinical Whole Genome Analysis – Laboratory Test
4. COSMOS: Clinical Whole Genome Analysis on AWS
•
•
•
•
AWS
Applications
Workflow
COSMOS
Stanford University
29. Clinical Whole Genome Analysis Computational
Objective:
< 3 hours < $100
Four approaches to optimize and achieve our
Clinical Turn-Around Objective:
• AWS
• Refine and Improve WGA Applications
• Create a Standardized, Robust CWGA Workflow
• Stabilize a new Workflow and Distributive Computing
Platform: COSMOS
Stanford University
30. Clinical Whole Genome Analysis Computational
Objective:
< 3 hours < $100
Four approaches to optimize and achieve our
Clinical Turn-Around Objective:
• AWS
• Refine and Improve WGA Applications
• Create a Standardized, Robust CWGA Workflow
• Stabilize a new Workflow and Distributive Computing
Platform: COSMOS
Stanford University
31. Dynamic Cluster with number and the type of instances
adapted to data-sets, jobs, and applications.
EC2
instances
AMIs
S3 storage
BAM
BAM
On-demand Master(s)
Load Balanced
Spot Instance
Workers
Stanford University
BAM
32. EC2
instances
Optimization: Correct type and number of EC2s and
cluster
Current non-optimized Master: CC2.8xlarge
High Memory: Single job (BWA) ~ 10GB RAM
High IO: Access to common data files
Virtualization: HVM for HugePage
AMIs
Current non-optimized Worker: CC2.8xlarge
S3 storage
High Memory: Single job (BWA) ~ 10GB RAM
High IO: Access to common data files
Virtualization: HVM for HugePage
Stanford University
33. Create stable CWGA AMI(s)
EC2
instances
Required Applications, libraries and
dependencies:
Applications (GATK): Samtools, BWA, …
Human Reference Genome
AMIs
Annotation Databases
S3 storage
Stanford University
34. Optimize: AMI
EC2
instances
Compiler:
GCC 4.6+ supports AVX mode
Refined GCC parameters
Compressed libraries: zlib and snappy
Refined JAVA parameters for GATK optimization
AMIs
S3 storage
Memory: HugePage (2M) configured for every node/application
Disks:
Ephemeral:
Cluster Disks:
RAID 0
GlusterFS
Stanford University
35. EC2
instances
AMIs
S3 storage:
•
•
•
•
•
Storage of BAM files
Transfer of BAM and other files
“checkpoint” after each successful workflow stage
Backup of intermediate and final results
Storage of all timings and job information
S3 storage
Stanford University
36. Clinical Whole Genome Analysis Computational
Objective:
< 3 hours < $100
Four approaches to optimize and achieve our
Clinical Turn-Around Objective:
• AWS
• Refine and Improve WGA Applications
• Create a Standardized, Robust CWGA Workflow
• Stabilize a new Workflow and Distributive Computing
Platform: COSMOS
Stanford University
41. Clinical Whole Genome Analysis Computational
Objective:
< 3 hours < $100
Four approaches to optimize and achieve our
Clinical Turn-Around Objective:
• AWS
• Refine and Improve WGA Applications
• Create a Standardized, Robust CWGA Workflow
• Stabilize a new Workflow and Distributive Computing
Platform: COSMOS
Stanford University
47. Workflow Optimization
• Speed:
• Replacing BWA with SNAP (for the same accuracy)
• Re-implement some slow algorithms (e.g. BQSR)
• Accuracy:
• Add additional quality control steps
• Replacing Unified Genotyper with Haplotype Caller
Stanford University
48. Clinical Whole Genome Analysis Computational
Objective:
< 3 hours < $100
Four approaches to optimize and achieve our
Clinical Turn-Around Objective:
• AWS
• Refine and Improve WGA Applications
• Create a Standardized, Robust CWGA Workflow
• Stabilize a new Workflow and Distributive Computing
Platform: COSMOS
Stanford University
56. Clinical Whole Genome Analysis Computational
Objective:
< 3 hours < $100
Four approaches to optimize and achieve our
Clinical Turn-Around Objective:
• AWS
• Refine and Improve WGA Applications
• Create a Standardized, Robust CWGA Workflow
• Stabilize a new Workflow and Distributive Computing
Platform: COSMOS
Stanford University
57. Whole Exome Analysis
Pre and Post-Optimization
30
Before
25
Wall time
20
~$90
Before
15
10
5
0
~$48 After
Before
~$27
After
~$47
After
~$27
~$10
1 exome
5 exomes
Stanford University
10 exomes
58. Whole Exome Analysis:
30
Before
25
Wall time
20
~$90
Before
15
10
5
0
~$48 After
Before
~$27
After
~$47
After
~$27
~$10
1 exome
5 exomes
Stanford University
10 exomes
59. Whole Exome Analysis:
30
Before
25
Wall time
20
~$90
Before
15
10
5
0
~$48 After
Before
~$27
After
~$47
After
~$27
~$10
1 exome
5 exomes
Stanford University
10 exomes
60. Whole Genome Breast Cancer Program
Objective: The objective of the Whole Genome Breast Cancer Program
(WGBC) is to demonstrate the clinical utility and value of the use of whole
genome analysis to practical breast cancer detection, diagnosis, prognosis
and improved outcomes.
WGA in Clinical Turn-Around
Demonstrate the use of Amazon Web Services to establish Clinical Whole
Genome Analysis in “clinical turn-around”:
WG NGS Sequence to Actionable Health Care Information
Clock time:
Cost:
< 3 hours
< $100
Stanford University
61. Acknowledgments
LPM (Tonellato)
Erik Gafni (InVitae)
Vince Fusaro (InVitae)
Jared B. Hawkins
Ryan Powles
Yassine Souilmi
Autism Speaks
6000 Exomes (current)
10,000 Genomes
Wall lab
(Harvard & Stanford University)
Jae-Yoon Jung
Alex Lancaster
David Tulga
Ancient Human Genomes
David Reich
Stanford University
62. Tractable, scalable, and economical processing of
clinical whole genome sequences in AWS
Clinical Genomics for Cancer Diagnosis
Amazon Web Services Re-Invent 2013
Nov 14th, 2013 Las Vegas, NV
Peter J. Tonellato, PhD
Harvard Medical School
Dennis P. Wall, PhD*
Stanford University*
Stanford University