Invited cloud-e-Genome project talk at 2015 NGS Data Congress

NGSDataCongress
London,June2015
P.Misiser
Scalable WES Processing And Variant Interpretation
With Provenance Recording
Using Workflow On The Cloud
Paolo Missier, Jacek Cała, Yaobo Xu,
Eldarina Wijaya, Ryan Kirby
School of Computing Science and Institute of Genetic Medicine
Newcastle University, Newcastle upon Tyne, UK
NGS Data Congress
London, June 15th, 2015

NGSDataCongress
London,June2015
P.Misiser
The Cloud-e-Genome project at Newcastle
1. NGS data processing:
• Implement a flexible WES/WGS pipeline
• Scalable deployment over a public cloud
• Cost control
• Scalability
• Flexibility
• Of design
• Of maintenance
• Ensure
accountability
through traceability
• Enable analytics
over past patient
cases
2. Traceable variant interpretation:
• Design a simple-to-use tool to facilitate
clinical diagnosis by clinicians
• Maintain history of past investigations for
analytical purposes
Objectives: With an aim to:
• 2 year pilot project: 2013-2015
• Funded by UK’s National Institute for Health Research (NIHR)
• Cloud resources from Azure for Research Award

NGSDataCongress
London,June2015
P.Misiser
Part I: data processing
Objectives:
• Design and Implement a flexible WES/WGS pipeline
• Using workflow technology  high level programming
• Providing scalable deployment over a public cloud

NGSDataCongress
London,June2015
P.Misiser
Scripted NGS data processing pipeline
Recalibration
Corrects for system
bias on quality
scores assigned by
sequencer
GATK
Computes coverage
of each read.
VCF Subsetting by filtering,
eg non-exomic variants
Annovar functional annotations (eg
MAF, synonimity, SNPs…)
followed by in house annotations
Aligns sample
sequence to HG19
reference genome
using BWA aligner
Cleaning, duplicate
elimination
Picard tools
Variant calling operates on
multiple samples
simultaneously
Splits samples into chunks.
Haplotype caller detects
both SNV as well as longer
indels
Variant recalibration
attempts to reduce
false positive rate
from caller

NGSDataCongress
London,June2015
P.Misiser
Scripts to workflow - Design
Design
Cloud
Deployment
Execution Analysis
• Better abstraction
• Easier to understand, share,
maintain
• Better exploit data parallelism
• Extensible by wrapping new tools
Theoretical advantages of using a workflow programming model

NGSDataCongress
London,June2015
P.Misiser
Workflow Design
echo Preparing directories $PICARD_OUTDIR and $PICARD_TEMP
mkdir -p $PICARD_OUTDIR
mkdir -p $PICARD_TEMP
echo Starting PICARD to clean BAM files...
$Picard_CleanSam INPUT=$SORTED_BAM_FILE OUTPUT=$SORTED_BAM_FILE_CLEANED
echo Starting PICARD to remove duplicates...
$Picard_NoDups INPUT=$SORTED_BAM_FILE_CLEANED OUTPUT =
$SORTED_BAM_FILE_NODUPS_NO_RG
METRICS_FILE=$PICARD_LOG REMOVE_DUPLICATES=true ASSUME_SORTED=true
echo Adding read group information to bam file...
$Picard_AddRG INPUT=$SORTED_BAM_FILE_NODUPS_NO_RG
OUTPUT=$SORTED_BAM_FILE_NODUPS RGID=$READ_GROUP_ID RGPL=illumina RGSM=$SAMPLE_ID
RGLB="${SAMPLE_ID}_${READ_GROUP_ID}” RGPU="platform_Unit_${SAMPLE_ID}_${READ_GROUP_ID}”
echo Indexing bam files...
samtools index $SORTED_BAM_FILE_NODUPS
“Wrapper”
blocksUtility
blocks

NGSDataCongress
London,June2015
P.Misiser
Workflow design
raw
sequences align clean
recalibrate
alignments
calculate
coverage
call
variants
recalibrate
variants
filter
variants
annotate
coverage
information
annotated
variants
raw
recalibrate
alignments
calculate
coverage
coverage
informationraw
calculate
coverage
coverage
information
recalibrate
alignments
annotate
annotated
variants
annotate
annotated
variants
Stage 1
Stage 2
Stage 3
filter
variants
filter
variants
Conceptual:
Actual:

NGSDataCongress
London,June2015
P.Misiser
Anatomy of a complex parallel dataflow
eScience Central: simple dataflow model…
raw
recalibrate
alignments
calculate
coverage
call
variants
recalibrate
variants
filter
variants
annotate
coverage
information
annotated
variants
raw
recalibrate
alignments
calculate
coverage
coverage
informationraw
calculate
coverage
coverage
information
recalibrate
alignments
annotate
annotated
variants
annotate
annotated
variants
Stage 1
Stage 2
Stage 3
filter
variants
filter
variants
Sample-split:
Parallel processing of
samples in a batch

NGSDataCongress
London,June2015
P.Misiser
Anatomy of a complex parallel dataflow
… with hierarchical structure

NGSDataCongress
London,June2015
P.Misiser
Phase II, top level
raw
recalibrate
alignments
calculate
coverage
call
variants
recalibrate
variants
filter
variants
annotate
coverage
information
annotated
variants
raw
recalibrate
alignments
calculate
coverage
coverage
informationraw
calculate
coverage
coverage
information
recalibrate
alignments
annotate
annotated
variants
annotate
annotated
variants
Stage 1
Stage 2
Stage 3
filter
variants
filter
variants
Chromosome-split:
Parallel processing of
each chromosome
across all samples

NGSDataCongress
London,June2015
P.Misiser
Phase III
raw
recalibrate
alignments
calculate
coverage
call
variants
recalibrate
variants
filter
variants
annotate
coverage
information
annotated
variants
raw
recalibrate
alignments
calculate
coverage
coverage
informationraw
calculate
coverage
coverage
information
recalibrate
alignments
annotate
annotated
variants
annotate
annotated
variants
Stage 1
Stage 2
Stage 3
filter
variants
filter
variants
Sample-split:
Parallel processing
of samples

NGSDataCongress
London,June2015
P.Misiser
Implicit parallelism in the pipeline
align-clean-
recalibrate-coverage
…
align-clean-
recalibrate-coverage
Sample
1
Sample
n
Variant calling
recalibration
Variant calling
recalibration
Variant filtering
annotation
Variant filtering
annotation
……
Chromosome
split
Per-sample
Parallel
processing
Per-chromosome
Parallel
processing
Stage I Stage II Stage III
How does the workflow design exploit this parallelism?
raw
recalibrate
alignments
calculate
coverage
call
variants
recalibrate
variants
filter
variants
annotate
coverage
information
annotated
variants
raw
recalibrate
alignments
calculate
coverage
coverage
informationraw
calculate
coverage
coverage
information
recalibrate
alignments
annotate
annotated
variants
annotate
annotated
variants
Stage 1
Stage 2
Stage 3
filter
variants
filter
variants

NGSDataCongress
London,June2015
P.Misiser
Parallel processing over a batch of exomes
align lane
recalibrate
sample
call
variants
recalibrate
variants
align
sample
haplotype
caller
recalibrate
sample
raw
sequences
align clean
calculate
coverage
coverage
information
recalibrate
alignments
annotate
annotated
variants
Stage 1
Stage 2
Stage 3
filter
variants
align lanealign lane
align
sample
align lane
clean
sampleclean
sampleclean
sample
align lane
align
sample
align lane
recalibrate
sample
VC with
chr-split
haplotype
callerhaplotype
callerhaplotype
caller
annotates
ampleannotates
ampleannotates
ample
filter
samplefilter
samplefilter
sample
annotated
variantsannotated
variants
raw
sequencesraw
sequences
coverage
informationcoverage
information
coverage
per samplecoverage
per samplecoverage
per sample
recalibrate

NGSDataCongress
London,June2015
P.Misiser
Cloud Deployment
Design
Cloud
Deployment
Execution Analysis
• Scalability
• Fewer installation/deployment requirements, staff hours required
• Automated dependency management, packaging
• Configurable to make most efficient use of a cluster

NGSDataCongress
London,June2015
P.Misiser
Workflow on Azure Cloud – modular configuration
<<Azure VM>>
Azure Blob
store
e-SC db
backend
<<Azure VM>>
e-Science
Central
main server JMS queue
REST APIWeb UI
web
browser
rich client
app
workflow invocations
e-SC control data
workflow data
<<worker role>>
Workflow
engine
<<worker role>>
Workflow
engine
e-SC blob
store
<<worker role>>
Workflow
engine
Workflow engines Module
configuration:
3 nodes, 24 cores
Modular architecture  indefinitely scalable!

NGSDataCongress
London,June2015
P.Misiser
Scripts to workflow
Design
Cloud
Deployment
Execution Analysis
3. Execution
• Runtime monitoring
• provenance collection

NGSDataCongress
London,June2015
P.Misiser
Performance
3 workflow engines perform better than our HPC benchmark on larger sample sizes
Technical configurations for 3VMs experiments:
HPC cluster (dedicated nodes): used 3x8-core compute nodes Intel Xeon E5640, 2.67GHz CPU, 48
GiB RAM, 160 GB scratch space
Azure workflow engines: D13 VMs with 8-core CPU, 56 GiB of memory and 400 GB SSD, Ubuntu
14.04.

NGSDataCongress
London,June2015
P.Misiser
Scalability
There is little incentive to grow the VM pool beyond 6 engines

NGSDataCongress
London,June2015
P.Misiser
Cost
0
2
4
6
8
10
12
14
16
18
0 6 12 18 24
CostinGBP
Number of samples
3 eng (24 cores)
6 eng (48 cores)
12 eng (96 cores)
Again, a 6 engine configuration achieves near-optimal cost/sample

NGSDataCongress
London,June2015
P.Misiser
Lessons learnt
Design
Cloud
Deployment
Execution Analysis
 Better abstraction
• Easier to understand, share,
maintain
 Better exploit data parallelism
 Extensible by wrapping new tools
• Scalability
 Fewer installation/deployment
requirements, staff hours required
 Automated dependency management,
packaging
 Configurable to make most efficient
use of a cluster
 Runtime monitoring
 Provenance collection
 Reproducibility
 Accountability

NGSDataCongress
London,June2015
P.Misiser
Part II: SVI- Simple, traceable variant interpretation
Objectives:
• Design a simple-to-use
tool to facilitate clinical
diagnosis by clinicians
• Maintain history of past
investigations for
analytical purposes
• Ensure accountability
through traceability
• Enable analytics over
past patient cases
MAF threshold
- Non-synonymous
- stop/gain
- frameshift
known polymorphisms
Homo / Heterozygous
Pathogenicity
predictors
Variant filtering
HPO match
HPO to OMIM
OMIM match
OMIM to Gene
Gene
Union
Gene
Intersect
Genes in scope
User-supplied
genes list
User-supplied
disease keywords
User-defined
preferred genes
Variant Scoping
Candidate
variants
Select
variants
in scope
variants
in scope
ClinVar
lookupClinVar
Annotated
patient
variants
Variant Classification
RED:
found,
pathogenic
AMBER:
not found
GREEN:
found,
benign
OMIM
AMBER/
not found
AMBER/
uncertain
NGS
pipeline

NGSDataCongress
London,June2015
P.Misiser
A database of patient cases and investigations
Cases:

NGSDataCongress
London,June2015
P.Misiser
Investigations

NGSDataCongress
London,June2015
P.Misiser
Provenance of variant identification
• A provenance graph is
generated for each
investigation
It accounts for the filtering process
for each variant listed in the result
Enables analytics over
provenance graphs across many
investigations
- “which variants where
identified independently on
different cases, and how do
they correlate with
phenotypes?”

NGSDataCongress
London,June2015
P.Misiser
Summary
1. WES/WGS data processing to annotated variants
• Scalable, Cloud-based
• High level
• Low cost / sample
2.Variant interpretation:
• Simple
• Targeted at clinicians
• Built-in accountability of genetic diagnosis
• Analytics over a database of past
investigations
What we are delivering to NIHR:

Invited cloud-e-Genome project talk at 2015 NGS Data Congress

More Related Content

What's hot

Similar to Invited cloud-e-Genome project talk at 2015 NGS Data Congress

More from Paolo Missier

Recently uploaded

Invited cloud-e-Genome project talk at 2015 NGS Data Congress

Editor's Notes