SlideShare a Scribd company logo
1 of 20
FGCSForum
Roma,April24,2016
P..Misiser
Scalable Whole-Exome Sequence Data
Processing Using Workflow On A Cloud
Paolo Missier, Jacek Cała, Yaobo Xu, Eldarina Wijaya
School of Computing Science and Institute of Genetic Medicine
Newcastle University, Newcastle upon Tyne, UK
FGCS Forum
Roma, April 24, 2016
FGCSForum
Roma,April24,2016
P..Misiser
The challenge
• Port an existing WES/WGS pipeline
• From HPC to a (public) cloud
• While achieving more flexibility and better abstraction
• With better performance than the equivalent HPC deployment
FGCSForum
Roma,April24,2016
P..Misiser
Scripted NGS data processing pipeline
Recalibration
Corrects for system
bias on quality scores
assigned by
sequencer
GATK
Computes coverage
of each read.
VCF Subsetting by filtering,
eg non-exomic variants
Annovar functional annotations (eg
MAF, synonimity, SNPs…)
followed by in house annotations
Aligns sample
sequence to HG19
reference genome
using BWA aligner
Cleaning, duplicate
elimination
Picard tools
Variant calling operates on
multiple samples
simultaneously
Splits samples into chunks.
Haplotype caller detects both
SNV as well as longer indels
Variant recalibration
attempts to reduce
false positive rate
from caller
FGCSForum
Roma,April24,2016
P..Misiser
The original implementation
echo Preparing directories $PICARD_OUTDIR and $PICARD_TEMP
mkdir -p $PICARD_OUTDIR
mkdir -p $PICARD_TEMP
echo Starting PICARD to clean BAM files...
$Picard_CleanSam INPUT=$SORTED_BAM_FILE OUTPUT=$SORTED_BAM_FILE_CLEANED
echo Starting PICARD to remove duplicates...
$Picard_NoDups INPUT=$SORTED_BAM_FILE_CLEANED OUTPUT = 
$SORTED_BAM_FILE_NODUPS_NO_RG
METRICS_FILE=$PICARD_LOG REMOVE_DUPLICATES=true ASSUME_SORTED=true
echo Adding read group information to bam file...
$Picard_AddRG INPUT=$SORTED_BAM_FILE_NODUPS_NO_RG
OUTPUT=$SORTED_BAM_FILE_NODUPS RGID=$READ_GROUP_ID RGPL=illumina RGSM=$SAMPLE_ID 
RGLB="${SAMPLE_ID}_${READ_GROUP_ID}” RGPU="platform_Unit_${SAMPLE_ID}_${READ_GROUP_ID}”
echo Indexing bam files...
samtools index $SORTED_BAM_FILE_NODUPS
• Pros
• simplicity – 50-100 lines of bash code
• flexibility of the bash language
• Cons
• embedded dependencies between steps
• low-level configuration
FGCSForum
Roma,April24,2016
P..Misiser
Problem scale
Data stats per sample:
4 files per sample (2-lane, pair-end,
reads)
≈15 GB of compressed text data (gz)
≈40 GB uncompressed text data
(FASTQ)
Usually 30-40 input samples
0.45-0.6 TB of compressed data
1.2-1.6 TB uncompressed
Most steps use 8-10 GB of
reference data
Small 6-sample run takes
about 30h on the IGM HPC
machine (Stage1+2)
FGCSForum
Roma,April24,2016
P..Misiser
Scripts to workflow - Design
Design
Cloud
Deployment
Execution Analysis
• Better abstraction
• Easier to understand, share,
maintain
• Better exploit data parallelism
• Extensible by wrapping new tools
Theoretical advantages of using a workflow programming model
FGCSForum
Roma,April24,2016
P..Misiser
Workflow Design
echo Preparing directories $PICARD_OUTDIR and $PICARD_TEMP
mkdir -p $PICARD_OUTDIR
mkdir -p $PICARD_TEMP
echo Starting PICARD to clean BAM files...
$Picard_CleanSam INPUT=$SORTED_BAM_FILE OUTPUT=$SORTED_BAM_FILE_CLEANED
echo Starting PICARD to remove duplicates...
$Picard_NoDups INPUT=$SORTED_BAM_FILE_CLEANED OUTPUT = 
$SORTED_BAM_FILE_NODUPS_NO_RG
METRICS_FILE=$PICARD_LOG REMOVE_DUPLICATES=true ASSUME_SORTED=true
echo Adding read group information to bam file...
$Picard_AddRG INPUT=$SORTED_BAM_FILE_NODUPS_NO_RG
OUTPUT=$SORTED_BAM_FILE_NODUPS RGID=$READ_GROUP_ID RGPL=illumina RGSM=$SAMPLE_ID 
RGLB="${SAMPLE_ID}_${READ_GROUP_ID}” RGPU="platform_Unit_${SAMPLE_ID}_${READ_GROUP_ID}”
echo Indexing bam files...
samtools index $SORTED_BAM_FILE_NODUPS
“Wrapper”
blocksUtility
blocks
FGCSForum
Roma,April24,2016
P..Misiser
Workflow design
raw
sequences align clean
recalibrate
alignments
calculate
coverage
call
variants
recalibrate
variants
filter
variants
annotate
coverage
information
annotated
variants
raw
sequences align clean
recalibrate
alignments
calculate
coverage
coverage
informationraw
sequences align clean
calculate
coverage
coverage
information
recalibrate
alignments
annotate
annotated
variants
annotate
annotated
variants
Stage 1
Stage 2
Stage 3
filter
variants
filter
variants
Conceptual:
Actual:
11 workflows
101 blocks
28 tool blocks
FGCSForum
Roma,April24,2016
P..Misiser
Anatomy of a complex parallel dataflow
eScience Central: simple dataflow model…
raw
sequences align clean
recalibrate
alignments
calculate
coverage
call
variants
recalibrate
variants
filter
variants
annotate
coverage
information
annotated
variants
raw
sequences align clean
recalibrate
alignments
calculate
coverage
coverage
informationraw
sequences align clean
calculate
coverage
coverage
information
recalibrate
alignments
annotate
annotated
variants
annotate
annotated
variants
Stage 1
Stage 2
Stage 3
filter
variants
filter
variants
Sample-split:
Parallel processing of
samples in a batch
FGCSForum
Roma,April24,2016
P..Misiser
Anatomy of a complex parallel dataflow
… with hierarchical structure
FGCSForum
Roma,April24,2016
P..Misiser
Cloud Deployment
Design
Cloud
Deployment
Execution Analysis
Scalability
• Exploiting data parallelism
• Fewer installation/deployment requirements, staff hours
required
• Automated dependency management, packaging
• Configurable to make most efficient use of a cluster
FGCSForum
Roma,April24,2016
P..Misiser
Parallelism in the pipeline
Chr1 Chr2 ChrM
Chr1 Chr2 ChrM
Chr1 Chr2 ChrM
align, clean,
recalibrate
call variants
annotate
align, clean,
recalibrate
align, clean,
recalibrate
Stage 1 Stage 2 Stage 3
annotate
annotate
call variants
call variants
Chr1
Chr1
Chr1
Chr2
Chr2
Chr2
ChrM
ChrM
ChrM
chromosomesplit
samplesplit
chromosomesplit
samplesplit
Sample 1
Sample 2
Sample N
Annotated
variants
Annotated
variants
Annotated
variants
align-clean-
recalibrate-coverage
…
align-clean-
recalibrate-coverage
Sample
1
Sample
n
Variant calling
recalibration
Variant calling
recalibration
Variant filtering
annotation
Variant filtering
annotation
……
Chromosome
split
Per-sample
Parallel
processing
Per-chromosome
Parallel
processing
Stage I Stage II Stage III
FGCSForum
Roma,April24,2016
P..Misiser
Workflow on Azure Cloud – modular configuration
<<Azure VM>>
Azure Blob
store
e-SC db
backend
<<Azure VM>>
e-Science
Central
main server JMS queue
REST APIWeb UI
web
browser
rich client
app
workflow invocations
e-SC control data
workflow data
<<worker role>>
Workflow
engine
<<worker role>>
Workflow
engine
e-SC blob
store
<<worker role>>
Workflow
engine
Workflow engines Module
configuration:
3 nodes, 24 cores
Modular architecture  indefinitely scalable!
FGCSForum
Roma,April24,2016
P..Misiser
Workflow and sub-workflows execution
To e-SC queue To e-SC queue
Executable Block
To e-SC queue
e-SC db
<<Azure VM>>
e-Science
Central
main server JMS queue
REST APIWeb UI
web
browser
rich client
app
workflow invocations
e-SC control data
workflow data
<<worker role>>
Workflow
engine
<<worker role>>
Workflow
engine
e-SC blob
store
<<worker role>>
Workflow
engine
Workflow invocation executing on one engine (fragment)
FGCSForum
Roma,April24,2016
P..Misiser
Scripts to workflow
Design
Cloud
Deployment
Execution Analysis
3. Execution
• Runtime monitoring
• provenance collection
FGCSForum
Roma,April24,2016
P..Misiser
Performance
Configurations for 3VMs experiments:
HPC cluster (dedicated nodes):
3x8-core compute nodes Intel Xeon E5640, 2.67GHz CPU, 48 GiB RAM, 160
GB scratch space
Azure workflow engines:
D13 VMs with 8-core CPU, 56 GiB of memory and 400 GB SSD, Ubuntu 14.04.
00:00
12:00
24:00
36:00
48:00
60:00
72:00
0 6 12 18 24
Responsetime[hh:mm]
Number of samples
3 eng (24 cores) 6 eng (48 cores)
12 eng (96 cores)
FGCSForum
Roma,April24,2016
P..Misiser
Comparison with HPC
0
24
48
72
96
120
144
168
0 6 12 18 24
Responsetime[hours]
Number of input samples
HPC (3 compute nodes) Azure (3xD13 – SSD) – sync Azure (3xD13 – SSD) – chained
0
1
2
3
4
5
6
0 50 100 150 200 250 300 350 400
Systemthroughput[GiB/hr]
Size of the sample cohort [GiB]
HPC (3 compute nodes) Azure (3xD13 – SSD) – sync Azure (3xD13 – SSD) – chained
FGCSForum
Roma,April24,2016
P..Misiser
Scalability
There is little incentive to grow the VM pool beyond 6 engines
FGCSForum
Roma,April24,2016
P..Misiser
Cost
Again, a 6 engine configuration achieves near-optimal cost/sample
0 50 100 150 200 250 300 350
0
0.2
0.4
0.6
0.8
1
1.2
0 6 12 18 24
0
2
4
6
8
10
12
14
16
18
Size of the input data [GiB]
CostperGiB[£]
Number of samples
Costpersample[£]
3 eng (24 cores)
6 eng (48 cores)
12 eng (96 cores)
FGCSForum
Roma,April24,2016
P..Misiser
Lessons learnt
Design
Cloud
Deployment
Execution Analysis
 Better abstraction
• Easier to understand, share,
maintain
 Better exploit data parallelism
 Extensible by wrapping new tools
• Scalability
 Fewer installation/deployment
requirements, staff hours required
 Automated dependency management,
packaging
 Configurable to make most efficient
use of a cluster
 Runtime monitoring
 Provenance collection
 Reproducibility
 Accountability

More Related Content

What's hot

Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache Flink
Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache FlinkAlbert Bifet – Apache Samoa: Mining Big Data Streams with Apache Flink
Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache Flink
Flink Forward
 

What's hot (20)

Moa: Real Time Analytics for Data Streams
Moa: Real Time Analytics for Data StreamsMoa: Real Time Analytics for Data Streams
Moa: Real Time Analytics for Data Streams
 
Sentiment Knowledge Discovery in Twitter Streaming Data
Sentiment Knowledge Discovery in Twitter Streaming DataSentiment Knowledge Discovery in Twitter Streaming Data
Sentiment Knowledge Discovery in Twitter Streaming Data
 
Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache Flink
Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache FlinkAlbert Bifet – Apache Samoa: Mining Big Data Streams with Apache Flink
Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache Flink
 
Artificial intelligence and data stream mining
Artificial intelligence and data stream miningArtificial intelligence and data stream mining
Artificial intelligence and data stream mining
 
Mining big data streams with APACHE SAMOA by Albert Bifet
Mining big data streams with APACHE SAMOA by Albert BifetMining big data streams with APACHE SAMOA by Albert Bifet
Mining big data streams with APACHE SAMOA by Albert Bifet
 
Fast Perceptron Decision Tree Learning from Evolving Data Streams
Fast Perceptron Decision Tree Learning from Evolving Data StreamsFast Perceptron Decision Tree Learning from Evolving Data Streams
Fast Perceptron Decision Tree Learning from Evolving Data Streams
 
5.1 mining data streams
5.1 mining data streams5.1 mining data streams
5.1 mining data streams
 
Mining Big Data Streams with APACHE SAMOA
Mining Big Data Streams with APACHE SAMOAMining Big Data Streams with APACHE SAMOA
Mining Big Data Streams with APACHE SAMOA
 
Data Streaming in Big Data Analysis
Data Streaming in Big Data AnalysisData Streaming in Big Data Analysis
Data Streaming in Big Data Analysis
 
MOA for the IoT at ACML 2016
MOA for the IoT at ACML 2016 MOA for the IoT at ACML 2016
MOA for the IoT at ACML 2016
 
18 Data Streams
18 Data Streams18 Data Streams
18 Data Streams
 
Cloud-based Data Stream Processing
Cloud-based Data Stream ProcessingCloud-based Data Stream Processing
Cloud-based Data Stream Processing
 
Real-Time Big Data Stream Analytics
Real-Time Big Data Stream AnalyticsReal-Time Big Data Stream Analytics
Real-Time Big Data Stream Analytics
 
parallel OLAP
parallel OLAPparallel OLAP
parallel OLAP
 
Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...
Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...
Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...
 
Chapter 08 Data Mining Techniques
Chapter 08 Data Mining Techniques Chapter 08 Data Mining Techniques
Chapter 08 Data Mining Techniques
 
Efficient Online Evaluation of Big Data Stream Classifiers
Efficient Online Evaluation of Big Data Stream ClassifiersEfficient Online Evaluation of Big Data Stream Classifiers
Efficient Online Evaluation of Big Data Stream Classifiers
 
Hadoop ensma poitiers
Hadoop ensma poitiersHadoop ensma poitiers
Hadoop ensma poitiers
 
Selective and incremental re-computation in reaction to changes: an exercise ...
Selective and incremental re-computation in reaction to changes: an exercise ...Selective and incremental re-computation in reaction to changes: an exercise ...
Selective and incremental re-computation in reaction to changes: an exercise ...
 
Efficient Duplicate Detection Over Massive Data Sets
Efficient Duplicate Detection Over Massive Data SetsEfficient Duplicate Detection Over Massive Data Sets
Efficient Duplicate Detection Over Massive Data Sets
 

Similar to Scalable Whole-Exome Sequence Data Processing Using Workflow On A Cloud

Ase2010 shang
Ase2010 shangAse2010 shang
Ase2010 shang
SAIL_QU
 

Similar to Scalable Whole-Exome Sequence Data Processing Using Workflow On A Cloud (20)

Invited cloud-e-Genome project talk at 2015 NGS Data Congress
Invited cloud-e-Genome project talk at 2015 NGS Data CongressInvited cloud-e-Genome project talk at 2015 NGS Data Congress
Invited cloud-e-Genome project talk at 2015 NGS Data Congress
 
QuTrack: Model Life Cycle Management for AI and ML models using a Blockchain ...
QuTrack: Model Life Cycle Management for AI and ML models using a Blockchain ...QuTrack: Model Life Cycle Management for AI and ML models using a Blockchain ...
QuTrack: Model Life Cycle Management for AI and ML models using a Blockchain ...
 
Chronix Poster for the Poster Session FAST 2017
Chronix Poster for the Poster Session FAST 2017Chronix Poster for the Poster Session FAST 2017
Chronix Poster for the Poster Session FAST 2017
 
2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management f...
2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management f...2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management f...
2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management f...
 
The Fast Path to Building Operational Applications with Spark
The Fast Path to Building Operational Applications with SparkThe Fast Path to Building Operational Applications with Spark
The Fast Path to Building Operational Applications with Spark
 
Dst
DstDst
Dst
 
Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?
 
Computing Outside The Box June 2009
Computing Outside The Box June 2009Computing Outside The Box June 2009
Computing Outside The Box June 2009
 
The hidden engineering behind machine learning products at Helixa
The hidden engineering behind machine learning products at HelixaThe hidden engineering behind machine learning products at Helixa
The hidden engineering behind machine learning products at Helixa
 
System mldl meetup
System mldl meetupSystem mldl meetup
System mldl meetup
 
Typesafe spark- Zalando meetup
Typesafe spark- Zalando meetupTypesafe spark- Zalando meetup
Typesafe spark- Zalando meetup
 
Intelligent Monitoring
Intelligent MonitoringIntelligent Monitoring
Intelligent Monitoring
 
Boosting spark performance: An Overview of Techniques
Boosting spark performance: An Overview of TechniquesBoosting spark performance: An Overview of Techniques
Boosting spark performance: An Overview of Techniques
 
Cassandra Day Atlanta 2015: Software Development with Apache Cassandra: A Wal...
Cassandra Day Atlanta 2015: Software Development with Apache Cassandra: A Wal...Cassandra Day Atlanta 2015: Software Development with Apache Cassandra: A Wal...
Cassandra Day Atlanta 2015: Software Development with Apache Cassandra: A Wal...
 
Scientific
Scientific Scientific
Scientific
 
WRENCH: Workflow Management System Simulation Workbench
WRENCH: Workflow Management System Simulation WorkbenchWRENCH: Workflow Management System Simulation Workbench
WRENCH: Workflow Management System Simulation Workbench
 
The Next Generation of Data Processing and Open Source
The Next Generation of Data Processing and Open SourceThe Next Generation of Data Processing and Open Source
The Next Generation of Data Processing and Open Source
 
Ase2010 shang
Ase2010 shangAse2010 shang
Ase2010 shang
 
Computing Outside The Box September 2009
Computing Outside The Box September 2009Computing Outside The Box September 2009
Computing Outside The Box September 2009
 
USENIX NSDI 2016 (Session: Resource Sharing)
USENIX NSDI 2016 (Session: Resource Sharing)USENIX NSDI 2016 (Session: Resource Sharing)
USENIX NSDI 2016 (Session: Resource Sharing)
 

More from Paolo Missier

Data-centric AI and the convergence of data and model engineering: opportunit...
Data-centric AI and the convergence of data and model engineering:opportunit...Data-centric AI and the convergence of data and model engineering:opportunit...
Data-centric AI and the convergence of data and model engineering: opportunit...
Paolo Missier
 
Tracking trajectories of multiple long-term conditions using dynamic patient...
Tracking trajectories of  multiple long-term conditions using dynamic patient...Tracking trajectories of  multiple long-term conditions using dynamic patient...
Tracking trajectories of multiple long-term conditions using dynamic patient...
Paolo Missier
 

More from Paolo Missier (20)

Towards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance recordsTowards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance records
 
Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...
 
Data-centric AI and the convergence of data and model engineering: opportunit...
Data-centric AI and the convergence of data and model engineering:opportunit...Data-centric AI and the convergence of data and model engineering:opportunit...
Data-centric AI and the convergence of data and model engineering: opportunit...
 
Realising the potential of Health Data Science: opportunities and challenges ...
Realising the potential of Health Data Science:opportunities and challenges ...Realising the potential of Health Data Science:opportunities and challenges ...
Realising the potential of Health Data Science: opportunities and challenges ...
 
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
 
A Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overviewA Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overview
 
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
 
Tracking trajectories of multiple long-term conditions using dynamic patient...
Tracking trajectories of  multiple long-term conditions using dynamic patient...Tracking trajectories of  multiple long-term conditions using dynamic patient...
Tracking trajectories of multiple long-term conditions using dynamic patient...
 
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
 
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcare
 
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcare
 
Data Provenance for Data Science
Data Provenance for Data ScienceData Provenance for Data Science
Data Provenance for Data Science
 
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
 
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...
Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...
 
Data Science for (Health) Science: tales from a challenging front line, and h...
Data Science for (Health) Science:tales from a challenging front line, and h...Data Science for (Health) Science:tales from a challenging front line, and h...
Data Science for (Health) Science: tales from a challenging front line, and h...
 
Analytics of analytics pipelines: from optimising re-execution to general Dat...
Analytics of analytics pipelines:from optimising re-execution to general Dat...Analytics of analytics pipelines:from optimising re-execution to general Dat...
Analytics of analytics pipelines: from optimising re-execution to general Dat...
 
ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...
 
ReComp, the complete story: an invited talk at Cardiff University
ReComp, the complete story:  an invited talk at Cardiff UniversityReComp, the complete story:  an invited talk at Cardiff University
ReComp, the complete story: an invited talk at Cardiff University
 
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
 
Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blo...
Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...
Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blo...
 

Recently uploaded

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 

Recently uploaded (20)

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 

Scalable Whole-Exome Sequence Data Processing Using Workflow On A Cloud

  • 1. FGCSForum Roma,April24,2016 P..Misiser Scalable Whole-Exome Sequence Data Processing Using Workflow On A Cloud Paolo Missier, Jacek Cała, Yaobo Xu, Eldarina Wijaya School of Computing Science and Institute of Genetic Medicine Newcastle University, Newcastle upon Tyne, UK FGCS Forum Roma, April 24, 2016
  • 2. FGCSForum Roma,April24,2016 P..Misiser The challenge • Port an existing WES/WGS pipeline • From HPC to a (public) cloud • While achieving more flexibility and better abstraction • With better performance than the equivalent HPC deployment
  • 3. FGCSForum Roma,April24,2016 P..Misiser Scripted NGS data processing pipeline Recalibration Corrects for system bias on quality scores assigned by sequencer GATK Computes coverage of each read. VCF Subsetting by filtering, eg non-exomic variants Annovar functional annotations (eg MAF, synonimity, SNPs…) followed by in house annotations Aligns sample sequence to HG19 reference genome using BWA aligner Cleaning, duplicate elimination Picard tools Variant calling operates on multiple samples simultaneously Splits samples into chunks. Haplotype caller detects both SNV as well as longer indels Variant recalibration attempts to reduce false positive rate from caller
  • 4. FGCSForum Roma,April24,2016 P..Misiser The original implementation echo Preparing directories $PICARD_OUTDIR and $PICARD_TEMP mkdir -p $PICARD_OUTDIR mkdir -p $PICARD_TEMP echo Starting PICARD to clean BAM files... $Picard_CleanSam INPUT=$SORTED_BAM_FILE OUTPUT=$SORTED_BAM_FILE_CLEANED echo Starting PICARD to remove duplicates... $Picard_NoDups INPUT=$SORTED_BAM_FILE_CLEANED OUTPUT = $SORTED_BAM_FILE_NODUPS_NO_RG METRICS_FILE=$PICARD_LOG REMOVE_DUPLICATES=true ASSUME_SORTED=true echo Adding read group information to bam file... $Picard_AddRG INPUT=$SORTED_BAM_FILE_NODUPS_NO_RG OUTPUT=$SORTED_BAM_FILE_NODUPS RGID=$READ_GROUP_ID RGPL=illumina RGSM=$SAMPLE_ID RGLB="${SAMPLE_ID}_${READ_GROUP_ID}” RGPU="platform_Unit_${SAMPLE_ID}_${READ_GROUP_ID}” echo Indexing bam files... samtools index $SORTED_BAM_FILE_NODUPS • Pros • simplicity – 50-100 lines of bash code • flexibility of the bash language • Cons • embedded dependencies between steps • low-level configuration
  • 5. FGCSForum Roma,April24,2016 P..Misiser Problem scale Data stats per sample: 4 files per sample (2-lane, pair-end, reads) ≈15 GB of compressed text data (gz) ≈40 GB uncompressed text data (FASTQ) Usually 30-40 input samples 0.45-0.6 TB of compressed data 1.2-1.6 TB uncompressed Most steps use 8-10 GB of reference data Small 6-sample run takes about 30h on the IGM HPC machine (Stage1+2)
  • 6. FGCSForum Roma,April24,2016 P..Misiser Scripts to workflow - Design Design Cloud Deployment Execution Analysis • Better abstraction • Easier to understand, share, maintain • Better exploit data parallelism • Extensible by wrapping new tools Theoretical advantages of using a workflow programming model
  • 7. FGCSForum Roma,April24,2016 P..Misiser Workflow Design echo Preparing directories $PICARD_OUTDIR and $PICARD_TEMP mkdir -p $PICARD_OUTDIR mkdir -p $PICARD_TEMP echo Starting PICARD to clean BAM files... $Picard_CleanSam INPUT=$SORTED_BAM_FILE OUTPUT=$SORTED_BAM_FILE_CLEANED echo Starting PICARD to remove duplicates... $Picard_NoDups INPUT=$SORTED_BAM_FILE_CLEANED OUTPUT = $SORTED_BAM_FILE_NODUPS_NO_RG METRICS_FILE=$PICARD_LOG REMOVE_DUPLICATES=true ASSUME_SORTED=true echo Adding read group information to bam file... $Picard_AddRG INPUT=$SORTED_BAM_FILE_NODUPS_NO_RG OUTPUT=$SORTED_BAM_FILE_NODUPS RGID=$READ_GROUP_ID RGPL=illumina RGSM=$SAMPLE_ID RGLB="${SAMPLE_ID}_${READ_GROUP_ID}” RGPU="platform_Unit_${SAMPLE_ID}_${READ_GROUP_ID}” echo Indexing bam files... samtools index $SORTED_BAM_FILE_NODUPS “Wrapper” blocksUtility blocks
  • 8. FGCSForum Roma,April24,2016 P..Misiser Workflow design raw sequences align clean recalibrate alignments calculate coverage call variants recalibrate variants filter variants annotate coverage information annotated variants raw sequences align clean recalibrate alignments calculate coverage coverage informationraw sequences align clean calculate coverage coverage information recalibrate alignments annotate annotated variants annotate annotated variants Stage 1 Stage 2 Stage 3 filter variants filter variants Conceptual: Actual: 11 workflows 101 blocks 28 tool blocks
  • 9. FGCSForum Roma,April24,2016 P..Misiser Anatomy of a complex parallel dataflow eScience Central: simple dataflow model… raw sequences align clean recalibrate alignments calculate coverage call variants recalibrate variants filter variants annotate coverage information annotated variants raw sequences align clean recalibrate alignments calculate coverage coverage informationraw sequences align clean calculate coverage coverage information recalibrate alignments annotate annotated variants annotate annotated variants Stage 1 Stage 2 Stage 3 filter variants filter variants Sample-split: Parallel processing of samples in a batch
  • 10. FGCSForum Roma,April24,2016 P..Misiser Anatomy of a complex parallel dataflow … with hierarchical structure
  • 11. FGCSForum Roma,April24,2016 P..Misiser Cloud Deployment Design Cloud Deployment Execution Analysis Scalability • Exploiting data parallelism • Fewer installation/deployment requirements, staff hours required • Automated dependency management, packaging • Configurable to make most efficient use of a cluster
  • 12. FGCSForum Roma,April24,2016 P..Misiser Parallelism in the pipeline Chr1 Chr2 ChrM Chr1 Chr2 ChrM Chr1 Chr2 ChrM align, clean, recalibrate call variants annotate align, clean, recalibrate align, clean, recalibrate Stage 1 Stage 2 Stage 3 annotate annotate call variants call variants Chr1 Chr1 Chr1 Chr2 Chr2 Chr2 ChrM ChrM ChrM chromosomesplit samplesplit chromosomesplit samplesplit Sample 1 Sample 2 Sample N Annotated variants Annotated variants Annotated variants align-clean- recalibrate-coverage … align-clean- recalibrate-coverage Sample 1 Sample n Variant calling recalibration Variant calling recalibration Variant filtering annotation Variant filtering annotation …… Chromosome split Per-sample Parallel processing Per-chromosome Parallel processing Stage I Stage II Stage III
  • 13. FGCSForum Roma,April24,2016 P..Misiser Workflow on Azure Cloud – modular configuration <<Azure VM>> Azure Blob store e-SC db backend <<Azure VM>> e-Science Central main server JMS queue REST APIWeb UI web browser rich client app workflow invocations e-SC control data workflow data <<worker role>> Workflow engine <<worker role>> Workflow engine e-SC blob store <<worker role>> Workflow engine Workflow engines Module configuration: 3 nodes, 24 cores Modular architecture  indefinitely scalable!
  • 14. FGCSForum Roma,April24,2016 P..Misiser Workflow and sub-workflows execution To e-SC queue To e-SC queue Executable Block To e-SC queue e-SC db <<Azure VM>> e-Science Central main server JMS queue REST APIWeb UI web browser rich client app workflow invocations e-SC control data workflow data <<worker role>> Workflow engine <<worker role>> Workflow engine e-SC blob store <<worker role>> Workflow engine Workflow invocation executing on one engine (fragment)
  • 15. FGCSForum Roma,April24,2016 P..Misiser Scripts to workflow Design Cloud Deployment Execution Analysis 3. Execution • Runtime monitoring • provenance collection
  • 16. FGCSForum Roma,April24,2016 P..Misiser Performance Configurations for 3VMs experiments: HPC cluster (dedicated nodes): 3x8-core compute nodes Intel Xeon E5640, 2.67GHz CPU, 48 GiB RAM, 160 GB scratch space Azure workflow engines: D13 VMs with 8-core CPU, 56 GiB of memory and 400 GB SSD, Ubuntu 14.04. 00:00 12:00 24:00 36:00 48:00 60:00 72:00 0 6 12 18 24 Responsetime[hh:mm] Number of samples 3 eng (24 cores) 6 eng (48 cores) 12 eng (96 cores)
  • 17. FGCSForum Roma,April24,2016 P..Misiser Comparison with HPC 0 24 48 72 96 120 144 168 0 6 12 18 24 Responsetime[hours] Number of input samples HPC (3 compute nodes) Azure (3xD13 – SSD) – sync Azure (3xD13 – SSD) – chained 0 1 2 3 4 5 6 0 50 100 150 200 250 300 350 400 Systemthroughput[GiB/hr] Size of the sample cohort [GiB] HPC (3 compute nodes) Azure (3xD13 – SSD) – sync Azure (3xD13 – SSD) – chained
  • 18. FGCSForum Roma,April24,2016 P..Misiser Scalability There is little incentive to grow the VM pool beyond 6 engines
  • 19. FGCSForum Roma,April24,2016 P..Misiser Cost Again, a 6 engine configuration achieves near-optimal cost/sample 0 50 100 150 200 250 300 350 0 0.2 0.4 0.6 0.8 1 1.2 0 6 12 18 24 0 2 4 6 8 10 12 14 16 18 Size of the input data [GiB] CostperGiB[£] Number of samples Costpersample[£] 3 eng (24 cores) 6 eng (48 cores) 12 eng (96 cores)
  • 20. FGCSForum Roma,April24,2016 P..Misiser Lessons learnt Design Cloud Deployment Execution Analysis  Better abstraction • Easier to understand, share, maintain  Better exploit data parallelism  Extensible by wrapping new tools • Scalability  Fewer installation/deployment requirements, staff hours required  Automated dependency management, packaging  Configurable to make most efficient use of a cluster  Runtime monitoring  Provenance collection  Reproducibility  Accountability

Editor's Notes

  1. Objective 1: Implement a cloud-based, secure scalable, computing infrastructure that is capable of translating the potential benefits of high throughput sequencing into actual genetic diagnosis to health care professionals. Obj 2: front end tool to facilitate clinical diagnosis 2 year pilot project Funded by UK’s National Institute for Health Research (NIHR) through the Biomedical Research Council (BRC) Nov. 2013: Cloud resources from Azure for Research Award 1 year’s worth of data/network/computing resources
  2. Current local implementation: - Scripted pipeline  requires expertise to maintain, evolve Deployed on local department cluster Difficult to scale Cost / patient unknown Unable to take advantage of decreasing cost of commodity cloud resources Coverage information translates into confidence on variant call Recalibration: quality score recalibration -- machine produces colour coding for the 4 aminocids, along with a p-value indicating the highest prob call; these are the Q scores different platforms give differnst system bias on Q scores -- and also depending on the lane. Each lane gives a different systematic bias. The point of recalibration is to correct for this type of bias
  3. Wrapper blocks, such as Picard-CleanSAM and Picard-MarkDuplicates, communicate via files in the local filesystem of the workflow engine, which is explicitly de- noted as a connection between blocks. The workflow includes also utility blocks to import and export files, i.e. to transfer data from/to the shared data space (in this case, the Azure blob store). These were com- plemented by e-SC shared libraries, which provide better efficiency in running the tools, as they are installed only once and cached by the workflow engine for any future use. Libraries also promote reproducibility because they eliminate dependencies on external data and services. For instance, to access the human reference genome we built and stored in the system a shared library that included the genome data in a specific version and flavour (precisely HG19 from UCSC).
  4. Wrapper blocks, such as Picard-CleanSAM and Picard-MarkDuplicates, communicate via files in the local filesystem of the workflow engine, which is explicitly de- noted as a connection between blocks. The workflow includes also utility blocks to import and export files, i.e. to transfer data from/to the shared data space (in this case, the Azure blob store). These were com- plemented by e-SC shared libraries, which provide better efficiency in running the tools, as they are installed only once and cached by the workflow engine for any future use. Libraries also promote reproducibility because they eliminate dependencies on external data and services. For instance, to access the human reference genome we built and stored in the system a shared library that included the genome data in a specific version and flavour (precisely HG19 from UCSC).
  5. Sync design: The subworkflows of each step are executed in parallel but synchronously over a number of samples. It means that the top-level workflow submits N subworkflow invocations for a particular step, wait The primary advantage of the discussed, synchronous de- sign is that the structure of the pipeline is modular and clearly represented by the top-level orchestrating workflow whilst the parallelisation is managed by e-SC automatically. The top-level workflow mainly includes blocks to run subworkflows that are independent parts implementing only the actual work done by a particular step. The control blocks take care of the interaction with the system to submit the subworkflows and also suspend the parent invocation until all of them complete.
  6. Model currently is sync execution
  7. Each sample included 2-lane, pair-end raw sequence reads (4 files per sample).The average size of compressed files was nearly 15 GiB per sample; file decompression was included in the pipeline as one of the initial tasks.
  8. 3 workflow engines perform better than our HPC benchmark on larger sample sizes