Another Cloud-e-Genome dissemination opportunity:
Porting an existing WES/WGS pipeline from HPC to a (public) cloud,
while achieving more flexibility and better abstraction,
and with better performance than the equivalent HPC deployment
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Scalable Whole-Exome Sequence Data Processing Using Workflow On A Cloud
1. FGCSForum
Roma,April24,2016
P..Misiser
Scalable Whole-Exome Sequence Data
Processing Using Workflow On A Cloud
Paolo Missier, Jacek Cała, Yaobo Xu, Eldarina Wijaya
School of Computing Science and Institute of Genetic Medicine
Newcastle University, Newcastle upon Tyne, UK
FGCS Forum
Roma, April 24, 2016
2. FGCSForum
Roma,April24,2016
P..Misiser
The challenge
• Port an existing WES/WGS pipeline
• From HPC to a (public) cloud
• While achieving more flexibility and better abstraction
• With better performance than the equivalent HPC deployment
3. FGCSForum
Roma,April24,2016
P..Misiser
Scripted NGS data processing pipeline
Recalibration
Corrects for system
bias on quality scores
assigned by
sequencer
GATK
Computes coverage
of each read.
VCF Subsetting by filtering,
eg non-exomic variants
Annovar functional annotations (eg
MAF, synonimity, SNPs…)
followed by in house annotations
Aligns sample
sequence to HG19
reference genome
using BWA aligner
Cleaning, duplicate
elimination
Picard tools
Variant calling operates on
multiple samples
simultaneously
Splits samples into chunks.
Haplotype caller detects both
SNV as well as longer indels
Variant recalibration
attempts to reduce
false positive rate
from caller
4. FGCSForum
Roma,April24,2016
P..Misiser
The original implementation
echo Preparing directories $PICARD_OUTDIR and $PICARD_TEMP
mkdir -p $PICARD_OUTDIR
mkdir -p $PICARD_TEMP
echo Starting PICARD to clean BAM files...
$Picard_CleanSam INPUT=$SORTED_BAM_FILE OUTPUT=$SORTED_BAM_FILE_CLEANED
echo Starting PICARD to remove duplicates...
$Picard_NoDups INPUT=$SORTED_BAM_FILE_CLEANED OUTPUT =
$SORTED_BAM_FILE_NODUPS_NO_RG
METRICS_FILE=$PICARD_LOG REMOVE_DUPLICATES=true ASSUME_SORTED=true
echo Adding read group information to bam file...
$Picard_AddRG INPUT=$SORTED_BAM_FILE_NODUPS_NO_RG
OUTPUT=$SORTED_BAM_FILE_NODUPS RGID=$READ_GROUP_ID RGPL=illumina RGSM=$SAMPLE_ID
RGLB="${SAMPLE_ID}_${READ_GROUP_ID}” RGPU="platform_Unit_${SAMPLE_ID}_${READ_GROUP_ID}”
echo Indexing bam files...
samtools index $SORTED_BAM_FILE_NODUPS
• Pros
• simplicity – 50-100 lines of bash code
• flexibility of the bash language
• Cons
• embedded dependencies between steps
• low-level configuration
5. FGCSForum
Roma,April24,2016
P..Misiser
Problem scale
Data stats per sample:
4 files per sample (2-lane, pair-end,
reads)
≈15 GB of compressed text data (gz)
≈40 GB uncompressed text data
(FASTQ)
Usually 30-40 input samples
0.45-0.6 TB of compressed data
1.2-1.6 TB uncompressed
Most steps use 8-10 GB of
reference data
Small 6-sample run takes
about 30h on the IGM HPC
machine (Stage1+2)
6. FGCSForum
Roma,April24,2016
P..Misiser
Scripts to workflow - Design
Design
Cloud
Deployment
Execution Analysis
• Better abstraction
• Easier to understand, share,
maintain
• Better exploit data parallelism
• Extensible by wrapping new tools
Theoretical advantages of using a workflow programming model
7. FGCSForum
Roma,April24,2016
P..Misiser
Workflow Design
echo Preparing directories $PICARD_OUTDIR and $PICARD_TEMP
mkdir -p $PICARD_OUTDIR
mkdir -p $PICARD_TEMP
echo Starting PICARD to clean BAM files...
$Picard_CleanSam INPUT=$SORTED_BAM_FILE OUTPUT=$SORTED_BAM_FILE_CLEANED
echo Starting PICARD to remove duplicates...
$Picard_NoDups INPUT=$SORTED_BAM_FILE_CLEANED OUTPUT =
$SORTED_BAM_FILE_NODUPS_NO_RG
METRICS_FILE=$PICARD_LOG REMOVE_DUPLICATES=true ASSUME_SORTED=true
echo Adding read group information to bam file...
$Picard_AddRG INPUT=$SORTED_BAM_FILE_NODUPS_NO_RG
OUTPUT=$SORTED_BAM_FILE_NODUPS RGID=$READ_GROUP_ID RGPL=illumina RGSM=$SAMPLE_ID
RGLB="${SAMPLE_ID}_${READ_GROUP_ID}” RGPU="platform_Unit_${SAMPLE_ID}_${READ_GROUP_ID}”
echo Indexing bam files...
samtools index $SORTED_BAM_FILE_NODUPS
“Wrapper”
blocksUtility
blocks
13. FGCSForum
Roma,April24,2016
P..Misiser
Workflow on Azure Cloud – modular configuration
<<Azure VM>>
Azure Blob
store
e-SC db
backend
<<Azure VM>>
e-Science
Central
main server JMS queue
REST APIWeb UI
web
browser
rich client
app
workflow invocations
e-SC control data
workflow data
<<worker role>>
Workflow
engine
<<worker role>>
Workflow
engine
e-SC blob
store
<<worker role>>
Workflow
engine
Workflow engines Module
configuration:
3 nodes, 24 cores
Modular architecture indefinitely scalable!
14. FGCSForum
Roma,April24,2016
P..Misiser
Workflow and sub-workflows execution
To e-SC queue To e-SC queue
Executable Block
To e-SC queue
e-SC db
<<Azure VM>>
e-Science
Central
main server JMS queue
REST APIWeb UI
web
browser
rich client
app
workflow invocations
e-SC control data
workflow data
<<worker role>>
Workflow
engine
<<worker role>>
Workflow
engine
e-SC blob
store
<<worker role>>
Workflow
engine
Workflow invocation executing on one engine (fragment)
19. FGCSForum
Roma,April24,2016
P..Misiser
Cost
Again, a 6 engine configuration achieves near-optimal cost/sample
0 50 100 150 200 250 300 350
0
0.2
0.4
0.6
0.8
1
1.2
0 6 12 18 24
0
2
4
6
8
10
12
14
16
18
Size of the input data [GiB]
CostperGiB[£]
Number of samples
Costpersample[£]
3 eng (24 cores)
6 eng (48 cores)
12 eng (96 cores)
20. FGCSForum
Roma,April24,2016
P..Misiser
Lessons learnt
Design
Cloud
Deployment
Execution Analysis
Better abstraction
• Easier to understand, share,
maintain
Better exploit data parallelism
Extensible by wrapping new tools
• Scalability
Fewer installation/deployment
requirements, staff hours required
Automated dependency management,
packaging
Configurable to make most efficient
use of a cluster
Runtime monitoring
Provenance collection
Reproducibility
Accountability
Editor's Notes
Objective 1: Implement a cloud-based, secure scalable, computing infrastructure that is capable of translating the potential benefits of high throughput sequencing into actual genetic diagnosis to health care professionals.
Obj 2: front end tool to facilitate clinical diagnosis
2 year pilot project
Funded by UK’s National Institute for Health Research (NIHR) through the Biomedical Research Council (BRC)
Nov. 2013: Cloud resources from Azure for Research Award
1 year’s worth of data/network/computing resources
Current local implementation:
- Scripted pipeline requires expertise to maintain, evolve
Deployed on local department cluster
Difficult to scale
Cost / patient unknown
Unable to take advantage of decreasing cost of commodity cloud resources
Coverage information translates into confidence on variant call
Recalibration:
quality score recalibration --
machine produces colour coding for the 4 aminocids, along with a p-value indicating the highest prob call; these are the Q scores
different platforms give differnst system bias on Q scores -- and also depending on the lane. Each lane gives a different systematic bias. The point of recalibration is to correct for this type of bias
Wrapper blocks, such as Picard-CleanSAM and Picard-MarkDuplicates, communicate via files in the local filesystem of the workflow engine, which is explicitly de- noted as a connection between blocks. The workflow includes also utility blocks to import and export files, i.e. to transfer data from/to the shared data space (in this case, the Azure blob store).
These were com- plemented by e-SC shared libraries, which provide better efficiency in running the tools, as they are installed only once and cached by the workflow engine for any future use. Libraries also promote reproducibility because they eliminate dependencies on external data and services. For instance, to access the human reference genome we built and stored in the system a shared library that included the genome data in a specific version and flavour (precisely HG19 from UCSC).
Wrapper blocks, such as Picard-CleanSAM and Picard-MarkDuplicates, communicate via files in the local filesystem of the workflow engine, which is explicitly de- noted as a connection between blocks. The workflow includes also utility blocks to import and export files, i.e. to transfer data from/to the shared data space (in this case, the Azure blob store).
These were com- plemented by e-SC shared libraries, which provide better efficiency in running the tools, as they are installed only once and cached by the workflow engine for any future use. Libraries also promote reproducibility because they eliminate dependencies on external data and services. For instance, to access the human reference genome we built and stored in the system a shared library that included the genome data in a specific version and flavour (precisely HG19 from UCSC).
Sync design: The subworkflows of each step are executed in parallel but synchronously over a number of samples. It means that the top-level workflow submits N subworkflow invocations for a particular step, wait
The primary advantage of the discussed, synchronous de- sign is that the structure of the pipeline is modular and clearly represented by the top-level orchestrating workflow whilst the parallelisation is managed by e-SC automatically. The top-level workflow mainly includes blocks to run subworkflows that are independent parts implementing only the actual work done by a particular step. The control blocks take care of the interaction with the system to submit the subworkflows and also suspend the parent invocation until all of them complete.
Model currently is sync execution
Each sample included 2-lane, pair-end raw sequence reads (4 files per sample).The average size of compressed files was nearly 15 GiB per sample; file decompression was included in the pipeline as one of the initial tasks.
3 workflow engines perform better than our HPC benchmark on larger sample sizes