SlideShare a Scribd company logo
A Comparative Evaluation Among Galaxy And Ruffus-based
Scripting Workflows For DNA-seq Analysis Pipeline workflow
Supervisor: Prof Alan Christoffels Mentor: Peter van Heusden
Brief Background
Human genome sequencing data analysis, such as secondary analysis is by far the
most computationally intensive. This is due to the
๏‚ง Size of files that must be manipulated,
๏‚ง Complexity of determining optimal alignment for millions of reads to the human
reference genome, and
๏‚ง Subsequently utilisation of the pipelines for variant calling and genotyping.
Literatures Review
This research focus on a comparative evaluation among different workflow
systems, but most specific on Galaxy and Ruffus-scripting program. There
exist a number of workflow management systems for use in bioinformatics.
Frameworkโ€™s like Galaxy (Goecks et al., 2010), Arvados (Zaranek, 2013) or
Mobyle (Neron et al., 2009) have been developed mainly for application in the
life science (Marc and Uif, 2013). Some workflow systems utilise high
computing resources and lack data provenance and management. Others tend
to be CPU bound and memory intensive.
Research Aims
๏‚ง To identify the optimal workflow system for analysis NGS data among
Galaxy, and a collection of scripts.
๏‚ง To present a community resource system that can allow users of a workflows
system to decide on which workflows to use for analysis NGS data analysis
with less time consumption.
๏‚ง To obtain detailed task-level performance metrics, and present it graphically.
Research Objectives
๏‚ง Implementation of the DNAseq Galaxy workflow and ruffus-script pipeline.
๏‚ง Graphical representation of Galaxy and ruffus-script pipeline
๏‚ง Benchmarking the performance of Galaxy workflow, and ruffus-script
using one of the linux benchmarking tools
๏‚ง Performance metric tabulation (e.g runtime, memory usage,etc) among
Galaxy and a collection of scripts pipeline.
Research Design and Methodology
Researchers at SANBI carried out the assembly and functional annotation of a
relatively large dataset with 400 Mycobacterium tuberculosis genome using reference
based assembly approach. Since it is more challenging to gather all the analysis steps
within a single graphical user interface, we use Galaxy and ruffus-scripting pipeline to
build an automated workflow for DNA-seq variant calling pipeline that run efficiently
on SANBI-HPC clusters.
Data and Requirement Analysis
๏‚ง This project required and uses DNA-seq data for its analysis steps to carried out
the comparison evaluation.
๏‚ง Concrete design implementation such as Galaxy and Ruffus-Scripting based
workflow system was used as design methodology stylish
Follows was a: USE CASE: DNA-seq Analysis Pipeline for this experiment
DNA-seq Analysis Requirements Design
Mate fixed reads per
sample
(matefixed.sorted.bam)
Reference
Genome
(Fasta)
Raw Reads
(fastq.gz)
Trim and Align
(trimmomatic +
(NovoAlig,BWA-mem)
Trimmed QC
reads
R1/R2.Fastq.pai
red.gz
Aligned reads
per samples
(sorted.bam)
Aligned reads duplicates
marked per samples
(sorted.dup.bam)
Realigned reads per
sample
(realigned.sorted.bam
)
Indel realigner
(GATK + Picard)
Samples
Metrics
(Picard)
Fixmate
(Picard)
Mark as
duplicates
(Picard)
Aligned reads duplicates
marked per samples
(sorted.dup.bam)
Recalibration
(GATK)
aligned reads
recalibrated per sample
(sorted.dup.recal.bam)
SNP and INDEL
BCF/FilterBCF
(Samtools mpileup +
bcftools + vcfutils
varfilter)
Bcf per region per
samples
Vcf per sample
(sample. *.vcf)
Filter
(perl script)
Filtered vcf per
sample
(Nfiltered.vcf)
Flag mappabillity
(vcftools )
vcf with mappability
flag per sample
(โ€ฆmil.vcf)
SNP ID
ANNOTATION
(SNPEFF)
vcf variant
annotated with
custom annotation
(โ€ฆSNPid.vcf)
SNPeff
Annotation:effect
prediction tool
(SNPeff)
Vcf SNPs annotated as
high/moderate/low
impact
(snpid.snpeff.vcf)
Final Report
(Hmtl)
End
Start
On a farm far, far away, in the countryside of Hong
Kong,Ruffus the shy-creature was born.
This banded krait images was from Wikimedia.
Ruffus is a shy creature, and pretends to be a
cobra or a banded krait by putting up its red tail
and ducking its head in its coils when startled.
But they did not know that Ruffus
had a secret......
Pipeline
Workflow
Scripting
Ruffus: Pipeline Flowchart
This project uses ruffus framework. Pipeline was created using object oriented
method, and was manipulated directly. This project use task objects instead of
via decorators.
Program Name Pipeline Runner State Util
Main
Configuration
Logger Stages
Error_4_DRMMAPipeline_Config
JobScripts
Ruffus: Script-Based Implementation Design
Main Script Module
๏‚ง Build the pipeline workflow
๏‚ง Run or print the workflow
๏‚ง End the DRMAA session
Configuration Module
Uses config-option that is written in
Yaml, where users can supplies input
Logger Module
Allow instant and concurrency logging facility
Pipeline Module
๏‚ง Integrate all stages by constructing stages.
๏‚ง Build the pipeline workflow together
Runner Module
๏‚ง Set DRMMA to virtual python environment
๏‚ง Run the pipeline workflow stges (Locally or Cluster)
Stage Module
๏‚ง Individual stages of this project implemented,
๏‚ง And functions from input files to output files
๏‚ง Integrated Run-Stage functions that:
๏‚ง Providing state parameter for Runner
๏‚ง Access to pipeline module, config option, DRMMA & Logger
Galaxy: DNA-seq Analysis Pipeline Workflow
System
๏‚ง This project customize Galaxy to run jobs on a cluster, in specifically, virtual local
job runner on cluster.
๏‚ง Galaxy uses shared filesystem between its application server and the cluster nodes.
๏‚ง Galaxy frontend application runs on a single server as usual, however tools are
run on cluster nodes instead.
๏‚ง Galaxy uses distributed resources manager (DRM) at the backend to implement
the Distributing Resource Management Application API (DRMAA) interface.
Galaxy Project: Main Component of Operating System
architecture Design
For this project, Galaxy was configure and organized using the following virtual
layers: driver, core components, and high-level tools.
Hypervisor (Virtual Hardware)
Platform ( SANBI Physical Hardware )
Physical Hardware Drivers
DriversCores
Administrator tools Users OS
Blocks
Memory+Dish Size
+CPU
SanbiInfrastructureManagement
PhysicalTools
Galaxy: DNA-seq Analysis Workflow
Implementation
Results: Pipeline Comparison Pros
The following summaries the merit of using another of the workflow:
๏‚ง Jobs are submitted via the python-drmaa
๏‚ง DNA-seq Pipeline Case Study
๏‚ง Uses SUN Grid Engine (SGE) as runner configuration manager.
๏‚ง Pipelines can be created on the fly
๏‚ง Multiple Tasks can share the same python function
๏‚ง Reuse and reproducible common sub-pipelines
Shared sub pipelines can be created from discrete python modules and joined
together as needed. Bioinformaticists may have โ€œmappingโ€, โ€œaligningโ€, โ€œvariant-
callingโ€ sub-pipelines etc.
Ruffus-Script Based Galaxy Project
Only based on Python Galaxy allows integration of different
programming languages
Task cannot be stop Task can be pause and restart using
the History refresh button
Only tested on SGE, using drmaa as
job submission.
Galaxy application server host processes
can only be configure in the DRM as a
submit host despite different Scheduler
engine
Prone to quick syntax errors where
tasks will blow up at any instance of
ambiguity in any particular pipeline
analsysis steps
Galaxy application server and worker
nodes must run the same version of
Python
Ruffus normal process-based allows
multiple local and distributing jobs
submission.
Shared filesystem and absolute
pathname are part of Galaxy workflow
system limitations
Pipeline Comparison Cons
Further Results
This project characterising the execution profile among Galaxy
and Ruffus-based scripting. Though, the performance
comparisons strategies are yet to be explored. However, the project
experimental use case was tested. The results statistics are yet
to be collected and graph-out. Results including:
๏‚ง Analysis construction performance,
๏‚ง Task start and end execution time,
๏‚ง Pipeline Runtime,
๏‚ง Disk and CPU usage summaries, and
๏‚ง Memory Usage.
Future Work
๏‚ง To complete the Advance DNA-seq analysis on Galaxy with
SANBI custom tool: NovoCraft tools
๏‚ง To deploy sanbi-ruffus scripting based pipeline experiment on
Docker platform
๏‚ง A graphical representation of comparison evaluation and
performance metric analysis using High chart and collectl
util tool on cluster.
References
๏‚ง TERKHORN nov 2011. Strength and weaknesses of sub-workflow interoperatbility.
Ekblom, r. & Galindo, j. 2011. Applications of next generation sequencing in molecular ecology of non-
model organisms. Heredity (edinb.), 107, 1-15.
๏‚ง Goecks, j., Nekrutenko, a., Taylor, j. & Galaxy, t. 2010. Galaxy: a comprehensive approach for
supporting accessible, reproducible, and transparent computational research in the life sciences. Genome
biol., 11, r86.
๏‚ง Haas, b. J., Papanicolaou, a., Yassour, m., Grabherr, m., Blood, p. D., Bowden, j., Couger, m. B., Eccles,
d., Li, b., Lieber, m., Macmanes, m. D., Ott, m., Orvis, j., Pochet, n., Strozzi, f., Weeks, n., Westerman,
r., William, t., Dewey, c. N., Henschel, r., Leduc, r. D., Friedman, n. & Regev, a. 2013. De novo
transcript sequence reconstruction from rna-seq using the trinity platform for reference generation
and analysis. Nat. Protoc., 8, 1494-512.
๏‚ง Leo Goodstadt 2010, Ruffus: a lightweight Python library for computational pipelines.
Bioinformatics 26(21): 2778-2779
๏‚ง Alexander Wait Zaranek. (2013). Lightning: the first component of the Arvados project to be
(re)written in "Go". Zenodo. 10.5281/zenodo.7133
๏‚ง Bernie Pope, Clare Sloggett, Gayle Philip, Matthew Wakefield , Complexo a pipeline for calling variants
People Contributions to this Project:
โ€ข Zahra
โ€ข Hocine
โ€ข Combat-TB Teams
THANK YOU !!!!

More Related Content

What's hot

Analyzing Data Movements and Identifying Techniques for Next-generation Networks
Analyzing Data Movements and Identifying Techniques for Next-generation NetworksAnalyzing Data Movements and Identifying Techniques for Next-generation Networks
Analyzing Data Movements and Identifying Techniques for Next-generation Networks
balmanme
ย 
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Paolo Missier
ย 
Hadoop 3
Hadoop 3Hadoop 3
Hadoop 3
shams03159691010
ย 
Quality Control of NGS Data Solutions
Quality Control of NGS Data  SolutionsQuality Control of NGS Data  Solutions
Quality Control of NGS Data Solutions
Surya Saha
ย 
Parallel Linear Regression in Interative Reduce and YARN
Parallel Linear Regression in Interative Reduce and YARNParallel Linear Regression in Interative Reduce and YARN
Parallel Linear Regression in Interative Reduce and YARN
DataWorks Summit
ย 
Overview of the Data Processing Error Analysis System (DPEAS)
Overview of the Data Processing Error Analysis System (DPEAS)Overview of the Data Processing Error Analysis System (DPEAS)
Overview of the Data Processing Error Analysis System (DPEAS)
The HDF-EOS Tools and Information Center
ย 
How to be a bioinformatician
How to be a bioinformaticianHow to be a bioinformatician
How to be a bioinformatician
Christian Frech
ย 
Hadoop & Big Data benchmarking
Hadoop & Big Data benchmarkingHadoop & Big Data benchmarking
Hadoop & Big Data benchmarking
Bart Vandewoestyne
ย 
Benchmarking Hadoop and Big Data
Benchmarking Hadoop and Big DataBenchmarking Hadoop and Big Data
Benchmarking Hadoop and Big Data
Nicolas Poggi
ย 
Ipaw14 presentation Quan, Tanu, Ian
Ipaw14 presentation Quan, Tanu, IanIpaw14 presentation Quan, Tanu, Ian
Ipaw14 presentation Quan, Tanu, Ian
Boris Glavic
ย 
Migration To Multi Core - Parallel Programming Models
Migration To Multi Core - Parallel Programming ModelsMigration To Multi Core - Parallel Programming Models
Migration To Multi Core - Parallel Programming Models
Zvi Avraham
ย 
Docker poster bsb2015-print
Docker poster bsb2015-printDocker poster bsb2015-print
Docker poster bsb2015-print
Genomika Diagnรณsticos
ย 
LDV: Light-weight Database Virtualization
LDV: Light-weight Database VirtualizationLDV: Light-weight Database Virtualization
LDV: Light-weight Database Virtualization
Tanu Malik
ย 
Zaharia spark-scala-days-2012
Zaharia spark-scala-days-2012Zaharia spark-scala-days-2012
Zaharia spark-scala-days-2012
Skills Matter Talks
ย 
Performance Analysis of an I/O-Intensive Workflow executing on Google Cloud a...
Performance Analysis of an I/O-Intensive Workflow executing on Google Cloud a...Performance Analysis of an I/O-Intensive Workflow executing on Google Cloud a...
Performance Analysis of an I/O-Intensive Workflow executing on Google Cloud a...
Rafael Ferreira da Silva
ย 
The Interplay of Workflow Execution and Resource Provisioning
The Interplay of Workflow Execution and Resource ProvisioningThe Interplay of Workflow Execution and Resource Provisioning
The Interplay of Workflow Execution and Resource Provisioning
Rafael Ferreira da Silva
ย 
KIISE:SIGDB Workshop presentation.
KIISE:SIGDB Workshop presentation.KIISE:SIGDB Workshop presentation.
KIISE:SIGDB Workshop presentation.
Kyong-Ha Lee
ย 
LAS16-305: Smart City Big Data Visualization on 96Boards
LAS16-305: Smart City Big Data Visualization on 96BoardsLAS16-305: Smart City Big Data Visualization on 96Boards
LAS16-305: Smart City Big Data Visualization on 96Boards
Linaro
ย 
Enhancing Performance and Fault Tolerance of Hadoop Cluster
Enhancing Performance and Fault Tolerance of Hadoop ClusterEnhancing Performance and Fault Tolerance of Hadoop Cluster
Enhancing Performance and Fault Tolerance of Hadoop Cluster
IRJET Journal
ย 
Declarative Experimentation in Information Retrieval using PyTerrier
Declarative Experimentation in Information Retrieval using PyTerrierDeclarative Experimentation in Information Retrieval using PyTerrier
Declarative Experimentation in Information Retrieval using PyTerrier
Crai Macdonald
ย 

What's hot (20)

Analyzing Data Movements and Identifying Techniques for Next-generation Networks
Analyzing Data Movements and Identifying Techniques for Next-generation NetworksAnalyzing Data Movements and Identifying Techniques for Next-generation Networks
Analyzing Data Movements and Identifying Techniques for Next-generation Networks
ย 
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
ย 
Hadoop 3
Hadoop 3Hadoop 3
Hadoop 3
ย 
Quality Control of NGS Data Solutions
Quality Control of NGS Data  SolutionsQuality Control of NGS Data  Solutions
Quality Control of NGS Data Solutions
ย 
Parallel Linear Regression in Interative Reduce and YARN
Parallel Linear Regression in Interative Reduce and YARNParallel Linear Regression in Interative Reduce and YARN
Parallel Linear Regression in Interative Reduce and YARN
ย 
Overview of the Data Processing Error Analysis System (DPEAS)
Overview of the Data Processing Error Analysis System (DPEAS)Overview of the Data Processing Error Analysis System (DPEAS)
Overview of the Data Processing Error Analysis System (DPEAS)
ย 
How to be a bioinformatician
How to be a bioinformaticianHow to be a bioinformatician
How to be a bioinformatician
ย 
Hadoop & Big Data benchmarking
Hadoop & Big Data benchmarkingHadoop & Big Data benchmarking
Hadoop & Big Data benchmarking
ย 
Benchmarking Hadoop and Big Data
Benchmarking Hadoop and Big DataBenchmarking Hadoop and Big Data
Benchmarking Hadoop and Big Data
ย 
Ipaw14 presentation Quan, Tanu, Ian
Ipaw14 presentation Quan, Tanu, IanIpaw14 presentation Quan, Tanu, Ian
Ipaw14 presentation Quan, Tanu, Ian
ย 
Migration To Multi Core - Parallel Programming Models
Migration To Multi Core - Parallel Programming ModelsMigration To Multi Core - Parallel Programming Models
Migration To Multi Core - Parallel Programming Models
ย 
Docker poster bsb2015-print
Docker poster bsb2015-printDocker poster bsb2015-print
Docker poster bsb2015-print
ย 
LDV: Light-weight Database Virtualization
LDV: Light-weight Database VirtualizationLDV: Light-weight Database Virtualization
LDV: Light-weight Database Virtualization
ย 
Zaharia spark-scala-days-2012
Zaharia spark-scala-days-2012Zaharia spark-scala-days-2012
Zaharia spark-scala-days-2012
ย 
Performance Analysis of an I/O-Intensive Workflow executing on Google Cloud a...
Performance Analysis of an I/O-Intensive Workflow executing on Google Cloud a...Performance Analysis of an I/O-Intensive Workflow executing on Google Cloud a...
Performance Analysis of an I/O-Intensive Workflow executing on Google Cloud a...
ย 
The Interplay of Workflow Execution and Resource Provisioning
The Interplay of Workflow Execution and Resource ProvisioningThe Interplay of Workflow Execution and Resource Provisioning
The Interplay of Workflow Execution and Resource Provisioning
ย 
KIISE:SIGDB Workshop presentation.
KIISE:SIGDB Workshop presentation.KIISE:SIGDB Workshop presentation.
KIISE:SIGDB Workshop presentation.
ย 
LAS16-305: Smart City Big Data Visualization on 96Boards
LAS16-305: Smart City Big Data Visualization on 96BoardsLAS16-305: Smart City Big Data Visualization on 96Boards
LAS16-305: Smart City Big Data Visualization on 96Boards
ย 
Enhancing Performance and Fault Tolerance of Hadoop Cluster
Enhancing Performance and Fault Tolerance of Hadoop ClusterEnhancing Performance and Fault Tolerance of Hadoop Cluster
Enhancing Performance and Fault Tolerance of Hadoop Cluster
ย 
Declarative Experimentation in Information Retrieval using PyTerrier
Declarative Experimentation in Information Retrieval using PyTerrierDeclarative Experimentation in Information Retrieval using PyTerrier
Declarative Experimentation in Information Retrieval using PyTerrier
ย 

Similar to 3rd presentation

Scaling Application on High Performance Computing Clusters and Analysis of th...
Scaling Application on High Performance Computing Clusters and Analysis of th...Scaling Application on High Performance Computing Clusters and Analysis of th...
Scaling Application on High Performance Computing Clusters and Analysis of th...
Rusif Eyvazli
ย 
Runtime Behavior of JavaScript Programs
Runtime Behavior of JavaScript ProgramsRuntime Behavior of JavaScript Programs
Runtime Behavior of JavaScript Programs
IRJET Journal
ย 
eResearch workflows for studying free and open source software development
eResearch workflows for studying free and open source software developmenteResearch workflows for studying free and open source software development
eResearch workflows for studying free and open source software development
Andrea Wiggins
ย 
2017 nov reflow sbtb
2017 nov reflow sbtb2017 nov reflow sbtb
2017 nov reflow sbtb
mariuseriksen4
ย 
Advances in Scientific Workflow Environments
Advances in Scientific Workflow EnvironmentsAdvances in Scientific Workflow Environments
Advances in Scientific Workflow Environments
Carole Goble
ย 
Hadoop for Scientific Workloads__HadoopSummit2010
Hadoop for Scientific Workloads__HadoopSummit2010Hadoop for Scientific Workloads__HadoopSummit2010
Hadoop for Scientific Workloads__HadoopSummit2010
Yahoo Developer Network
ย 
Pegasus-Poster-2016-final-v2
Pegasus-Poster-2016-final-v2Pegasus-Poster-2016-final-v2
Pegasus-Poster-2016-final-v2
Samrat Jha
ย 
Architecture and Performance of Runtime Environments for Data Intensive Scala...
Architecture and Performance of Runtime Environments for Data Intensive Scala...Architecture and Performance of Runtime Environments for Data Intensive Scala...
Architecture and Performance of Runtime Environments for Data Intensive Scala...
jaliyae
ย 
Sharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reportsSharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reports
Gaignard Alban
ย 
International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER) International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)
ijceronline
ย 
International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)
ijceronline
ย 
Apache Hadoop India Summit 2011 Keynote talk "Programming Abstractions for Sm...
Apache Hadoop India Summit 2011 Keynote talk "Programming Abstractions for Sm...Apache Hadoop India Summit 2011 Keynote talk "Programming Abstractions for Sm...
Apache Hadoop India Summit 2011 Keynote talk "Programming Abstractions for Sm...
Yahoo Developer Network
ย 
A Collaborative Research Proposal To The NSF Research Accelerator For Multip...
A Collaborative Research Proposal To The NSF  Research Accelerator For Multip...A Collaborative Research Proposal To The NSF  Research Accelerator For Multip...
A Collaborative Research Proposal To The NSF Research Accelerator For Multip...
Scott Donald
ย 
Swift Parallel Scripting for High-Performance Workflow
Swift Parallel Scripting for High-Performance WorkflowSwift Parallel Scripting for High-Performance Workflow
Swift Parallel Scripting for High-Performance Workflow
Daniel S. Katz
ย 
Raminder kaur presentation_two
Raminder kaur presentation_twoRaminder kaur presentation_two
Raminder kaur presentation_two
ramikaurraminder
ย 
IEEE Parallel and distributed system 2016 Title and Abstract
IEEE Parallel and distributed system 2016 Title and AbstractIEEE Parallel and distributed system 2016 Title and Abstract
IEEE Parallel and distributed system 2016 Title and Abstract
tsysglobalsolutions
ย 
Spark
SparkSpark
Spark
newmooxx
ย 
Taverna workflows: provenance and reproducibility - STFC/NERC workshop 2013
Taverna workflows: provenance and reproducibility - STFC/NERC workshop 2013Taverna workflows: provenance and reproducibility - STFC/NERC workshop 2013
Taverna workflows: provenance and reproducibility - STFC/NERC workshop 2013
anpawlik
ย 
Computing Outside The Box June 2009
Computing Outside The Box June 2009Computing Outside The Box June 2009
Computing Outside The Box June 2009
Ian Foster
ย 
Computing Outside The Box September 2009
Computing Outside The Box September 2009Computing Outside The Box September 2009
Computing Outside The Box September 2009
Ian Foster
ย 

Similar to 3rd presentation (20)

Scaling Application on High Performance Computing Clusters and Analysis of th...
Scaling Application on High Performance Computing Clusters and Analysis of th...Scaling Application on High Performance Computing Clusters and Analysis of th...
Scaling Application on High Performance Computing Clusters and Analysis of th...
ย 
Runtime Behavior of JavaScript Programs
Runtime Behavior of JavaScript ProgramsRuntime Behavior of JavaScript Programs
Runtime Behavior of JavaScript Programs
ย 
eResearch workflows for studying free and open source software development
eResearch workflows for studying free and open source software developmenteResearch workflows for studying free and open source software development
eResearch workflows for studying free and open source software development
ย 
2017 nov reflow sbtb
2017 nov reflow sbtb2017 nov reflow sbtb
2017 nov reflow sbtb
ย 
Advances in Scientific Workflow Environments
Advances in Scientific Workflow EnvironmentsAdvances in Scientific Workflow Environments
Advances in Scientific Workflow Environments
ย 
Hadoop for Scientific Workloads__HadoopSummit2010
Hadoop for Scientific Workloads__HadoopSummit2010Hadoop for Scientific Workloads__HadoopSummit2010
Hadoop for Scientific Workloads__HadoopSummit2010
ย 
Pegasus-Poster-2016-final-v2
Pegasus-Poster-2016-final-v2Pegasus-Poster-2016-final-v2
Pegasus-Poster-2016-final-v2
ย 
Architecture and Performance of Runtime Environments for Data Intensive Scala...
Architecture and Performance of Runtime Environments for Data Intensive Scala...Architecture and Performance of Runtime Environments for Data Intensive Scala...
Architecture and Performance of Runtime Environments for Data Intensive Scala...
ย 
Sharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reportsSharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reports
ย 
International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER) International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)
ย 
International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)
ย 
Apache Hadoop India Summit 2011 Keynote talk "Programming Abstractions for Sm...
Apache Hadoop India Summit 2011 Keynote talk "Programming Abstractions for Sm...Apache Hadoop India Summit 2011 Keynote talk "Programming Abstractions for Sm...
Apache Hadoop India Summit 2011 Keynote talk "Programming Abstractions for Sm...
ย 
A Collaborative Research Proposal To The NSF Research Accelerator For Multip...
A Collaborative Research Proposal To The NSF  Research Accelerator For Multip...A Collaborative Research Proposal To The NSF  Research Accelerator For Multip...
A Collaborative Research Proposal To The NSF Research Accelerator For Multip...
ย 
Swift Parallel Scripting for High-Performance Workflow
Swift Parallel Scripting for High-Performance WorkflowSwift Parallel Scripting for High-Performance Workflow
Swift Parallel Scripting for High-Performance Workflow
ย 
Raminder kaur presentation_two
Raminder kaur presentation_twoRaminder kaur presentation_two
Raminder kaur presentation_two
ย 
IEEE Parallel and distributed system 2016 Title and Abstract
IEEE Parallel and distributed system 2016 Title and AbstractIEEE Parallel and distributed system 2016 Title and Abstract
IEEE Parallel and distributed system 2016 Title and Abstract
ย 
Spark
SparkSpark
Spark
ย 
Taverna workflows: provenance and reproducibility - STFC/NERC workshop 2013
Taverna workflows: provenance and reproducibility - STFC/NERC workshop 2013Taverna workflows: provenance and reproducibility - STFC/NERC workshop 2013
Taverna workflows: provenance and reproducibility - STFC/NERC workshop 2013
ย 
Computing Outside The Box June 2009
Computing Outside The Box June 2009Computing Outside The Box June 2009
Computing Outside The Box June 2009
ย 
Computing Outside The Box September 2009
Computing Outside The Box September 2009Computing Outside The Box September 2009
Computing Outside The Box September 2009
ย 

Recently uploaded

skeleton System.pdf (skeleton system wow)
skeleton System.pdf (skeleton system wow)skeleton System.pdf (skeleton system wow)
skeleton System.pdf (skeleton system wow)
Mohammad Al-Dhahabi
ย 
Philippine Edukasyong Pantahanan at Pangkabuhayan (EPP) Curriculum
Philippine Edukasyong Pantahanan at Pangkabuhayan (EPP) CurriculumPhilippine Edukasyong Pantahanan at Pangkabuhayan (EPP) Curriculum
Philippine Edukasyong Pantahanan at Pangkabuhayan (EPP) Curriculum
MJDuyan
ย 
Educational Technology in the Health Sciences
Educational Technology in the Health SciencesEducational Technology in the Health Sciences
Educational Technology in the Health Sciences
Iris Thiele Isip-Tan
ย 
Geography as a Discipline Chapter 1 __ Class 11 Geography NCERT _ Class Notes...
Geography as a Discipline Chapter 1 __ Class 11 Geography NCERT _ Class Notes...Geography as a Discipline Chapter 1 __ Class 11 Geography NCERT _ Class Notes...
Geography as a Discipline Chapter 1 __ Class 11 Geography NCERT _ Class Notes...
ImMuslim
ย 
Haunted Houses by H W Longfellow for class 10
Haunted Houses by H W Longfellow for class 10Haunted Houses by H W Longfellow for class 10
Haunted Houses by H W Longfellow for class 10
nitinpv4ai
ย 
How to Predict Vendor Bill Product in Odoo 17
How to Predict Vendor Bill Product in Odoo 17How to Predict Vendor Bill Product in Odoo 17
How to Predict Vendor Bill Product in Odoo 17
Celine George
ย 
Bossa Nโ€™ Roll Records by Ismael Vazquez.
Bossa Nโ€™ Roll Records by Ismael Vazquez.Bossa Nโ€™ Roll Records by Ismael Vazquez.
Bossa Nโ€™ Roll Records by Ismael Vazquez.
IsmaelVazquez38
ย 
Gender and Mental Health - Counselling and Family Therapy Applications and In...
Gender and Mental Health - Counselling and Family Therapy Applications and In...Gender and Mental Health - Counselling and Family Therapy Applications and In...
Gender and Mental Health - Counselling and Family Therapy Applications and In...
PsychoTech Services
ย 
Nutrition Inc FY 2024, 4 - Hour Training
Nutrition Inc FY 2024, 4 - Hour TrainingNutrition Inc FY 2024, 4 - Hour Training
Nutrition Inc FY 2024, 4 - Hour Training
melliereed
ย 
RESULTS OF THE EVALUATION QUESTIONNAIRE.pptx
RESULTS OF THE EVALUATION QUESTIONNAIRE.pptxRESULTS OF THE EVALUATION QUESTIONNAIRE.pptx
RESULTS OF THE EVALUATION QUESTIONNAIRE.pptx
zuzanka
ย 
Pharmaceutics Pharmaceuticals best of brub
Pharmaceutics Pharmaceuticals best of brubPharmaceutics Pharmaceuticals best of brub
Pharmaceutics Pharmaceuticals best of brub
danielkiash986
ย 
A Visual Guide to 1 Samuel | A Tale of Two Hearts
A Visual Guide to 1 Samuel | A Tale of Two HeartsA Visual Guide to 1 Samuel | A Tale of Two Hearts
A Visual Guide to 1 Samuel | A Tale of Two Hearts
Steve Thomason
ย 
Level 3 NCEA - NZ: A Nation In the Making 1872 - 1900 SML.ppt
Level 3 NCEA - NZ: A  Nation In the Making 1872 - 1900 SML.pptLevel 3 NCEA - NZ: A  Nation In the Making 1872 - 1900 SML.ppt
Level 3 NCEA - NZ: A Nation In the Making 1872 - 1900 SML.ppt
Henry Hollis
ย 
BIOLOGY NATIONAL EXAMINATION COUNCIL (NECO) 2024 PRACTICAL MANUAL.pptx
BIOLOGY NATIONAL EXAMINATION COUNCIL (NECO) 2024 PRACTICAL MANUAL.pptxBIOLOGY NATIONAL EXAMINATION COUNCIL (NECO) 2024 PRACTICAL MANUAL.pptx
BIOLOGY NATIONAL EXAMINATION COUNCIL (NECO) 2024 PRACTICAL MANUAL.pptx
RidwanHassanYusuf
ย 
MDP on air pollution of class 8 year 2024-2025
MDP on air pollution of class 8 year 2024-2025MDP on air pollution of class 8 year 2024-2025
MDP on air pollution of class 8 year 2024-2025
khuleseema60
ย 
Oliver Asks for More by Charles Dickens (9)
Oliver Asks for More by Charles Dickens (9)Oliver Asks for More by Charles Dickens (9)
Oliver Asks for More by Charles Dickens (9)
nitinpv4ai
ย 
CIS 4200-02 Group 1 Final Project Report (1).pdf
CIS 4200-02 Group 1 Final Project Report (1).pdfCIS 4200-02 Group 1 Final Project Report (1).pdf
CIS 4200-02 Group 1 Final Project Report (1).pdf
blueshagoo1
ย 
Temple of Asclepius in Thrace. Excavation results
Temple of Asclepius in Thrace. Excavation resultsTemple of Asclepius in Thrace. Excavation results
Temple of Asclepius in Thrace. Excavation results
Krassimira Luka
ย 
Juneteenth Freedom Day 2024 David Douglas School District
Juneteenth Freedom Day 2024 David Douglas School DistrictJuneteenth Freedom Day 2024 David Douglas School District
Juneteenth Freedom Day 2024 David Douglas School District
David Douglas School District
ย 
HYPERTENSION - SLIDE SHARE PRESENTATION.
HYPERTENSION - SLIDE SHARE PRESENTATION.HYPERTENSION - SLIDE SHARE PRESENTATION.
HYPERTENSION - SLIDE SHARE PRESENTATION.
deepaannamalai16
ย 

Recently uploaded (20)

skeleton System.pdf (skeleton system wow)
skeleton System.pdf (skeleton system wow)skeleton System.pdf (skeleton system wow)
skeleton System.pdf (skeleton system wow)
ย 
Philippine Edukasyong Pantahanan at Pangkabuhayan (EPP) Curriculum
Philippine Edukasyong Pantahanan at Pangkabuhayan (EPP) CurriculumPhilippine Edukasyong Pantahanan at Pangkabuhayan (EPP) Curriculum
Philippine Edukasyong Pantahanan at Pangkabuhayan (EPP) Curriculum
ย 
Educational Technology in the Health Sciences
Educational Technology in the Health SciencesEducational Technology in the Health Sciences
Educational Technology in the Health Sciences
ย 
Geography as a Discipline Chapter 1 __ Class 11 Geography NCERT _ Class Notes...
Geography as a Discipline Chapter 1 __ Class 11 Geography NCERT _ Class Notes...Geography as a Discipline Chapter 1 __ Class 11 Geography NCERT _ Class Notes...
Geography as a Discipline Chapter 1 __ Class 11 Geography NCERT _ Class Notes...
ย 
Haunted Houses by H W Longfellow for class 10
Haunted Houses by H W Longfellow for class 10Haunted Houses by H W Longfellow for class 10
Haunted Houses by H W Longfellow for class 10
ย 
How to Predict Vendor Bill Product in Odoo 17
How to Predict Vendor Bill Product in Odoo 17How to Predict Vendor Bill Product in Odoo 17
How to Predict Vendor Bill Product in Odoo 17
ย 
Bossa Nโ€™ Roll Records by Ismael Vazquez.
Bossa Nโ€™ Roll Records by Ismael Vazquez.Bossa Nโ€™ Roll Records by Ismael Vazquez.
Bossa Nโ€™ Roll Records by Ismael Vazquez.
ย 
Gender and Mental Health - Counselling and Family Therapy Applications and In...
Gender and Mental Health - Counselling and Family Therapy Applications and In...Gender and Mental Health - Counselling and Family Therapy Applications and In...
Gender and Mental Health - Counselling and Family Therapy Applications and In...
ย 
Nutrition Inc FY 2024, 4 - Hour Training
Nutrition Inc FY 2024, 4 - Hour TrainingNutrition Inc FY 2024, 4 - Hour Training
Nutrition Inc FY 2024, 4 - Hour Training
ย 
RESULTS OF THE EVALUATION QUESTIONNAIRE.pptx
RESULTS OF THE EVALUATION QUESTIONNAIRE.pptxRESULTS OF THE EVALUATION QUESTIONNAIRE.pptx
RESULTS OF THE EVALUATION QUESTIONNAIRE.pptx
ย 
Pharmaceutics Pharmaceuticals best of brub
Pharmaceutics Pharmaceuticals best of brubPharmaceutics Pharmaceuticals best of brub
Pharmaceutics Pharmaceuticals best of brub
ย 
A Visual Guide to 1 Samuel | A Tale of Two Hearts
A Visual Guide to 1 Samuel | A Tale of Two HeartsA Visual Guide to 1 Samuel | A Tale of Two Hearts
A Visual Guide to 1 Samuel | A Tale of Two Hearts
ย 
Level 3 NCEA - NZ: A Nation In the Making 1872 - 1900 SML.ppt
Level 3 NCEA - NZ: A  Nation In the Making 1872 - 1900 SML.pptLevel 3 NCEA - NZ: A  Nation In the Making 1872 - 1900 SML.ppt
Level 3 NCEA - NZ: A Nation In the Making 1872 - 1900 SML.ppt
ย 
BIOLOGY NATIONAL EXAMINATION COUNCIL (NECO) 2024 PRACTICAL MANUAL.pptx
BIOLOGY NATIONAL EXAMINATION COUNCIL (NECO) 2024 PRACTICAL MANUAL.pptxBIOLOGY NATIONAL EXAMINATION COUNCIL (NECO) 2024 PRACTICAL MANUAL.pptx
BIOLOGY NATIONAL EXAMINATION COUNCIL (NECO) 2024 PRACTICAL MANUAL.pptx
ย 
MDP on air pollution of class 8 year 2024-2025
MDP on air pollution of class 8 year 2024-2025MDP on air pollution of class 8 year 2024-2025
MDP on air pollution of class 8 year 2024-2025
ย 
Oliver Asks for More by Charles Dickens (9)
Oliver Asks for More by Charles Dickens (9)Oliver Asks for More by Charles Dickens (9)
Oliver Asks for More by Charles Dickens (9)
ย 
CIS 4200-02 Group 1 Final Project Report (1).pdf
CIS 4200-02 Group 1 Final Project Report (1).pdfCIS 4200-02 Group 1 Final Project Report (1).pdf
CIS 4200-02 Group 1 Final Project Report (1).pdf
ย 
Temple of Asclepius in Thrace. Excavation results
Temple of Asclepius in Thrace. Excavation resultsTemple of Asclepius in Thrace. Excavation results
Temple of Asclepius in Thrace. Excavation results
ย 
Juneteenth Freedom Day 2024 David Douglas School District
Juneteenth Freedom Day 2024 David Douglas School DistrictJuneteenth Freedom Day 2024 David Douglas School District
Juneteenth Freedom Day 2024 David Douglas School District
ย 
HYPERTENSION - SLIDE SHARE PRESENTATION.
HYPERTENSION - SLIDE SHARE PRESENTATION.HYPERTENSION - SLIDE SHARE PRESENTATION.
HYPERTENSION - SLIDE SHARE PRESENTATION.
ย 

3rd presentation

  • 1. A Comparative Evaluation Among Galaxy And Ruffus-based Scripting Workflows For DNA-seq Analysis Pipeline workflow Supervisor: Prof Alan Christoffels Mentor: Peter van Heusden
  • 2. Brief Background Human genome sequencing data analysis, such as secondary analysis is by far the most computationally intensive. This is due to the ๏‚ง Size of files that must be manipulated, ๏‚ง Complexity of determining optimal alignment for millions of reads to the human reference genome, and ๏‚ง Subsequently utilisation of the pipelines for variant calling and genotyping.
  • 3. Literatures Review This research focus on a comparative evaluation among different workflow systems, but most specific on Galaxy and Ruffus-scripting program. There exist a number of workflow management systems for use in bioinformatics. Frameworkโ€™s like Galaxy (Goecks et al., 2010), Arvados (Zaranek, 2013) or Mobyle (Neron et al., 2009) have been developed mainly for application in the life science (Marc and Uif, 2013). Some workflow systems utilise high computing resources and lack data provenance and management. Others tend to be CPU bound and memory intensive.
  • 4. Research Aims ๏‚ง To identify the optimal workflow system for analysis NGS data among Galaxy, and a collection of scripts. ๏‚ง To present a community resource system that can allow users of a workflows system to decide on which workflows to use for analysis NGS data analysis with less time consumption. ๏‚ง To obtain detailed task-level performance metrics, and present it graphically.
  • 5. Research Objectives ๏‚ง Implementation of the DNAseq Galaxy workflow and ruffus-script pipeline. ๏‚ง Graphical representation of Galaxy and ruffus-script pipeline ๏‚ง Benchmarking the performance of Galaxy workflow, and ruffus-script using one of the linux benchmarking tools ๏‚ง Performance metric tabulation (e.g runtime, memory usage,etc) among Galaxy and a collection of scripts pipeline.
  • 6. Research Design and Methodology Researchers at SANBI carried out the assembly and functional annotation of a relatively large dataset with 400 Mycobacterium tuberculosis genome using reference based assembly approach. Since it is more challenging to gather all the analysis steps within a single graphical user interface, we use Galaxy and ruffus-scripting pipeline to build an automated workflow for DNA-seq variant calling pipeline that run efficiently on SANBI-HPC clusters.
  • 7. Data and Requirement Analysis ๏‚ง This project required and uses DNA-seq data for its analysis steps to carried out the comparison evaluation. ๏‚ง Concrete design implementation such as Galaxy and Ruffus-Scripting based workflow system was used as design methodology stylish Follows was a: USE CASE: DNA-seq Analysis Pipeline for this experiment
  • 8. DNA-seq Analysis Requirements Design Mate fixed reads per sample (matefixed.sorted.bam) Reference Genome (Fasta) Raw Reads (fastq.gz) Trim and Align (trimmomatic + (NovoAlig,BWA-mem) Trimmed QC reads R1/R2.Fastq.pai red.gz Aligned reads per samples (sorted.bam) Aligned reads duplicates marked per samples (sorted.dup.bam) Realigned reads per sample (realigned.sorted.bam ) Indel realigner (GATK + Picard) Samples Metrics (Picard) Fixmate (Picard) Mark as duplicates (Picard) Aligned reads duplicates marked per samples (sorted.dup.bam) Recalibration (GATK) aligned reads recalibrated per sample (sorted.dup.recal.bam) SNP and INDEL BCF/FilterBCF (Samtools mpileup + bcftools + vcfutils varfilter) Bcf per region per samples Vcf per sample (sample. *.vcf) Filter (perl script) Filtered vcf per sample (Nfiltered.vcf) Flag mappabillity (vcftools ) vcf with mappability flag per sample (โ€ฆmil.vcf) SNP ID ANNOTATION (SNPEFF) vcf variant annotated with custom annotation (โ€ฆSNPid.vcf) SNPeff Annotation:effect prediction tool (SNPeff) Vcf SNPs annotated as high/moderate/low impact (snpid.snpeff.vcf) Final Report (Hmtl) End Start
  • 9. On a farm far, far away, in the countryside of Hong Kong,Ruffus the shy-creature was born. This banded krait images was from Wikimedia.
  • 10. Ruffus is a shy creature, and pretends to be a cobra or a banded krait by putting up its red tail and ducking its head in its coils when startled.
  • 11. But they did not know that Ruffus had a secret...... Pipeline Workflow Scripting
  • 12. Ruffus: Pipeline Flowchart This project uses ruffus framework. Pipeline was created using object oriented method, and was manipulated directly. This project use task objects instead of via decorators.
  • 13. Program Name Pipeline Runner State Util Main Configuration Logger Stages Error_4_DRMMAPipeline_Config JobScripts Ruffus: Script-Based Implementation Design
  • 14. Main Script Module ๏‚ง Build the pipeline workflow ๏‚ง Run or print the workflow ๏‚ง End the DRMAA session
  • 15. Configuration Module Uses config-option that is written in Yaml, where users can supplies input
  • 16. Logger Module Allow instant and concurrency logging facility
  • 17. Pipeline Module ๏‚ง Integrate all stages by constructing stages. ๏‚ง Build the pipeline workflow together
  • 18. Runner Module ๏‚ง Set DRMMA to virtual python environment ๏‚ง Run the pipeline workflow stges (Locally or Cluster)
  • 19. Stage Module ๏‚ง Individual stages of this project implemented, ๏‚ง And functions from input files to output files ๏‚ง Integrated Run-Stage functions that: ๏‚ง Providing state parameter for Runner ๏‚ง Access to pipeline module, config option, DRMMA & Logger
  • 20. Galaxy: DNA-seq Analysis Pipeline Workflow System ๏‚ง This project customize Galaxy to run jobs on a cluster, in specifically, virtual local job runner on cluster. ๏‚ง Galaxy uses shared filesystem between its application server and the cluster nodes. ๏‚ง Galaxy frontend application runs on a single server as usual, however tools are run on cluster nodes instead. ๏‚ง Galaxy uses distributed resources manager (DRM) at the backend to implement the Distributing Resource Management Application API (DRMAA) interface.
  • 21. Galaxy Project: Main Component of Operating System architecture Design For this project, Galaxy was configure and organized using the following virtual layers: driver, core components, and high-level tools. Hypervisor (Virtual Hardware) Platform ( SANBI Physical Hardware ) Physical Hardware Drivers DriversCores Administrator tools Users OS Blocks Memory+Dish Size +CPU SanbiInfrastructureManagement PhysicalTools
  • 22. Galaxy: DNA-seq Analysis Workflow Implementation
  • 23. Results: Pipeline Comparison Pros The following summaries the merit of using another of the workflow: ๏‚ง Jobs are submitted via the python-drmaa ๏‚ง DNA-seq Pipeline Case Study ๏‚ง Uses SUN Grid Engine (SGE) as runner configuration manager. ๏‚ง Pipelines can be created on the fly ๏‚ง Multiple Tasks can share the same python function ๏‚ง Reuse and reproducible common sub-pipelines Shared sub pipelines can be created from discrete python modules and joined together as needed. Bioinformaticists may have โ€œmappingโ€, โ€œaligningโ€, โ€œvariant- callingโ€ sub-pipelines etc.
  • 24. Ruffus-Script Based Galaxy Project Only based on Python Galaxy allows integration of different programming languages Task cannot be stop Task can be pause and restart using the History refresh button Only tested on SGE, using drmaa as job submission. Galaxy application server host processes can only be configure in the DRM as a submit host despite different Scheduler engine Prone to quick syntax errors where tasks will blow up at any instance of ambiguity in any particular pipeline analsysis steps Galaxy application server and worker nodes must run the same version of Python Ruffus normal process-based allows multiple local and distributing jobs submission. Shared filesystem and absolute pathname are part of Galaxy workflow system limitations Pipeline Comparison Cons
  • 25. Further Results This project characterising the execution profile among Galaxy and Ruffus-based scripting. Though, the performance comparisons strategies are yet to be explored. However, the project experimental use case was tested. The results statistics are yet to be collected and graph-out. Results including: ๏‚ง Analysis construction performance, ๏‚ง Task start and end execution time, ๏‚ง Pipeline Runtime, ๏‚ง Disk and CPU usage summaries, and ๏‚ง Memory Usage.
  • 26. Future Work ๏‚ง To complete the Advance DNA-seq analysis on Galaxy with SANBI custom tool: NovoCraft tools ๏‚ง To deploy sanbi-ruffus scripting based pipeline experiment on Docker platform ๏‚ง A graphical representation of comparison evaluation and performance metric analysis using High chart and collectl util tool on cluster.
  • 27. References ๏‚ง TERKHORN nov 2011. Strength and weaknesses of sub-workflow interoperatbility. Ekblom, r. & Galindo, j. 2011. Applications of next generation sequencing in molecular ecology of non- model organisms. Heredity (edinb.), 107, 1-15. ๏‚ง Goecks, j., Nekrutenko, a., Taylor, j. & Galaxy, t. 2010. Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome biol., 11, r86. ๏‚ง Haas, b. J., Papanicolaou, a., Yassour, m., Grabherr, m., Blood, p. D., Bowden, j., Couger, m. B., Eccles, d., Li, b., Lieber, m., Macmanes, m. D., Ott, m., Orvis, j., Pochet, n., Strozzi, f., Weeks, n., Westerman, r., William, t., Dewey, c. N., Henschel, r., Leduc, r. D., Friedman, n. & Regev, a. 2013. De novo transcript sequence reconstruction from rna-seq using the trinity platform for reference generation and analysis. Nat. Protoc., 8, 1494-512. ๏‚ง Leo Goodstadt 2010, Ruffus: a lightweight Python library for computational pipelines. Bioinformatics 26(21): 2778-2779 ๏‚ง Alexander Wait Zaranek. (2013). Lightning: the first component of the Arvados project to be (re)written in "Go". Zenodo. 10.5281/zenodo.7133 ๏‚ง Bernie Pope, Clare Sloggett, Gayle Philip, Matthew Wakefield , Complexo a pipeline for calling variants
  • 28. People Contributions to this Project: โ€ข Zahra โ€ข Hocine โ€ข Combat-TB Teams THANK YOU !!!!