Cloudgene - A MapReduce based Workflow Management System

Cloudgene
A MapReduce based Workflow Management System
Lukas Forer and Sebstian Schönherr
Division of Genetic Epidemiology
Medical University of Innsbruck, Austria
UPPNEX Workshop - January 2015

Motivation: Bioinformatics
• Next Generation Sequencing (NGS)
– Sequencing the whole genome at low cost
– Gigabytes of produced data per experiment
– Allows data production at high scale
• Data generation is not the bottleneck anymore
• Data processing as the current bottleneck
– Single Workstation not sufficient
– Super-Computers too expensive

MapReduce
• Commodity computing
– Parallel computing on a large number of low budget
components
• MapReduce
– Parallelization framework
– Enables analyzing large data
– User writes map/reduce function
– Framework takes care about
fault-tolerance, data distribution, load balancing
– Apache Hadoop: Open Source implementation

MapReduce in Bioinformatics (1)
Hadoop
MapReduce
libraries for
Bioinformatics
Hadoop BAM
Manipulation of aligned next-generation sequencing data (supports BAM, SAM, FASTQ,
FASTA, QSEQ, BCF, and VCF)
SeqPig
Processing NGS data with Apache Pig; Presenting UDFs for frequent tasks; using Hadoop-
BAM
BioPig Processing NGS data with Apache Pig; Presenting UDFs
Biodoop
MapReduce suite for sequence alignments / manipulation of aligned records; written in
Python
DNA -
Alignment
algorithms
based on
Hadoop
CloudBurst
Based on RMAP (seed-and-extend algorithm) Map: Extracting k-mers of reference, non-
overlapping k-mers of reads (as keys) Reduce: End-to-end alignments of seeds
Seal
Based on BWA (version 0.5.9) Map: Alignment using BWA (on a previously created internal
file format) Reduce: Remove duplicates (optional)
Crossbow
Based on Bowtie / SOAPsnp
Map: Executing Bowtie on chunks
Reduce: SNP calling using SOAPsnp
RNA - Analysis
based on
Hadoop
MyRNA Pipeline for calculating differential gene expression in RNA; including Bowtie
FX RNA-Seq analysis tool
Eoulsan RNA-Seq analysis tool
Non-Hadoop
based
Approaches
GATK
MapReduce-like framework including a rich set of tools for quality assurance, alignment and
variant calling; not based on Hadoop MapReduce

MapReduce in Bioinformatics (2)
• Bioinformatics MapReduce Applications
– Available only on a per-tool basis
– Cover one aspect of a larger data analysis pipeline
– Hard to use for scientists without background in
Computer Science
• Popular workflow systems
– Enable this level of abstraction for the traditional tools
– Do not support tools based on MapReduce
Missing: System which enables building
MapReduce workflows

Cloudgene
• System to execute MapReduce programs graphically
and combine them to workflows
• One platform – many programs
– Integration of existing MapReduce programs without
source code adaptations
– Create workflows using MapReduce, Apache Pig, R or
Unix command-line programs
• Runs in your browser

Cloudgene: Overview
Cloudgene-MapRed
MapReduce Workflow Manager
Bioinformatics WorkflowsBioinformatics WorkflowsBioinformatics Workflows
• Requires a compatible cluster to execute workflows
– Small/Medium sized research institutes can hardly
afford own clusters
– Cloud computing: rent computer hardware from different
providers (e.g. Amazon, HP)

CloudgeneCloudgene: Overview
Cloudgene-MapRed
MapReduce Workflow Manager
Cloudgene-Cluster
Infrastructure Manager
Bioinformatics WorkflowsBioinformatics WorkflowsBioinformatics WorkflowsBioinformatics Workflows

Workflow Composition
• New MapReduce algorithms can be integrated easily
• Integration of existing MapReduce algorithms without
adaptations in source code
• Cloudgene uses its own workflow language
• Workflow Definition Language (WDL)
– Formal description of tasks and workflow steps
– Property-based and uses the YAML syntax
– Supports heterogeneous software components
(MapReduce, R and unix command-line programs)
– Basic workflow control patterns (loops and conditions)

• Example of a simple WDL-Manifest file
Command line
parameters
Inputs:
Are set by the user
through the web
interface
Outputs:
are created by
tasks (intermediate
or persistent)

• The user interface is created automatically

Workflow Execution Engine
1. Creates a dependency graph based on the WDL file and user input
2. Optimizes the graph to minimize the execution time (i.e. caching)
3. Schedules and submits jobs to the Hadoop Cluster

Workflow Results
Used
Parameters
Download links
to result files

Supported Technologies
• Apache Hadoop MapReduce
• Apache PIG
• RMarkdown
– Ideal to generate html files with charts, statistics, …
• Unix command line programs
– Cloudgene exports automatically all HDFS files
– No manual file staging between HDFS and POSIX filesystem
needed!
Advantage: Composition of
hybrid Workflows possible

Other Features
• Authentication and User-Management
• Parameter Tracking
• HDFS Workspace
– Hides HDFS filesystem by the end-user
– Importing Data from Amazon S3 Buckets,
HTTP and (S)FTP Servers, File Uploads, ...
– Facilitates the management of datasets on
the cluster

Preview: Cloudgene 2.0
• Interface for web-services
– Same WDL file, but different interface
– User Registration
– Intelligent Queuing
– User Notification
• Examples:
– https://imputationserver.sph.umich.edu
– http://mtdna-server.uibk.ac.at

Preview: Cloudgene 2.0
• Generic data analysis platform
– Integration of additional data processing models
Cloudgene
Hadoop 1.0
MapReduce
Cloudgene
Hadoop 2.0
YARN
MapReduce Spark Giraph …

Conclusion
• Website
– http://cloudgene.uibk.ac.at
• Virtual Machine
– https://bioimg.org/cloudgene
• Getting started
– http://cloudgene.uibk.ac.at/getting-started
• Developer Guide
– http://cloudgene.uibk.ac.at/developer-guide

Acks
• Cloudgene
– Lukas Forer (@lukfor) and Sebastian Schoenherr
(@seppinho)
• Imputation with Minimac
– Goncalo Abecasis, Christian Fuchsberger
• mtDNA-Server
– Hansi Weißensteiner
• Univ.-Prof. Florian Kronenberg
– Head of the Division of Genetic Epidemiology,
Medical University of Innsbruck
23

Cloudgene - A MapReduce based Workflow Management System

More Related Content

What's hot

Viewers also liked

Similar to Cloudgene - A MapReduce based Workflow Management System

Recently uploaded

Cloudgene - A MapReduce based Workflow Management System

Editor's Notes