Cloudgene - A MapReduce based Workflow Management System

Cloudgene
A MapReduce based Workflow Management System
Lukas Forer and Sebstian Schönherr
Division of Genetic Epidemiology
Medical University of Innsbruck, Austria
UPPNEX Workshop - January 2015
Page 2
Motivation: Bioinformatics
• Next Generation Sequencing (NGS)
– Sequencing the whole genome at low cost
– Gigabytes of produced data per experiment
– Allows data production at high scale
• Data generation is not the bottleneck anymore
• Data processing as the current bottleneck
– Single Workstation not sufficient
– Super-Computers too expensive
Page 3
MapReduce
• Commodity computing
– Parallel computing on a large number of low budget
components
• MapReduce
– Parallelization framework
– Enables analyzing large data
– User writes map/reduce function
– Framework takes care about
fault-tolerance, data distribution, load balancing
– Apache Hadoop: Open Source implementation
Page 4
MapReduce in Bioinformatics (1)
Hadoop
MapReduce
libraries for
Bioinformatics
Hadoop BAM
Manipulation of aligned next-generation sequencing data (supports BAM, SAM, FASTQ,
FASTA, QSEQ, BCF, and VCF)
SeqPig
Processing NGS data with Apache Pig; Presenting UDFs for frequent tasks; using Hadoop-
BAM
BioPig Processing NGS data with Apache Pig; Presenting UDFs
Biodoop
MapReduce suite for sequence alignments / manipulation of aligned records; written in
Python
DNA -
Alignment
algorithms
based on
Hadoop
CloudBurst
Based on RMAP (seed-and-extend algorithm) Map: Extracting k-mers of reference, non-
overlapping k-mers of reads (as keys) Reduce: End-to-end alignments of seeds
Seal
Based on BWA (version 0.5.9) Map: Alignment using BWA (on a previously created internal
file format) Reduce: Remove duplicates (optional)
Crossbow
Based on Bowtie / SOAPsnp
Map: Executing Bowtie on chunks
Reduce: SNP calling using SOAPsnp
RNA - Analysis
based on
Hadoop
MyRNA Pipeline for calculating differential gene expression in RNA; including Bowtie
FX RNA-Seq analysis tool
Eoulsan RNA-Seq analysis tool
Non-Hadoop
based
Approaches
GATK
MapReduce-like framework including a rich set of tools for quality assurance, alignment and
variant calling; not based on Hadoop MapReduce
Page 5
MapReduce in Bioinformatics (2)
• Bioinformatics MapReduce Applications
– Available only on a per-tool basis
– Cover one aspect of a larger data analysis pipeline
– Hard to use for scientists without background in
Computer Science
• Popular workflow systems
– Enable this level of abstraction for the traditional tools
– Do not support tools based on MapReduce
Missing: System which enables building
MapReduce workflows
Page 6
Cloudgene
• System to execute MapReduce programs graphically
and combine them to workflows
• One platform – many programs
– Integration of existing MapReduce programs without
source code adaptations
– Create workflows using MapReduce, Apache Pig, R or
Unix command-line programs
• Runs in your browser
Page 7
Cloudgene: Overview
Cloudgene-MapRed
MapReduce Workflow Manager
Bioinformatics WorkflowsBioinformatics WorkflowsBioinformatics Workflows
• Requires a compatible cluster to execute workflows
– Small/Medium sized research institutes can hardly
afford own clusters
– Cloud computing: rent computer hardware from different
providers (e.g. Amazon, HP)
Page 8
CloudgeneCloudgene: Overview
Cloudgene-MapRed
MapReduce Workflow Manager
Cloudgene-Cluster
Infrastructure Manager
Bioinformatics WorkflowsBioinformatics WorkflowsBioinformatics WorkflowsBioinformatics Workflows
Page 9
Cloudgene: Advantages
Page 10
Architecture
Page 11
Workflow Composition
• New MapReduce algorithms can be integrated easily
• Integration of existing MapReduce algorithms without
adaptations in source code
• Cloudgene uses its own workflow language
• Workflow Definition Language (WDL)
– Formal description of tasks and workflow steps
– Property-based and uses the YAML syntax
– Supports heterogeneous software components
(MapReduce, R and unix command-line programs)
– Basic workflow control patterns (loops and conditions)
Page 13
Workflow Composition
• Example of a simple WDL-Manifest file
Command line
parameters
Inputs:
Are set by the user
through the web
interface
Outputs:
are created by
tasks (intermediate
or persistent)
Page 14
Workflow Composition
• The user interface is created automatically
Page 15
Workflow Execution Engine
1. Creates a dependency graph based on the WDL file and user input
2. Optimizes the graph to minimize the execution time (i.e. caching)
3. Schedules and submits jobs to the Hadoop Cluster
Page 16
Web Interface
Page 17
Workflow Results
Used
Parameters
Download links
to result files
Page 18
Supported Technologies
• Apache Hadoop MapReduce
• Apache PIG
• RMarkdown
– Ideal to generate html files with charts, statistics, …
• Unix command line programs
– Cloudgene exports automatically all HDFS files
– No manual file staging between HDFS and POSIX filesystem
needed!
Advantage: Composition of
hybrid Workflows possible
Page 19
Other Features
• Authentication and User-Management
• Parameter Tracking
• HDFS Workspace
– Hides HDFS filesystem by the end-user
– Importing Data from Amazon S3 Buckets,
HTTP and (S)FTP Servers, File Uploads, ...
– Facilitates the management of datasets on
the cluster
Page 20
Preview: Cloudgene 2.0
• Interface for web-services
– Same WDL file, but different interface
– User Registration
– Intelligent Queuing
– User Notification
• Examples:
– https://imputationserver.sph.umich.edu
– http://mtdna-server.uibk.ac.at
Page 21
Preview: Cloudgene 2.0
• Generic data analysis platform
– Integration of additional data processing models
Cloudgene
Hadoop 1.0
MapReduce
Cloudgene
Hadoop 2.0
YARN
MapReduce Spark Giraph …
Page 22
Conclusion
• Website
– http://cloudgene.uibk.ac.at
• Virtual Machine
– https://bioimg.org/cloudgene
• Getting started
– http://cloudgene.uibk.ac.at/getting-started
• Developer Guide
– http://cloudgene.uibk.ac.at/developer-guide
Page 23
Acks
• Cloudgene
– Lukas Forer (@lukfor) and Sebastian Schoenherr
(@seppinho)
• Imputation with Minimac
– Goncalo Abecasis, Christian Fuchsberger
• mtDNA-Server
– Hansi Weißensteiner
• Univ.-Prof. Florian Kronenberg
– Head of the Division of Genetic Epidemiology,
Medical University of Innsbruck
23
1 of 22

Recommended

Hourglass: a Library for Incremental Processing on Hadoop by
Hourglass: a Library for Incremental Processing on HadoopHourglass: a Library for Incremental Processing on Hadoop
Hourglass: a Library for Incremental Processing on HadoopMatthew Hayes
13.6K views29 slides
PEARC 17: Spark On the ARC by
PEARC 17: Spark On the ARCPEARC 17: Spark On the ARC
PEARC 17: Spark On the ARCHimanshu Bedi
105 views16 slides
Hadoop versus spark by
Hadoop versus sparkHadoop versus spark
Hadoop versus sparkPrwaTech
38 views1 slide
Big Data Heterogeneous Mixture Learning on Spark by
Big Data Heterogeneous Mixture Learning on SparkBig Data Heterogeneous Mixture Learning on Spark
Big Data Heterogeneous Mixture Learning on SparkDataWorks Summit/Hadoop Summit
1.1K views62 slides
Introduction to Yarn by
Introduction to YarnIntroduction to Yarn
Introduction to YarnApache Apex
5.3K views15 slides

More Related Content

What's hot

Development History Data Management in Hadoop by
Development History Data Management in HadoopDevelopment History Data Management in Hadoop
Development History Data Management in HadoopJohan Gustavsson
557 views18 slides
2011.10.14 Apache Giraph - Hortonworks by
2011.10.14 Apache Giraph - Hortonworks2011.10.14 Apache Giraph - Hortonworks
2011.10.14 Apache Giraph - HortonworksAvery Ching
11.6K views33 slides
Map Reduce along with Amazon EMR by
Map Reduce along with Amazon EMRMap Reduce along with Amazon EMR
Map Reduce along with Amazon EMRABC Talks
569 views22 slides
Review of Calculation Paradigm and its Components by
Review of Calculation Paradigm and its ComponentsReview of Calculation Paradigm and its Components
Review of Calculation Paradigm and its ComponentsNamuk Park
230 views27 slides
project--2 nd review_2 by
project--2 nd review_2project--2 nd review_2
project--2 nd review_2Aswini Ashu
96 views38 slides
Map Reduce by
Map ReduceMap Reduce
Map ReduceRahul Agarwal
3.3K views15 slides

What's hot(20)

Development History Data Management in Hadoop by Johan Gustavsson
Development History Data Management in HadoopDevelopment History Data Management in Hadoop
Development History Data Management in Hadoop
Johan Gustavsson557 views
2011.10.14 Apache Giraph - Hortonworks by Avery Ching
2011.10.14 Apache Giraph - Hortonworks2011.10.14 Apache Giraph - Hortonworks
2011.10.14 Apache Giraph - Hortonworks
Avery Ching11.6K views
Map Reduce along with Amazon EMR by ABC Talks
Map Reduce along with Amazon EMRMap Reduce along with Amazon EMR
Map Reduce along with Amazon EMR
ABC Talks569 views
Review of Calculation Paradigm and its Components by Namuk Park
Review of Calculation Paradigm and its ComponentsReview of Calculation Paradigm and its Components
Review of Calculation Paradigm and its Components
Namuk Park230 views
project--2 nd review_2 by Aswini Ashu
project--2 nd review_2project--2 nd review_2
project--2 nd review_2
Aswini Ashu96 views
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019 by VMware Tanzu
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
VMware Tanzu2K views
Telco analytics at scale by datamantra
Telco analytics at scaleTelco analytics at scale
Telco analytics at scale
datamantra992 views
(ATS3-PLAT08) Optimizing Protocol Performance by BIOVIA
(ATS3-PLAT08) Optimizing Protocol Performance(ATS3-PLAT08) Optimizing Protocol Performance
(ATS3-PLAT08) Optimizing Protocol Performance
BIOVIA529 views
Vgu bis2010 Mapreduce and Batch processing by Lam Pham
Vgu bis2010 Mapreduce and Batch processingVgu bis2010 Mapreduce and Batch processing
Vgu bis2010 Mapreduce and Batch processing
Lam Pham947 views
Resource Aware Scheduling for Hadoop [Final Presentation] by Lu Wei
Resource Aware Scheduling for Hadoop [Final Presentation]Resource Aware Scheduling for Hadoop [Final Presentation]
Resource Aware Scheduling for Hadoop [Final Presentation]
Lu Wei2.6K views
Filtering vs Enriching Data in Apache Spark by Databricks
Filtering vs Enriching Data in Apache SparkFiltering vs Enriching Data in Apache Spark
Filtering vs Enriching Data in Apache Spark
Databricks428 views
Data Center Operating System by Keshav Yadav
Data Center Operating SystemData Center Operating System
Data Center Operating System
Keshav Yadav92 views
HDFS-HC2: Analysis of Data Placement Strategy based on Computing Power of Nod... by Xiao Qin
HDFS-HC2: Analysis of Data Placement Strategy based on Computing Power of Nod...HDFS-HC2: Analysis of Data Placement Strategy based on Computing Power of Nod...
HDFS-HC2: Analysis of Data Placement Strategy based on Computing Power of Nod...
Xiao Qin1.1K views

Viewers also liked

SM SILO Presentation by
SM SILO Presentation SM SILO Presentation
SM SILO Presentation Gavin Baird
157 views13 slides
Diploma by
DiplomaDiploma
DiplomaGeorge Avetisyan
188 views11 slides
HITEK Circuits Company Presentation16_Ray.Lei(sale05@hitekcircuits.com ) by
HITEK Circuits Company  Presentation16_Ray.Lei(sale05@hitekcircuits.com )HITEK Circuits Company  Presentation16_Ray.Lei(sale05@hitekcircuits.com )
HITEK Circuits Company Presentation16_Ray.Lei(sale05@hitekcircuits.com )Aron Le
101 views10 slides
Aleem Ashraf CV by
Aleem Ashraf CV Aleem Ashraf CV
Aleem Ashraf CV Aleem Ashraf
358 views3 slides
سوءالن تاهون 5 by
سوءالن تاهون 5سوءالن تاهون 5
سوءالن تاهون 5fakhar zack
152 views8 slides
GallupReport_Signature Themes by
GallupReport_Signature ThemesGallupReport_Signature Themes
GallupReport_Signature ThemesDavid Higgins
130 views3 slides

Viewers also liked(9)

SM SILO Presentation by Gavin Baird
SM SILO Presentation SM SILO Presentation
SM SILO Presentation
Gavin Baird157 views
HITEK Circuits Company Presentation16_Ray.Lei(sale05@hitekcircuits.com ) by Aron Le
HITEK Circuits Company  Presentation16_Ray.Lei(sale05@hitekcircuits.com )HITEK Circuits Company  Presentation16_Ray.Lei(sale05@hitekcircuits.com )
HITEK Circuits Company Presentation16_Ray.Lei(sale05@hitekcircuits.com )
Aron Le101 views
سوءالن تاهون 5 by fakhar zack
سوءالن تاهون 5سوءالن تاهون 5
سوءالن تاهون 5
fakhar zack152 views
GallupReport_Signature Themes by David Higgins
GallupReport_Signature ThemesGallupReport_Signature Themes
GallupReport_Signature Themes
David Higgins130 views
The innovation intensive by Alan J Sears
The innovation intensiveThe innovation intensive
The innovation intensive
Alan J Sears331 views
Segurança no Regresso às Aulas by fmcardoso2014
Segurança no Regresso às AulasSegurança no Regresso às Aulas
Segurança no Regresso às Aulas
fmcardoso201497 views
Alphatise Presentation by Gavin Baird
Alphatise PresentationAlphatise Presentation
Alphatise Presentation
Gavin Baird654 views

Similar to Cloudgene - A MapReduce based Workflow Management System

Cloud Services for Big Data Analytics by
Cloud Services for Big Data AnalyticsCloud Services for Big Data Analytics
Cloud Services for Big Data AnalyticsGeoffrey Fox
1.2K views33 slides
Cloud Services for Big Data Analytics by
Cloud Services for Big Data AnalyticsCloud Services for Big Data Analytics
Cloud Services for Big Data AnalyticsGeoffrey Fox
15.2K views33 slides
Report Hadoop Map Reduce by
Report Hadoop Map ReduceReport Hadoop Map Reduce
Report Hadoop Map ReduceUrvashi Kataria
1.2K views31 slides
Map reducecloudtech by
Map reducecloudtechMap reducecloudtech
Map reducecloudtechJakir Hossain
766 views62 slides
A data aware caching 2415 by
A data aware caching 2415A data aware caching 2415
A data aware caching 2415SANTOSH WAYAL
51 views3 slides

Similar to Cloudgene - A MapReduce based Workflow Management System(20)

Cloud Services for Big Data Analytics by Geoffrey Fox
Cloud Services for Big Data AnalyticsCloud Services for Big Data Analytics
Cloud Services for Big Data Analytics
Geoffrey Fox1.2K views
Cloud Services for Big Data Analytics by Geoffrey Fox
Cloud Services for Big Data AnalyticsCloud Services for Big Data Analytics
Cloud Services for Big Data Analytics
Geoffrey Fox15.2K views
Architecting Big Data Ingest & Manipulation by George Long
Architecting Big Data Ingest & ManipulationArchitecting Big Data Ingest & Manipulation
Architecting Big Data Ingest & Manipulation
George Long630 views
Introduccion a Hadoop / Introduction to Hadoop by GERARDO BARBERENA
Introduccion a Hadoop / Introduction to HadoopIntroduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to Hadoop
GERARDO BARBERENA401 views
Hadoop a Natural Choice for Data Intensive Log Processing by Hitendra Kumar
Hadoop a Natural Choice for Data Intensive Log ProcessingHadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log Processing
Hitendra Kumar3.8K views
Introduction to Hadoop Technology by Manish Borkar
Introduction to Hadoop TechnologyIntroduction to Hadoop Technology
Introduction to Hadoop Technology
Manish Borkar2.4K views
Pegasus-Poster-2016-final-v2 by Samrat Jha
Pegasus-Poster-2016-final-v2Pegasus-Poster-2016-final-v2
Pegasus-Poster-2016-final-v2
Samrat Jha38 views
Delivering Bioinformatics MapReduce Applications in the Cloud by Lukas Forer
Delivering Bioinformatics MapReduce Applications in the CloudDelivering Bioinformatics MapReduce Applications in the Cloud
Delivering Bioinformatics MapReduce Applications in the Cloud
Lukas Forer555 views
Survey on Performance of Hadoop Map reduce Optimization Methods by paperpublications3
Survey on Performance of Hadoop Map reduce Optimization MethodsSurvey on Performance of Hadoop Map reduce Optimization Methods
Survey on Performance of Hadoop Map reduce Optimization Methods
Hadoop hive presentation by Arvind Kumar
Hadoop hive presentationHadoop hive presentation
Hadoop hive presentation
Arvind Kumar5.5K views
Learn what is Hadoop-and-BigData by Thanusha154
Learn  what is Hadoop-and-BigDataLearn  what is Hadoop-and-BigData
Learn what is Hadoop-and-BigData
Thanusha154304 views
Next Generation of Hadoop MapReduce by huguk
Next Generation of Hadoop MapReduceNext Generation of Hadoop MapReduce
Next Generation of Hadoop MapReduce
huguk1.3K views

Recently uploaded

MVP and prioritization.pdf by
MVP and prioritization.pdfMVP and prioritization.pdf
MVP and prioritization.pdfrahuldharwal141
37 views8 slides
Evolving the Network Automation Journey from Python to Platforms by
Evolving the Network Automation Journey from Python to PlatformsEvolving the Network Automation Journey from Python to Platforms
Evolving the Network Automation Journey from Python to PlatformsNetwork Automation Forum
17 views21 slides
Network Source of Truth and Infrastructure as Code revisited by
Network Source of Truth and Infrastructure as Code revisitedNetwork Source of Truth and Infrastructure as Code revisited
Network Source of Truth and Infrastructure as Code revisitedNetwork Automation Forum
32 views45 slides
PRODUCT PRESENTATION.pptx by
PRODUCT PRESENTATION.pptxPRODUCT PRESENTATION.pptx
PRODUCT PRESENTATION.pptxangelicacueva6
18 views1 slide
20231123_Camunda Meetup Vienna.pdf by
20231123_Camunda Meetup Vienna.pdf20231123_Camunda Meetup Vienna.pdf
20231123_Camunda Meetup Vienna.pdfPhactum Softwareentwicklung GmbH
45 views73 slides
virtual reality.pptx by
virtual reality.pptxvirtual reality.pptx
virtual reality.pptxG036GaikwadSnehal
18 views15 slides

Recently uploaded(20)

The Forbidden VPN Secrets.pdf by Mariam Shaba
The Forbidden VPN Secrets.pdfThe Forbidden VPN Secrets.pdf
The Forbidden VPN Secrets.pdf
Mariam Shaba20 views
Case Study Copenhagen Energy and Business Central.pdf by Aitana
Case Study Copenhagen Energy and Business Central.pdfCase Study Copenhagen Energy and Business Central.pdf
Case Study Copenhagen Energy and Business Central.pdf
Aitana17 views
Future of AR - Facebook Presentation by ssuserb54b561
Future of AR - Facebook PresentationFuture of AR - Facebook Presentation
Future of AR - Facebook Presentation
ssuserb54b56122 views
STPI OctaNE CoE Brochure.pdf by madhurjyapb
STPI OctaNE CoE Brochure.pdfSTPI OctaNE CoE Brochure.pdf
STPI OctaNE CoE Brochure.pdf
madhurjyapb14 views
GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N... by James Anderson
GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N...GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N...
GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N...
James Anderson126 views
Piloting & Scaling Successfully With Microsoft Viva by Richard Harbridge
Piloting & Scaling Successfully With Microsoft VivaPiloting & Scaling Successfully With Microsoft Viva
Piloting & Scaling Successfully With Microsoft Viva
TouchLog: Finger Micro Gesture Recognition Using Photo-Reflective Sensors by sugiuralab
TouchLog: Finger Micro Gesture Recognition  Using Photo-Reflective SensorsTouchLog: Finger Micro Gesture Recognition  Using Photo-Reflective Sensors
TouchLog: Finger Micro Gesture Recognition Using Photo-Reflective Sensors
sugiuralab23 views
HTTP headers that make your website go faster - devs.gent November 2023 by Thijs Feryn
HTTP headers that make your website go faster - devs.gent November 2023HTTP headers that make your website go faster - devs.gent November 2023
HTTP headers that make your website go faster - devs.gent November 2023
Thijs Feryn26 views

Cloudgene - A MapReduce based Workflow Management System

  • 1. Cloudgene A MapReduce based Workflow Management System Lukas Forer and Sebstian Schönherr Division of Genetic Epidemiology Medical University of Innsbruck, Austria UPPNEX Workshop - January 2015
  • 2. Page 2 Motivation: Bioinformatics • Next Generation Sequencing (NGS) – Sequencing the whole genome at low cost – Gigabytes of produced data per experiment – Allows data production at high scale • Data generation is not the bottleneck anymore • Data processing as the current bottleneck – Single Workstation not sufficient – Super-Computers too expensive
  • 3. Page 3 MapReduce • Commodity computing – Parallel computing on a large number of low budget components • MapReduce – Parallelization framework – Enables analyzing large data – User writes map/reduce function – Framework takes care about fault-tolerance, data distribution, load balancing – Apache Hadoop: Open Source implementation
  • 4. Page 4 MapReduce in Bioinformatics (1) Hadoop MapReduce libraries for Bioinformatics Hadoop BAM Manipulation of aligned next-generation sequencing data (supports BAM, SAM, FASTQ, FASTA, QSEQ, BCF, and VCF) SeqPig Processing NGS data with Apache Pig; Presenting UDFs for frequent tasks; using Hadoop- BAM BioPig Processing NGS data with Apache Pig; Presenting UDFs Biodoop MapReduce suite for sequence alignments / manipulation of aligned records; written in Python DNA - Alignment algorithms based on Hadoop CloudBurst Based on RMAP (seed-and-extend algorithm) Map: Extracting k-mers of reference, non- overlapping k-mers of reads (as keys) Reduce: End-to-end alignments of seeds Seal Based on BWA (version 0.5.9) Map: Alignment using BWA (on a previously created internal file format) Reduce: Remove duplicates (optional) Crossbow Based on Bowtie / SOAPsnp Map: Executing Bowtie on chunks Reduce: SNP calling using SOAPsnp RNA - Analysis based on Hadoop MyRNA Pipeline for calculating differential gene expression in RNA; including Bowtie FX RNA-Seq analysis tool Eoulsan RNA-Seq analysis tool Non-Hadoop based Approaches GATK MapReduce-like framework including a rich set of tools for quality assurance, alignment and variant calling; not based on Hadoop MapReduce
  • 5. Page 5 MapReduce in Bioinformatics (2) • Bioinformatics MapReduce Applications – Available only on a per-tool basis – Cover one aspect of a larger data analysis pipeline – Hard to use for scientists without background in Computer Science • Popular workflow systems – Enable this level of abstraction for the traditional tools – Do not support tools based on MapReduce Missing: System which enables building MapReduce workflows
  • 6. Page 6 Cloudgene • System to execute MapReduce programs graphically and combine them to workflows • One platform – many programs – Integration of existing MapReduce programs without source code adaptations – Create workflows using MapReduce, Apache Pig, R or Unix command-line programs • Runs in your browser
  • 7. Page 7 Cloudgene: Overview Cloudgene-MapRed MapReduce Workflow Manager Bioinformatics WorkflowsBioinformatics WorkflowsBioinformatics Workflows • Requires a compatible cluster to execute workflows – Small/Medium sized research institutes can hardly afford own clusters – Cloud computing: rent computer hardware from different providers (e.g. Amazon, HP)
  • 8. Page 8 CloudgeneCloudgene: Overview Cloudgene-MapRed MapReduce Workflow Manager Cloudgene-Cluster Infrastructure Manager Bioinformatics WorkflowsBioinformatics WorkflowsBioinformatics WorkflowsBioinformatics Workflows
  • 11. Page 11 Workflow Composition • New MapReduce algorithms can be integrated easily • Integration of existing MapReduce algorithms without adaptations in source code • Cloudgene uses its own workflow language • Workflow Definition Language (WDL) – Formal description of tasks and workflow steps – Property-based and uses the YAML syntax – Supports heterogeneous software components (MapReduce, R and unix command-line programs) – Basic workflow control patterns (loops and conditions)
  • 12. Page 13 Workflow Composition • Example of a simple WDL-Manifest file Command line parameters Inputs: Are set by the user through the web interface Outputs: are created by tasks (intermediate or persistent)
  • 13. Page 14 Workflow Composition • The user interface is created automatically
  • 14. Page 15 Workflow Execution Engine 1. Creates a dependency graph based on the WDL file and user input 2. Optimizes the graph to minimize the execution time (i.e. caching) 3. Schedules and submits jobs to the Hadoop Cluster
  • 17. Page 18 Supported Technologies • Apache Hadoop MapReduce • Apache PIG • RMarkdown – Ideal to generate html files with charts, statistics, … • Unix command line programs – Cloudgene exports automatically all HDFS files – No manual file staging between HDFS and POSIX filesystem needed! Advantage: Composition of hybrid Workflows possible
  • 18. Page 19 Other Features • Authentication and User-Management • Parameter Tracking • HDFS Workspace – Hides HDFS filesystem by the end-user – Importing Data from Amazon S3 Buckets, HTTP and (S)FTP Servers, File Uploads, ... – Facilitates the management of datasets on the cluster
  • 19. Page 20 Preview: Cloudgene 2.0 • Interface for web-services – Same WDL file, but different interface – User Registration – Intelligent Queuing – User Notification • Examples: – https://imputationserver.sph.umich.edu – http://mtdna-server.uibk.ac.at
  • 20. Page 21 Preview: Cloudgene 2.0 • Generic data analysis platform – Integration of additional data processing models Cloudgene Hadoop 1.0 MapReduce Cloudgene Hadoop 2.0 YARN MapReduce Spark Giraph …
  • 21. Page 22 Conclusion • Website – http://cloudgene.uibk.ac.at • Virtual Machine – https://bioimg.org/cloudgene • Getting started – http://cloudgene.uibk.ac.at/getting-started • Developer Guide – http://cloudgene.uibk.ac.at/developer-guide
  • 22. Page 23 Acks • Cloudgene – Lukas Forer (@lukfor) and Sebastian Schoenherr (@seppinho) • Imputation with Minimac – Goncalo Abecasis, Christian Fuchsberger • mtDNA-Server – Hansi Weißensteiner • Univ.-Prof. Florian Kronenberg – Head of the Division of Genetic Epidemiology, Medical University of Innsbruck 23

Editor's Notes

  1. Welcome everybody to the defense of my phd thesis. In the next 20 minutes i give you an overview about the results and outcomes of my thesis. The main topic is the efficient analysis of data in the field of bioinformatics.
  2. NGS enables to sequence the whole genome. This is done in a extremely parallel way and enables to sequence the genome at low cost and high scale- This has the consequence that more and more data will be produced. So the bottleneck is no longer the data production in the lab, but its analysis. This is because one experiment produces gigabytes of data. Therefore ,one single workstation is no sufficient for the data analysis and super computers are often too expensive!
  3. So one solution fot that problem is to use commodity computing. That means we use a large number of normal cheap computing components and use them to perform our analysis in parallel And one approach which was developed specially for that kind of infrastructure is mapreduce. It is a parallelization framework developed by google in 2004 and enables to analyze large data efficiently in parallel The user writes only the map and the reduce function and the framework takes care about fault-tolerance, data-dist and load balancing. All the stuff we need in parallel computing environment. The map and reduce functions are stateless, and can be executed in parallel and therefore this approach scales very well! Apache hadoop is open source implementation of mapreduce.
  4. As this table shows, there exist already several Mapreduce apps in the field of bioinformatics and it is a high potential. For example there are algorithm available for mapping shot reads to a reference ot for rna analysis.
  5. But the problem of such approaches is that thei are available only on a per-tool basis in genetics we often need large workflows which consits of several steps To analyze data. But those tools cover only one aspect of such a pipeline. Moreover, for biologists without background in cs it is very hard to use them Most of popular workflow systems such as galaxy enable this abstraction only for traditional tools and not for mapreduce. So a system which enables building such mapreduce workflows is missing.
  6. So the aim sof my thesis can be classified in two parts: First, developing a system to compose complex workflows of multiple Mapreduce tools. This is done by abstracting all the technical details Second, evaluating this system by applying it to Bioinformatics. For that reason i adapted 3 different workflows to MapReduce. The first workflow is for genotype imputation, the second for genome-wide association studies and the last one detects Copy number variations.
  7. The first aim was solved by implementing a Workflow Execution Engine called Cloudgene MapRed. And on the top of this i have integrated the three workflows. Cloudgene-Mapred requires a compatible cluster to execute the pipeline. Especially for small research institutes it can be hard to afford and maintain their own cluster So a possible solution is cloud computing which enables to rent computer hardware from different providers for example amazon. So they can use the rented resources on demand.
  8. To overcome this issues, sebastian developed in his thesis a infrastructure manager which enables to launch and manage an hadoop cluster through the browser. So ist possible to run the same workflows on a local cluster, on private cloud or on a public cloud. This whole system is called cloudgene and in my presentation today i talk about the workflow executing engine and one of the three workflows. And this workflow is called imputation server.
  9. On this slide you can see the advantages of cloudgene compared with the manual appraoch.
  10. This workflow manager assists scientists in executing and monitoring worklfows The core of the architecture is the execution engine. As you can see in this picture, the workflow execution engine operates on a hadoop cluster . Therefore data reliability and fault tolerance are provided. The workflow engine contains an optimizer which tries to minimize the execution time by using caching mechanisms. Moreover, it contains a data manager for importing and exporting datasets. The system has rest api interface in order to communicate with clients. In our case the the client is a webapplication.
  11. The Workflow composition in Cloudgene was developed with two aims in mind: first it should be possible to implement new algorithms easily Second, it should be possible to integrate existing algorithm without source code adaptations For that reason, i developed a new Workflow Language which is called WDL and is used by cloudgene. - It enables a formal description of workflows and their tasks It is property based and uses a human readable syntax Supports different software components as tasks And supports some basic control patterns like conditions and loops.
  12. Here is a very simple example of such a workflow written in WDL. I don't want to go too much in detail, but you can define inputs and outputs and then you can reuse them in your tasks.
  13. Based on this manifest file, we create automatically a use rinterface which can be used to submit the job with different parameters and datasets. And when the user clicks on the submit button then the workflow engine comes into play.
  14. We have the WDL manifest file with thr workflow structure, and the user input which is used to execute it. Based on this information a graph is created which contains all tasks and their dependencies. Then the optimized tries to minimize the graph by using caching. And finally based on this graph are task execution plan is created which is used to submit the jobs to the cluster.
  15. Once the job is submitted, we can monitor the progress.
  16. When the job is complete, we can download the results files directly through the browser and all used parameters are tracked.
  17. Beside the hadoop technologies we support also other useful technologies. For example rmarkdown to create html reports Or any other unix command line program. In this case cloudgene automatically exports files from the hdfs to the local filesystem. So an intuitive combination of these technologies is possible.
  18. The next step of this project is to turn Cloudgene into a more generic data analysis cloud platform. Therefore we plan to integrate additional big data computation models so that cloudgene is not limited to mapreduce. One possibility is to integrate YARN which is the new version of hadoop and is a middle layer between hadoop and mapreduce. So we can support also other models for example for graph data processing and in-memory calculations.