SlideShare a Scribd company logo
1 of 20
Download to read offline
Analyze Genomes:
Modeling and Executing Genome Data Processing Pipelines
Cindy Perscheid
Festival of Genomics
London, Jan 19, 2016
Perscheid,
Schapranow
Recap Analyze Genomes
Real-time Analysis of Big Medical Data
In-Memory Database
Extensions for Life Sciences
Data Exchange,
App Store
Access Control,
Data Protection
Fair Use
Statistical
Tools
Real-time
Analysis
App-spanning
User Profiles
Combined and Linked Data
Genome
Data
Cellular
Pathways
Genome
Metadata
Research
Publications
Pipeline and
Analysis Models
Drugs and
Interactions
Modeling and
Executing Genome
Data Processing
Pipelines
Drug Response
Analysis
Pathway Topology
Analysis
Medical
Knowledge CockpitOncolyzer
Clinical Trial
Recruitment
Cohort
Analysis
...
Indexed
Sources
From Raw Genome Data to Analysis Results
Perscheid,
Schapranow
Modeling and
Executing Genome
Data Processing
Pipelines
■  Sequencing: Acquire digital DNA
data
■  Alignment: Reconstruction of
complete genome with snippets
■  Variant Calling: Identification of
genetic variants
■  Data Annotation: Linking genetic
variants with research findings
■  Not standardized
■  Not exchangeable
■  Concatenation of bash scripts reading from and writing to files
■  Requires IT expertise for
□  Setup
□  Error handling, and
□  Efficient processing and parallelization
■  Objective: Model, configure, and execute pipelines without involving IT
experts
Genome Data Processing Pipelines
State of the Art
Perscheid,
Schapranow
Modeling and
Executing Genome
Data Processing
Pipelines
bwa aln ref.fa sample.fastq | bwa samse ref.fa – sample.fastq | samtools view -Su - | samtools sort …
■  Modeling Genome Data Processing Pipelines
■  Pipeline Execution in the Worker Framework
Agenda
Perscheid,
Schapranow
Modeling and
Executing Genome
Data Processing
Pipelines
■  Modeling Genome Data Processing Pipelines
■  Pipeline Execution in the Worker Framework
Agenda
Perscheid,
Schapranow
Modeling and
Executing Genome
Data Processing
Pipelines
Modeling Genome Data Processing Pipelines
BPMN 2.0 Example
Perscheid,
Schapranow
Modeling and
Executing Genome
Data Processing
Pipelines
Start
Event
End
Event
Annotated Data
Object
Parallel
Gateway
Collapsed
Subtask
Task
Multiple Task Instances
Executed in Parallel
■  Graphical modeling notation
■  Compliant with BPMN 2.0 extended by
□  Modular structure
□  Degree of parallelization
□  Parameters and variables
■  Model descriptions (XPDL) are stored in IMDB
■  Model instances are transformed into graph
structure executed by our worker framework
Modeling Genome Data Processing Pipelines
BPMN 2.0 Extensions
Perscheid,
Schapranow
Modeling and
Executing Genome
Data Processing
Pipelines
Modeling Genome Data Processing Pipelines
From Model to Execution
1.  Design time (researcher, process expert)
□  Definition of parameterized process model
□  Uses graphical editor and available jobs
2.  Configuration time (researcher, lab assistant)
□  Select model and specify parameters
□  Results in model instance stored in database
3.  Execution time (researcher)
□  Select model instance
□  Specify execution parameters, e.g. input files
Modeling and
Executing Genome
Data Processing
Pipelines
Perscheid,
Schapranow
■  Results are imported into IMDB
■  Optimization reduced execution time by >50%
Modeling Genome Data Processing Pipelines
Traditional vs. Optimized Approach
Perscheid,
Schapranow
Modeling and
Executing Genome
Data Processing
Pipelines
■  Modeling of Genome Data Processing Pipelines
■  Worker Framework
Agenda
Perscheid,
Schapranow
Modeling and
Executing Genome
Data Processing
Pipelines
Pipeline Execution
Overview
Modeling and Executing
Genome Data
Processing Pipelines
In-Memory Database
Tasks
Scheduler
ID Pipeline Params
12 BWA xyz.fastq
13 BWAmem abc.fastq
14 Bowtie2 xyz.fastq
Worker
Worker
Subtasks
Task ID Job Status Params
12 97 Split done xyz.fastq
12 98 Import todo abc.vcf
12 98 Import done abc.vcf
Webservice
. . .
1. Trigger task execution
2. Schedule subtasks
3. Execute subtasks
Pipeline Execution
Software Components and Communication
Perscheid,
Schapranow
Modeling and
Executing Genome
Data Processing
PipelinesNode
WorkerWorkerWorker
IMDB
Node
WorkerWorkerWorker
IMDB
Node
WorkerWorkerWorker
IMDB
Scheduler
Node
WorkerWorkerWorker
IMDB
Transmitter
Node
WorkerWorkerWorker
IMDB
...
Site I Site IIVPN
UDP
TCP
Shared File System Shared File System
Node
WorkerWorkerWorker
IMDB
■  Workers execute jobs one by one
■  Subtask execution status in IMDB:
□  Ready (0),
□  In Progress (1),
□  Done (2), or
□  Erroneous (3).
■  Jobs implemented as Python modules/classes
□  Can contain arbitrary code
□  Have access to IMDB
□  Can read/write to shared working directory
Pipeline Execution
Runtime Layer - Worker
Perscheid,
Schapranow
Modeling and
Executing Genome
Data Processing
Pipelines
Pipeline Execution
Coordination Layer - Scheduler
Perscheid,
Schapranow
Modeling and
Executing Genome
Data Processing
Pipelines
Scheduling Algorithm
Pipeline Executor
Ressource
Allocator
SubtaskSubtaskSubtask i
Subtask k
Subtask m
m..
■  Scheduling algorithms are plug-in software modules
□  “User-/Group-based” to let users execute their tasks on their local site
only
□  “Priority First” to prefer important users
□  “High Throughput”, i.e. “shortest task first” to deal with high load
■  Scheduling algorithms can also be composed hierarchically
Pipeline Execution
Scheduling Algorithms
Perscheid,
Schapranow
Modeling and
Executing Genome
Data Processing
PipelinesPriority-based
High-throughput High-throughput
High-throughput
Site I Site II
Prio A Prio *
Group-based
■  Maintains lists of running and idle nodes
■  Idle worker requests new sub task for its assigned groups
■  If there is no matching sub task, it sleeps until a new sub task gets ready
Pipeline Execution
Resource Allocator
Perscheid,
Schapranow
Modeling and
Executing Genome
Data Processing
Pipelines
Node
WorkerWorkerWorker
IMDB
Site 1 Site 2
Node
WorkerWorkerWorker
IMDB
Node
WorkerWorkerWorker
IMDB
Working Working Waiting
Pipeline Execution
Flexibility
Perscheid,
Schapranow
Modeling and
Executing Genome
Data Processing
PipelinesNode
IMDB
Node
WorkerWorkerWorker
Node
WorkerWorkerWorker
IMDB
Scheduler
UDP
TCP
File System Share
■  All execution data is stored in
IMDB
■  Temporary files on a shared file
system
■  In case of any failure, the
system-wide state can be
restored
Pipeline Execution
Recoverability
Perscheid,
Schapranow
Modeling and
Executing Genome
Data Processing
Pipelines
IMDB
Pipeline TasksScheduler
Worker
Worker
Worker
Worker
Pipeline Subtasks
Events
Data
Thanks!
Hasso Plattner Institute
Enterprise Platform & Integration Concepts
August-Bebel-Str. 88
14482 Potsdam, Germany
Dr. Matthieu-P. Schapranow
schapranow@hpi.de
http://we.analyzegenomes.com/
Cindy Perscheid, M. Sc.
cindy.perscheid@hpi.de
Perscheid,
Schapranow
Modeling and
Executing Genome
Data Processing
Pipelines

More Related Content

What's hot

Analyze Genomes Services for Precision Medicine
Analyze Genomes Services for Precision MedicineAnalyze Genomes Services for Precision Medicine
Analyze Genomes Services for Precision MedicineMatthieu Schapranow
 
Gesundheit geht uns alle an: Smart Data ermöglicht passendere Entscheidungen...
Gesundheit geht uns alle an: Smart Data ermöglicht passendere Entscheidungen...Gesundheit geht uns alle an: Smart Data ermöglicht passendere Entscheidungen...
Gesundheit geht uns alle an: Smart Data ermöglicht passendere Entscheidungen...Matthieu Schapranow
 
A Platform for Integrated Genome Data Analysis
A Platform for Integrated Genome Data AnalysisA Platform for Integrated Genome Data Analysis
A Platform for Integrated Genome Data AnalysisMatthieu Schapranow
 
The Driver of the Healthcare System in the 21st Century: Real-world Applicati...
The Driver of the Healthcare System in the 21st Century: Real-world Applicati...The Driver of the Healthcare System in the 21st Century: Real-world Applicati...
The Driver of the Healthcare System in the 21st Century: Real-world Applicati...Matthieu Schapranow
 
BioNRW: Big Medical Data: Challenge or Potential
BioNRW: Big Medical Data: Challenge or PotentialBioNRW: Big Medical Data: Challenge or Potential
BioNRW: Big Medical Data: Challenge or PotentialMatthieu Schapranow
 
A Federated In-Memory Database System for Life Sciences
A Federated In-Memory Database System for Life SciencesA Federated In-Memory Database System for Life Sciences
A Federated In-Memory Database System for Life SciencesMatthieu Schapranow
 
Big Data in Genomics: Opportunities and Challenges
Big Data in Genomics: Opportunities and ChallengesBig Data in Genomics: Opportunities and Challenges
Big Data in Genomics: Opportunities and ChallengesMatthieu Schapranow
 
Processing of Big Medical Data in Personalized Medicine: Challenge or Potential
Processing of Big Medical Data in Personalized Medicine: Challenge or PotentialProcessing of Big Medical Data in Personalized Medicine: Challenge or Potential
Processing of Big Medical Data in Personalized Medicine: Challenge or PotentialMatthieu Schapranow
 
In-Memory Apps for Precision Medicine
In-Memory Apps for Precision MedicineIn-Memory Apps for Precision Medicine
In-Memory Apps for Precision MedicineMatthieu Schapranow
 
Analyze Genomes: Drug Response Analysis
Analyze Genomes: Drug Response AnalysisAnalyze Genomes: Drug Response Analysis
Analyze Genomes: Drug Response AnalysisMatthieu Schapranow
 
Festival of Genomics 2016 London: What to take home?
Festival of Genomics 2016 London: What to take home?Festival of Genomics 2016 London: What to take home?
Festival of Genomics 2016 London: What to take home?Matthieu Schapranow
 
AnalyzeGenomes.com: A Federated In-Memory Database Platform for Digital Health
AnalyzeGenomes.com: A Federated In-Memory Database Platform for Digital HealthAnalyzeGenomes.com: A Federated In-Memory Database Platform for Digital Health
AnalyzeGenomes.com: A Federated In-Memory Database Platform for Digital HealthMatthieu Schapranow
 
Festival of Genomics 2016 London: Analyze Genomes: Real-world Examples
Festival of Genomics 2016 London: Analyze Genomes: Real-world ExamplesFestival of Genomics 2016 London: Analyze Genomes: Real-world Examples
Festival of Genomics 2016 London: Analyze Genomes: Real-world ExamplesMatthieu Schapranow
 
Patient Journey in Oncology 2025: Molecular Tumour Boards in Practice
Patient Journey in Oncology 2025: Molecular Tumour Boards in PracticePatient Journey in Oncology 2025: Molecular Tumour Boards in Practice
Patient Journey in Oncology 2025: Molecular Tumour Boards in PracticeMatthieu Schapranow
 
Festival of Genomics 2016 London: Mining and Processing of Unstructured Medic...
Festival of Genomics 2016 London: Mining and Processing of Unstructured Medic...Festival of Genomics 2016 London: Mining and Processing of Unstructured Medic...
Festival of Genomics 2016 London: Mining and Processing of Unstructured Medic...Matthieu Schapranow
 
How will AI affect the patient journey of the future?
How will AI affect the patient journey of the future?How will AI affect the patient journey of the future?
How will AI affect the patient journey of the future?Matthieu Schapranow
 
Algorithmen statt Ärzte: Algorithmen statt Ärzte: Ersetzt Big Data künftig ...
Algorithmen statt Ärzte: Algorithmen statt Ärzte: Ersetzt Big Data künftig ...Algorithmen statt Ärzte: Algorithmen statt Ärzte: Ersetzt Big Data künftig ...
Algorithmen statt Ärzte: Algorithmen statt Ärzte: Ersetzt Big Data künftig ...Matthieu Schapranow
 
Festival of Genomics 2016 London: Agenda
Festival of Genomics 2016 London: AgendaFestival of Genomics 2016 London: Agenda
Festival of Genomics 2016 London: AgendaMatthieu Schapranow
 

What's hot (20)

Analyze Genomes Services for Precision Medicine
Analyze Genomes Services for Precision MedicineAnalyze Genomes Services for Precision Medicine
Analyze Genomes Services for Precision Medicine
 
Gesundheit geht uns alle an: Smart Data ermöglicht passendere Entscheidungen...
Gesundheit geht uns alle an: Smart Data ermöglicht passendere Entscheidungen...Gesundheit geht uns alle an: Smart Data ermöglicht passendere Entscheidungen...
Gesundheit geht uns alle an: Smart Data ermöglicht passendere Entscheidungen...
 
A Platform for Integrated Genome Data Analysis
A Platform for Integrated Genome Data AnalysisA Platform for Integrated Genome Data Analysis
A Platform for Integrated Genome Data Analysis
 
The Driver of the Healthcare System in the 21st Century: Real-world Applicati...
The Driver of the Healthcare System in the 21st Century: Real-world Applicati...The Driver of the Healthcare System in the 21st Century: Real-world Applicati...
The Driver of the Healthcare System in the 21st Century: Real-world Applicati...
 
"When time matters..."
"When time matters...""When time matters..."
"When time matters..."
 
BioNRW: Big Medical Data: Challenge or Potential
BioNRW: Big Medical Data: Challenge or PotentialBioNRW: Big Medical Data: Challenge or Potential
BioNRW: Big Medical Data: Challenge or Potential
 
Big Data in Life Sciences
Big Data in Life SciencesBig Data in Life Sciences
Big Data in Life Sciences
 
A Federated In-Memory Database System for Life Sciences
A Federated In-Memory Database System for Life SciencesA Federated In-Memory Database System for Life Sciences
A Federated In-Memory Database System for Life Sciences
 
Big Data in Genomics: Opportunities and Challenges
Big Data in Genomics: Opportunities and ChallengesBig Data in Genomics: Opportunities and Challenges
Big Data in Genomics: Opportunities and Challenges
 
Processing of Big Medical Data in Personalized Medicine: Challenge or Potential
Processing of Big Medical Data in Personalized Medicine: Challenge or PotentialProcessing of Big Medical Data in Personalized Medicine: Challenge or Potential
Processing of Big Medical Data in Personalized Medicine: Challenge or Potential
 
In-Memory Apps for Precision Medicine
In-Memory Apps for Precision MedicineIn-Memory Apps for Precision Medicine
In-Memory Apps for Precision Medicine
 
Analyze Genomes: Drug Response Analysis
Analyze Genomes: Drug Response AnalysisAnalyze Genomes: Drug Response Analysis
Analyze Genomes: Drug Response Analysis
 
Festival of Genomics 2016 London: What to take home?
Festival of Genomics 2016 London: What to take home?Festival of Genomics 2016 London: What to take home?
Festival of Genomics 2016 London: What to take home?
 
AnalyzeGenomes.com: A Federated In-Memory Database Platform for Digital Health
AnalyzeGenomes.com: A Federated In-Memory Database Platform for Digital HealthAnalyzeGenomes.com: A Federated In-Memory Database Platform for Digital Health
AnalyzeGenomes.com: A Federated In-Memory Database Platform for Digital Health
 
Festival of Genomics 2016 London: Analyze Genomes: Real-world Examples
Festival of Genomics 2016 London: Analyze Genomes: Real-world ExamplesFestival of Genomics 2016 London: Analyze Genomes: Real-world Examples
Festival of Genomics 2016 London: Analyze Genomes: Real-world Examples
 
Patient Journey in Oncology 2025: Molecular Tumour Boards in Practice
Patient Journey in Oncology 2025: Molecular Tumour Boards in PracticePatient Journey in Oncology 2025: Molecular Tumour Boards in Practice
Patient Journey in Oncology 2025: Molecular Tumour Boards in Practice
 
Festival of Genomics 2016 London: Mining and Processing of Unstructured Medic...
Festival of Genomics 2016 London: Mining and Processing of Unstructured Medic...Festival of Genomics 2016 London: Mining and Processing of Unstructured Medic...
Festival of Genomics 2016 London: Mining and Processing of Unstructured Medic...
 
How will AI affect the patient journey of the future?
How will AI affect the patient journey of the future?How will AI affect the patient journey of the future?
How will AI affect the patient journey of the future?
 
Algorithmen statt Ärzte: Algorithmen statt Ärzte: Ersetzt Big Data künftig ...
Algorithmen statt Ärzte: Algorithmen statt Ärzte: Ersetzt Big Data künftig ...Algorithmen statt Ärzte: Algorithmen statt Ärzte: Ersetzt Big Data künftig ...
Algorithmen statt Ärzte: Algorithmen statt Ärzte: Ersetzt Big Data künftig ...
 
Festival of Genomics 2016 London: Agenda
Festival of Genomics 2016 London: AgendaFestival of Genomics 2016 London: Agenda
Festival of Genomics 2016 London: Agenda
 

Viewers also liked

Interviewing - why some questions are off limits
Interviewing - why some questions are off limitsInterviewing - why some questions are off limits
Interviewing - why some questions are off limitsAnn Loraine
 
Big Data & the networked future of Science (at Ignite Seattle 7)
Big Data & the networked future of Science (at Ignite Seattle 7)Big Data & the networked future of Science (at Ignite Seattle 7)
Big Data & the networked future of Science (at Ignite Seattle 7)Deepak Singh
 
Human Genome and Big Data Challenges
Human Genome and Big Data ChallengesHuman Genome and Big Data Challenges
Human Genome and Big Data ChallengesPhilip Bourne
 
Bioinformatics & Genomics December Newsletter
Bioinformatics & Genomics December NewsletterBioinformatics & Genomics December Newsletter
Bioinformatics & Genomics December NewsletterKelly Wickham
 
Visualizing the genome: Techniques for presenting genome data and annotations
Visualizing the genome: Techniques for presenting genome data and annotationsVisualizing the genome: Techniques for presenting genome data and annotations
Visualizing the genome: Techniques for presenting genome data and annotationsAnn Loraine
 
Daily Snapshot - 2nd March 2017
Daily Snapshot - 2nd March 2017Daily Snapshot - 2nd March 2017
Daily Snapshot - 2nd March 2017Tracxn
 
Hannes Smarason: Genomics: Forging Patient-Centric Communities
Hannes Smarason: Genomics: Forging Patient-Centric CommunitiesHannes Smarason: Genomics: Forging Patient-Centric Communities
Hannes Smarason: Genomics: Forging Patient-Centric CommunitiesHannes Smárason
 
Hannes Smarason: Progress & Prospects in Genomics
Hannes Smarason: Progress & Prospects in GenomicsHannes Smarason: Progress & Prospects in Genomics
Hannes Smarason: Progress & Prospects in GenomicsHannes Smárason
 
Genomics Is Not Special: Towards Data Intensive Biology
Genomics Is Not Special: Towards Data Intensive BiologyGenomics Is Not Special: Towards Data Intensive Biology
Genomics Is Not Special: Towards Data Intensive BiologyUri Laserson
 
Hannes Smarason: 2015 = An Inflection Point in Genomics
Hannes Smarason: 2015 = An Inflection Point in GenomicsHannes Smarason: 2015 = An Inflection Point in Genomics
Hannes Smarason: 2015 = An Inflection Point in GenomicsHannes Smárason
 
Strata Big Data Science Talk on ADAM
Strata Big Data Science Talk on ADAMStrata Big Data Science Talk on ADAM
Strata Big Data Science Talk on ADAMMatt Massie
 
Genome Analysis Pipelines with Spark and ADAM
Genome Analysis Pipelines with Spark and ADAMGenome Analysis Pipelines with Spark and ADAM
Genome Analysis Pipelines with Spark and ADAMAllen Day, PhD
 
Tracxn Research — Home Improvements Landscape, December 2016
Tracxn Research — Home Improvements Landscape, December 2016Tracxn Research — Home Improvements Landscape, December 2016
Tracxn Research — Home Improvements Landscape, December 2016Tracxn
 
Tracxn Research — New Space Landscape, December 2016
Tracxn Research — New Space Landscape, December 2016Tracxn Research — New Space Landscape, December 2016
Tracxn Research — New Space Landscape, December 2016Tracxn
 
Tracxn Research - PR Tech Landscape, January 2017
Tracxn Research - PR Tech Landscape, January 2017Tracxn Research - PR Tech Landscape, January 2017
Tracxn Research - PR Tech Landscape, January 2017Tracxn
 
Tracxn Research — Legal Tech Landscape, December 2016
Tracxn Research — Legal Tech Landscape, December 2016Tracxn Research — Legal Tech Landscape, December 2016
Tracxn Research — Legal Tech Landscape, December 2016Tracxn
 
Tracxn Research - Smart Cars Landscape, January 2017
Tracxn Research - Smart Cars Landscape, January 2017Tracxn Research - Smart Cars Landscape, January 2017
Tracxn Research - Smart Cars Landscape, January 2017Tracxn
 
Genomics: Big Data Leading to Big Opportunities
Genomics: Big Data Leading to Big OpportunitiesGenomics: Big Data Leading to Big Opportunities
Genomics: Big Data Leading to Big OpportunitiesHannes Smárason
 
Tracxn Research - Wind Energy Landscape, January 2017
Tracxn Research - Wind Energy Landscape, January 2017Tracxn Research - Wind Energy Landscape, January 2017
Tracxn Research - Wind Energy Landscape, January 2017Tracxn
 

Viewers also liked (20)

Interviewing - why some questions are off limits
Interviewing - why some questions are off limitsInterviewing - why some questions are off limits
Interviewing - why some questions are off limits
 
Big Data & the networked future of Science (at Ignite Seattle 7)
Big Data & the networked future of Science (at Ignite Seattle 7)Big Data & the networked future of Science (at Ignite Seattle 7)
Big Data & the networked future of Science (at Ignite Seattle 7)
 
Human Genome and Big Data Challenges
Human Genome and Big Data ChallengesHuman Genome and Big Data Challenges
Human Genome and Big Data Challenges
 
Bioinformatics & Genomics December Newsletter
Bioinformatics & Genomics December NewsletterBioinformatics & Genomics December Newsletter
Bioinformatics & Genomics December Newsletter
 
Visualizing the genome: Techniques for presenting genome data and annotations
Visualizing the genome: Techniques for presenting genome data and annotationsVisualizing the genome: Techniques for presenting genome data and annotations
Visualizing the genome: Techniques for presenting genome data and annotations
 
Daily Snapshot - 2nd March 2017
Daily Snapshot - 2nd March 2017Daily Snapshot - 2nd March 2017
Daily Snapshot - 2nd March 2017
 
Hannes Smarason: Genomics: Forging Patient-Centric Communities
Hannes Smarason: Genomics: Forging Patient-Centric CommunitiesHannes Smarason: Genomics: Forging Patient-Centric Communities
Hannes Smarason: Genomics: Forging Patient-Centric Communities
 
Hannes Smarason: Progress & Prospects in Genomics
Hannes Smarason: Progress & Prospects in GenomicsHannes Smarason: Progress & Prospects in Genomics
Hannes Smarason: Progress & Prospects in Genomics
 
Genomics Is Not Special: Towards Data Intensive Biology
Genomics Is Not Special: Towards Data Intensive BiologyGenomics Is Not Special: Towards Data Intensive Biology
Genomics Is Not Special: Towards Data Intensive Biology
 
Hannes Smarason: 2015 = An Inflection Point in Genomics
Hannes Smarason: 2015 = An Inflection Point in GenomicsHannes Smarason: 2015 = An Inflection Point in Genomics
Hannes Smarason: 2015 = An Inflection Point in Genomics
 
Strata Big Data Science Talk on ADAM
Strata Big Data Science Talk on ADAMStrata Big Data Science Talk on ADAM
Strata Big Data Science Talk on ADAM
 
Genome Analysis Pipelines with Spark and ADAM
Genome Analysis Pipelines with Spark and ADAMGenome Analysis Pipelines with Spark and ADAM
Genome Analysis Pipelines with Spark and ADAM
 
Genome Big Data
Genome Big DataGenome Big Data
Genome Big Data
 
Tracxn Research — Home Improvements Landscape, December 2016
Tracxn Research — Home Improvements Landscape, December 2016Tracxn Research — Home Improvements Landscape, December 2016
Tracxn Research — Home Improvements Landscape, December 2016
 
Tracxn Research — New Space Landscape, December 2016
Tracxn Research — New Space Landscape, December 2016Tracxn Research — New Space Landscape, December 2016
Tracxn Research — New Space Landscape, December 2016
 
Tracxn Research - PR Tech Landscape, January 2017
Tracxn Research - PR Tech Landscape, January 2017Tracxn Research - PR Tech Landscape, January 2017
Tracxn Research - PR Tech Landscape, January 2017
 
Tracxn Research — Legal Tech Landscape, December 2016
Tracxn Research — Legal Tech Landscape, December 2016Tracxn Research — Legal Tech Landscape, December 2016
Tracxn Research — Legal Tech Landscape, December 2016
 
Tracxn Research - Smart Cars Landscape, January 2017
Tracxn Research - Smart Cars Landscape, January 2017Tracxn Research - Smart Cars Landscape, January 2017
Tracxn Research - Smart Cars Landscape, January 2017
 
Genomics: Big Data Leading to Big Opportunities
Genomics: Big Data Leading to Big OpportunitiesGenomics: Big Data Leading to Big Opportunities
Genomics: Big Data Leading to Big Opportunities
 
Tracxn Research - Wind Energy Landscape, January 2017
Tracxn Research - Wind Energy Landscape, January 2017Tracxn Research - Wind Energy Landscape, January 2017
Tracxn Research - Wind Energy Landscape, January 2017
 

Similar to Festival of Genomics 2016 London: Analyze Genomes: Modeling and Executing Genome Data Processing Pipelines

A Federated In-Memory Database Computing Platform Enabling Real-Time Analysis...
A Federated In-Memory Database Computing Platform Enabling Real-Time Analysis...A Federated In-Memory Database Computing Platform Enabling Real-Time Analysis...
A Federated In-Memory Database Computing Platform Enabling Real-Time Analysis...Matthieu Schapranow
 
Crossing the Analytics Chasm and Getting the Models You Developed Deployed
Crossing the Analytics Chasm and Getting the Models You Developed DeployedCrossing the Analytics Chasm and Getting the Models You Developed Deployed
Crossing the Analytics Chasm and Getting the Models You Developed DeployedRobert Grossman
 
Instrumentation and measurement
Instrumentation and measurementInstrumentation and measurement
Instrumentation and measurementDr.M.Prasad Naidu
 
Integrative information management for systems biology
Integrative information management for systems biologyIntegrative information management for systems biology
Integrative information management for systems biologyNeil Swainston
 
Legacy Analysis: How Hadoop Streaming Enables Software Reuse – A Genomics Cas...
Legacy Analysis: How Hadoop Streaming Enables Software Reuse – A Genomics Cas...Legacy Analysis: How Hadoop Streaming Enables Software Reuse – A Genomics Cas...
Legacy Analysis: How Hadoop Streaming Enables Software Reuse – A Genomics Cas...StampedeCon
 
RAMSES: Robust Analytic Models for Science at Extreme Scales
RAMSES: Robust Analytic Models for Science at Extreme ScalesRAMSES: Robust Analytic Models for Science at Extreme Scales
RAMSES: Robust Analytic Models for Science at Extreme ScalesIan Foster
 
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...Robert Grossman
 
How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...
How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...
How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...Robert Grossman
 
Integration Patterns for Big Data Applications
Integration Patterns for Big Data ApplicationsIntegration Patterns for Big Data Applications
Integration Patterns for Big Data ApplicationsMichael Häusler
 
Labmatrix
LabmatrixLabmatrix
Labmatrixjwppz
 
AnalyticOps - Chicago PAW 2016
AnalyticOps - Chicago PAW 2016AnalyticOps - Chicago PAW 2016
AnalyticOps - Chicago PAW 2016Robert Grossman
 
Making project data avalialble eNanomapper through Database
Making project data avalialble eNanomapper through  DatabaseMaking project data avalialble eNanomapper through  Database
Making project data avalialble eNanomapper through DatabaseNina Jeliazkova
 
2012 sept 18_thug_biotech
2012 sept 18_thug_biotech2012 sept 18_thug_biotech
2012 sept 18_thug_biotechAdam Muise
 
Cloud e-Genome: NGS Workflows on the Cloud Using e-Science Central
Cloud e-Genome: NGS Workflows on the Cloud Using e-Science CentralCloud e-Genome: NGS Workflows on the Cloud Using e-Science Central
Cloud e-Genome: NGS Workflows on the Cloud Using e-Science CentralPaolo Missier
 
Data analysis patterns, tools and data types in genomics
Data analysis patterns, tools and data types in genomicsData analysis patterns, tools and data types in genomics
Data analysis patterns, tools and data types in genomicsAltuna Akalin
 
Automating the process of continuously prioritising data, updating and deploy...
Automating the process of continuously prioritising data, updating and deploy...Automating the process of continuously prioritising data, updating and deploy...
Automating the process of continuously prioritising data, updating and deploy...Ola Spjuth
 
Automating Speed: A Proven Approach to Preventing Performance Regressions in ...
Automating Speed: A Proven Approach to Preventing Performance Regressions in ...Automating Speed: A Proven Approach to Preventing Performance Regressions in ...
Automating Speed: A Proven Approach to Preventing Performance Regressions in ...HostedbyConfluent
 
Pipeline Scripting for the Parallel Alignment of Genomic Short Sequence Reads
Pipeline Scripting for the Parallel Alignment of Genomic Short Sequence ReadsPipeline Scripting for the Parallel Alignment of Genomic Short Sequence Reads
Pipeline Scripting for the Parallel Alignment of Genomic Short Sequence ReadsAdam Bradley
 

Similar to Festival of Genomics 2016 London: Analyze Genomes: Modeling and Executing Genome Data Processing Pipelines (20)

A Federated In-Memory Database Computing Platform Enabling Real-Time Analysis...
A Federated In-Memory Database Computing Platform Enabling Real-Time Analysis...A Federated In-Memory Database Computing Platform Enabling Real-Time Analysis...
A Federated In-Memory Database Computing Platform Enabling Real-Time Analysis...
 
Crossing the Analytics Chasm and Getting the Models You Developed Deployed
Crossing the Analytics Chasm and Getting the Models You Developed DeployedCrossing the Analytics Chasm and Getting the Models You Developed Deployed
Crossing the Analytics Chasm and Getting the Models You Developed Deployed
 
Instrumentation and measurement
Instrumentation and measurementInstrumentation and measurement
Instrumentation and measurement
 
Integrative information management for systems biology
Integrative information management for systems biologyIntegrative information management for systems biology
Integrative information management for systems biology
 
Legacy Analysis: How Hadoop Streaming Enables Software Reuse – A Genomics Cas...
Legacy Analysis: How Hadoop Streaming Enables Software Reuse – A Genomics Cas...Legacy Analysis: How Hadoop Streaming Enables Software Reuse – A Genomics Cas...
Legacy Analysis: How Hadoop Streaming Enables Software Reuse – A Genomics Cas...
 
RAMSES: Robust Analytic Models for Science at Extreme Scales
RAMSES: Robust Analytic Models for Science at Extreme ScalesRAMSES: Robust Analytic Models for Science at Extreme Scales
RAMSES: Robust Analytic Models for Science at Extreme Scales
 
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
 
How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...
How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...
How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...
 
Integration Patterns for Big Data Applications
Integration Patterns for Big Data ApplicationsIntegration Patterns for Big Data Applications
Integration Patterns for Big Data Applications
 
Labmatrix
LabmatrixLabmatrix
Labmatrix
 
AnalyticOps - Chicago PAW 2016
AnalyticOps - Chicago PAW 2016AnalyticOps - Chicago PAW 2016
AnalyticOps - Chicago PAW 2016
 
Making project data avalialble eNanomapper through Database
Making project data avalialble eNanomapper through  DatabaseMaking project data avalialble eNanomapper through  Database
Making project data avalialble eNanomapper through Database
 
2012 sept 18_thug_biotech
2012 sept 18_thug_biotech2012 sept 18_thug_biotech
2012 sept 18_thug_biotech
 
Cloud e-Genome: NGS Workflows on the Cloud Using e-Science Central
Cloud e-Genome: NGS Workflows on the Cloud Using e-Science CentralCloud e-Genome: NGS Workflows on the Cloud Using e-Science Central
Cloud e-Genome: NGS Workflows on the Cloud Using e-Science Central
 
Data analysis patterns, tools and data types in genomics
Data analysis patterns, tools and data types in genomicsData analysis patterns, tools and data types in genomics
Data analysis patterns, tools and data types in genomics
 
Raptor user manual3.0
Raptor user manual3.0Raptor user manual3.0
Raptor user manual3.0
 
XSEDE15_PhastaGateway
XSEDE15_PhastaGatewayXSEDE15_PhastaGateway
XSEDE15_PhastaGateway
 
Automating the process of continuously prioritising data, updating and deploy...
Automating the process of continuously prioritising data, updating and deploy...Automating the process of continuously prioritising data, updating and deploy...
Automating the process of continuously prioritising data, updating and deploy...
 
Automating Speed: A Proven Approach to Preventing Performance Regressions in ...
Automating Speed: A Proven Approach to Preventing Performance Regressions in ...Automating Speed: A Proven Approach to Preventing Performance Regressions in ...
Automating Speed: A Proven Approach to Preventing Performance Regressions in ...
 
Pipeline Scripting for the Parallel Alignment of Genomic Short Sequence Reads
Pipeline Scripting for the Parallel Alignment of Genomic Short Sequence ReadsPipeline Scripting for the Parallel Alignment of Genomic Short Sequence Reads
Pipeline Scripting for the Parallel Alignment of Genomic Short Sequence Reads
 

Recently uploaded

My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
 
Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Hyundai Motor Group
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 

Recently uploaded (20)

My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 

Festival of Genomics 2016 London: Analyze Genomes: Modeling and Executing Genome Data Processing Pipelines

  • 1. Analyze Genomes: Modeling and Executing Genome Data Processing Pipelines Cindy Perscheid Festival of Genomics London, Jan 19, 2016
  • 2. Perscheid, Schapranow Recap Analyze Genomes Real-time Analysis of Big Medical Data In-Memory Database Extensions for Life Sciences Data Exchange, App Store Access Control, Data Protection Fair Use Statistical Tools Real-time Analysis App-spanning User Profiles Combined and Linked Data Genome Data Cellular Pathways Genome Metadata Research Publications Pipeline and Analysis Models Drugs and Interactions Modeling and Executing Genome Data Processing Pipelines Drug Response Analysis Pathway Topology Analysis Medical Knowledge CockpitOncolyzer Clinical Trial Recruitment Cohort Analysis ... Indexed Sources
  • 3. From Raw Genome Data to Analysis Results Perscheid, Schapranow Modeling and Executing Genome Data Processing Pipelines ■  Sequencing: Acquire digital DNA data ■  Alignment: Reconstruction of complete genome with snippets ■  Variant Calling: Identification of genetic variants ■  Data Annotation: Linking genetic variants with research findings
  • 4. ■  Not standardized ■  Not exchangeable ■  Concatenation of bash scripts reading from and writing to files ■  Requires IT expertise for □  Setup □  Error handling, and □  Efficient processing and parallelization ■  Objective: Model, configure, and execute pipelines without involving IT experts Genome Data Processing Pipelines State of the Art Perscheid, Schapranow Modeling and Executing Genome Data Processing Pipelines bwa aln ref.fa sample.fastq | bwa samse ref.fa – sample.fastq | samtools view -Su - | samtools sort …
  • 5. ■  Modeling Genome Data Processing Pipelines ■  Pipeline Execution in the Worker Framework Agenda Perscheid, Schapranow Modeling and Executing Genome Data Processing Pipelines
  • 6. ■  Modeling Genome Data Processing Pipelines ■  Pipeline Execution in the Worker Framework Agenda Perscheid, Schapranow Modeling and Executing Genome Data Processing Pipelines
  • 7. Modeling Genome Data Processing Pipelines BPMN 2.0 Example Perscheid, Schapranow Modeling and Executing Genome Data Processing Pipelines Start Event End Event Annotated Data Object Parallel Gateway Collapsed Subtask Task Multiple Task Instances Executed in Parallel
  • 8. ■  Graphical modeling notation ■  Compliant with BPMN 2.0 extended by □  Modular structure □  Degree of parallelization □  Parameters and variables ■  Model descriptions (XPDL) are stored in IMDB ■  Model instances are transformed into graph structure executed by our worker framework Modeling Genome Data Processing Pipelines BPMN 2.0 Extensions Perscheid, Schapranow Modeling and Executing Genome Data Processing Pipelines
  • 9. Modeling Genome Data Processing Pipelines From Model to Execution 1.  Design time (researcher, process expert) □  Definition of parameterized process model □  Uses graphical editor and available jobs 2.  Configuration time (researcher, lab assistant) □  Select model and specify parameters □  Results in model instance stored in database 3.  Execution time (researcher) □  Select model instance □  Specify execution parameters, e.g. input files Modeling and Executing Genome Data Processing Pipelines Perscheid, Schapranow
  • 10. ■  Results are imported into IMDB ■  Optimization reduced execution time by >50% Modeling Genome Data Processing Pipelines Traditional vs. Optimized Approach Perscheid, Schapranow Modeling and Executing Genome Data Processing Pipelines
  • 11. ■  Modeling of Genome Data Processing Pipelines ■  Worker Framework Agenda Perscheid, Schapranow Modeling and Executing Genome Data Processing Pipelines
  • 12. Pipeline Execution Overview Modeling and Executing Genome Data Processing Pipelines In-Memory Database Tasks Scheduler ID Pipeline Params 12 BWA xyz.fastq 13 BWAmem abc.fastq 14 Bowtie2 xyz.fastq Worker Worker Subtasks Task ID Job Status Params 12 97 Split done xyz.fastq 12 98 Import todo abc.vcf 12 98 Import done abc.vcf Webservice . . . 1. Trigger task execution 2. Schedule subtasks 3. Execute subtasks
  • 13. Pipeline Execution Software Components and Communication Perscheid, Schapranow Modeling and Executing Genome Data Processing PipelinesNode WorkerWorkerWorker IMDB Node WorkerWorkerWorker IMDB Node WorkerWorkerWorker IMDB Scheduler Node WorkerWorkerWorker IMDB Transmitter Node WorkerWorkerWorker IMDB ... Site I Site IIVPN UDP TCP Shared File System Shared File System
  • 14. Node WorkerWorkerWorker IMDB ■  Workers execute jobs one by one ■  Subtask execution status in IMDB: □  Ready (0), □  In Progress (1), □  Done (2), or □  Erroneous (3). ■  Jobs implemented as Python modules/classes □  Can contain arbitrary code □  Have access to IMDB □  Can read/write to shared working directory Pipeline Execution Runtime Layer - Worker Perscheid, Schapranow Modeling and Executing Genome Data Processing Pipelines
  • 15. Pipeline Execution Coordination Layer - Scheduler Perscheid, Schapranow Modeling and Executing Genome Data Processing Pipelines Scheduling Algorithm Pipeline Executor Ressource Allocator SubtaskSubtaskSubtask i Subtask k Subtask m m..
  • 16. ■  Scheduling algorithms are plug-in software modules □  “User-/Group-based” to let users execute their tasks on their local site only □  “Priority First” to prefer important users □  “High Throughput”, i.e. “shortest task first” to deal with high load ■  Scheduling algorithms can also be composed hierarchically Pipeline Execution Scheduling Algorithms Perscheid, Schapranow Modeling and Executing Genome Data Processing PipelinesPriority-based High-throughput High-throughput High-throughput Site I Site II Prio A Prio * Group-based
  • 17. ■  Maintains lists of running and idle nodes ■  Idle worker requests new sub task for its assigned groups ■  If there is no matching sub task, it sleeps until a new sub task gets ready Pipeline Execution Resource Allocator Perscheid, Schapranow Modeling and Executing Genome Data Processing Pipelines Node WorkerWorkerWorker IMDB Site 1 Site 2 Node WorkerWorkerWorker IMDB Node WorkerWorkerWorker IMDB Working Working Waiting
  • 18. Pipeline Execution Flexibility Perscheid, Schapranow Modeling and Executing Genome Data Processing PipelinesNode IMDB Node WorkerWorkerWorker Node WorkerWorkerWorker IMDB Scheduler UDP TCP File System Share
  • 19. ■  All execution data is stored in IMDB ■  Temporary files on a shared file system ■  In case of any failure, the system-wide state can be restored Pipeline Execution Recoverability Perscheid, Schapranow Modeling and Executing Genome Data Processing Pipelines IMDB Pipeline TasksScheduler Worker Worker Worker Worker Pipeline Subtasks Events Data
  • 20. Thanks! Hasso Plattner Institute Enterprise Platform & Integration Concepts August-Bebel-Str. 88 14482 Potsdam, Germany Dr. Matthieu-P. Schapranow schapranow@hpi.de http://we.analyzegenomes.com/ Cindy Perscheid, M. Sc. cindy.perscheid@hpi.de Perscheid, Schapranow Modeling and Executing Genome Data Processing Pipelines