Festival of Genomics 2016 London: Analyze Genomes: Modeling and Executing Genome Data Processing Pipelines

Analyze Genomes:
Modeling and Executing Genome Data Processing Pipelines
Cindy Perscheid
Festival of Genomics
London, Jan 19, 2016

Perscheid,
Schapranow
Recap Analyze Genomes
Real-time Analysis of Big Medical Data
In-Memory Database
Extensions for Life Sciences
Data Exchange,
App Store
Access Control,
Data Protection
Fair Use
Statistical
Tools
Real-time
Analysis
App-spanning
User Proﬁles
Combined and Linked Data
Genome
Data
Cellular
Pathways
Genome
Metadata
Research
Publications
Pipeline and
Analysis Models
Drugs and
Interactions
Modeling and
Executing Genome
Data Processing
Pipelines
Drug Response
Analysis
Pathway Topology
Analysis
Medical
Knowledge CockpitOncolyzer
Clinical Trial
Recruitment
Cohort
Analysis
...
Indexed
Sources

From Raw Genome Data to Analysis Results
Perscheid,
Schapranow
Modeling and
Executing Genome
Data Processing
Pipelines
■  Sequencing: Acquire digital DNA
data
■  Alignment: Reconstruction of
complete genome with snippets
■  Variant Calling: Identification of
genetic variants
■  Data Annotation: Linking genetic
variants with research findings

■  Not standardized
■  Not exchangeable
■  Concatenation of bash scripts reading from and writing to files
■  Requires IT expertise for
□  Setup
□  Error handling, and
□  Efficient processing and parallelization
■  Objective: Model, configure, and execute pipelines without involving IT
experts
Genome Data Processing Pipelines
State of the Art
Perscheid,
Schapranow
Modeling and
Executing Genome
Data Processing
Pipelines
bwa aln ref.fa sample.fastq | bwa samse ref.fa – sample.fastq | samtools view -Su - | samtools sort …

■  Modeling Genome Data Processing Pipelines
■  Pipeline Execution in the Worker Framework
Agenda
Perscheid,
Schapranow
Modeling and
Executing Genome
Data Processing
Pipelines

Modeling Genome Data Processing Pipelines
BPMN 2.0 Example
Perscheid,
Schapranow
Modeling and
Executing Genome
Data Processing
Pipelines
Start
Event
End
Event
Annotated Data
Object
Parallel
Gateway
Collapsed
Subtask
Task
Multiple Task Instances
Executed in Parallel

■  Graphical modeling notation
■  Compliant with BPMN 2.0 extended by
□  Modular structure
□  Degree of parallelization
□  Parameters and variables
■  Model descriptions (XPDL) are stored in IMDB
■  Model instances are transformed into graph
structure executed by our worker framework
BPMN 2.0 Extensions
Perscheid,
Schapranow
Modeling and
Executing Genome
Data Processing
Pipelines

From Model to Execution
1.  Design time (researcher, process expert)
□  Definition of parameterized process model
□  Uses graphical editor and available jobs
2.  Configuration time (researcher, lab assistant)
□  Select model and specify parameters
□  Results in model instance stored in database
3.  Execution time (researcher)
□  Select model instance
□  Specify execution parameters, e.g. input files
Modeling and
Executing Genome
Data Processing
Pipelines
Perscheid,
Schapranow

■  Results are imported into IMDB
■  Optimization reduced execution time by >50%
Traditional vs. Optimized Approach
Perscheid,
Schapranow
Modeling and
Executing Genome
Data Processing
Pipelines

■  Modeling of Genome Data Processing Pipelines
■  Worker Framework
Agenda
Perscheid,
Schapranow
Modeling and
Executing Genome
Data Processing
Pipelines

Pipeline Execution
Overview
Modeling and Executing
Genome Data
Processing Pipelines
In-Memory Database
Tasks
Scheduler
ID Pipeline Params
12 BWA xyz.fastq
13 BWAmem abc.fastq
14 Bowtie2 xyz.fastq
Worker
Worker
Subtasks
Task ID Job Status Params
12 97 Split done xyz.fastq
12 98 Import todo abc.vcf
12 98 Import done abc.vcf
Webservice
. . .
1. Trigger task execution
2. Schedule subtasks
3. Execute subtasks

Pipeline Execution
Software Components and Communication
Perscheid,
Schapranow
Modeling and
Executing Genome
Data Processing
PipelinesNode
WorkerWorkerWorker
IMDB
Node
WorkerWorkerWorker
IMDB
Node
WorkerWorkerWorker
IMDB
Scheduler
Node
WorkerWorkerWorker
IMDB
Transmitter
Node
WorkerWorkerWorker
IMDB
...
Site I Site IIVPN
UDP
TCP
Shared File System Shared File System

Node
WorkerWorkerWorker
IMDB
■  Workers execute jobs one by one
■  Subtask execution status in IMDB:
□  Ready (0),
□  In Progress (1),
□  Done (2), or
□  Erroneous (3).
■  Jobs implemented as Python modules/classes
□  Can contain arbitrary code
□  Have access to IMDB
□  Can read/write to shared working directory
Pipeline Execution
Runtime Layer - Worker
Perscheid,
Schapranow
Modeling and
Executing Genome
Data Processing
Pipelines

Pipeline Execution
Coordination Layer - Scheduler
Perscheid,
Schapranow
Modeling and
Executing Genome
Data Processing
Pipelines
Scheduling Algorithm
Pipeline Executor
Ressource
Allocator
SubtaskSubtaskSubtask i
Subtask k
Subtask m
m..

■  Scheduling algorithms are plug-in software modules
□  “User-/Group-based” to let users execute their tasks on their local site
only
□  “Priority First” to prefer important users
□  “High Throughput”, i.e. “shortest task first” to deal with high load
■  Scheduling algorithms can also be composed hierarchically
Pipeline Execution
Scheduling Algorithms
Perscheid,
Schapranow
Modeling and
Executing Genome
Data Processing
PipelinesPriority-based
High-throughput High-throughput
High-throughput
Site I Site II
Prio A Prio *
Group-based

■  Maintains lists of running and idle nodes
■  Idle worker requests new sub task for its assigned groups
■  If there is no matching sub task, it sleeps until a new sub task gets ready
Pipeline Execution
Resource Allocator
Perscheid,
Schapranow
Modeling and
Executing Genome
Data Processing
Pipelines
Node
WorkerWorkerWorker
IMDB
Site 1 Site 2
Node
WorkerWorkerWorker
IMDB
Node
WorkerWorkerWorker
IMDB
Working Working Waiting

Pipeline Execution
Flexibility
Perscheid,
Schapranow
Modeling and
Executing Genome
Data Processing
PipelinesNode
IMDB
Node
WorkerWorkerWorker
Node
WorkerWorkerWorker
IMDB
Scheduler
UDP
TCP
File System Share

■  All execution data is stored in
IMDB
■  Temporary files on a shared file
system
■  In case of any failure, the
system-wide state can be
restored
Pipeline Execution
Recoverability
Perscheid,
Schapranow
Modeling and
Executing Genome
Data Processing
Pipelines
IMDB
Pipeline TasksScheduler
Worker
Worker
Worker
Worker
Pipeline Subtasks
Events
Data

Thanks!
Hasso Plattner Institute
Enterprise Platform & Integration Concepts
August-Bebel-Str. 88
14482 Potsdam, Germany
Dr. Matthieu-P. Schapranow
schapranow@hpi.de
http://we.analyzegenomes.com/
Cindy Perscheid, M. Sc.
cindy.perscheid@hpi.de
Perscheid,
Schapranow
Modeling and
Executing Genome
Data Processing
Pipelines

Festival of Genomics 2016 London: Analyze Genomes: Modeling and Executing Genome Data Processing Pipelines

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Festival of Genomics 2016 London: Analyze Genomes: Modeling and Executing Genome Data Processing Pipelines

Similar to Festival of Genomics 2016 London: Analyze Genomes: Modeling and Executing Genome Data Processing Pipelines (20)

Recently uploaded

Recently uploaded (20)

Festival of Genomics 2016 London: Analyze Genomes: Modeling and Executing Genome Data Processing Pipelines