Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Festival of Genomics 2016 London: Analyze Genomes: Modeling and Executing Genome Data Processing Pipelines

636 views

Published on

This presentation covers the "Analyze Genomes: Modeling and Executing Genome Data Processing Pipelines" presentation of the 2016 Festival of Genomics workshop "Big Medical Data in Precision Medicine: Challenges or Opportunities?" on Jan 19, 2016 in London.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Festival of Genomics 2016 London: Analyze Genomes: Modeling and Executing Genome Data Processing Pipelines

  1. 1. Analyze Genomes: Modeling and Executing Genome Data Processing Pipelines Cindy Perscheid Festival of Genomics London, Jan 19, 2016
  2. 2. Perscheid, Schapranow Recap Analyze Genomes Real-time Analysis of Big Medical Data In-Memory Database Extensions for Life Sciences Data Exchange, App Store Access Control, Data Protection Fair Use Statistical Tools Real-time Analysis App-spanning User Profiles Combined and Linked Data Genome Data Cellular Pathways Genome Metadata Research Publications Pipeline and Analysis Models Drugs and Interactions Modeling and Executing Genome Data Processing Pipelines Drug Response Analysis Pathway Topology Analysis Medical Knowledge CockpitOncolyzer Clinical Trial Recruitment Cohort Analysis ... Indexed Sources
  3. 3. From Raw Genome Data to Analysis Results Perscheid, Schapranow Modeling and Executing Genome Data Processing Pipelines ■  Sequencing: Acquire digital DNA data ■  Alignment: Reconstruction of complete genome with snippets ■  Variant Calling: Identification of genetic variants ■  Data Annotation: Linking genetic variants with research findings
  4. 4. ■  Not standardized ■  Not exchangeable ■  Concatenation of bash scripts reading from and writing to files ■  Requires IT expertise for □  Setup □  Error handling, and □  Efficient processing and parallelization ■  Objective: Model, configure, and execute pipelines without involving IT experts Genome Data Processing Pipelines State of the Art Perscheid, Schapranow Modeling and Executing Genome Data Processing Pipelines bwa aln ref.fa sample.fastq | bwa samse ref.fa – sample.fastq | samtools view -Su - | samtools sort …
  5. 5. ■  Modeling Genome Data Processing Pipelines ■  Pipeline Execution in the Worker Framework Agenda Perscheid, Schapranow Modeling and Executing Genome Data Processing Pipelines
  6. 6. ■  Modeling Genome Data Processing Pipelines ■  Pipeline Execution in the Worker Framework Agenda Perscheid, Schapranow Modeling and Executing Genome Data Processing Pipelines
  7. 7. Modeling Genome Data Processing Pipelines BPMN 2.0 Example Perscheid, Schapranow Modeling and Executing Genome Data Processing Pipelines Start Event End Event Annotated Data Object Parallel Gateway Collapsed Subtask Task Multiple Task Instances Executed in Parallel
  8. 8. ■  Graphical modeling notation ■  Compliant with BPMN 2.0 extended by □  Modular structure □  Degree of parallelization □  Parameters and variables ■  Model descriptions (XPDL) are stored in IMDB ■  Model instances are transformed into graph structure executed by our worker framework Modeling Genome Data Processing Pipelines BPMN 2.0 Extensions Perscheid, Schapranow Modeling and Executing Genome Data Processing Pipelines
  9. 9. Modeling Genome Data Processing Pipelines From Model to Execution 1.  Design time (researcher, process expert) □  Definition of parameterized process model □  Uses graphical editor and available jobs 2.  Configuration time (researcher, lab assistant) □  Select model and specify parameters □  Results in model instance stored in database 3.  Execution time (researcher) □  Select model instance □  Specify execution parameters, e.g. input files Modeling and Executing Genome Data Processing Pipelines Perscheid, Schapranow
  10. 10. ■  Results are imported into IMDB ■  Optimization reduced execution time by >50% Modeling Genome Data Processing Pipelines Traditional vs. Optimized Approach Perscheid, Schapranow Modeling and Executing Genome Data Processing Pipelines
  11. 11. ■  Modeling of Genome Data Processing Pipelines ■  Worker Framework Agenda Perscheid, Schapranow Modeling and Executing Genome Data Processing Pipelines
  12. 12. Pipeline Execution Overview Modeling and Executing Genome Data Processing Pipelines In-Memory Database Tasks Scheduler ID Pipeline Params 12 BWA xyz.fastq 13 BWAmem abc.fastq 14 Bowtie2 xyz.fastq Worker Worker Subtasks Task ID Job Status Params 12 97 Split done xyz.fastq 12 98 Import todo abc.vcf 12 98 Import done abc.vcf Webservice . . . 1. Trigger task execution 2. Schedule subtasks 3. Execute subtasks
  13. 13. Pipeline Execution Software Components and Communication Perscheid, Schapranow Modeling and Executing Genome Data Processing PipelinesNode WorkerWorkerWorker IMDB Node WorkerWorkerWorker IMDB Node WorkerWorkerWorker IMDB Scheduler Node WorkerWorkerWorker IMDB Transmitter Node WorkerWorkerWorker IMDB ... Site I Site IIVPN UDP TCP Shared File System Shared File System
  14. 14. Node WorkerWorkerWorker IMDB ■  Workers execute jobs one by one ■  Subtask execution status in IMDB: □  Ready (0), □  In Progress (1), □  Done (2), or □  Erroneous (3). ■  Jobs implemented as Python modules/classes □  Can contain arbitrary code □  Have access to IMDB □  Can read/write to shared working directory Pipeline Execution Runtime Layer - Worker Perscheid, Schapranow Modeling and Executing Genome Data Processing Pipelines
  15. 15. Pipeline Execution Coordination Layer - Scheduler Perscheid, Schapranow Modeling and Executing Genome Data Processing Pipelines Scheduling Algorithm Pipeline Executor Ressource Allocator SubtaskSubtaskSubtask i Subtask k Subtask m m..
  16. 16. ■  Scheduling algorithms are plug-in software modules □  “User-/Group-based” to let users execute their tasks on their local site only □  “Priority First” to prefer important users □  “High Throughput”, i.e. “shortest task first” to deal with high load ■  Scheduling algorithms can also be composed hierarchically Pipeline Execution Scheduling Algorithms Perscheid, Schapranow Modeling and Executing Genome Data Processing PipelinesPriority-based High-throughput High-throughput High-throughput Site I Site II Prio A Prio * Group-based
  17. 17. ■  Maintains lists of running and idle nodes ■  Idle worker requests new sub task for its assigned groups ■  If there is no matching sub task, it sleeps until a new sub task gets ready Pipeline Execution Resource Allocator Perscheid, Schapranow Modeling and Executing Genome Data Processing Pipelines Node WorkerWorkerWorker IMDB Site 1 Site 2 Node WorkerWorkerWorker IMDB Node WorkerWorkerWorker IMDB Working Working Waiting
  18. 18. Pipeline Execution Flexibility Perscheid, Schapranow Modeling and Executing Genome Data Processing PipelinesNode IMDB Node WorkerWorkerWorker Node WorkerWorkerWorker IMDB Scheduler UDP TCP File System Share
  19. 19. ■  All execution data is stored in IMDB ■  Temporary files on a shared file system ■  In case of any failure, the system-wide state can be restored Pipeline Execution Recoverability Perscheid, Schapranow Modeling and Executing Genome Data Processing Pipelines IMDB Pipeline TasksScheduler Worker Worker Worker Worker Pipeline Subtasks Events Data
  20. 20. Thanks! Hasso Plattner Institute Enterprise Platform & Integration Concepts August-Bebel-Str. 88 14482 Potsdam, Germany Dr. Matthieu-P. Schapranow schapranow@hpi.de http://we.analyzegenomes.com/ Cindy Perscheid, M. Sc. cindy.perscheid@hpi.de Perscheid, Schapranow Modeling and Executing Genome Data Processing Pipelines

×