Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Festival of Genomics 2016 London: Analyze Genomes: A Federated In-Memory Computing Platform for Life Sciences

509 views

Published on

This presentation covers the "Analyze Genomes: A Federated In-Memory Computing Platform for Life Sciences" presentation of the 2016 Festival of Genomics workshop "Big Medical Data in Precision Medicine: Challenges or Opportunities?" on Jan 19, 2016 in London.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Festival of Genomics 2016 London: Analyze Genomes: A Federated In-Memory Computing Platform for Life Sciences

  1. 1. Analyze Genomes: A Federated In-Memory Computing Platform for Life Sciences Dr. Matthieu-P. Schapranow Festival of Genomics, London, U.K. Jan 19, 2016
  2. 2. Schapranow, Festival of Genomics, Jan 19, 2016 Recap: we.analyzegenomes.com Real-time Analysis of Big Medical Data 2 In-Memory Database Extensions for Life Sciences Data Exchange, App Store Access Control, Data Protection Fair Use Statistical Tools Real-time Analysis App-spanning User Profiles Combined and Linked Data Genome Data Cellular Pathways Genome Metadata Research Publications Pipeline and Analysis Models Drugs and Interactions Challenges of Big Medical Data Drug Response Analysis Pathway Topology Analysis Medical Knowledge CockpitOncolyzer Clinical Trial Recruitment Cohort Analysis ... Indexed Sources
  3. 3. Combined column and row store Map/Reduce Single and multi-tenancy Lightweight compression Insert only for time travel Real-time replication Working on integers SQL interface on columns and rows Active/passive data store Minimal projections Group key Reduction of software layers Dynamic multi- threading Bulk load of data Object- relational mapping Text retrieval and extraction engine No aggregate tables Data partitioning Any attribute as index No disk On-the-fly extensibility Analytics on historical data Multi-core/ parallelization Our Technology In-Memory Database Technology + ++ + + P v +++ t SQL x x T disk 3 Schapranow, Festival of Genomics, Jan 19, 2016 Challenges of Big Medical Data
  4. 4. In-Memory Data Management Overview Schapranow, Festival of Genomics, Jan 19, 2016 Analyze Genomes: Real-world Examples 4 Advances in Hardware 64 bit address space – 4 TB in current server boards 4 MB/ms/core data throughput Cost-performance ratio rapidly declining Multi-core architecture (6 x 12 core CPU per blade) Parallel scaling across blades 1 blade ≈50k USD = 1 enterprise class server Advances in Software Row and Column Store Compression PartitioningInsert Only A Parallelization + ++ + + P Active & Passive Data Stores
  5. 5. ■  Main memory access is the new bottleneck ■  Lightweight compression can reduce this bottleneck, i.e. □  Lossless □  Improved usage of data bus capacity □  Work directly on compressed data Lightweight Compression Schapranow, Festival of Genomics, Jan 19, 2016 Analyze Genomes: Real-world Examples 5 Attribute Vector RecId ValueId 1  C18.0 2  C32.0 3  C00.9 4  C18.0 5 C20.0 6 C20.0 7 C50.9 8 C18.0 Inverted Index ValueId RecIdList 1  2 2  3 3  5,6 4  1,4,8 5  7 Data Dictionary ValueId Value 1 Larynx 2 Lip 3 Rectum 4 Colon 5 MamaTable ……… C18.0Colon646470 C50.9Mama167898 C20.0Rectum647912 C20.0Rectum215678 C18.0Colon998711 C00.9Lip123489 C32.0Larynx357982 C18.0Colon091487RecId 1 RecId 2 RecId 3 RecId 4 RecId 5 RecId 6 RecId 7 RecId 8 … •  Typical compression factor of 10:1 for enterprise software •  In financial applications up to 50:1
  6. 6. ■  Row stores are designed for operative workload, e.g. □  Create and maintain meta data for tests □  Access a complete record of a trial or test series ■  Column stores are designed for analytical work, e.g. □  Evaluate the number of positive test results □  Identification of correlations or test candidates ■  In-Memory approach: Combination of both stores □  Increased performance for analytical work □  Operative performance remains interactively Combined Column and Row Store Schapranow, Festival of Genomics, Jan 19, 2016 Analyze Genomes: Real-world Examples 6 +
  7. 7. ■  Traditional databases allow four data operations: □  INSERT, SELECT and □  DELETE, UPDATE ■  Insert-only requires only INSERT, SELECT to maintain a complete history (bookkeeping systems) ■  Insert-only enables time travelling, e.g. to □  Trace changes and reconstruct decisions □  Document complete history of changes, therapies, etc. □  Enable statistical observations Insert-Only / Append-Only Schapranow, Festival of Genomics, Jan 19, 2016 Analyze Genomes: Real-world Examples 7 ++ + +
  8. 8. ■  Horizontal Partitioning □  Cut long tables into shorter segments □  E.g. to group samples with same relevance ■  Vertical Partitioning □  Split off columns to individual resources □  E.g. to separate personalized data from experiment data ■  Partitioning is the basis for □  Parallel execution of database queries □  Implementation of data aging and data retention management Partitioning Schapranow, Festival of Genomics, Jan 19, 2016 Analyze Genomes: Real-world Examples 8
  9. 9. ■  Modern server systems consist of x CPUs, e.g. ■  Each CPU consists of y CPU cores, e.g. 12 ■  Consider each of the x*y CPU core as individual workers, e.g. 6x12 ■  Each worker can perform one task at the same time in parallel ■  Full table scan of database table w/ 1M entries results in 1/x*1/y search time when traversing in parallel □  Reduced response time □  No need for pre-aggregated totals and redundant data □  Improved usage of hardware □  Instant analysis of data Multi-core and Parallelization Schapranow, Festival of Genomics, Jan 19, 2016 Analyze Genomes: Real-world Examples 9
  10. 10. ■  Consider two categories of data stores: active and passive ■  Active data are accessed frequently & updates are expected, e.g. □  Most recent experiment results, e.g. last two weeks □  Samples that have not been processed, yet ■  Passive data are used for analytical & statistical purposes, e.g. □  Samples that were processed 5 years ago □  Meta data about seeds that are not longer produced ■  Moving passive data on slower storages □  Reduces main memory demands □  Improves performance for active data Active and Passive Data Store Schapranow, Festival of Genomics, Jan 19, 2016 Analyze Genomes: Real-world Examples 10 P
  11. 11. ■  Layers are introduced to abstract from complexity ■  Each layer offers complete functionality, e.g. meta data of samples ■  Less layer result in □  Less code to maintain □  More specific code □  Reduced resource demands □  Improves performance of applications due to eliminating obsolete processing steps Reduction of Application Layers Schapranow, Festival of Genomics, Jan 19, 2016 Analyze Genomes: Real-world Examples 11 x x
  12. 12. ■  1,000 core cluster at Hasso Plattner Institute with 25 TB main memory ■  25 nodes, each consists of: □  40 cores □  1 TB main memory □  Intel® Xeon® E7- 4870 □  2.40GHz □  30 MB Cache In-Memory Database Technology Hardware Characteristics at HPI FSOC Lab Schapranow, Festival of Genomics, Jan 19, 2016 Challenges of Big Medical Data 12
  13. 13. In-Memory Database Technology Use Case: Analysis of Genomic Data Schapranow, Festival of Genomics, Jan 19, 2016 Analyze Genomes: Real-world Examples 13 Analysis of Genomic Data Alignment and Variant Calling Analysis of Annotations in World- wide DBs Bound To CPU Performance Memory Capacity Duration Hours – Days Weeks HPI Minutes Real-time In-Memory Technology Multi-Core Partitioning & Compression
  14. 14. Federated In-Memory Database (FIMDB) Incorporating Local Compute Resources Schapranow, Festival of Genomics, Jan 19, 2016 Challenges of Big Medical Data 14 Site B Federated In-M em ory D atabase Instance, Algorithm s, and Applications M anaged by Service Provider CloudService Provider Site A FIMDB A.1 FIMDB A.2 FIMDB A.3 FIMDB A.4 FIMDB A.5 FIMDB B.1 FIMDB B.2 FIMDB B.3 FIMDB C.1 Federated In-M em ory Database Instances M aster Data M anaged by Service Provider Sensitive D ata reside at Site
  15. 15. Multiple Sites Forming the Federated In-Memory Database System Schapranow, Festival of Genomics, Jan 19, 2016 Challenges of Big Medical Data 15 Federated In-M em ory D atabase System M aster Data and Shared Algorithm s Site A Site BCloud Provider Cloud IM D B Instance Local IM DB Instance Sensitive D ata, e.g. Patient Data R Local IM DB Instance Sensitive Data, e.g. Patient D ata R
  16. 16. ■  Front-End □  Any user interface supported, e.g. UI5, JavaScript, iOS □  Communication protocol: Ajax/JSON ■  API Server □  Ruby on Rails □  Translates into database queries ■  Platform □  IMDB with extensions for Life Sciences, e.g. worker framework, scheduler, third- party tools, etc. Software Architecture Schapranow, Festival of Genomics, Jan 19, 2016 Challenges of Big Medical Data 16 H igh-Perform ance In-M em ory G enom e Cloud Platform Researcher with W eb Browser Clinician with M obile Device Analytical Data Processing MasterData,e.g. ReferenceGenomes, Annotations,and ClinicalTrials R G raph Processing Specific C loud Application Specific Cloud Application s Specific M obile Application RR TransactionalData, e.g.GenomeReads, PatientEMRData, andWorkerStatus In-M em ory Database Instances In-M em ory Database Instances RR Scheduling and W orker Fram ework Annotation and Updater Fram ew ork InternetIntranet DataLayerPlatformLayerApplicationLayer Accounting Security Extensions
  17. 17. Keep in contact with us! Hasso Plattner Institute Enterprise Platform & Integration Concepts (EPIC) Program Manager E-Health Dr. Matthieu-P. Schapranow August-Bebel-Str. 88 14482 Potsdam, Germany Dr. Matthieu-P. Schapranow schapranow@hpi.de http://we.analyzegenomes.com/ Schapranow, Festival of Genomics, Jan 19, 2016 Challenges of Big Medical Data 17

×