Introduction to High-performance In-memory Genome Project at HPI

766 views
648 views

Published on

Published in: Technology, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
766
On SlideShare
0
From Embeds
0
Number of Embeds
7
Actions
Shares
0
Downloads
22
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Introduction to High-performance In-memory Genome Project at HPI

  1. 1. Challenges of Big Data Processing in Personalized Medicine Trends and Concepts, Potsdam Nov. 28, 2013 Dr. Matthieu Schapranow Hasso Plattner Institute
  2. 2. Our Vision Personalized Medicine 2 Challenges of Big Data Processing in Personalized Medicine, Schapranow, Nov. 28, 2013
  3. 3. Our Vision Personalized Medicine 3 Desirability ■  Leveraging directed customer services ■  Portfolio of integrated services for clinicians, researchers, and patients ■  Include latest research results, e.g. most effective therapies Viability Feasibility ■  Enable personalized medicine also in faroff regions and developing countries to exceed customer base ■  HiSeq 2500 enables highcoverage whole genome sequencing in ≈1d ■  Share data via the Internet to get feedback from word-wide experts (costsaving) ■  IMDB enables allele frequency determination of 12B records within <1s ■  Combine research data (publications, annotations, genome data) from international databases in a single knowledge base ■  Identification of relevant annotations out of 80M <1s ■  Data preparation as a service reduces TCO Challenges of Big Data Processing in Personalized Medicine, Schapranow, Nov. 28, 2013
  4. 4. Our Challenge Data Challenges 4 Human genome/biological data 600GB per full genome 15PB+ in databases of leading institutes Hospital information systems Often more than 50GB Cancer patient records >160k records at NCT Prescription data 1.5B records from 10,000 doctors and 10M Patients (100 GB) Human proteome 160M data points (2.4GB) per sample >3TB raw proteome data in ProteomicsDB PubMed database >23M articles Medical sensor data Scan of a single organ in 1s creates 10GB of raw data Clinical trials Currently more than 30k recruiting on ClinicalTrials.gov Challenges of Big Data Processing in Personalized Medicine, Schapranow, Nov. 28, 2013
  5. 5. Our Approach In-Memory Technology ● ● ● ● Read Event Read Event Repositories Repositories Combined column and row store Insert only for time travel + + + ++ A up to 2.000 up to 2.000 requests requests per second per second Minimal projections Any attribute as index Bulk load Discovery Service Discovery Service Multi-core/ parallelization SAP HANA SAP HANA P Active/passive data store Dynamic multithreading within nodes No aggregate tables +++ up to 8.000 read up to 8.000 read event notifications event notifications per second per second On-the-fly extensibility Map reduce P P t A A Lightweight Compression Partitioning SQL Analytics on historical data Single and multi-tenancy Object to relational mapping Group Key SQL interface on columns & rows Reduction of layers x x 5 Verification Verification Services Services T Text Retrieval and Extraction No disk Challenges of Big Data Processing in Personalized Medicine, Schapranow, Nov. 28, 2013
  6. 6. High-Performance In-Memory Genome Project Architectural Overview 6 Cohort Analysis Pathway Finder Analytical Tools Paper Search Access Control Clinical Trial Finder App Store Extensions Billing Genome Data Papers Pathways Pipeline ... Genome Metadata Data Pipeline Editor Pipeline Models ... In-Memory Database Challenges of Big Data Processing in Personalized Medicine, Schapranow, Nov. 28, 2013 ...
  7. 7. High-Performance In-Memory Genome Project Selected Research Topics 7 Improving Analyses: ■  Medical Knowledge Cockpit, Oncolyzer ■  Genome Browser enables deep dive into the genome ■  Cohort Analysis, e.g. clustering of patient cohorts ■  Combined Search, e.g. in clinical trials and side-effect databases ■  Pathways Topology Analysis, e.g. to identify cause/effect Improving Data Preparations: ■  Graphical modeling of Genome Data Processing (GDP) pipelines ■  Scheduling and execution of multiple GPD pipelines in parallel ■  App store for medical knowledge (bring algorithms to data) ■  Exchange of sensitive data, e.g. history-based access control ■  Billing processes for intellectual property and services Challenges of Big Data Processing in Personalized Medicine, Schapranow, Nov. 28, 2013
  8. 8. High-Performance In-Memory Genome Project Medical Knowledge Cockpit ■  Search for affected genes in distributed and heterogeneous data sources 8 ■  Immediate exploration of relevant information, such as □  Gene descriptions, □  Molecular impact and related pathways, Unified access to structured and unstructured data sources Automatic clinical trial matching using HANA text analysis features □  Scientific publications, and □  Suitable clinical trials. ■  No manual searching for hours or days – In-memory technology translates it into interactive finding Challenges of Big Data Processing in Personalized Medicine, Schapranow, Nov. 28, 2013
  9. 9. High-Performance In-Memory Genome Project Genome Browser ■  Genome Browser: Comparison of multiple genomes with reference 9 ■  Combined knowledge base: latest relevant annotations and literature, e.g. NCBI, dbSNP, and UCSC ■  Detailed exploration of genome locations and existing associations Unified access to multiple formerly disjoint data sources Matching of variants with relevant international annotations ■  Ranked variants, e.g. accordingly to known diseases ■  Links to more detailed sources enable fast identification of relevant information while eliminating longlasting searches. Challenges of Big Data Processing in Personalized Medicine, Schapranow, Nov. 28, 2013
  10. 10. High-Performance In-Memory Genome Project Analysis of Patient Cohorts ■  In a patient cohort, a subset does not respond to therapy – why? 10 ■  Clustering using various statistical algorithms, such as k-means or hierarchical clustering ■  Calculation of all locus combinations in which at least 5% of all TCGA participants have mutations: 200ms for top 20 combinations Fast clustering directly inside HANA ■  Individual clusters are calculated in parallel directly within the database ■  K-means algorithm: 50ms (PAL) vs. 500ms (R) Challenges of Big Data Processing in Personalized Medicine, Schapranow, Nov. 28, 2013
  11. 11. High-Performance In-Memory Genome Project Search in Structured and Unstructured Medical Data ■  Extended text analysis feature by medical terminology 11 □  Genes (122,975 + 186,771 synonyms) □  medical terms and categories (98,886 diseases + 48,561 synonyms, 47 categories) Unified access to structured and unstructured data sources Clinical trial matching using text analysis features □  pharmaceutical ingredients (7,099 + 5,561 synonyms) ■  Imported clinicaltrials.gov database (145k trials/30,138 recruiting) ■  Extracted, e.g., 320k genes, 161k ingredients, 30k periods ■  Select all studies based on multiple filters in <500ms Challenges of Big Data Processing in Personalized Medicine, Schapranow, Nov. 28, 2013
  12. 12. High-Performance In-Memory Genome Project Pathway Analysis ■  Search in pathways is limited to “is a certain element contained” today. 12 ■  Integrated >1,5k pathways from international sources, e.g. KEGG, HumanCyc, and WikiPathways, into HANA Unified access to multiple formerly disjoint data sources ■  Implemented graph-based topology exploration and ranking based on patient specifics ■  Enables interactive identification of possible dysfunctions affecting the course of a therapy before its start Pathway analysis of genetic variations with Graph Engine Challenges of Big Data Processing in Personalized Medicine, Schapranow, Nov. 28, 2013
  13. 13. Our Vision 13 For researchers: ■  Enable real-time analysis of genome data ■  Automatic scan of pathways to identify cellular impact of mutations ■  Free-text search in publications, diagnosis, and EMR data (structured and unstructured data) For clinicians: ■  Preventive diagnostics to identify risk patients ■  Indicate pharmacokinetic correlations ■  Scan for comparable patient cases For patients: ■  Identify relevant clinical trials / experts ■  Start most appropriate therapy early based on all evidences and latest knowledge Challenges of Big Data Processing in Personalized Medicine, Schapranow, Nov. 28, 2013
  14. 14. Thank you for your interest! Keep in contact with us. 14 Dr. Matthieu-P. Schapranow schapranow@hpi.uni-potsdam.de http://we.analyzegenomes.com/ Hasso Plattner Institute Enterprise Platform & Integration Concepts Dr. Matthieu-P. Schapranow August-Bebel-Str. 88 14482 Potsdam, Germany Challenges of Big Data Processing in Personalized Medicine, Schapranow, Nov. 28, 2013

×