Enabling Real-Time Genome Data Research with In-Memory Database Technology (Illumina GIA Meeting)

500 views

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
500
On SlideShare
0
From Embeds
0
Number of Embeds
23
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Enabling Real-Time Genome Data Research with In-Memory Database Technology (Illumina GIA Meeting)

  1. 1. Enabling Real-Time Genome Data Researchwith in-Memory Database TechnologyMay 30, 2013Dr. Matthieu SchapranowHasso Plattner InstituteDr. Anja BogSAP Labs LLC
  2. 2. Numbers You Should KnowComparison of CostsEnabling Real-Time Genome Data Research, GIA Meeting, Dr. Schapranow and Dr. Bog, May 30, 20130,010,111010010001000001.01.0101.05.0101.09.0101.01.0201.05.0201.09.0201.01.0301.05.0301.09.0301.01.0401.05.0401.09.0401.01.0501.05.0501.09.0501.01.0601.05.0601.09.0601.01.0701.05.0701.09.0701.01.0801.05.0801.09.0801.01.0901.05.0901.09.0901.01.1001.05.1001.09.1001.01.1101.05.1101.09.1101.01.12CostsinUSDComparison of Costs for Main Memory and Genome AnalysisCosts per Megabyte RAM Costs per Megabase Sequencing2
  3. 3. In-Memory TechnologyA Toolbox for Big Data AnalysisEnabling Real-Time Genome Data Research, GIA Meeting, Dr. Schapranow and Dr. Bog, May 30, 2013Any attributeas indexInsert onlyfor time travelCombinedcolumnand row store+No aggregatetablesMinimalprojectionsPartitioningAnalytics onhistoricaldatatSingle andmulti-tenancySQL interfaceon columns &rowsSQLReduction oflayersxxLightweightCompressionMulti-core/parallelizationOn-the-flyextensibility+++Active/passivedata storePABulk loadDiscovery ServiceRead EventRepositoriesVerificationServicesSAP HANA●●P Aup to 8.000 readevent notificationsper secondup to 2.000requestsper secondDiscovery ServiceRead EventRepositoriesVerificationServicesSAP HANA●●P Aup to 8.000 readevent notificationsper secondup to 2.000requestsper second++++TText Retrievaland ExtractionObject torelationalmappingDynamicmulti-threadingwithin nodesMapreduceNo diskGroup Key3
  4. 4. High-Performance In-Memory Genome ProjectChallenges of Genome Data AnalysisAnalysis of GenomicDataAlignment andVariant CallingAnalysis of Annotationsin World-wide DBsBound To CPU Performance Memory CapacityDuration Hours – Days WeeksHPI Minutes Real-timeIn-MemoryTechnologyMulti-Core Partitioning & CompressionEnabling Real-Time Genome Data Research, GIA Meeting, Dr. Schapranow and Dr. Bog, May 30, 20134
  5. 5. Enabling Real-Time Genome Data Research, GIA Meeting, Dr. Schapranow and Dr. Bog, May 30, 2013High-Performance In-Memory Genome ProjectChallenges of Genome Data AnalysisAnalysis of GenomicDataAlignment andVariant CallingAnalysis of Annotationsin World-wide DBsBound To CPU Performance Memory CapacityDuration Hours – Days WeeksHPI & SAP Minutes – Hours InteractivelyIn-MemoryTechnologyMulti-Core Partitioning & Compression5
  6. 6. High-Performance In-Memory Genome ProjectSelected Research TopicsImproving Analyses:■  Clustering of patient cohorts, e.g. k-means clustering■  Combined search, e.g. in clinical trials and side-effect databases■  Ad-hoc analysis of genetic pathways, e.g. to identify cause/effectImproving Data Preparations:■  Graphical modeling of Genome Data Processing (GDP) pipelines■  Scheduling and execution of multiple GPD pipelines in parallel■  App store for medical knowledge (bring algorithms to data)■  Exchange of sensitive data, e.g. history-based access control■  Billing processes for intellectual property and servicesEnabling Real-Time Genome Data Research, GIA Meeting, Dr. Schapranow and Dr. Bog, May 30, 20136
  7. 7. Genomics AnalysisLoaded part of 1,000 genomes pre-phase 1 dataset■  Chromosome 1 of 629 individuals from the 1,000 genomes project■  12 billion entries in largest database table■  293 GB of data (compressed in HANA)Results■  Report SNPs failing quality controlUCSC 102.47 sec | SAP HANA 1.25 sec – 82x faster■  Compute the alternative allele frequency for each variant/regionVCFtools 259 sec | SAP HANA 0.43 sec – 600x faster■  Compute the total number of missing genotypes per individualVCFtools 548 sec | SAP HANA 2 sec – 270x fasterEnabling Real-Time Genome Data Research, GIA Meeting, Dr. Schapranow and Dr. Bog, May 30, 20137Supported by Dr. Carlos Bustamante lab
  8. 8. Chromosome  AbsolutefrequencyNumber  of  Alleles  Working With Big DataLoaded entire 1,000 genomes pre-phase 1 dataset■  Queries on all chromosomes for all 629 individuals■  136 billion entries in largest database table■  ≈1.2TB (compressed in HANA)Enabling Real-Time Genome Data Research, GIA Meeting, Dr. Schapranow and Dr. Bog, May 30, 20138Query  results  using  R  connec0vity:    Report  all  varia0ons  in  BRCA1  and  BRCA2    Supported by Dr. Carlos Bustamante lab
  9. 9. High-Performance In-Memory Genome ProjectAnalysis of Patient Cohorts9Enabling Real-Time Genome Data Research, GIA Meeting, Dr. Schapranow and Dr. Bog, May 30, 2013■  Columnar storage optimizesspace requirements whileenabling enhancing calculationperformance■  Single k-means clustering:R 470ms vs. HANA 30ms (15:1)■  >60k clusters are calculated in<2s on 1,000 core cluster■  è Interactive exploration ofclusters comes trueWhy is a therapy only working in 80% of the patient cases?
  10. 10. High-Performance In-Memory Genome ProjectIntegration of Genetic Pathways■  Storing and accessing graph datawithin in-memory database (ActiveInformation Store)■  263 pathways KEGG pathways with6,481 genetic components, 32,784vertices, and 90,682 edges■  Rank all pathways by evaluation ofnode connections: IMDB <350ms■  >5,5k rankings can be calculated in<2s on 1,000 core clusterEnabling Real-Time Genome Data Research, GIA Meeting, Dr. Schapranow and Dr. Bog, May 30, 201310 What are known effects for a somatic mutation?
  11. 11. High-Performance In-Memory Genome ProjectCombined Search in Structured and Unstructured Data■  In-memory technology enables entity extraction, e.g. age,genes, and drugs■  Integrated 30k free text documents from clinicaltrials.gov■  Relational search on entities enables interactive comparison■  Results by rated by relevant search criteriaEnabling Real-Time Genome Data Research, GIA Meeting, Dr. Schapranow and Dr. Bog, May 30, 201311 What clinical trials are relevant for individual patient?
  12. 12. High-Performance In-Memory Genome ProjectArchitectural OverviewEnabling Real-Time Genome Data Research, GIA Meeting, Dr. Schapranow and Dr. Bog, May 30, 2013CohortAnalysisPathwayFinderPaperSearchIn-Memory DatabaseClinical TrialFinderPipelineEditorExtensionsApp StoreAccessControlBillingPipelineDataGenomeDataPathwaysGenomeMetadataPapersPipelineModelsAnalyticalTools12.........
  13. 13. The Future:Combined Information RequirementsEnable clinicians to:■  Make evidence-based therapydecisions at the patient’s bed■  Exchange latest patient datawith international expertsEnabling Real-Time Genome Data Research, GIA Meeting, Dr. Schapranow and Dr. Bog, May 30, 201313Enable researchers to:■  Investigate genomes ofpatient cohorts to derive newknowledge■  Analyze results inreal-timeEnable patients to:■  To identify risk factors longbefore they turn into diseases■  Identify experts and similarpatient cases to bring upalternatives for individualtherapies
  14. 14. Thank you for your interest!Keep in contact with us.Enabling Real-Time Genome Data Research, GIA Meeting, Dr. Schapranow and Dr. Bog, May 30, 2013SAP Labs LLCDr. Anja Bog3410 Hillview Avenue94304 Palo Alto, CADr. Anja Boganja.bog@sap.com14Hasso Plattner InstituteEnterprise Platform & Integration ConceptsDr. Matthieu-P. SchapranowAugust-Bebel-Str. 8814482 Potsdam, GermanyDr. Matthieu-P. Schapranowschapranow@hpi.uni-potsdam.dehttp://j.mp/schapranow

×