• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Enabling Real-Time Genome Data Research with In-Memory Database Technology (Illumina GIA Meeting)
 

Enabling Real-Time Genome Data Research with In-Memory Database Technology (Illumina GIA Meeting)

on

  • 420 views

 

Statistics

Views

Total Views
420
Views on SlideShare
402
Embed Views
18

Actions

Likes
0
Downloads
0
Comments
0

3 Embeds 18

http://epic.hpi.uni-potsdam.de 12
http://www.stanford.edu 3
http://we.analyzegenomes.com 3

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Enabling Real-Time Genome Data Research with In-Memory Database Technology (Illumina GIA Meeting) Enabling Real-Time Genome Data Research with In-Memory Database Technology (Illumina GIA Meeting) Presentation Transcript

    • Enabling Real-Time Genome Data Researchwith in-Memory Database TechnologyMay 30, 2013Dr. Matthieu SchapranowHasso Plattner InstituteDr. Anja BogSAP Labs LLC
    • Numbers You Should KnowComparison of CostsEnabling Real-Time Genome Data Research, GIA Meeting, Dr. Schapranow and Dr. Bog, May 30, 20130,010,111010010001000001.01.0101.05.0101.09.0101.01.0201.05.0201.09.0201.01.0301.05.0301.09.0301.01.0401.05.0401.09.0401.01.0501.05.0501.09.0501.01.0601.05.0601.09.0601.01.0701.05.0701.09.0701.01.0801.05.0801.09.0801.01.0901.05.0901.09.0901.01.1001.05.1001.09.1001.01.1101.05.1101.09.1101.01.12CostsinUSDComparison of Costs for Main Memory and Genome AnalysisCosts per Megabyte RAM Costs per Megabase Sequencing2
    • In-Memory TechnologyA Toolbox for Big Data AnalysisEnabling Real-Time Genome Data Research, GIA Meeting, Dr. Schapranow and Dr. Bog, May 30, 2013Any attributeas indexInsert onlyfor time travelCombinedcolumnand row store+No aggregatetablesMinimalprojectionsPartitioningAnalytics onhistoricaldatatSingle andmulti-tenancySQL interfaceon columns &rowsSQLReduction oflayersxxLightweightCompressionMulti-core/parallelizationOn-the-flyextensibility+++Active/passivedata storePABulk loadDiscovery ServiceRead EventRepositoriesVerificationServicesSAP HANA●●P Aup to 8.000 readevent notificationsper secondup to 2.000requestsper secondDiscovery ServiceRead EventRepositoriesVerificationServicesSAP HANA●●P Aup to 8.000 readevent notificationsper secondup to 2.000requestsper second++++TText Retrievaland ExtractionObject torelationalmappingDynamicmulti-threadingwithin nodesMapreduceNo diskGroup Key3
    • High-Performance In-Memory Genome ProjectChallenges of Genome Data AnalysisAnalysis of GenomicDataAlignment andVariant CallingAnalysis of Annotationsin World-wide DBsBound To CPU Performance Memory CapacityDuration Hours – Days WeeksHPI Minutes Real-timeIn-MemoryTechnologyMulti-Core Partitioning & CompressionEnabling Real-Time Genome Data Research, GIA Meeting, Dr. Schapranow and Dr. Bog, May 30, 20134
    • Enabling Real-Time Genome Data Research, GIA Meeting, Dr. Schapranow and Dr. Bog, May 30, 2013High-Performance In-Memory Genome ProjectChallenges of Genome Data AnalysisAnalysis of GenomicDataAlignment andVariant CallingAnalysis of Annotationsin World-wide DBsBound To CPU Performance Memory CapacityDuration Hours – Days WeeksHPI & SAP Minutes – Hours InteractivelyIn-MemoryTechnologyMulti-Core Partitioning & Compression5
    • High-Performance In-Memory Genome ProjectSelected Research TopicsImproving Analyses:■  Clustering of patient cohorts, e.g. k-means clustering■  Combined search, e.g. in clinical trials and side-effect databases■  Ad-hoc analysis of genetic pathways, e.g. to identify cause/effectImproving Data Preparations:■  Graphical modeling of Genome Data Processing (GDP) pipelines■  Scheduling and execution of multiple GPD pipelines in parallel■  App store for medical knowledge (bring algorithms to data)■  Exchange of sensitive data, e.g. history-based access control■  Billing processes for intellectual property and servicesEnabling Real-Time Genome Data Research, GIA Meeting, Dr. Schapranow and Dr. Bog, May 30, 20136
    • Genomics AnalysisLoaded part of 1,000 genomes pre-phase 1 dataset■  Chromosome 1 of 629 individuals from the 1,000 genomes project■  12 billion entries in largest database table■  293 GB of data (compressed in HANA)Results■  Report SNPs failing quality controlUCSC 102.47 sec | SAP HANA 1.25 sec – 82x faster■  Compute the alternative allele frequency for each variant/regionVCFtools 259 sec | SAP HANA 0.43 sec – 600x faster■  Compute the total number of missing genotypes per individualVCFtools 548 sec | SAP HANA 2 sec – 270x fasterEnabling Real-Time Genome Data Research, GIA Meeting, Dr. Schapranow and Dr. Bog, May 30, 20137Supported by Dr. Carlos Bustamante lab
    • Chromosome  AbsolutefrequencyNumber  of  Alleles  Working With Big DataLoaded entire 1,000 genomes pre-phase 1 dataset■  Queries on all chromosomes for all 629 individuals■  136 billion entries in largest database table■  ≈1.2TB (compressed in HANA)Enabling Real-Time Genome Data Research, GIA Meeting, Dr. Schapranow and Dr. Bog, May 30, 20138Query  results  using  R  connec0vity:    Report  all  varia0ons  in  BRCA1  and  BRCA2    Supported by Dr. Carlos Bustamante lab
    • High-Performance In-Memory Genome ProjectAnalysis of Patient Cohorts9Enabling Real-Time Genome Data Research, GIA Meeting, Dr. Schapranow and Dr. Bog, May 30, 2013■  Columnar storage optimizesspace requirements whileenabling enhancing calculationperformance■  Single k-means clustering:R 470ms vs. HANA 30ms (15:1)■  >60k clusters are calculated in<2s on 1,000 core cluster■  è Interactive exploration ofclusters comes trueWhy is a therapy only working in 80% of the patient cases?
    • High-Performance In-Memory Genome ProjectIntegration of Genetic Pathways■  Storing and accessing graph datawithin in-memory database (ActiveInformation Store)■  263 pathways KEGG pathways with6,481 genetic components, 32,784vertices, and 90,682 edges■  Rank all pathways by evaluation ofnode connections: IMDB <350ms■  >5,5k rankings can be calculated in<2s on 1,000 core clusterEnabling Real-Time Genome Data Research, GIA Meeting, Dr. Schapranow and Dr. Bog, May 30, 201310 What are known effects for a somatic mutation?
    • High-Performance In-Memory Genome ProjectCombined Search in Structured and Unstructured Data■  In-memory technology enables entity extraction, e.g. age,genes, and drugs■  Integrated 30k free text documents from clinicaltrials.gov■  Relational search on entities enables interactive comparison■  Results by rated by relevant search criteriaEnabling Real-Time Genome Data Research, GIA Meeting, Dr. Schapranow and Dr. Bog, May 30, 201311 What clinical trials are relevant for individual patient?
    • High-Performance In-Memory Genome ProjectArchitectural OverviewEnabling Real-Time Genome Data Research, GIA Meeting, Dr. Schapranow and Dr. Bog, May 30, 2013CohortAnalysisPathwayFinderPaperSearchIn-Memory DatabaseClinical TrialFinderPipelineEditorExtensionsApp StoreAccessControlBillingPipelineDataGenomeDataPathwaysGenomeMetadataPapersPipelineModelsAnalyticalTools12.........
    • The Future:Combined Information RequirementsEnable clinicians to:■  Make evidence-based therapydecisions at the patient’s bed■  Exchange latest patient datawith international expertsEnabling Real-Time Genome Data Research, GIA Meeting, Dr. Schapranow and Dr. Bog, May 30, 201313Enable researchers to:■  Investigate genomes ofpatient cohorts to derive newknowledge■  Analyze results inreal-timeEnable patients to:■  To identify risk factors longbefore they turn into diseases■  Identify experts and similarpatient cases to bring upalternatives for individualtherapies
    • Thank you for your interest!Keep in contact with us.Enabling Real-Time Genome Data Research, GIA Meeting, Dr. Schapranow and Dr. Bog, May 30, 2013SAP Labs LLCDr. Anja Bog3410 Hillview Avenue94304 Palo Alto, CADr. Anja Boganja.bog@sap.com14Hasso Plattner InstituteEnterprise Platform & Integration ConceptsDr. Matthieu-P. SchapranowAugust-Bebel-Str. 8814482 Potsdam, GermanyDr. Matthieu-P. Schapranowschapranow@hpi.uni-potsdam.dehttp://j.mp/schapranow