Enabling Real-time Genome Data Research with In-memory Database Technology (SAP Life Science Forum 2013)

  • 535 views
Uploaded on

 

More in: Technology , Business
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
535
On Slideshare
0
From Embeds
0
Number of Embeds
3

Actions

Shares
Downloads
0
Comments
0
Likes
1

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. In-Memory Database Technology EnablesReal-Time Genome Data ResearchSAP Life Science Forum, DublinJune 04, 2013Dr. Matthieu SchapranowHasso Plattner Institute
  • 2. Agenda■  Numbers You Should Know■  Personalized Medicine■  High-Performance In-Memory Genome (HIG) Project■  OutlookIn-Memory Technology Enables Genome Data Research, Dr. Schapranow, June 04, 20132
  • 3. Agenda■  Numbers You Should Know■  Personalized Medicine■  High-Performance In-Memory Genome (HIG) Project■  OutlookIn-Memory Technology Enables Genome Data Research, Dr. Schapranow, June 04, 20133
  • 4. Numbers You Should KnowConventional Cancer TherapiesIn-Memory Technology Enables Genome Data Research, Dr. Schapranow, June 04, 20130% 100%MenWomen WillDevelopCancerWill NeverDevelopCancerAmerican Cancer Society, Surveillance Research, 2012ChemotherapiesFailWork4
  • 5. Numbers You Should KnowThe Human Genome Project■  1990: Human Genome (HG) projectstarted with 3B USD funding■  2000: 1st draft of the HG announced■  10 years until first HG version;thousands of institutes involvedIn-Memory Technology Enables Genome Data Research, Dr. Schapranow, June 04, 20135http://www.molecularecologist.com/next-gen-table-3a/■  2013: Latest Next-Generation Sequencing (NGS) device“Illumina HiSeq 2500” costs ≈700k USD, which enables wholegenome sequencing in <2 days for < 10k USD per run■  But: analysis takes up to weeks■  What’s next? Real-time analysis of genome data!
  • 6. Numbers You Should KnowComparison of CostsIn-Memory Technology Enables Genome Data Research, Dr. Schapranow, June 04, 201360,0010,010,111010010001000001.01.0101.05.0101.09.0101.01.0201.05.0201.09.0201.01.0301.05.0301.09.0301.01.0401.05.0401.09.0401.01.0501.05.0501.09.0501.01.0601.05.0601.09.0601.01.0701.05.0701.09.0701.01.0801.05.0801.09.0801.01.0901.05.0901.09.0901.01.1001.05.1001.09.1001.01.1101.05.1101.09.1101.01.1201.05.1201.09.1201.01.13CostsinUSDComparison of Costs for Main Memory and Genome SequencingCosts per Megabyte RAM Costs per Megabase Sequencing
  • 7. Agenda■  Numbers You Should Know■  Personalized Medicine■  High-Performance In-Memory Genome (HIG) Project■  OutlookIn-Memory Technology Enables Genome Data Research, Dr. Schapranow, June 04, 20137
  • 8. Personalized MedicineOur Motivation■  Today analysis of genome data, e.g. for personalized treatment,takes 4-6 weeks (incl. biopsy, biological preparation, sequencing,alignment, variant calling, full analysis, and evaluation)■  In-memory technology is suitable to accelerate genome analysis□  Highly parallel alignment / variant calling (data preparation)□  Real-time analysis of individual patient and cohort data□  Combined search in structured / unstructured data■  Challenge: Can we analyze the entire data ofa patient, incl. Electronic Medical Record (EMR) and genomedata, during a doctor’s visit?In-Memory Technology Enables Genome Data Research, Dr. Schapranow, June 04, 20138
  • 9. Personalized MedicineOur VisionIn-Memory Technology Enables Genome Data Research, Dr. Schapranow, June 04, 20139
  • 10. Personalized MedicineOur VisionIn-Memory Technology Enables Genome Data Research, Dr. Schapranow, June 04, 201310 Desirability■  Integrated portfolio of specialized servicesfor clinicians, researchers, and patients■  Include latest research results, e.g. mosteffective therapiesViability■  Share data via the Internet to getfeedback from word-wide experts (cost-saving)■  Combine research data (publications,annotations, genome data) frominternational databases in a singleknowledge base■  Enable personalized medicine also in far-off regions and developing countriesFeasibility■  Allele frequency count of 12Brecords in < 1s■  Identification of relevantannotations out of 80M <1s■  Integrated alignment andvariant calling within hoursinstead of days
  • 11. Personalized MedicineUser RequirementsFor researchers■  Enable real-time analysis of genome data■  Automatic scan of pathways to identify cellularimpact of mutations■  Free-text search in publications, diagnosis, and EMRdata (structured and unstructured data)For clinicians■  Preventive diagnostics to identify risk patients■  Indicate pharmacokinetic correlations■  Scan for comparable patient casesFor patients■  Identify relevant clinical trials / experts■  Start most appropriate therapy early based on allevidences and latest knowledgeIn-Memory Technology Enables Genome Data Research, Dr. Schapranow, June 04, 201311
  • 12. Agenda■  Numbers You Should Know■  Personalized Medicine■  High-Performance In-Memory Genome (HIG) Project■  OutlookIn-Memory Technology Enables Genome Data Research, Dr. Schapranow, June 04, 201312
  • 13. High-Performance In-Memory Genome ProjectIntegration of Genomic Data■  Once DNA sequencesare generated by NGSdevices, HIG comesinto play■  Preprocessing of DNA(alignment, variantcalling) can bemodeled and isexecuted as integratedprocess■  Results are stored inin-memory databaseto enable instantanalysisIn-Memory Technology Enables Genome Data Research, Dr. Schapranow, June 04, 201313
  • 14. High-Performance In-Memory Genome ProjectThe In-Memory Technology ToolboxIn-Memory Technology Enables Genome Data Research, Dr. Schapranow, June 04, 2013Any attributeas indexInsert onlyfor time travelCombinedcolumnand row store+No aggregatetablesMinimalprojectionsPartitioningAnalytics onhistoricaldatatSingle andmulti-tenancySQL interfaceon columns &rowsSQLReduction oflayersxxLightweightCompressionMulti-core/parallelizationOn-the-flyextensibility+++Active/passivedata storePABulk loadDiscovery ServiceRead EventRepositoriesVerificationServicesSAP HANA●●P Aup to 8.000 readevent notificationsper secondup to 2.000requestsper secondDiscovery ServiceRead EventRepositoriesVerificationServicesSAP HANA●●P Aup to 8.000 readevent notificationsper secondup to 2.000requestsper second++++TText Retrievaland ExtractionObject torelationalmappingDynamicmulti-threadingwithin nodesMapreduceNo diskGroup Key14
  • 15. High-Performance In-Memory Genome ProjectChallenges of Genome Data AnalysisAnalysis of GenomicDataAlignment andVariant CallingAnalysis of Annotationsin World-wide DBsBound To CPU Performance Memory CapacityDuration Hours – Days WeeksHPI Minutes Real-timeIn-MemoryTechnologyMulti-Core Partitioning & CompressionIn-Memory Technology Enables Genome Data Research, Dr. Schapranow, June 04, 201315
  • 16. In-Memory Technology Enables Genome Data Research, Dr. Schapranow, June 04, 2013High-Performance In-Memory Genome ProjectChallenges of Genome Data AnalysisAnalysis of GenomicDataAlignment andVariant CallingAnalysis of Annotationsin World-wide DBsBound To CPU Performance Memory CapacityDuration Hours – Days WeeksHPI & SAP Minutes – Hours InteractivelyIn-MemoryTechnologyMulti-Core Partitioning & Compression16
  • 17. High-Performance In-Memory Genome ProjectSelected Research TopicsImproving Analyses:■  Clustering of patient cohorts, e.g. k-means clustering■  Combined search, e.g. in clinical trials and side-effect databases■  Ad-hoc analysis of genetic pathways, e.g. to identify cause/effectImproving Data Preparations:■  Graphical modeling of Genome Data Processing (GDP) pipelines■  Scheduling and execution of multiple GPD pipelines in parallel■  App store for medical knowledge (bring algorithms to data)■  Exchange of sensitive data, e.g. history-based access control■  Billing processes for intellectual property and servicesIn-Memory Technology Enables Genome Data Research, Dr. Schapranow, June 04, 201317
  • 18. High-Performance In-Memory Genome ProjectGenomics AnalysisLoaded part of 1,000 genomes pre-phase 1 dataset■  Chromosome 1 of 629 individuals from the 1,000 genomes project■  12 billion entries in largest database table■  293 GB of data (compressed in HANA)Results■  Report SNPs failing quality controlUCSC 102.47 sec | SAP HANA 1.25 sec – 82x faster■  Compute the alternative allele frequency for each variant/regionVCFtools 259 sec | SAP HANA 0.43 sec – 600x faster■  Compute the total number of missing genotypes per individualVCFtools 548 sec | SAP HANA 2 sec – 270x fasterIn-Memory Technology Enables Genome Data Research, Dr. Schapranow, June 04, 201318Supported by Dr. Carlos Bustamante lab
  • 19. Chromosome  AbsolutefrequencyNumber  of  Alleles  High-Performance In-Memory Genome ProjectWorking With Big DataLoaded entire 1,000 genomes pre-phase 1 dataset■  Queries on all chromosomes for all 629 individuals■  136 billion entries in largest database table■  ≈1.2TB (compressed in HANA)In-Memory Technology Enables Genome Data Research, Dr. Schapranow, June 04, 201319Query  results  using  R  connec0vity:    Report  all  varia0ons  in  BRCA1  and  BRCA2    Supported by Dr. Carlos Bustamante lab
  • 20. High-Performance In-Memory Genome ProjectAnalysis of Patient Cohorts20In-Memory Technology Enables Genome Data Research, Dr. Schapranow, June 04, 2013■  Columnar storage optimizesspace requirements whileenabling enhancing calculationperformance■  Single k-means clustering:R 470ms vs. HANA 30ms (15:1)■  >60k clusters are calculated in<2s on 1,000 core cluster■  è Interactive exploration ofclusters comes trueWhy is a therapy only working in 80% of the patient cases?
  • 21. High-Performance In-Memory Genome ProjectIntegration of Genetic Pathways■  Storing and accessing graph datawithin in-memory database (ActiveInformation Store)■  263 pathways KEGG pathways with6,481 genetic components, 32,784vertices, and 90,682 edges■  Rank all pathways by evaluation ofnode connections: IMDB <350ms■  >5,5k rankings can be calculated in<2s on 1,000 core clusterIn-Memory Technology Enables Genome Data Research, Dr. Schapranow, June 04, 201321 What are known effects for a somatic mutation?
  • 22. High-Performance In-Memory Genome ProjectSearch in Structured / Unstructured Data■  In-memory technology enables entity extraction, e.g. age,genes, and drugs■  Integrated 30k free text documents from clinicaltrials.gov■  Relational search on entities enables interactive comparison■  Results by rated by relevant search criteriaIn-Memory Technology Enables Genome Data Research, Dr. Schapranow, June 04, 201322 What clinical trials are relevant for individual patient?
  • 23. High-Performance In-Memory Genome ProjectArchitectural OverviewIn-Memory Technology Enables Genome Data Research, Dr. Schapranow, June 04, 2013CohortAnalysisPathwayFinderPaperSearchIn-Memory DatabaseClinical TrialFinderPipelineEditorExtensionsApp StoreAccessControlBillingPipelineDataGenomeDataPathwaysGenomeMetadataPapersPipelineModelsAnalyticalTools23.........
  • 24. Agenda■  Numbers You Should Know■  Personalized Medicine■  High-Performance In-Memory Genome (HIG) Project■  OutlookIn-Memory Technology Enables Genome Data Research, Dr. Schapranow, June 04, 201324
  • 25. The VisionCombined Data and Expert’s KnowledgeIn-Memory Technology Enables Genome Data Research, Dr. Schapranow, June 04, 201325
  • 26. The FutureCombined InformationEnable clinicians to:■  Make evidence-based therapydecisions at the patient’s bed■  Exchange latest patient datawith international expertsIn-Memory Technology Enables Genome Data Research, Dr. Schapranow, June 04, 201326Enable researchers to:■  Investigate genomes ofpatient cohorts to derive newknowledge■  Analyze results inreal-timeEnable patients to:■  To identify risk factors longbefore they turn into diseases■  Identify experts and similarpatient cases to bring upalternatives for individualtherapies
  • 27. Thank you for your interest!Keep in contact with us.In-Memory Technology Enables Genome Data Research, Dr. Schapranow, June 04, 2013Hasso Plattner InstituteEnterprise Platform & Integration ConceptsDr. Matthieu-P. SchapranowAugust-Bebel-Str. 8814482 Potsdam, GermanyDr. Matthieu-P. Schapranowschapranow@hpi.uni-potsdam.dehttp://j.mp/schapranow27