Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

2014.06.30 - Renaissance in Medicine - Singapore Management University - Data Science SG

1,170 views

Published on

First draft for upcoming Hadoop World presentation "Renaissance in Medicine" that gives an overview of the upcoming changes in medical practice that are enabled by BigData technologies. Specific algorithmic techniques are detailed that enable this use case.

  • Be the first to comment

2014.06.30 - Renaissance in Medicine - Singapore Management University - Data Science SG

  1. 1. © 2014 MapR Technologies 1 Primary Sequence Analysis (ETL), MapReduce style .fastq .bam .vcf short read alignment genotype calling MAP MAP REDUCE, rotate matrix 90º (O(mn)) / 1 (O(mn) + O(n log n)) / s Hello!
  2. 2. © 2014 MapR Technologies 2© 2014 MapR Technologies Renaissance in Medicine (Draft 1)
  3. 3. © 2014 MapR Technologies 3 High-Level Biomedical Goal: Improve Fitness Therapeutics => Diagnostics => Prognostics • Therapeutics => traditional medicine • Diagnostics => personalized medicine – NextGen public health – Requires hi-res mechanical knowledge – Reverse engineer how genetic variation leads to (un)desired traits • Prognostics => GATTACA (dys/eu)topia – Managed populations / NextGen eugenics
  4. 4. © 2014 MapR Technologies 4Star Wars III: Revenge of the Sith
  5. 5. © 2014 MapR Technologies 5Star Wars V: The Empire Strikes Back
  6. 6. © 2014 MapR Technologies 6
  7. 7. © 2014 MapR Technologies 7 Many DNA-Based Apps Coming*… • 2014: US$ 2B, mostly research, mostly chemical costs • 2020: US$ 20B, mostly clinical, mostly analytics costs * Macquarie Capital, 2014. Genomics 2.0: It’s just the beginning 0 5 10 15 20 25 2014 2020 Cinical Non-Clinical
  8. 8. © 2014 MapR Technologies 8 (Even) Moore’s Law Stein. 2010. The case for cloud computing in genome informatics “(Even) Moore’s” begins in 2004 with Solexa (acquired by ILMN 2007) Storage:MB/$ DNA:bp/$ ILMN HiSeq XTen (Jan 2014) $1000 Genome
  9. 9. © 2014 MapR Technologies 9 Trends and Events: ILMN HiSeq XTen Specs • Sold in sets of 10 units ONLY (XTen =10 sequencers) ~ $10 million/XTen, shipments began in Jan 2014 • XTen produces 600 GBases/day @ 30x oversampling = 1.8 TBases per 3-day cycle = 54 TBytes per 3-day cycle = $1000 per genome = 18,000 genomes/year/XTen ~ 4,000,000 births/year (US, 2012)  Neonatal sequencing is a reality (with 200 of today’s systems)
  10. 10. © 2014 MapR Technologies 10 Summary: Major Impact on Social Fabric • Muscular dystrophy • Cystic fibrosis • Albinism • Phenylketonuria • Hemophilia Diseases soon to be gone http://pandawhale.com/post/13851/my-report-card-came-in-my-paternity-test-came-in http://www.nature.com/scitable/topicpage/rare-genetic-disorders-learning-about-genetic-disease-979 http://en.wikipedia.org/wiki/Paternity_fraud http://www.cancer.org/cancer/cancercauses/geneticsandcancer/heredity-and-cancer Paternity Tests fact: US paternity fraud rate is 1 in 25 More Troubling: Huntington’s Disease: allow? Inherited Cancers (10% !!!): allow?
  11. 11. © 2014 MapR Technologies 11 Singapore: Government Sponsored Matchmaking • Some people have more desirable genes than others. • “Our government wants smart ladies to meet smart guys to get smart children.” ~ Annie Chan, Club2040 (Singapore matchmaking agency) http://www.nytimes.com/2008/04/29/world/asia/29iht-sing.1.12428974.html
  12. 12. © 2014 MapR Technologies 12
  13. 13. © 2014 MapR Technologies 13
  14. 14. © 2014 MapR Technologies 14© 2014 MapR Technologies Why hasn’t this happened yet?
  15. 15. © 2014 MapR Technologies 15 The Evolving Genomics Workload DNA Sequencing DNA Specimen Primary Analytics Apps
  16. 16. © 2014 MapR Technologies 16 DNA Sequencing Value Chain %Effort 0 100 Pre-NGS ~2000 Future ~2020 Now Sboner, et al, 2011. The real cost of sequencing: higher than you think!
  17. 17. © 2014 MapR Technologies 17 Bottleneck @ Primary Analytics DNA Sequencing DNA Specimen Primary Analytics Apps Fix this
  18. 18. © 2014 MapR Technologies 18 DNA sequencing effectively becomes free Commoditization pattern Huge influx of inexpensive data Creates new medical and biotech use-cases Sequence is Becoming Free %Effort 0 100 Pre-NGS ~2000 Future ~ Now
  19. 19. © 2014 MapR Technologies 19 Specialization will grow to 100% effort This is the desirable scenario Biologists ought to be doing biology Experiment Design and “Downstream” Analytics %Effort 0 100 Pre-NGS ~2000 Future ~ Now ANALYTICS
  20. 20. © 2014 MapR Technologies 20 Time currently being spent on BigData problems Not ideal Physicians & Biologists need help from CS & SW Engineers Data Management (1º Analytics) Bottleneck %Effort 0 100 Pre-NGS ~2000 Future ~2020 Now
  21. 21. © 2014 MapR Technologies 21 Just Remember the Diamond %Effort 0 100 Pre-NGS ~2000 Future ~2020 Now
  22. 22. © 2014 MapR Technologies 22© 2014 MapR Technologies DNA Sequencing Meets MapReduce
  23. 23. © 2014 MapR Technologies 23 Parallelize Primary Analytics .fastq .vcf short read alignment genotype callingreads & mappings
  24. 24. © 2014 MapR Technologies 24 Sequence Analysis, Quick Overview […] G A C T A G A fragment1 A C A G T T T A C A fragment2 A G A T A - - A G A fragment3 A A C A G C T T A C A […] fragment4 C T A T A G A T A A fragment5 […] G A T T A C A G A T T A C A G A T T A C A […] referenceDNA […] G A C T A C A G A T A A C A G A T T A C A […] sampleDNA
  25. 25. © 2014 MapR Technologies 25 What is the (Probable) Color of Each Column?
  26. 26. © 2014 MapR Technologies 26 Which Columns are (probably) Not White? Strategy 1: examine foreach column, foreach row O(rows*cols) + O(1 col) memory
  27. 27. © 2014 MapR Technologies 27 Which Columns are (probably) Not White? Strategy 2: examine foreach row. keep running tallies O(rows) + O(rows*cols) memory
  28. 28. © 2014 MapR Technologies 28 Which Columns are (probably) Not White? Strategy 3: rotate matrix. examine foreach column O(rows log rows) + O(cols) + O(1 col) memory
  29. 29. © 2014 MapR Technologies 29 Comparison of Strategies Strategy 1 • Low mem req • Random access pattern, many ops Strategy 3 • Low mem req • Sequential access pattern • Requires Sort Strategy 2 • High mem req • Sequential access pattern O(rows*cols) + O(1 col) memory O(rows) + O(rows*cols) memory O(rows log rows) + O(cols) + O(1 col) memory
  30. 30. © 2014 MapR Technologies 30 Comparison of Strategies Strategy 1 • Low mem req • Random access pattern, many ops Strategy 3 • Low mem req • Sequential access pattern • Requires Sort Strategy 2 • High mem req • Sequential access pattern O(rows*cols) + O(1 col) memory O(rows) + O(rows*cols) memory O(rows log rows) ÷ shards + O(cols) ÷ shards + O(1 col) memory As # of rows & columns increases Strategy 3 becomes more attractive
  31. 31. © 2014 MapR Technologies 31 Primary Sequence Analysis (ETL), MapReduce style .fastq .bam .vcf short read alignment genotype calling MAP MAP REDUCE, rotate matrix 90º (O(mn)) / 1 (O(mn) + O(n log n)) / s Hello!
  32. 32. © 2014 MapR Technologies 32 See also: Twitter Algebird – Parallel Linear Algebra Library for Scala / MapReduce
  33. 33. © 2014 MapR Technologies 33© 2014 MapR Technologies First App You’ll Likely See: Clinical Genomics
  34. 34. © 2014 MapR Technologies 34 Clinical Sequencing Business Process Workflow PhysicianPatient Clinic blood/saliva Clinical Lab Analytics extract
  35. 35. © 2014 MapR Technologies 35 One Bad MTHFR MTHFR C677T Methylfolate helps make neurotransmitters in your brain. When methylfolate levels are low, so are your neurotransmitters. Low production of neurotransmitters may cause conditions of addictive behavior, depression, anxiety, ADHD, mania, irritability, insomnia, learning disorders and others. Everyone should get tested. Why? Because 1 in 2 people are affected and if one knows they have a MTHFR polymorphism, they know they have to be very proactive in taking care of themselves. http://thyroid.about.com/od/MTHFR-Gene-Mutations-and-Polymorphisms/fl/The- Link-Between-MTHFR-Gene-Mutations-and-Disease-Including-Thyroid- Health.htm
  36. 36. © 2014 MapR Technologies 37 What’s the Impact on Human Evolution? More Reading: The Red Queen: Sex and the Evolution of Human Nature
  37. 37. © 2014 MapR Technologies 38 Clinical Sequencing Business Process Workflow PhysicianPatient Clinic blood/saliva Clinical Lab Analytics extract
  38. 38. © 2014 MapR Technologies 39 Clinical Genomics, Information Systems Perspective Compressed Structured Base4 Data Uncompressed Unstructured Base2 Data extract Base4=>Base2 Converter [[ DE-STRUCTURES ]] “BI” Reporting and Visualization tools PhysicianPatient AnalystStakeholder
  39. 39. © 2014 MapR Technologies 40 Clinical Genomics, Information Systems Perspective PhysicianPatient AnalystStakeholder ETL Reporting and Viz Data Store Analytics
  40. 40. © 2014 MapR Technologies 41 Clinical Genomics, Information Systems Perspective PhysicianPatient AnalystStakeholder ETL Reporting and Viz Data Store Analytics 1º analytics 2º analytics Not much in this presentation, see also: http://slidesha.re/1sC2BOX
  41. 41. © 2014 MapR Technologies 42 Clinical Applications: Performance Matters MapR FilesystemN F S DNA Sequencer DNA Sequencer DNA Sequencer Raw DNARaw DNARaw DNA 1º Analytics Raw DNARaw DNASNP calls Static Clinical Reporting PhysicianPatient Reference DBs SNP DB ETL 2º Analytics ResearcherSubject
  42. 42. © 2014 MapR Technologies 43 Variant Collection Enables Downstream Apps • GWAS Association Studies • Versioned, Personalized Medicine • Companion Diagnostics SNP DB 2º Analytics New Markets Hello! More linear algebra  [Spark, Summingbird, Lambda Architecture Slides]
  43. 43. © 2014 MapR Technologies 44 First Bottleneck Removed. Now What? %Effort 0 100 Pre-NGS ~2000 Future ~ Now ANALYTICS
  44. 44. © 2014 MapR Technologies 45© 2014 MapR Technologies Next Bottleneck, Of Course!
  45. 45. © 2014 MapR Technologies 46 Example GWAS/SNP Analysis • Find me related SNPs… – From other experiments • Given a phenotype… – And an associated SNP from my experiment • That elucidate genetic basis of phenotype… • And rank order them by impact/likelihood/etc
  46. 46. © 2014 MapR Technologies 47 Example GWAS/SNP Analysis • Find me related SNPs… – From other experiments • Given a phenotype… – And an associated SNP from my experiment • That elucidate genetic basis of phenotype… • And rank order them by impact/likelihood/etc • In context of, e.g. – ε1: Racial, etc. background – ε2: Experimental design- specific concerns (e.g. familial IBD/IBS) – ε3: Environmental factors and penetrance – ε4: Assay-specific biases and noise phenotype = αgenotype + β + ε1 + ε2 + ε3 + ε4 At risk of over-simplifying as business-level concept…
  47. 47. © 2014 MapR Technologies 48 HUGE PROBLEM COMBINATORIAL EXPLOSION
  48. 48. © 2014 MapR Technologies 49 What’s a Percolator? • Google Percolator – “Caffeine” update 2010 • Iterative, incremental prioritized updates • No batch processing • Decouple computational results from data size Peng & Dabek, 2010. Large-scale Incremental Processing Using Distributed Transactions and Notifications
  49. 49. © 2014 MapR Technologies 50 Solution: Percolate SNPs, experimental groupings, assay technologies, assayed phenotypes, annotations/ontologies Denormalize and Percolate (re)prioritize & (re)process service queries drive dashboards create reports denormalize for display buffer New models @allenday on percolators: http://slidesha.re/1qSXCKw
  50. 50. © 2014 MapR Technologies 51 If they were unlabeled, would you know which is which? Friend. 2010. The Need for Precompetitive Integrative Bionetwork Disease Model Building NPR. 2011. The Search For Analysts To Make Sense Of 'Big Data’ http://www.npr.org/2011/11/30/142893065
  51. 51. © 2014 MapR Technologies 52 If they were unlabeled, would you know which is which? • Identify network structures • Label them • Observe stimulus=>response space mapping • Purposefully target • $$$$ Twitter’s Business ModelFriend. 2010. The Need for Precompetitive Integrative Bionetwork Disease Model Building
  52. 52. © 2014 MapR Technologies 53 Robot Scientist Sparkes, et al. 2010. Towards Robot Scientists for autonomous scientific discovery
  53. 53. © 2014 MapR Technologies 54 Robot (Data?) Scientist Sparkes, et al. 2010. Towards Robot Scientists for autonomous scientific discovery
  54. 54. © 2014 MapR Technologies 55
  55. 55. © 2014 MapR Technologies 56 Q&A @allenday allenday@mapr.com allendaylinkedin.com/in/allenday

×