• Save
2014.06.30 - Renaissance in Medicine - Singapore Management University - Data Science SG

Like this? Share it with your network

Share

2014.06.30 - Renaissance in Medicine - Singapore Management University - Data Science SG

  • 488 views
Uploaded on

First draft for upcoming Hadoop World presentation "Renaissance in Medicine" that gives an overview of the upcoming changes in medical practice that are enabled by BigData technologies. Specific......

First draft for upcoming Hadoop World presentation "Renaissance in Medicine" that gives an overview of the upcoming changes in medical practice that are enabled by BigData technologies. Specific algorithmic techniques are detailed that enable this use case.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
488
On Slideshare
484
From Embeds
4
Number of Embeds
2

Actions

Shares
Downloads
0
Comments
0
Likes
4

Embeds 4

https://www.linkedin.com 2
http://www.slideee.com 2

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. © 2014 MapR Technologies 1 Primary Sequence Analysis (ETL), MapReduce style .fastq .bam .vcf short read alignment genotype calling MAP MAP REDUCE, rotate matrix 90º (O(mn)) / 1 (O(mn) + O(n log n)) / s Hello!
  • 2. © 2014 MapR Technologies 2© 2014 MapR Technologies Renaissance in Medicine (Draft 1)
  • 3. © 2014 MapR Technologies 3 High-Level Biomedical Goal: Improve Fitness Therapeutics => Diagnostics => Prognostics • Therapeutics => traditional medicine • Diagnostics => personalized medicine – NextGen public health – Requires hi-res mechanical knowledge – Reverse engineer how genetic variation leads to (un)desired traits • Prognostics => GATTACA (dys/eu)topia – Managed populations / NextGen eugenics
  • 4. © 2014 MapR Technologies 4Star Wars III: Revenge of the Sith
  • 5. © 2014 MapR Technologies 5Star Wars V: The Empire Strikes Back
  • 6. © 2014 MapR Technologies 6
  • 7. © 2014 MapR Technologies 7 Many DNA-Based Apps Coming*… • 2014: US$ 2B, mostly research, mostly chemical costs • 2020: US$ 20B, mostly clinical, mostly analytics costs * Macquarie Capital, 2014. Genomics 2.0: It’s just the beginning 0 5 10 15 20 25 2014 2020 Cinical Non-Clinical
  • 8. © 2014 MapR Technologies 8 (Even) Moore’s Law Stein. 2010. The case for cloud computing in genome informatics “(Even) Moore’s” begins in 2004 with Solexa (acquired by ILMN 2007) Storage:MB/$ DNA:bp/$ ILMN HiSeq XTen (Jan 2014) $1000 Genome
  • 9. © 2014 MapR Technologies 9 Trends and Events: ILMN HiSeq XTen Specs • Sold in sets of 10 units ONLY (XTen =10 sequencers) ~ $10 million/XTen, shipments began in Jan 2014 • XTen produces 600 GBases/day @ 30x oversampling = 1.8 TBases per 3-day cycle = 54 TBytes per 3-day cycle = $1000 per genome = 18,000 genomes/year/XTen ~ 4,000,000 births/year (US, 2012)  Neonatal sequencing is a reality (with 200 of today’s systems)
  • 10. © 2014 MapR Technologies 10 Summary: Major Impact on Social Fabric • Muscular dystrophy • Cystic fibrosis • Albinism • Phenylketonuria • Hemophilia Diseases soon to be gone http://pandawhale.com/post/13851/my-report-card-came-in-my-paternity-test-came-in http://www.nature.com/scitable/topicpage/rare-genetic-disorders-learning-about-genetic-disease-979 http://en.wikipedia.org/wiki/Paternity_fraud http://www.cancer.org/cancer/cancercauses/geneticsandcancer/heredity-and-cancer Paternity Tests fact: US paternity fraud rate is 1 in 25 More Troubling: Huntington’s Disease: allow? Inherited Cancers (10% !!!): allow?
  • 11. © 2014 MapR Technologies 11 Singapore: Government Sponsored Matchmaking • Some people have more desirable genes than others. • “Our government wants smart ladies to meet smart guys to get smart children.” ~ Annie Chan, Club2040 (Singapore matchmaking agency) http://www.nytimes.com/2008/04/29/world/asia/29iht-sing.1.12428974.html
  • 12. © 2014 MapR Technologies 12
  • 13. © 2014 MapR Technologies 13
  • 14. © 2014 MapR Technologies 14© 2014 MapR Technologies Why hasn’t this happened yet?
  • 15. © 2014 MapR Technologies 15 The Evolving Genomics Workload DNA Sequencing DNA Specimen Primary Analytics Apps
  • 16. © 2014 MapR Technologies 16 DNA Sequencing Value Chain %Effort 0 100 Pre-NGS ~2000 Future ~2020 Now Sboner, et al, 2011. The real cost of sequencing: higher than you think!
  • 17. © 2014 MapR Technologies 17 Bottleneck @ Primary Analytics DNA Sequencing DNA Specimen Primary Analytics Apps Fix this
  • 18. © 2014 MapR Technologies 18 DNA sequencing effectively becomes free Commoditization pattern Huge influx of inexpensive data Creates new medical and biotech use-cases Sequence is Becoming Free %Effort 0 100 Pre-NGS ~2000 Future ~ Now
  • 19. © 2014 MapR Technologies 19 Specialization will grow to 100% effort This is the desirable scenario Biologists ought to be doing biology Experiment Design and “Downstream” Analytics %Effort 0 100 Pre-NGS ~2000 Future ~ Now ANALYTICS
  • 20. © 2014 MapR Technologies 20 Time currently being spent on BigData problems Not ideal Physicians & Biologists need help from CS & SW Engineers Data Management (1º Analytics) Bottleneck %Effort 0 100 Pre-NGS ~2000 Future ~2020 Now
  • 21. © 2014 MapR Technologies 21 Just Remember the Diamond %Effort 0 100 Pre-NGS ~2000 Future ~2020 Now
  • 22. © 2014 MapR Technologies 22© 2014 MapR Technologies DNA Sequencing Meets MapReduce
  • 23. © 2014 MapR Technologies 23 Parallelize Primary Analytics .fastq .vcf short read alignment genotype callingreads & mappings
  • 24. © 2014 MapR Technologies 24 Sequence Analysis, Quick Overview […] G A C T A G A fragment1 A C A G T T T A C A fragment2 A G A T A - - A G A fragment3 A A C A G C T T A C A […] fragment4 C T A T A G A T A A fragment5 […] G A T T A C A G A T T A C A G A T T A C A […] referenceDNA […] G A C T A C A G A T A A C A G A T T A C A […] sampleDNA
  • 25. © 2014 MapR Technologies 25 What is the (Probable) Color of Each Column?
  • 26. © 2014 MapR Technologies 26 Which Columns are (probably) Not White? Strategy 1: examine foreach column, foreach row O(rows*cols) + O(1 col) memory
  • 27. © 2014 MapR Technologies 27 Which Columns are (probably) Not White? Strategy 2: examine foreach row. keep running tallies O(rows) + O(rows*cols) memory
  • 28. © 2014 MapR Technologies 28 Which Columns are (probably) Not White? Strategy 3: rotate matrix. examine foreach column O(rows log rows) + O(cols) + O(1 col) memory
  • 29. © 2014 MapR Technologies 29 Comparison of Strategies Strategy 1 • Low mem req • Random access pattern, many ops Strategy 3 • Low mem req • Sequential access pattern • Requires Sort Strategy 2 • High mem req • Sequential access pattern O(rows*cols) + O(1 col) memory O(rows) + O(rows*cols) memory O(rows log rows) + O(cols) + O(1 col) memory
  • 30. © 2014 MapR Technologies 30 Comparison of Strategies Strategy 1 • Low mem req • Random access pattern, many ops Strategy 3 • Low mem req • Sequential access pattern • Requires Sort Strategy 2 • High mem req • Sequential access pattern O(rows*cols) + O(1 col) memory O(rows) + O(rows*cols) memory O(rows log rows) ÷ shards + O(cols) ÷ shards + O(1 col) memory As # of rows & columns increases Strategy 3 becomes more attractive
  • 31. © 2014 MapR Technologies 31 Primary Sequence Analysis (ETL), MapReduce style .fastq .bam .vcf short read alignment genotype calling MAP MAP REDUCE, rotate matrix 90º (O(mn)) / 1 (O(mn) + O(n log n)) / s Hello!
  • 32. © 2014 MapR Technologies 32 See also: Twitter Algebird – Parallel Linear Algebra Library for Scala / MapReduce
  • 33. © 2014 MapR Technologies 33© 2014 MapR Technologies First App You’ll Likely See: Clinical Genomics
  • 34. © 2014 MapR Technologies 34 Clinical Sequencing Business Process Workflow PhysicianPatient Clinic blood/saliva Clinical Lab Analytics extract
  • 35. © 2014 MapR Technologies 35 One Bad MTHFR MTHFR C677T Methylfolate helps make neurotransmitters in your brain. When methylfolate levels are low, so are your neurotransmitters. Low production of neurotransmitters may cause conditions of addictive behavior, depression, anxiety, ADHD, mania, irritability, insomnia, learning disorders and others. Everyone should get tested. Why? Because 1 in 2 people are affected and if one knows they have a MTHFR polymorphism, they know they have to be very proactive in taking care of themselves. http://thyroid.about.com/od/MTHFR-Gene-Mutations-and-Polymorphisms/fl/The- Link-Between-MTHFR-Gene-Mutations-and-Disease-Including-Thyroid- Health.htm
  • 36. © 2014 MapR Technologies 37 What’s the Impact on Human Evolution? More Reading: The Red Queen: Sex and the Evolution of Human Nature
  • 37. © 2014 MapR Technologies 38 Clinical Sequencing Business Process Workflow PhysicianPatient Clinic blood/saliva Clinical Lab Analytics extract
  • 38. © 2014 MapR Technologies 39 Clinical Genomics, Information Systems Perspective Compressed Structured Base4 Data Uncompressed Unstructured Base2 Data extract Base4=>Base2 Converter [[ DE-STRUCTURES ]] “BI” Reporting and Visualization tools PhysicianPatient AnalystStakeholder
  • 39. © 2014 MapR Technologies 40 Clinical Genomics, Information Systems Perspective PhysicianPatient AnalystStakeholder ETL Reporting and Viz Data Store Analytics
  • 40. © 2014 MapR Technologies 41 Clinical Genomics, Information Systems Perspective PhysicianPatient AnalystStakeholder ETL Reporting and Viz Data Store Analytics 1º analytics 2º analytics Not much in this presentation, see also: http://slidesha.re/1sC2BOX
  • 41. © 2014 MapR Technologies 42 Clinical Applications: Performance Matters MapR FilesystemN F S DNA Sequencer DNA Sequencer DNA Sequencer Raw DNARaw DNARaw DNA 1º Analytics Raw DNARaw DNASNP calls Static Clinical Reporting PhysicianPatient Reference DBs SNP DB ETL 2º Analytics ResearcherSubject
  • 42. © 2014 MapR Technologies 43 Variant Collection Enables Downstream Apps • GWAS Association Studies • Versioned, Personalized Medicine • Companion Diagnostics SNP DB 2º Analytics New Markets Hello! More linear algebra  [Spark, Summingbird, Lambda Architecture Slides]
  • 43. © 2014 MapR Technologies 44 First Bottleneck Removed. Now What? %Effort 0 100 Pre-NGS ~2000 Future ~ Now ANALYTICS
  • 44. © 2014 MapR Technologies 45© 2014 MapR Technologies Next Bottleneck, Of Course!
  • 45. © 2014 MapR Technologies 46 Example GWAS/SNP Analysis • Find me related SNPs… – From other experiments • Given a phenotype… – And an associated SNP from my experiment • That elucidate genetic basis of phenotype… • And rank order them by impact/likelihood/etc
  • 46. © 2014 MapR Technologies 47 Example GWAS/SNP Analysis • Find me related SNPs… – From other experiments • Given a phenotype… – And an associated SNP from my experiment • That elucidate genetic basis of phenotype… • And rank order them by impact/likelihood/etc • In context of, e.g. – ε1: Racial, etc. background – ε2: Experimental design- specific concerns (e.g. familial IBD/IBS) – ε3: Environmental factors and penetrance – ε4: Assay-specific biases and noise phenotype = αgenotype + β + ε1 + ε2 + ε3 + ε4 At risk of over-simplifying as business-level concept…
  • 47. © 2014 MapR Technologies 48 HUGE PROBLEM COMBINATORIAL EXPLOSION
  • 48. © 2014 MapR Technologies 49 What’s a Percolator? • Google Percolator – “Caffeine” update 2010 • Iterative, incremental prioritized updates • No batch processing • Decouple computational results from data size Peng & Dabek, 2010. Large-scale Incremental Processing Using Distributed Transactions and Notifications
  • 49. © 2014 MapR Technologies 50 Solution: Percolate SNPs, experimental groupings, assay technologies, assayed phenotypes, annotations/ontologies Denormalize and Percolate (re)prioritize & (re)process service queries drive dashboards create reports denormalize for display buffer New models @allenday on percolators: http://slidesha.re/1qSXCKw
  • 50. © 2014 MapR Technologies 51 If they were unlabeled, would you know which is which? Friend. 2010. The Need for Precompetitive Integrative Bionetwork Disease Model Building NPR. 2011. The Search For Analysts To Make Sense Of 'Big Data’ http://www.npr.org/2011/11/30/142893065
  • 51. © 2014 MapR Technologies 52 If they were unlabeled, would you know which is which? • Identify network structures • Label them • Observe stimulus=>response space mapping • Purposefully target • $$$$ Twitter’s Business ModelFriend. 2010. The Need for Precompetitive Integrative Bionetwork Disease Model Building
  • 52. © 2014 MapR Technologies 53 Robot Scientist Sparkes, et al. 2010. Towards Robot Scientists for autonomous scientific discovery
  • 53. © 2014 MapR Technologies 54 Robot (Data?) Scientist Sparkes, et al. 2010. Towards Robot Scientists for autonomous scientific discovery
  • 54. © 2014 MapR Technologies 55
  • 55. © 2014 MapR Technologies 56 Q&A @allenday allenday@mapr.com allendaylinkedin.com/in/allenday