© 2014 MapR Technologies 1
Primary Sequence Analysis (ETL), MapReduce style
.fastq .bam .vcf
short read
alignment
genotype...
© 2014 MapR Technologies 2© 2014 MapR Technologies
Renaissance in Medicine (Draft 1)
© 2014 MapR Technologies 3
High-Level Biomedical Goal: Improve Fitness
Therapeutics => Diagnostics => Prognostics
• Therap...
© 2014 MapR Technologies 4Star Wars III: Revenge of the Sith
© 2014 MapR Technologies 5Star Wars V: The Empire Strikes Back
© 2014 MapR Technologies 6
© 2014 MapR Technologies 7
Many DNA-Based Apps Coming*…
• 2014: US$ 2B, mostly
research, mostly
chemical costs
• 2020: US$...
© 2014 MapR Technologies 8
(Even) Moore’s Law
Stein. 2010. The case for cloud computing in genome informatics
“(Even) Moor...
© 2014 MapR Technologies 9
Trends and Events: ILMN HiSeq XTen Specs
• Sold in sets of 10 units ONLY (XTen =10 sequencers)
...
© 2014 MapR Technologies 10
Summary: Major Impact on Social Fabric
• Muscular dystrophy
• Cystic fibrosis
• Albinism
• Phe...
© 2014 MapR Technologies 11
Singapore: Government Sponsored Matchmaking
• Some people have more desirable genes than other...
© 2014 MapR Technologies 12
© 2014 MapR Technologies 13
© 2014 MapR Technologies 14© 2014 MapR Technologies
Why hasn’t this happened yet?
© 2014 MapR Technologies 15
The Evolving Genomics Workload
DNA
Sequencing
DNA Specimen
Primary
Analytics Apps
© 2014 MapR Technologies 16
DNA Sequencing Value Chain
%Effort
0
100
Pre-NGS
~2000
Future
~2020
Now
Sboner, et al, 2011. T...
© 2014 MapR Technologies 17
Bottleneck @ Primary Analytics
DNA
Sequencing
DNA Specimen
Primary
Analytics Apps
Fix this
© 2014 MapR Technologies 18
DNA sequencing effectively becomes free
Commoditization pattern
Huge influx of inexpensive dat...
© 2014 MapR Technologies 19
Specialization will grow to 100% effort
This is the desirable scenario
Biologists ought to be ...
© 2014 MapR Technologies 20
Time currently being spent on BigData problems
Not ideal
Physicians & Biologists need help fro...
© 2014 MapR Technologies 21
Just Remember the Diamond
%Effort
0
100
Pre-NGS
~2000
Future
~2020
Now
© 2014 MapR Technologies 22© 2014 MapR Technologies
DNA Sequencing Meets MapReduce
© 2014 MapR Technologies 23
Parallelize Primary Analytics
.fastq .vcf
short read
alignment
genotype
callingreads &
mappings
© 2014 MapR Technologies 24
Sequence Analysis, Quick Overview
[…] G A C T A G A fragment1
A C A G T T T A C A fragment2
A ...
© 2014 MapR Technologies 25
What is the (Probable) Color of Each Column?
© 2014 MapR Technologies 26
Which Columns are (probably) Not White?
Strategy 1: examine foreach column, foreach row O(rows...
© 2014 MapR Technologies 27
Which Columns are (probably) Not White?
Strategy 2: examine foreach row. keep running tallies ...
© 2014 MapR Technologies 28
Which Columns are (probably) Not White?
Strategy 3: rotate matrix. examine foreach column O(ro...
© 2014 MapR Technologies 29
Comparison of Strategies
Strategy 1
• Low mem req
• Random access
pattern, many ops
Strategy 3...
© 2014 MapR Technologies 30
Comparison of Strategies
Strategy 1
• Low mem req
• Random access
pattern, many ops
Strategy 3...
© 2014 MapR Technologies 31
Primary Sequence Analysis (ETL), MapReduce style
.fastq .bam .vcf
short read
alignment
genotyp...
© 2014 MapR Technologies 32
See also: Twitter Algebird – Parallel Linear Algebra Library
for Scala / MapReduce
© 2014 MapR Technologies 33© 2014 MapR Technologies
First App You’ll Likely See:
Clinical Genomics
© 2014 MapR Technologies 34
Clinical Sequencing Business Process Workflow
PhysicianPatient
Clinic
blood/saliva
Clinical La...
© 2014 MapR Technologies 35
One Bad MTHFR
MTHFR C677T
Methylfolate helps make neurotransmitters in
your brain. When methyl...
© 2014 MapR Technologies 37
What’s the Impact on Human Evolution?
More Reading:
The Red Queen: Sex and the Evolution of
Hu...
© 2014 MapR Technologies 38
Clinical Sequencing Business Process Workflow
PhysicianPatient
Clinic
blood/saliva
Clinical La...
© 2014 MapR Technologies 39
Clinical Genomics, Information Systems Perspective
Compressed Structured
Base4 Data
Uncompress...
© 2014 MapR Technologies 40
Clinical Genomics, Information Systems Perspective
PhysicianPatient
AnalystStakeholder
ETL
Rep...
© 2014 MapR Technologies 41
Clinical Genomics, Information Systems Perspective
PhysicianPatient
AnalystStakeholder
ETL
Rep...
© 2014 MapR Technologies 42
Clinical Applications: Performance Matters
MapR
FilesystemN
F
S
DNA
Sequencer
DNA
Sequencer
DN...
© 2014 MapR Technologies 43
Variant Collection Enables Downstream Apps
• GWAS Association Studies
• Versioned, Personalize...
© 2014 MapR Technologies 44
First Bottleneck Removed. Now What?
%Effort
0
100
Pre-NGS
~2000
Future
~
Now
ANALYTICS
© 2014 MapR Technologies 45© 2014 MapR Technologies
Next Bottleneck, Of Course!
© 2014 MapR Technologies 46
Example GWAS/SNP Analysis
• Find me related SNPs…
– From other experiments
• Given a phenotype...
© 2014 MapR Technologies 47
Example GWAS/SNP Analysis
• Find me related SNPs…
– From other experiments
• Given a phenotype...
© 2014 MapR Technologies 48
HUGE PROBLEM
COMBINATORIAL EXPLOSION
© 2014 MapR Technologies 49
What’s a Percolator?
• Google Percolator
– “Caffeine” update 2010
• Iterative, incremental pri...
© 2014 MapR Technologies 50
Solution: Percolate
SNPs,
experimental groupings,
assay technologies,
assayed phenotypes,
anno...
© 2014 MapR Technologies 51
If they were unlabeled, would you know which is which?
Friend. 2010. The Need for Precompetiti...
© 2014 MapR Technologies 52
If they were unlabeled, would you know which is which?
• Identify network structures
• Label t...
© 2014 MapR Technologies 53
Robot Scientist
Sparkes, et al. 2010. Towards Robot Scientists for autonomous scientific disco...
© 2014 MapR Technologies 54
Robot (Data?) Scientist
Sparkes, et al. 2010. Towards Robot Scientists for autonomous scientif...
© 2014 MapR Technologies 55
© 2014 MapR Technologies 56
Q&A
@allenday allenday@mapr.com
allendaylinkedin.com/in/allenday
Upcoming SlideShare
Loading in …5
×

2014.06.30 - Renaissance in Medicine - Singapore Management University - Data Science SG

1,131 views

Published on

First draft for upcoming Hadoop World presentation "Renaissance in Medicine" that gives an overview of the upcoming changes in medical practice that are enabled by BigData technologies. Specific algorithmic techniques are detailed that enable this use case.

2014.06.30 - Renaissance in Medicine - Singapore Management University - Data Science SG

  1. 1. © 2014 MapR Technologies 1 Primary Sequence Analysis (ETL), MapReduce style .fastq .bam .vcf short read alignment genotype calling MAP MAP REDUCE, rotate matrix 90º (O(mn)) / 1 (O(mn) + O(n log n)) / s Hello!
  2. 2. © 2014 MapR Technologies 2© 2014 MapR Technologies Renaissance in Medicine (Draft 1)
  3. 3. © 2014 MapR Technologies 3 High-Level Biomedical Goal: Improve Fitness Therapeutics => Diagnostics => Prognostics • Therapeutics => traditional medicine • Diagnostics => personalized medicine – NextGen public health – Requires hi-res mechanical knowledge – Reverse engineer how genetic variation leads to (un)desired traits • Prognostics => GATTACA (dys/eu)topia – Managed populations / NextGen eugenics
  4. 4. © 2014 MapR Technologies 4Star Wars III: Revenge of the Sith
  5. 5. © 2014 MapR Technologies 5Star Wars V: The Empire Strikes Back
  6. 6. © 2014 MapR Technologies 6
  7. 7. © 2014 MapR Technologies 7 Many DNA-Based Apps Coming*… • 2014: US$ 2B, mostly research, mostly chemical costs • 2020: US$ 20B, mostly clinical, mostly analytics costs * Macquarie Capital, 2014. Genomics 2.0: It’s just the beginning 0 5 10 15 20 25 2014 2020 Cinical Non-Clinical
  8. 8. © 2014 MapR Technologies 8 (Even) Moore’s Law Stein. 2010. The case for cloud computing in genome informatics “(Even) Moore’s” begins in 2004 with Solexa (acquired by ILMN 2007) Storage:MB/$ DNA:bp/$ ILMN HiSeq XTen (Jan 2014) $1000 Genome
  9. 9. © 2014 MapR Technologies 9 Trends and Events: ILMN HiSeq XTen Specs • Sold in sets of 10 units ONLY (XTen =10 sequencers) ~ $10 million/XTen, shipments began in Jan 2014 • XTen produces 600 GBases/day @ 30x oversampling = 1.8 TBases per 3-day cycle = 54 TBytes per 3-day cycle = $1000 per genome = 18,000 genomes/year/XTen ~ 4,000,000 births/year (US, 2012)  Neonatal sequencing is a reality (with 200 of today’s systems)
  10. 10. © 2014 MapR Technologies 10 Summary: Major Impact on Social Fabric • Muscular dystrophy • Cystic fibrosis • Albinism • Phenylketonuria • Hemophilia Diseases soon to be gone http://pandawhale.com/post/13851/my-report-card-came-in-my-paternity-test-came-in http://www.nature.com/scitable/topicpage/rare-genetic-disorders-learning-about-genetic-disease-979 http://en.wikipedia.org/wiki/Paternity_fraud http://www.cancer.org/cancer/cancercauses/geneticsandcancer/heredity-and-cancer Paternity Tests fact: US paternity fraud rate is 1 in 25 More Troubling: Huntington’s Disease: allow? Inherited Cancers (10% !!!): allow?
  11. 11. © 2014 MapR Technologies 11 Singapore: Government Sponsored Matchmaking • Some people have more desirable genes than others. • “Our government wants smart ladies to meet smart guys to get smart children.” ~ Annie Chan, Club2040 (Singapore matchmaking agency) http://www.nytimes.com/2008/04/29/world/asia/29iht-sing.1.12428974.html
  12. 12. © 2014 MapR Technologies 12
  13. 13. © 2014 MapR Technologies 13
  14. 14. © 2014 MapR Technologies 14© 2014 MapR Technologies Why hasn’t this happened yet?
  15. 15. © 2014 MapR Technologies 15 The Evolving Genomics Workload DNA Sequencing DNA Specimen Primary Analytics Apps
  16. 16. © 2014 MapR Technologies 16 DNA Sequencing Value Chain %Effort 0 100 Pre-NGS ~2000 Future ~2020 Now Sboner, et al, 2011. The real cost of sequencing: higher than you think!
  17. 17. © 2014 MapR Technologies 17 Bottleneck @ Primary Analytics DNA Sequencing DNA Specimen Primary Analytics Apps Fix this
  18. 18. © 2014 MapR Technologies 18 DNA sequencing effectively becomes free Commoditization pattern Huge influx of inexpensive data Creates new medical and biotech use-cases Sequence is Becoming Free %Effort 0 100 Pre-NGS ~2000 Future ~ Now
  19. 19. © 2014 MapR Technologies 19 Specialization will grow to 100% effort This is the desirable scenario Biologists ought to be doing biology Experiment Design and “Downstream” Analytics %Effort 0 100 Pre-NGS ~2000 Future ~ Now ANALYTICS
  20. 20. © 2014 MapR Technologies 20 Time currently being spent on BigData problems Not ideal Physicians & Biologists need help from CS & SW Engineers Data Management (1º Analytics) Bottleneck %Effort 0 100 Pre-NGS ~2000 Future ~2020 Now
  21. 21. © 2014 MapR Technologies 21 Just Remember the Diamond %Effort 0 100 Pre-NGS ~2000 Future ~2020 Now
  22. 22. © 2014 MapR Technologies 22© 2014 MapR Technologies DNA Sequencing Meets MapReduce
  23. 23. © 2014 MapR Technologies 23 Parallelize Primary Analytics .fastq .vcf short read alignment genotype callingreads & mappings
  24. 24. © 2014 MapR Technologies 24 Sequence Analysis, Quick Overview […] G A C T A G A fragment1 A C A G T T T A C A fragment2 A G A T A - - A G A fragment3 A A C A G C T T A C A […] fragment4 C T A T A G A T A A fragment5 […] G A T T A C A G A T T A C A G A T T A C A […] referenceDNA […] G A C T A C A G A T A A C A G A T T A C A […] sampleDNA
  25. 25. © 2014 MapR Technologies 25 What is the (Probable) Color of Each Column?
  26. 26. © 2014 MapR Technologies 26 Which Columns are (probably) Not White? Strategy 1: examine foreach column, foreach row O(rows*cols) + O(1 col) memory
  27. 27. © 2014 MapR Technologies 27 Which Columns are (probably) Not White? Strategy 2: examine foreach row. keep running tallies O(rows) + O(rows*cols) memory
  28. 28. © 2014 MapR Technologies 28 Which Columns are (probably) Not White? Strategy 3: rotate matrix. examine foreach column O(rows log rows) + O(cols) + O(1 col) memory
  29. 29. © 2014 MapR Technologies 29 Comparison of Strategies Strategy 1 • Low mem req • Random access pattern, many ops Strategy 3 • Low mem req • Sequential access pattern • Requires Sort Strategy 2 • High mem req • Sequential access pattern O(rows*cols) + O(1 col) memory O(rows) + O(rows*cols) memory O(rows log rows) + O(cols) + O(1 col) memory
  30. 30. © 2014 MapR Technologies 30 Comparison of Strategies Strategy 1 • Low mem req • Random access pattern, many ops Strategy 3 • Low mem req • Sequential access pattern • Requires Sort Strategy 2 • High mem req • Sequential access pattern O(rows*cols) + O(1 col) memory O(rows) + O(rows*cols) memory O(rows log rows) ÷ shards + O(cols) ÷ shards + O(1 col) memory As # of rows & columns increases Strategy 3 becomes more attractive
  31. 31. © 2014 MapR Technologies 31 Primary Sequence Analysis (ETL), MapReduce style .fastq .bam .vcf short read alignment genotype calling MAP MAP REDUCE, rotate matrix 90º (O(mn)) / 1 (O(mn) + O(n log n)) / s Hello!
  32. 32. © 2014 MapR Technologies 32 See also: Twitter Algebird – Parallel Linear Algebra Library for Scala / MapReduce
  33. 33. © 2014 MapR Technologies 33© 2014 MapR Technologies First App You’ll Likely See: Clinical Genomics
  34. 34. © 2014 MapR Technologies 34 Clinical Sequencing Business Process Workflow PhysicianPatient Clinic blood/saliva Clinical Lab Analytics extract
  35. 35. © 2014 MapR Technologies 35 One Bad MTHFR MTHFR C677T Methylfolate helps make neurotransmitters in your brain. When methylfolate levels are low, so are your neurotransmitters. Low production of neurotransmitters may cause conditions of addictive behavior, depression, anxiety, ADHD, mania, irritability, insomnia, learning disorders and others. Everyone should get tested. Why? Because 1 in 2 people are affected and if one knows they have a MTHFR polymorphism, they know they have to be very proactive in taking care of themselves. http://thyroid.about.com/od/MTHFR-Gene-Mutations-and-Polymorphisms/fl/The- Link-Between-MTHFR-Gene-Mutations-and-Disease-Including-Thyroid- Health.htm
  36. 36. © 2014 MapR Technologies 37 What’s the Impact on Human Evolution? More Reading: The Red Queen: Sex and the Evolution of Human Nature
  37. 37. © 2014 MapR Technologies 38 Clinical Sequencing Business Process Workflow PhysicianPatient Clinic blood/saliva Clinical Lab Analytics extract
  38. 38. © 2014 MapR Technologies 39 Clinical Genomics, Information Systems Perspective Compressed Structured Base4 Data Uncompressed Unstructured Base2 Data extract Base4=>Base2 Converter [[ DE-STRUCTURES ]] “BI” Reporting and Visualization tools PhysicianPatient AnalystStakeholder
  39. 39. © 2014 MapR Technologies 40 Clinical Genomics, Information Systems Perspective PhysicianPatient AnalystStakeholder ETL Reporting and Viz Data Store Analytics
  40. 40. © 2014 MapR Technologies 41 Clinical Genomics, Information Systems Perspective PhysicianPatient AnalystStakeholder ETL Reporting and Viz Data Store Analytics 1º analytics 2º analytics Not much in this presentation, see also: http://slidesha.re/1sC2BOX
  41. 41. © 2014 MapR Technologies 42 Clinical Applications: Performance Matters MapR FilesystemN F S DNA Sequencer DNA Sequencer DNA Sequencer Raw DNARaw DNARaw DNA 1º Analytics Raw DNARaw DNASNP calls Static Clinical Reporting PhysicianPatient Reference DBs SNP DB ETL 2º Analytics ResearcherSubject
  42. 42. © 2014 MapR Technologies 43 Variant Collection Enables Downstream Apps • GWAS Association Studies • Versioned, Personalized Medicine • Companion Diagnostics SNP DB 2º Analytics New Markets Hello! More linear algebra  [Spark, Summingbird, Lambda Architecture Slides]
  43. 43. © 2014 MapR Technologies 44 First Bottleneck Removed. Now What? %Effort 0 100 Pre-NGS ~2000 Future ~ Now ANALYTICS
  44. 44. © 2014 MapR Technologies 45© 2014 MapR Technologies Next Bottleneck, Of Course!
  45. 45. © 2014 MapR Technologies 46 Example GWAS/SNP Analysis • Find me related SNPs… – From other experiments • Given a phenotype… – And an associated SNP from my experiment • That elucidate genetic basis of phenotype… • And rank order them by impact/likelihood/etc
  46. 46. © 2014 MapR Technologies 47 Example GWAS/SNP Analysis • Find me related SNPs… – From other experiments • Given a phenotype… – And an associated SNP from my experiment • That elucidate genetic basis of phenotype… • And rank order them by impact/likelihood/etc • In context of, e.g. – ε1: Racial, etc. background – ε2: Experimental design- specific concerns (e.g. familial IBD/IBS) – ε3: Environmental factors and penetrance – ε4: Assay-specific biases and noise phenotype = αgenotype + β + ε1 + ε2 + ε3 + ε4 At risk of over-simplifying as business-level concept…
  47. 47. © 2014 MapR Technologies 48 HUGE PROBLEM COMBINATORIAL EXPLOSION
  48. 48. © 2014 MapR Technologies 49 What’s a Percolator? • Google Percolator – “Caffeine” update 2010 • Iterative, incremental prioritized updates • No batch processing • Decouple computational results from data size Peng & Dabek, 2010. Large-scale Incremental Processing Using Distributed Transactions and Notifications
  49. 49. © 2014 MapR Technologies 50 Solution: Percolate SNPs, experimental groupings, assay technologies, assayed phenotypes, annotations/ontologies Denormalize and Percolate (re)prioritize & (re)process service queries drive dashboards create reports denormalize for display buffer New models @allenday on percolators: http://slidesha.re/1qSXCKw
  50. 50. © 2014 MapR Technologies 51 If they were unlabeled, would you know which is which? Friend. 2010. The Need for Precompetitive Integrative Bionetwork Disease Model Building NPR. 2011. The Search For Analysts To Make Sense Of 'Big Data’ http://www.npr.org/2011/11/30/142893065
  51. 51. © 2014 MapR Technologies 52 If they were unlabeled, would you know which is which? • Identify network structures • Label them • Observe stimulus=>response space mapping • Purposefully target • $$$$ Twitter’s Business ModelFriend. 2010. The Need for Precompetitive Integrative Bionetwork Disease Model Building
  52. 52. © 2014 MapR Technologies 53 Robot Scientist Sparkes, et al. 2010. Towards Robot Scientists for autonomous scientific discovery
  53. 53. © 2014 MapR Technologies 54 Robot (Data?) Scientist Sparkes, et al. 2010. Towards Robot Scientists for autonomous scientific discovery
  54. 54. © 2014 MapR Technologies 55
  55. 55. © 2014 MapR Technologies 56 Q&A @allenday allenday@mapr.com allendaylinkedin.com/in/allenday

×