Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Hadoop as a Platform for Genomics - Strata 2015, San Jose

4,654 views

Published on

Personalized medicine holds much promise to improve the quality of human life.
However, personalizing medicine depends on genome analysis software that does not scale well. Given the potential impact on society, genomics takes first place among fields of science that can benefit from Hadoop.

A single human genome contains about 3 billion base pairs. This is less than 1 gigabyte of data but the intermediate data produced by a DNA sequencer, required to produce a sequenced human genome, is many hundreds of times larger. Beyond the huge storage requirement, deep genomic analysis across large populations of humans requires enormous computational capacity as well.

Interestingly enough, while genome scientists have adopted the concept of MapReduce for parallelizing I/O, they have not embraced the Hadoop ecosystem. For example, the popular Genome Analysis Toolkit (GATK) uses a proprietary MapReduce implementation that can scale vertically but not horizontally.

Published in: Health & Medicine

Hadoop as a Platform for Genomics - Strata 2015, San Jose

  1. 1. Hadoop as a Platform for Genomics @AllenDay, Chief Scientist Sungwook Yoon, Data Scientist Data Science @MapR
  2. 2. © 2014 MapR Technologies 2 DNA Sequencing, pre-2004 years CPU transistors/mm2 HDD GB/mm2 DNA bp/$, pre-2004
  3. 3. © 2014 MapR Technologies 3 DNA Sequencing, 2004 Disruption years CPU transistors/mm2 HDD GB/mm2DNA bp/$, post-2004 DNA bp/$, pre-2004
  4. 4. © 2014 MapR Technologies 4 DNA Sequencing, 2004 Disruption years CPU transistors/mm2 HDD GB/mm2DNA bp/$, post-2004 DNA bp/$, pre-2004 Similar disruption occurred for Internet traffic in mid-1990s
  5. 5. © 2014 MapR Technologies 5 Effect: Many DNA-Based Apps Coming… • 2014: US$ 2B, mostly research, mostly chemical costs • 2020: US$ 20B, mostly clinical, mostly analytics costs Macquarie Capital, 2014. Genomics 2.0: It’s just the beginning 0 5 10 15 20 25 2014 2020 Clinical Non-Clinical
  6. 6. © 2014 MapR Technologies 6© 2014 MapR Technologies 1. What Kind of Analytics Apps? 2. How do they Work?
  7. 7. © 2014 MapR Technologies 7 Target Audience • Fluency in computing, math • Basic knowledge of genetics, DNA …so expect some encapsulated complexity http://xkcd.com/803/
  8. 8. © 2014 MapR Technologies 8 Clinical Sequencing Business Process Workflow PhysicianPatient Clinic blood/saliva Clinical Lab Analytics extract
  9. 9. © 2014 MapR Technologies 9 Step 1: Identify all the Single Nucleotide Polymorphisms • Currently ~12MM known SNPs • Each person has a unique Genotype – Typically 3-5MM SNPs – Relative to a reference human – diff this.human other.human, essentially • Inherited from parents • Inexpensive to find as sequencing costs have plummeted http://learn.genetics.utah.edu/content/pharma/snips/
  10. 10. © 2014 MapR Technologies 10 Step 2: Characterize all the SNPs (ML, AI) Other data & algorithms JOIN
  11. 11. © 2014 MapR Technologies 11 Innovation Opportunities Pop. Freq Drug A Response Drug B Response 10% Good Good 30% Poor Fair 30% Excellent Poor 30% Good, but Toxic Fair “Nil nocere” – do no harm Step 3: Use Genotype to Customize Therapy
  12. 12. © 2014 MapR Technologies 12 Jan 30: Obama Unveils “Precision Medicine” Initiative “Most medical treatments have been designed for the ‘average patient’ … treatments can be very successful for some patients but not for others.” http://www.msnbc.com/msnbc/obama-seeks-215-million-personalized-medicine
  13. 13. © 2014 MapR Technologies 13 Application: Forensic Analysis http://cgi.uconn.edu/stranger-visions-forensic-art-exhibit/ http://snapshot.parabon-nanolabs.com/ http://www.nature.com/news/mugshots-built-from-dna-data-1.14899
  14. 14. © 2014 MapR Technologies 14 http://steamcommunity.com/app/203160/discussions/0/846956188647169800/ http://www.vox.com/2015/2/1/7955921/lara-croft-moores-law Moore’s Law #Dataviz: Lara Croft 230=>40,000 Polygons (1996-2014)
  15. 15. © 2014 MapR Technologies 15© 2014 MapR Technologies 1. What Kind of Analytics Apps? 2. How do they Work?
  16. 16. © 2014 MapR Technologies 16 Genome Sequencing in a Nutshell Reference HumanPatient Reference Genome        De novo sequencing + assemblyResequencing Patient Genotype
  17. 17. © 2014 MapR Technologies 17 Population-Scale Genome Biobanking
  18. 18. © 2014 MapR Technologies 18 GATK: Typical Tool for DNA=>Genotype Conversion Advantages • No consensus alternative… yet • Works! • Already deployed and being used to save lives Disadvantages • Map-Reduce but not Hadoop (and no plans to support) • Compute context cannot span multiple nodes • Inefficient use of shared memory (even within one node) • Inefficient asymmetric joins. No leverage of context, data locality
  19. 19. © 2014 MapR Technologies 19 GATK: flat after chromosome split
  20. 20. © 2014 MapR Technologies 20 Big Picture N DNA Input Records All SNPs Catalog still growing; Genotype space huge ≫ 8E37 Personal input is fixed N records and trivial to cut into P partitions GG A good implementation: scales O(N) ~ F(N,P) But GATK is SLOW: scales O(N) ~ F(Genotypes) GATK parallelization metrics / DEAD END attempts: https://github.com/allenday/sequencing-utils
  21. 21. © 2014 MapR Technologies 21 Bigger Picture: Human Suffering • Widely disliked. Reduction of suffering is good business. Even Bigger • Is it morally wrong to allow others to suffer? • If you agree, and there’s a way to reduce suffering, then… • We can argue there is a moral imperative to build the most efficient, dependable, inexpensive solution possible
  22. 22. © 2014 MapR Technologies 22© 2014 MapR Technologies From Feasible to Easy & Efficient
  23. 23. © 2014 MapR Technologies 23 Two Phases of Genome Data Analysis • Batch Sequence Processing – Align the reads to correct location – Make correct Variants detection through statistical modeling • Genome / Phenome Data Analysis – Find relevant Genotypes for Phenotypes – Find relevant Phenotypes for Genotypes
  24. 24. © 2014 MapR Technologies 24 Genome Processing Requirements Big Storage Big Memory Algorithms Sorting Group By Clustering Sparse Matrix Distributed Processing Which Free SW Has This Solution? 2TB per person Affordable Hardware Forward Backward
  25. 25. © 2014 MapR Technologies 25 Genome Processing Needs More Than Hadoop • Strong In Memory Computation • Strong Sparse Matrix Computation Which Free SW Has This Solution?
  26. 26. © 2014 MapR Technologies 26 Still One More Genome Data Format Definition (A 1 Z) (B 1 Z) (C 1 Z) A 1 Z B 1 Z C 1 Z A B C 1 1 1 Z Z Z Record 1 Record 2 Record 3 RowBased ColBased Sorting Group MLLib
  27. 27. © 2014 MapR Technologies 27 Compute Engines Data Workflow Adam Pipeline FastQ BAM ADAM ADAM- VCF VCF AvocadoADAM ADAMAligner Super Fast • In-memory • Scalable compute context Pipeline in Genomics Data Workflow, a sequence of data transformation from DNA sequence read to Variant Calls
  28. 28. © 2014 MapR Technologies 28 Scale with Machines From ADAM Tech Report
  29. 29. © 2014 MapR Technologies 29 That’s A lot but it just is a start • Why do we want sequencing? – To catch criminals ?? • Police State?? • Deeper wider genome study may reveal – Future medicine – Cure for diseases – Maybe … find Heroes??
  30. 30. © 2014 MapR Technologies 30 Variants Accumulate – Need a Scalable Variant Store ADAM ADAM- VCF
  31. 31. © 2014 MapR Technologies 31 Genome × Phenome Analysis For given population, given SNP 𝛿, and given phenotype ϕ: Count the number of occurrences as the value of the matrix 𝛿5 ϕ5 ϕ3 ϕ1 𝛿3 𝛿1 SPARSE Billion + Phenotypes SPARSEBillion+Genotypes
  32. 32. © 2014 MapR Technologies 32 Interpreting Genome × Phenome Matrix Factorization Result • Row Vectors of X represents – Archetype set of phenotypes • Column vectors of Y represents – Archetype set of genotypes 𝛿5 ϕ5 ϕ3 ϕ1 𝛿3 𝛿1 Principal Column Vector Archetype Genotypes Archetype Phenotypes Principal Row Vector Sparse Matrix Package is Actively Developed in Spark Community
  33. 33. © 2014 MapR Technologies 33 Toward Heroes : Genome × Phenome Tensor • Aggregating over individuals with matrix could ignore the correlations among genotypes and phenotypes • Maintain individual identity Variants Phenotypes Variants Phenotypes
  34. 34. © 2014 MapR Technologies 34 Tensor Factorization (Parafac) Genome Variants Phenome ≈ Principal Variants1 Principal Phenotypes1
  35. 35. © 2014 MapR Technologies 35© 2014 MapR Technologies From Imaginable to Possible
  36. 36. © 2014 MapR Technologies 36 Genome needs Hadoop Variant Calling DNA Sequencer Reads Reference Genome Genotype/ Phenotype/ Individual Matrix Cure & Prevent Disease Medical Records Patient
  37. 37. © 2014 MapR Technologies 37 Scalable Variant Store – Data Mining Model P ~ F(G) Fortunately, this has already been done… Genotypes Med Record Phenotypes, e.g. disease risk, drug response
  38. 38. © 2014 MapR Technologies 38 Largest Biometric Database in the World PEOPLE 1.2B PEOPLE
  39. 39. © 2014 MapR Technologies 39 Why Create Aadhaar? • India: 1.2 billion residents – 640,000 villages, ~60% lives under $2/day – ~75% literacy, <3% pay income tax, <20% have bank accounts – ~800 million mobile, ~200-300 million migrant workers • Govt. spends about $25-40 billion on direct subsidies – Residents have no standard identity document – Most programs plagued with ghost and multiple identities causing leakage of 30-40% Standardize identity => Stop leakage
  40. 40. © 2014 MapR Technologies 40 Aadhaar Biometric Capture & Index Raw Digital Fingerprint
  41. 41. © 2014 MapR Technologies 41 Aadhaar Biometric ID Creation F(x): unique features G(x): uncommon features H(x): other features • 900MM people loaded in 4 years • In production – 1MM registrations/day – 200+ trillion lookups/day • All built on MapR-DB (HBase)
  42. 42. © 2014 MapR Technologies 42 How Does this Relate to Genomics? F(x): unique features G(x): uncommon features H(x): other features Same data shape and size • Aadhaar: 1B humans, 5MB minutia • Genome: 7B humans, ~3M variants
  43. 43. © 2014 MapR Technologies 43 How Does this Relate to Genomics? F-1(x): common features F(x): unique features G(x): uncommon features H(x): other features Same data shape and size • Aadhaar: 1B humans, 5MB minutia • Genome: 6B humans, ~3M variants • Genome: variant × phenotype • Common variant => effect-causing gene F-1(x) ! Same data set operations
  44. 44. © 2014 MapR Technologies 44 Genotype/ Phenotype/ Individual Matrix ≈ individuals fingerprint minutiae Find genetic basis of fingerprints medicalrecords genetic variants Find genetic basis of disease
  45. 45. © 2014 MapR Technologies, confidential Thanks! Questions? @allenday, @mapr aday@mapr.com, syoon@mapr.com linkedin.com/in/allenday

×