Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Hadoop and Genomics - What you need to know - Cambridge - Sanger Center and EBI

1,230 views

Published on

Published in: Health & Medicine
  • Be the first to comment

Hadoop and Genomics - What you need to know - Cambridge - Sanger Center and EBI

  1. 1. © 2015 MapR Technologies 1© 2015 MapR Technologies Hadoop for Genomics: What you need to know
  2. 2. © 2015 MapR Technologies 2 Target Application: Alleviate / Prevent (Deterministic) Suffering Variant Calling DNA Sequencer Reads Reference Genome Genotype/ Phenotype/ Individual Matrix Cure & Prevent Disease Medical Records Patient
  3. 3. © 2015 MapR Technologies 3 DNA Sequencing, pre-2004 years CPU transistors/mm2 HDD GB/mm2 DNA bp/$, pre-2004
  4. 4. © 2015 MapR Technologies 4 DNA Sequencing, 2004 Disruption years CPU transistors/mm2 HDD GB/mm2DNA bp/$, post-2004 DNA bp/$, pre-2004
  5. 5. © 2015 MapR Technologies 5 DNA Sequencing, 2004 Disruption years CPU transistors/mm2 HDD GB/mm2DNA bp/$, post-2004 DNA bp/$, pre-2004 Similar disruption occurred for Internet traffic in mid-1990s
  6. 6. © 2015 MapR Technologies 6 Effect: Many DNA-Based Apps Coming… • 2014: US$ 2B, mostly research, mostly chemical costs • 2020: US$ 20B, mostly clinical, mostly analytics costs Macquarie Capital, 2014. Genomics 2.0: It’s just the beginning 0 5 10 15 20 25 2014 2020 Clinical Non-Clinical
  7. 7. © 2015 MapR Technologies 7 http://steamcommunity.com/app/203160/discussions/0/846956188647169800/ http://www.vox.com/2015/2/1/7955921/lara-croft-moores-law What Does Moore’s Law Feel Like? #Dataviz: Lara Croft 230=>40,000 Polygons (1996-2014)
  8. 8. © 2015 MapR Technologies 8 Application: Forensics http://cgi.uconn.edu/stranger-visions-forensic-art-exhibit/ http://snapshot.parabon-nanolabs.com/ http://www.nature.com/news/mugshots-built-from-dna-data-1.14899
  9. 9. © 2015 MapR Technologies 9 Growth in Resource Capacity
  10. 10. © 2015 MapR Technologies 10 Disruption Circa 2000 NASDAQ Composite
  11. 11. © 2015 MapR Technologies 11 What Happened? What did winners do right to survive the .com recession? NASDAQ Composite
  12. 12. © 2015 MapR Technologies 12 Early 1990s: Early eCommerce Vendor Setup Storage read/write read/write Website Back Office
  13. 13. © 2015 MapR Technologies 13 Late 1990s: Workload became too big Storage read/write read/write Website WebsiteWebsite Website Back Office Back Office
  14. 14. © 2015 MapR Technologies 14 Google Publishes • 2003: Google Filesystem (aka GFS) – http://research.google.com/archive/gfs.html • 2004: MapReduce – http://research.google.com/archive/mapreduce.html • 2006: BigTable – http://research.google.com/archive/bigtable.html
  15. 15. © 2015 MapR Technologies 15 Scale-out with Google FS + MapReduce read/write read/write Website WebsiteWebsite Website Storage + Compute Cluster Back Office Back Office
  16. 16. © 2015 MapR Technologies 16 Apache Software Foundation: Fast Follower of Google MapReduce Hadoop Google FS Hadoop FS BigTable HBase
  17. 17. © 2015 MapR Technologies 17 DNA Sequencing, post-2004 DNA Sequence NASDAQ Composite
  18. 18. © 2015 MapR Technologies 18 DNA Sequencing, pre-2004 Storage write-only read/write High-Performance Compute Cluster Coordinator / Edge Node Sequencer
  19. 19. © 2015 MapR Technologies 19 DNA Sequencing, post-2004 Storage write-only read/write High-Performance Compute Cluster Coordinator / Edge Node DNA Sequencer Cluster (e.g. Illumina X-Ten) HPC bottleneck Sequencer back-pressure
  20. 20. © 2015 MapR Technologies 20 Solution: Implemented 2014 @ Sequencer Vendor (with MapR) write-only DNA Sequencer Cluster (e.g. Illumina X-Ten Storage + Compute Cluster Decentralize I/O Decentralize I/O
  21. 21. © 2015 MapR Technologies 21 Allows Secondary Analytics to Scale Out Variant Calling DNA Sequencer Reads Reference Genome Genotype/ Phenotype/ Individual Matrix Cure & Prevent Disease Medical Records Patient
  22. 22. © 2015 MapR Technologies 22 Allows Secondary Analytics to Scale Out GATK / HPC method: flat after chromosome split Hadoop / Spark method
  23. 23. © 2015 MapR Technologies 23 Secondary Analytics: Acute Pain Point FastQ Reads Aligned Reads Variants ADAM + Avocado Matrix rotation is very I/O intense Velvet: Algorithms for de novo short read assembly using de Bruijn graphs, Zerbino & Birney. 2008 Local de novo is best… …only feasible with efficient rotations
  24. 24. © 2015 MapR Technologies 24 Columnar Storage => Efficient Rotations Genome Data Format Definition (A 1 Z) (B 1 Z) (C 1 Z) A 1 Z B 1 Z C 1 Z A B C 1 1 1 Z Z Z Record 1 Record 2 Record 3 RowBased ColBased Sorting Group MLLib
  25. 25. © 2015 MapR Technologies 25 Avro & Parquet • Apache fast followers of Google Protocol Buffers. • Application data is abstracted from structure. Storage and versioning efficiently handled internally. • Read/write codecs auto-generated for any language. • Avro: row-based records. • Parquet: columnar Avro. Improves compression and I/O profile. • ADAM: Genomics specific formats in Parquet. Effectively optimized BAM and VCF for distributed computing.
  26. 26. © 2015 MapR Technologies 26 Downstream Analytics: GWAS/PheWAS FastQ Reads Aligned Reads Variants Function Phenotypes Scalable GWAS/PheWA S: “Green Field” Territory ADAM + Avocado
  27. 27. © 2015 MapR Technologies 27 Compute Engine Data Workflow Adam Pipeline FastQ BAM ADAM ADAM- VCF VCF AvocadoADAM ADAM Aligner Super Fast • In-memory • Scalable compute context
  28. 28. © 2015 MapR Technologies 28 Target Application: Alleviate / Prevent Suffering Variant Calling DNA Sequencer Reads Reference Genome Genotype/ Phenotype/ Individual Matrix Cure & Prevent Disease Medical Records Patient
  29. 29. © 2015 MapR Technologies 29 GWAS Overview (Genome-wide Association Study) • Which genome features are associated with phenotype X? https://en.wikipedia.org/wiki/Genome-wide_association_study
  30. 30. © 2015 MapR Technologies 30 PheWAS Overview (Phenome-wide …) • Which phenotypes are associated with genome variant X? http://www.tcpinnovations.com/drugbaron/phewas-the-tool-thats-revolutionizing-drug-development-that-youve-likely-never-heard-of/
  31. 31. © 2015 MapR Technologies 31 Genome × Phenome Analysis For given population, given SNP 𝛿, and given phenotype ϕ: Count the number of occurrences as the value of the matrix 𝛿5 ϕ5 ϕ3 ϕ1 𝛿3 𝛿1 SPARSE Billion + Phenotypes SPARSEBillion+Genotypes
  32. 32. © 2015 MapR Technologies 32 Disease Cause via Genome × Phenome Matrix Factorization • Row Eigenvectors of X represent – Sets of related phenotypes (by SNP) • Column Eigenvectors of Y represent – Sets of related SNPS (by phenotype) 𝛿5 ϕ5 ϕ3 ϕ1 𝛿3 𝛿1 Principal Column Vector Archetype Genotypes Archetype Phenotypes Principal Row Vector Sparse Matrix Package is Actively Developed in Spark Community
  33. 33. © 2015 MapR Technologies 33 Generalized Approach: Genome × Phenome Tensor • Maintain individual identity • Aggregating individuals gives up statistical power • Leverage pedigrees – Individuals are not independent observations Variants Phenotypes Variants Phenotypes
  34. 34. © 2015 MapR Technologies 34 Scalable Variant Store => Root out Disease Causes Model P ~ F(G) Fortunately, this has already been done… Genotypes Med Record Phenotypes, e.g. disease risk, drug response
  35. 35. © 2015 MapR Technologies 35 Largest Biometric Database in the World PEOPLE 1.2B PEOPLE
  36. 36. © 2015 MapR Technologies 36 Why Create Aadhaar? • India: 1.2 billion residents – 640,000 villages, ~60% lives under $2/day – ~75% literacy, <3% pay income tax, <20% have bank accounts – ~800 million mobile, ~200-300 million migrant workers • Govt. spends about $25-40 billion on direct subsidies – Residents have no standard identity document – Most programs plagued with ghost and multiple identities causing leakage of 30-40% Standardize identity => Stop leakage
  37. 37. © 2015 MapR Technologies 37 Aadhaar Biometric Capture & Index Raw Digital Fingerprint
  38. 38. © 2015 MapR Technologies 38 Aadhaar Biometric ID Creation F(x): unique features G(x): uncommon features H(x): other features • 900MM people loaded in 4 years • In production – 1MM registrations/day – 200+ trillion lookups/day • All built on MapR-DB (HBase) Low Entropy + Unique Low Entropy + Infrequent
  39. 39. © 2015 MapR Technologies 39 How Does this Relate to Genomics? F-1(x): common features F(x): unique features G(x): uncommon features H(x): other features Same data shape and size • Aadhaar: 1B humans, 5MB minutia • Genome: 7B humans, ~3M variants
  40. 40. © 2015 MapR Technologies 40 How Does this Relate to Genomics? F-1(x): common features F(x): unique features G(x): uncommon features H(x): other features Phenotype: healthy or sick? Phenotype Partition => Low Entropy
  41. 41. © 2015 MapR Technologies 41 ≈ individuals fingerprint minutiae Find rare minutiae to uniquely identify medicalrecords genetic variants Find shared variants to get disease root cause Takeaway 1: Don’t reinvent the wheel
  42. 42. © 2015 MapR Technologies 42 Takeaway 2: Evolution, not Revolution DNA Sequence NASDAQ Composite
  43. 43. © 2015 MapR Technologies 43 Thank You @allenday // @mapr Now a few slides about MapR’s product… …and proposed next actions
  44. 44. © 2015 MapR Technologies 44 The MapR Advantage • Scale Reliability Across the Enterprise – Advanced multi-tenancy – Business continuity – HA, DR • Speed – 2-7x faster than other Hadoop distro’s – Ultra-fast data ingest (100M data points per sec) – NFS & R/W file system • Real-time & Self-Service Data Exploration – On-the-fly SQL without up-front schema – Fast lookups and queries Best Hadoop Platform for Data Warehouse Optimization & Analytics Security Streaming NoSQL & Search Provisioning & coordination ML, Graph W orkflow & Data Governance Batch SQL INTEGRATED COMMERCIAL ENGINES TOOLSCOMPUTE ENGINES Batch Interactive Real-time Online Others Management Operations Governance Audits Security MapR-FS MapR-DB MapR Data Platform
  45. 45. © 2015 MapR Technologies 45© 2015 MapR Technologies Genome Sequencing Quick Start Solution
  46. 46. © 2015 MapR Technologies 46 Quick Start Solutions: Speeding Time-to-Value SOLUTION TEMPLATE KNOWLEDGE TRANSFER DEPLOYMENT ARCHITECTURE Data Warehouse Optimization and Analytics Security Log Analytics Recommendation Engine Genome Sequencing
  47. 47. © 2015 MapR Technologies 47 What’s in the Genome Sequencing Quick Start Solution? 6 nodes of MapR software 3-4 week engagement 3 Hadoop Professional Certifications
  48. 48. © 2015 MapR Technologies 48 Service Offering 1 – Resequencing with Hadoop Reduces Storage Hardware Requirements Accelerates Data Processing Time Minimal impact to existing data pipelines Service Offering 2 – Variant Analysis with NoSQL Present data for exploration Operationalize complex workflows Web-scale performance
  49. 49. © 2015 MapR Technologies 49 Quick Start Service Engagement Engagement includes: 1. Identification of data sources, transformations and reporting engines 2. Access and use of the solution template including source code 3. Training on customizing the solution template to the organization’s requirement 4. Deployment architecture document that enables a production deployment plan for the specific solution SOLUTION TEMPLATE KNOWLEDGE TRANSFER DEPLOYMENT ARCHITECTURE

×