Genomics Crash Course for Data Engineers

10,678 views

Published on

Genomics Crash Course for Data Engineers

Published in: Data & Analytics, Science
1 Comment
44 Likes
Statistics
Notes
  • If You Want Any SlideShare Powerpoint Presentaion that cannot be downloaded . Contact me I will Send You :

    Message me on Facebook Or Gmail . Here are the contact details :-
    Facebook - https://www.facebook.com/profile.php?id=100008254320573

    Gmail -- tanwar.pankaj25@gmail.com
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total views
10,678
On SlideShare
0
From Embeds
0
Number of Embeds
240
Actions
Shares
0
Downloads
0
Comments
1
Likes
44
Embeds 0
No embeds

No notes for slide

Genomics Crash Course for Data Engineers

  1. © 2014 MapR Technologies 1
  2. © 2014 MapR Technologies 2 Biomedical & Advertising Tech Overarching Themes* *Obligatory movie references… shout-out to my hometown LA Eugenics & Determinism Free will vs. Determinism Media Tech & Privacy
  3. © 2014 MapR Technologies 3 Biomedical Research Goal: Therapeutics => Diagnostics => Prognostics • Therapeutics => traditional medicine • Diagnostics => personalized medicine – NextGen public health – Requires hi-res mechanical knowledge – Reverse engineer how genetic variation leads to (un)desired traits • Prognostics => GATTACA (dys/eu)topia – Managed populations / NextGen eugenics
  4. © 2014 MapR Technologies 4Star Wars III: Revenge of the Sith
  5. © 2014 MapR Technologies 5Star Wars V: The Empire Strikes Back
  6. © 2014 MapR Technologies 6 Genetic Basis of Facial Features self-reported values of {sex, ancestry} + observer scores [race, sex]} + 3D facial scan + genome scan ______________________________ Allelic model of 20 genes that determine facial characteristics Claes, et al. 2014. Modeling 3D Facial Shape from DNA
  7. © 2014 MapR Technologies 7 Genetic Basis of Facial Features Claes, et al. 2014. Modeling 3D Facial Shape from DNA
  8. © 2014 MapR Technologies 8 So Get Ready… www.theness.com
  9. © 2014 MapR Technologies 9© 2014 MapR Technologies Genomics Crash Course for Data Engineers
  10. © 2014 MapR Technologies 10 Me, Us • Allen Day, Principal Data Scientist, MapR 5yr Hadoop Dev, R project contributor PhD, Human Genetics, UCLA Medicine • MapR Distributes open source components for Hadoop Adds major technology for performance, HA, industry standard API’s • See Also – “allenday” most places (twitter, github, etc.) – @mapR
  11. © 2014 MapR Technologies 11 Clinical Sequencing Business Process Workflow PhysicianPatient Clinic blood/saliva Clinical Lab Analytics extract
  12. © 2014 MapR Technologies 12 One Bad MTHFR MTHFR C677T Methylfolate helps make neurotransmitters in your brain. When methylfolate levels are low, so are your neurotransmitters. Low production of neurotransmitters may cause conditions of addictive behavior, depression, anxiety, ADHD, mania, irritability, insomnia, learning disorders and others. Everyone should get tested. Why? Because 1 in 2 people are affected and if one knows they have a MTHFR polymorphism, they know they have to be very proactive in taking care of themselves. http://thyroid.about.com/od/MTHFR-Gene-Mutations-and-Polymorphisms/fl/The- Link-Between-MTHFR-Gene-Mutations-and-Disease-Including-Thyroid- Health.htm
  13. © 2014 MapR Technologies 13 One Bad MTHFR MTHFR C677T Methylfolate helps make neurotransmitters in your brain. When methylfolate levels are low, so are your neurotransmitters. Low production of neurotransmitters may cause conditions of addictive behavior, depression, anxiety, ADHD, mania, irritability, insomnia, learning disorders and others. Everyone should get tested. Why? Because 1 in 2 people are affected and if one knows they have a MTHFR polymorphism, they know they have to be very proactive in taking care of themselves. http://thyroid.about.com/od/MTHFR-Gene-Mutations-and-Polymorphisms/fl/The- Link-Between-MTHFR-Gene-Mutations-and-Disease-Including-Thyroid- Health.htm
  14. © 2014 MapR Technologies 14 One Bad MTHFR MTHFR C677T Methylfolate helps make neurotransmitters in your brain. When methylfolate levels are low, so are your neurotransmitters. Low production of neurotransmitters may cause conditions of addictive behavior, depression, anxiety, ADHD, mania, irritability, insomnia, learning disorders and others. Everyone should get tested. Why? Because 1 in 2 people are affected and if one knows they have a MTHFR polymorphism, they know they have to be very proactive in taking care of themselves. http://thyroid.about.com/od/MTHFR-Gene-Mutations-and-Polymorphisms/fl/The- Link-Between-MTHFR-Gene-Mutations-and-Disease-Including-Thyroid- Health.htm
  15. © 2014 MapR Technologies 15 One Bad MTHFR MTHFR C677T Methylfolate helps make neurotransmitters in your brain. When methylfolate levels are low, so are your neurotransmitters. Low production of neurotransmitters may cause conditions of addictive behavior, depression, anxiety, ADHD, mania, irritability, insomnia, learning disorders and others. Everyone should get tested. Why? Because 1 in 2 people are affected and if one knows they have a MTHFR polymorphism, they know they have to be very proactive in taking care of themselves. http://thyroid.about.com/od/MTHFR-Gene-Mutations-and-Polymorphisms/fl/The- Link-Between-MTHFR-Gene-Mutations-and-Disease-Including-Thyroid- Health.htm
  16. © 2014 MapR Technologies 16 Clinical Sequencing Business Process Workflow PhysicianPatient Clinic blood/saliva Clinical Lab Analytics extract
  17. © 2014 MapR Technologies 17 Clinical Genomics, Information Systems Perspective Compressed Structured Base4 Data Uncompressed Unstructured Base2 Data extract Base4=>Base2 Converter [[ DE-STRUCTURES ]] “BI” Reporting and Visualization tools PhysicianPatient AnalystStakeholder
  18. © 2014 MapR Technologies 18 Clinical Genomics, Information Systems Perspective PhysicianPatient AnalystStakeholder ETL Reporting and Viz Data Store Analytics
  19. © 2014 MapR Technologies 19 Sequencing “Even Moore’s Law” Stein. 2010. The case for cloud computing in genome informatics
  20. © 2014 MapR Technologies 20 The Evolving Genomics Workload Sboner, et al, 2011. The real cost of sequencing: higher than you think! <= 1º analytics “current high ROI use cases” <= 2º analytics “next-gen high ROI use cases”
  21. © 2014 MapR Technologies 21 Clinical Genomics, Information Systems Perspective PhysicianPatient AnalystStakeholder ETL Reporting and Viz Data Store Analytics 1º analytics 2º analytics Not much in this presentation, see also: http://slidesha.re/1sC2BOX
  22. © 2014 MapR Technologies 22 Sequence Analysis, Quick Partial Details […] G A C T A G A fragment1 A C A G T T T A C A fragment2 A G A T A - - A G A fragment3 A A C A G C T T A C A […] fragment4 C T A T A G A T A A fragment5 […] G A T T A C A G A T T A C A G A T T A C A […] referenceDNA […] G A C T A C A G A T A A C A G A T T A C A […] patient__DNA
  23. © 2014 MapR Technologies 23 What is the (Probable) Color of Each Column?
  24. © 2014 MapR Technologies 24 Which Columns are (probably) Not White? Strategy 1: examine foreach column, foreach row O(rows*cols) + O(1 col) memory
  25. © 2014 MapR Technologies 25 Which Columns are (probably) Not White? Strategy 2: examine foreach row. keep running tallies O(rows) + O(rows*cols) memory
  26. © 2014 MapR Technologies 26 Which Columns are (probably) Not White? Strategy 3: rotate matrix. examine foreach column O(rows log rows) + O(cols) + O(1 col) memory
  27. © 2014 MapR Technologies 27 Comparison of Strategies Strategy 1 • Low mem req • Random access pattern, many ops Strategy 3 • Low mem req • Sequential access pattern • Requires Sort Strategy 2 • High mem req • Sequential access pattern O(rows*cols) + O(1 col) memory O(rows) + O(rows*cols) memory O(rows log rows) + O(cols) + O(1 col) memory
  28. © 2014 MapR Technologies 28 Comparison of Strategies Strategy 1 • Low mem req • Random access pattern, many ops Strategy 3 • Low mem req • Sequential access pattern • Requires Sort Strategy 2 • High mem req • Sequential access pattern O(rows*cols) + O(1 col) memory O(rows) + O(rows*cols) memory O(rows log rows) ÷ shards + O(cols) ÷ shards + O(1 col) memory As # of rows & columns increases Strategy 3 becomes more attractive
  29. © 2014 MapR Technologies 29 1º Sequence Analysis (ETL), MapReduce style .fastq .bam .vcf short read alignment genotype calling MAP MAP REDUCE, rotate matrix 90º (O(mn)) / 1 (O(mn) + O(n log n)) / s
  30. © 2014 MapR Technologies 30 Crossbow (MapReduce Strategy, implemented) Langmead, et al. 2009. Searching for SNPs with cloud computing
  31. © 2014 MapR Technologies 31 Ion Flux (MapReduce Strategy, implemented for Enterprise) • Sequencing workflow in MapReduce (Hadoop, Cascading, Amazon Elastic M/R) • Integrated with Ion Torrent as a plugin to stream sequence to the cloud • Emphasis on scalability and latency – assay->clinical report turnaround in < 24h • Compare to fast-follower stack ILMN MiSeq+BaseSpace http://aws.amazon.com/solutions/case-studies/ion-flux/ http://ionflux.com
  32. © 2014 MapR Technologies 32© 2014 MapR Technologies Non-Genomics Digression, 1 of 2 Data Warehouse ETL Offload
  33. © 2014 MapR Technologies 33 The Problem • Major telecom vendor • Key step in billing pipeline handled by data warehouse (EDW) • EDW at maximum capacity • Multiple rounds of software optimization already done • Revenue limiting (= career limiting) bottleneck
  34. © 2014 MapR Technologies 34 Three Options 1. No more revenue growth 2. Increase EDW size – Expensive – Known to not scale well 3. Find a more scalable solution
  35. © 2014 MapR Technologies 35 ETL CDR billing records Billing reports Data Warehouse Customer bills Original Flow – ELTL
  36. © 2014 MapR Technologies 36 Simplified Analysis – EDW Strategy • 70% of EDW consumed by ELTL processing – Caused by 10% of code (CDR transformations) • 200% EDW capacity adds capital cost is ~X • Indirect costs non-trivial (floor space, power) • 150% performance increase (poor division of labor)
  37. © 2014 MapR Technologies 37 ETL CDR billing records Billing reports Data Warehouse Customer billing With ETL Offload
  38. © 2014 MapR Technologies 38 Simplified Analysis – MapR Strategy • Hardware + MapR cost ~1/20X • ETL replacement development costs ~1/20X • 300% performance increase
  39. © 2014 MapR Technologies 39 Price Performance • EDW strategy – 1.5x performance – Cost is X • MapR Strategy – 3x performance – Cost is 1/10X • 20x cost/performance advantage for MapR strategy
  40. © 2014 MapR Technologies 40 Platform Advantages • Standard Hadoop eco-system components allow efficient CDR parsing and ETL • MapR platform provides high availability, disaster recovery • MapR NFS interface allows direct load of transformed data
  41. © 2014 MapR Technologies 41© 2014 MapR Technologies Non-Genomics Digression, 2 of 2
  42. © 2014 MapR Technologies 42© 2014 MapR Technologies <Recommendation System. Redacted>
  43. © 2014 MapR Technologies 50© 2014 MapR Technologies Hybrid Use-Cases
  44. © 2014 MapR Technologies 51 MapR Data Platform Advantage, Telecommunications CO-OCCURRENCE (MAHOUT) SOLR INDEXING ETL BILLING REPORTS WEB TIERDATA WAREHOUSE CDR BILLING RECORDS CUSTOMER BILLING USER HISTORY QUERY / CONTEXT RECOMENDATIONS COMPLETE HISTORY (all users) ITEM META-DATA INDEX SHARDS
  45. © 2014 MapR Technologies 52 MapR Data Platform Advantage, Clinical Genomics Epidemiological, Actuarial Analyses Denormalization for Search, Viz, Research ETL Clinical Reporting WEB TIERClinical Reporting Systems CLINICAL TREATMENT OF PATIENTS RESEARCHERS National Pop. Database INDEX SHARDSPrognostic Capability
  46. © 2014 MapR Technologies 53© 2014 MapR Technologies Bonus Round: 2º Analytics
  47. © 2014 MapR Technologies 54 Clinical Genomics, Information Systems Perspective PhysicianPatient AnalystStakeholder ETL Reporting and Viz Data Store Analytics 2º analytics Not much in this presentation, see also: http://slidesha.re/1sC2BOX
  48. © 2014 MapR Technologies 55 Matrices A (U*Q) and B (U*V) Query Term = Clicked Term Users Query Terms Users Clicked Videos
  49. © 2014 MapR Technologies 56 Relate Q to V Users Query Terms
  50. © 2014 MapR Technologies 57 Relate Q to V Users Query Terms
  51. © 2014 MapR Technologies 58 Relate Q to V: it’s a Cross-Recommender QueryTerms Videos
  52. © 2014 MapR Technologies 59 Users Query Terms
  53. © 2014 MapR Technologies 60 If they were unlabeled, would you know which is which? Friend. 2010. The Need for Precompetitive Integrative Bionetwork Disease Model Building NPR. 2011. The Search For Analysts To Make Sense Of 'Big Data’ http://www.npr.org/2011/11/30/142893065
  54. © 2014 MapR Technologies 61 If they were unlabeled, would you know which is which? Friend. 2010. The Need for Precompetitive Integrative Bionetwork Disease Model Building • Identify network structures • Label them • Observe stimulus=>response space mapping • Purposefully target • PROFIT ! ! ! !

×