• Save
Human Genetics & Big Data [sans Ethics]
 

Human Genetics & Big Data [sans Ethics]

on

  • 1,077 views

Genes => Traits => Behaviors => Fitness => Genes

Genes => Traits => Behaviors => Fitness => Genes

Guest lecture @ University of Sydney, May 2014

Statistics

Views

Total Views
1,077
Views on SlideShare
1,067
Embed Views
10

Actions

Likes
7
Downloads
0
Comments
0

2 Embeds 10

https://twitter.com 9
https://www.linkedin.com 1

Accessibility

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • The genomic position (x-axis) of probesets within a 6 megabase region centered at the location of TTN, a gene known to be associated with LMGD2, is plotted versus the Pearson correlation coefficient An external file that holds a picture, illustration, etc.Object name is pone.0008491.e023.jpg (y-axis) to a list of probesets targeting other genes known to be associated with LGMD2 (excluding TTN) across 11636 HG-U133_Plus_2 microarrays. Solid circles: probesets targeting TTN, An external file that holds a picture, illustration, etc.Object name is pone.0008491.e024.jpg: probesets that are for genes of unknown function and, open circles: probesets for known genes in interval.
  • Allen: this is the transitional slide from talking about more than one input to one step further: cross recommendation. I doubt you want to use it as it, but I’ve included it FYI
  • Allen: additional transitional slide
  • Allen: What do you plan to say about this? General example without anything proprietary?
  • Allen: What do you plan to say about this? General example without anything proprietary?
  • Allen: What do you plan to say about this? General example without anything proprietary?

Human Genetics & Big Data [sans Ethics] Human Genetics & Big Data [sans Ethics] Presentation Transcript

  • © 2014 MapR Technologies 1
  • © 2014 MapR Technologies 2 Biomedical Research Goal: Improve Fitness Therapeutics => Diagnostics => Prognostics • Therapeutics => traditional medicine • Diagnostics => personalized medicine – NextGen public health – Requires hi-res mechanical knowledge – Reverse engineer how genetic variation leads to (un)desired traits • Prognostics => GATTACA (dys/eu)topia – Managed populations / NextGen eugenics
  • © 2014 MapR Technologies 3 Biomedical & Advertising Tech Overarching Themes* *Obligatory movie references… shout-out to my hometown LA Eugenics & Determinism Free will vs. Determinism Media Tech & Privacy
  • © 2014 MapR Technologies 4Star Wars III: Revenge of the Sith
  • © 2014 MapR Technologies 5Star Wars V: The Empire Strikes Back
  • © 2014 MapR Technologies 6 Health ~ Fitness Genes => Traits => Behaviors => Fitness
  • © 2014 MapR Technologies 7
  • © 2014 MapR Technologies 8© 2014 MapR Technologies Human Genetics & Big Data Human Genetics & Ethics Today we talk about technology
  • © 2014 MapR Technologies 9 Me, Us • Allen Day, Principal Data Scientist, MapR 5yr Hadoop Dev, R project contributor PhD, Human Genetics, UCLA Medicine • MapR Distributes open source components for Hadoop Adds major technology for performance, HA, industry standard API’s • See Also – “allenday” most places (twitter, github, etc.) – @mapR
  • © 2014 MapR Technologies 10 Genetic Basis of Facial Features self-reported values of {sex, ancestry} + observer scores [race, sex]} + 3D facial scan + genome scan ______________________________ Allelic model of 20 genes that determine facial characteristics Claes, et al. 2014. Modeling 3D Facial Shape from DNA
  • © 2014 MapR Technologies 11 Genetic Basis of Facial Features Claes, et al. 2014. Modeling 3D Facial Shape from DNA
  • © 2014 MapR Technologies 12 So Get Ready… www.theness.com
  • © 2014 MapR Technologies 13 DTRA102-007 – Forensic DNA Analysis Kit for Genetic Intelligence • Sex • Blood type • Ancestry • Hair morphology • Dimples • Freckles • Shoe size • Flat-footedness • Vision correction • Ear lobe attachment • Ear lobe crease • 5th digit clinodactyly • Eye color, hair color, skin color • Height, handedness • Etc https://sbirsource.com/grantiq#/topics/85383
  • © 2014 MapR Technologies 14 DTRA102-007: Sex and Ancestry
  • © 2014 MapR Technologies 15© 2014 MapR Technologies Trends & Events
  • © 2014 MapR Technologies 16 Trends and Events: Even Moore’s Law Stein. 2010. The case for cloud computing in genome informatics “Even Moore’s” begins in 2004 with Solexa (acquired by ILMN 2007) Storage:MB/$ DNA:bp/$ ILMN HiSeq XTen (Jan 2014) $1000 Genome
  • © 2014 MapR Technologies 17 NIH Research Funding Trends. http://www.faseb.org/Policy-and-Government-Affairs/Data-Compilations/NIH-Research-Funding-Trends.aspx Trends and Events: US Federal Funding You are here
  • © 2014 MapR Technologies 18 More Data Less Federal $
  • © 2014 MapR Technologies 19 Trends and Events: The $1000 Genome • Physicians want to use patient genomes to improve care • Scientists say personalized medicine breakthroughs require 100Ks to MMs of genomes • Healthcare mandates efficacy and efficiency (early majority) These forces converge at $1000 for a clinically usable genome
  • © 2014 MapR Technologies 20 Trends and Events: ILMN HiSeq XTen Specs • Sold in sets of 10 units ONLY (XTen =10 sequencers) ~ $10 million/XTen, shipments began in Jan 2014 • XTen produces 600 GBases/day @ 30x oversampling = 1.8 TBases per 3-day cycle = 54 TBytes per 3-day cycle = $1000 per genome = 18,000 genomes/year/XTen ~ 4,000,000 births/year (US, 2012)  Neonatal sequencing is a reality (with 200 of today’s systems)
  • © 2014 MapR Technologies 21 Summary: Major Impact on Social Fabric Soon to be gone: • Muscular dystrophy • Cystic fibrosis • Albinism • PKU (phenylketonuria) • Paternity Tests => http://pandawhale.com/post/13851/my-report-card-came-in-my-paternity-test-came-in http://www.nature.com/scitable/topicpage/rare-genetic-disorders-learning-about-genetic-disease-979 • Hemophilia • Huntington’s Disease (keep?) Fact: US paternity fraud rate is 1 in 25
  • © 2014 MapR Technologies 22 Summary: A Perfect Storm • LESS public funding (NIH) • MORE DNA sequencing efficiency (HiSeq XTen) • Predicted DNA sequencing demand VALIDATED (medicine) • MORE VC investment ($1000/genome force confluence) • DNA sequencing capacity consolidating into genome “factories” (e.g. Broad, ILMN) => REQUIRES new infrastructure
  • © 2014 MapR Technologies 23 The Evolving Genomics Workload Sboner, et al, 2011. The real cost of sequencing: higher than you think! <= 1º analytics “current high ROI use cases” <= 2º analytics “next-gen high ROI use cases”
  • © 2014 MapR Technologies 24 The Evolving Genomics Workload Sboner, et al, 2011. The real cost of sequencing: higher than you think! <= 1º analytics “current high ROI use cases” <= 2º analytics “next-gen high ROI use cases”
  • © 2014 MapR Technologies 25© 2014 MapR Technologies Clinical Application of Human Genetics
  • © 2014 MapR Technologies 26 Clinical Sequencing Business Process Workflow PhysicianPatient Clinic blood/saliva Clinical Lab Analytics extract
  • © 2014 MapR Technologies 27 One Bad MTHFR MTHFR C677T Methylfolate helps make neurotransmitters in your brain. When methylfolate levels are low, so are your neurotransmitters. Low production of neurotransmitters may cause conditions of addictive behavior, depression, anxiety, ADHD, mania, irritability, insomnia, learning disorders and others. Everyone should get tested. Why? Because 1 in 2 people are affected and if one knows they have a MTHFR polymorphism, they know they have to be very proactive in taking care of themselves. http://thyroid.about.com/od/MTHFR-Gene-Mutations-and-Polymorphisms/fl/The- Link-Between-MTHFR-Gene-Mutations-and-Disease-Including-Thyroid- Health.htm
  • © 2014 MapR Technologies 28 One Bad MTHFR MTHFR C677T Methylfolate helps make neurotransmitters in your brain. When methylfolate levels are low, so are your neurotransmitters. Low production of neurotransmitters may cause conditions of addictive behavior, depression, anxiety, ADHD, mania, irritability, insomnia, learning disorders and others. Everyone should get tested. Why? Because 1 in 2 people are affected and if one knows they have a MTHFR polymorphism, they know they have to be very proactive in taking care of themselves. http://thyroid.about.com/od/MTHFR-Gene-Mutations-and-Polymorphisms/fl/The- Link-Between-MTHFR-Gene-Mutations-and-Disease-Including-Thyroid- Health.htm
  • © 2014 MapR Technologies 29 Clinical Sequencing Business Process Workflow PhysicianPatient Clinic blood/saliva Clinical Lab Analytics extract
  • © 2014 MapR Technologies 30 Clinical Genomics, Information Systems Perspective Compressed Structured Base4 Data Uncompressed Unstructured Base2 Data extract Base4=>Base2 Converter [[ DE-STRUCTURES ]] “BI” Reporting and Visualization tools PhysicianPatient AnalystStakeholder
  • © 2014 MapR Technologies 31 Clinical Genomics, Information Systems Perspective PhysicianPatient AnalystStakeholder ETL Reporting and Viz Data Store Analytics
  • © 2014 MapR Technologies 32 Clinical Genomics, Information Systems Perspective PhysicianPatient AnalystStakeholder ETL Reporting and Viz Data Store Analytics 1º analytics 2º analytics Not much in this presentation, see also: http://slidesha.re/1sC2BOX
  • © 2014 MapR Technologies 33© 2014 MapR Technologies 1º Analytics: Why MapReduce?
  • © 2014 MapR Technologies 34 Clinical Genomics, Information Systems Perspective PhysicianPatient AnalystStakeholder ETL Reporting and Viz Data Store Analytics 1º analytics 2º analytics see also: http://slidesha.re/1sC2BOX
  • © 2014 MapR Technologies 35 The Essence of the Problem: What is the (Probable) Color of Each Column?
  • © 2014 MapR Technologies 36© 2014 MapR Technologies Next-Gen Human Genetics – Population Scale
  • © 2014 MapR Technologies 37 The Evolving Genomics Workload Sboner, et al, 2011. The real cost of sequencing: higher than you think! <= 2º analytics “next-gen high ROI use cases”
  • © 2014 MapR Technologies 38 MapR Data Platform Advantage, Clinical Genomics Epidemiological, Actuarial Analyses Denormalization for Search, Viz, Research ETL Clinical Reporting WEB TIERClinical Reporting Systems CLINICAL TREATMENT OF PATIENTS RESEARCHERS National Pop. Database INDEX SHARDSPrognostic Capability
  • © 2014 MapR Technologies 39 Co-expression (10K samples) and Linkage Gene Annotation / Set CompletionBMP6 BMP2 MMP3 LIF NOS2A MMP13 CSPG4 ACAN ACAN ACAN COL11A2 COL11A2 COL9A1 MATN1 LECT1 MATN4 HAPLN1 HAPLN1 ITGA10 EDIL3 NGF MAST4 MATN3 EPYC COL11A1 COL11A1 COL10A1 COL10A1 THBS3 C1QTNF3 WISP1 PDPN PDLIM4 CHST3 MIA SOX5 CYTL1 TNMD AKR1C1 MMP12 ETNK1 RELA FOSL1 EIF2C2 NUPL1 RLF RELB SOD2 RNF24 RNF24 XYLT1 HAS2 BDKRB1 HSPC159 SLC28A3 FZD10 SLC28A3 HSPC159 BDKRB1 HAS2 XYLT1 RNF24 RNF24 SOD2 RELB RLF NUPL1 EIF2C2 FOSL1 RELA ETNK1 MMP12 AKR1C1 TNMD CYTL1 SOX5 MIA CHST3 PDLIM4 PDPN FZD10 WISP1 C1QTNF3 THBS3 COL10A1 COL10A1 COL11A1 COL11A1 EPYC MATN3 MAST4 NGF EDIL3 ITGA10 HAPLN1 HAPLN1 MATN4 ACAN ACAN ACAN LECT1 MATN1 COL9A1 COL11A2 COL11A2 CSPG4 MMP13 NOS2A LIF MMP3 BMP2 BMP6 Disease gene characterization through large-scale co-expression analysis. http://www.ncbi.nlm.nih.gov/pubmed/20046828 + =>
  • © 2014 MapR Technologies 40 If they were unlabeled, would you know which is which? Friend. 2010. The Need for Precompetitive Integrative Bionetwork Disease Model Building NPR. 2011. The Search For Analysts To Make Sense Of 'Big Data’ http://www.npr.org/2011/11/30/142893065
  • © 2014 MapR Technologies 41 If they were unlabeled, would you know which is which? Friend. 2010. The Need for Precompetitive Integrative Bionetwork Disease Model Building • Identify network structures • Label them • Observe stimulus=>response space mapping • Purposefully target • $$$$ Twitter’s Business Model
  • © 2014 MapR Technologies 42© 2014 MapR Technologies These are Linear Algebra / Machine Learning Problems
  • © 2014 MapR Technologies 43© 2014 MapR Technologies A Quick Digression: Recommender Systems
  • © 2014 MapR Technologies 44 HOW RECOMMENDATIONS WORK Behavior of a crowd helps us understand what individuals will do
  • © 2014 MapR Technologies 45 History Matrix (A) Alice Bob Charles ✔ ✔ ✔ ✔ ✔ ✔ ✔
  • © 2014 MapR Technologies 46 Co-occurrence Matrix (ATA) 1 2 1 1 1 1 2 1
  • © 2014 MapR Technologies 47 <Normalize> (filter to identify only unusual co-occurences)
  • © 2014 MapR Technologies 48 HOW CROSS-RECOMMENDATIONS WORK Behavior of a crowd helps us understand what individuals will do
  • © 2014 MapR Technologies 49 Example Multi-modal Inputs • Overlap in restaurant visits is useful • Big spender cues • Cuisine as an indicator • Review text as an indicator
  • © 2014 MapR Technologies 50 People do more than one kind of thing • Different kinds of behaviors give different quality, quantity and kind of information – Restaurant visits – Movie reviews • We don’t have to do co-occurrence • We can do cross-occurrence • Result is cross-recommendation
  • © 2014 MapR Technologies 51 For example • Users enter queries (A) – (actor = user, item=query) • Users view videos (B) – (actor = user, item=video) • ATA gives query recommendation – “did you mean to ask for” • BTB gives video recommendation – “you might like these videos”
  • © 2014 MapR Technologies 52 The punch-line • BTA recommends videos in response to a query – (isn’t that a search engine?) – (not quite, it doesn’t look at content or meta-data)
  • © 2014 MapR Technologies 53 Real-life example • Query: “Paco de Lucia” • Conventional meta-data search results: – “hombres del paco” times 400 – not much else • Recommendation based search: – Flamenco guitar and dancers – Spanish and classical guitar – Van Halen doing a classical/flamenco riff
  • © 2014 MapR Technologies 54 Real-life example
  • © 2014 MapR Technologies 55 Previous Click Histories user1 user2 user3 user4 user5 1 2 3 4 5 6 7 8
  • © 2014 MapR Technologies 56 Detect similar content: 2 & 8 user1 user2 user3 user4 user5 1 2 3 4 5 6 7 8
  • © 2014 MapR Technologies 57 Call to Action – Request Clicks user1 user2 user3 user4 user5 Show me more: sports comedy technology 1 2 3 4 5 6 7 8 “Under Construction”
  • © 2014 MapR Technologies 58 Build Navigational Ontology (estimate content labels): 4=sports ; 2 & 7=comedy user1 user2 user3 user4 user5 Show me more: sports comedy technology 1 2 3 4 5 6 7 8 4 2 & 7 Under construction
  • © 2014 MapR Technologies 59 Matrices A (U*Q) and B (U*V) Query Term = Clicked Term Users Query Terms Users Clicked Videos
  • © 2014 MapR Technologies 60 Relate Q to V Users Query Terms
  • © 2014 MapR Technologies 61 Relate Q to V Users Query Terms
  • © 2014 MapR Technologies 62 Users Query Terms
  • © 2014 MapR Technologies 63 Relate Q to V: it’s a Cross-Recommender QueryTerms Videos
  • © 2014 MapR Technologies 64© 2014 MapR Technologies Population-level Inference
  • © 2014 MapR Technologies 65 Typical Dimensions in Genetics/Medicine • Genotype • Gene Expression • Samples • Phenotypes (traits/behavior)
  • © 2014 MapR Technologies 66 Typical Dimensions in Behavioral Data • Genotype • Gene Expression • Samples Individuals • Phenotype – Traits – Behaviors
  • © 2014 MapR Technologies 67 Incidence/Co-occurrence in Behavioral Data • Individual * Individual – Genealogy • Trait * Behavior => [Netflix] – User/Content Topic Modeling • Genotype * Behavior => [Psychometrics] – Genetics of personality, intelligence, aptitude • Behavior * Outcome => [Korn-Ferry] – Job effectiveness • Phenotype (trait/behavior) * Outcome => [eHarmony] – Reproductive fitness
  • © 2014 MapR Technologies 68 Traits and Behaviors: Content Topic Modeling / UX Personalization
  • © 2014 MapR Technologies 69 Behaviors and Outcomes: Economic Fitness (Korn/Ferry) Korn/Ferry ProSpective http://linkedin.kornferry.com Allen =>
  • © 2014 MapR Technologies 70 Genes Job Performance
  • © 2014 MapR Technologies 71 (Traits/Behaviors) and Outcomes Reproductive Fitness (eHarmony) eHarmony @ Hadoop World: Data Science of Love http://eharmony.com
  • © 2014 MapR Technologies 72 Genes Reproductive Outcomes
  • © 2014 MapR Technologies 73 Genes => Traits => Behaviors => Fitness Job Performance Psychometrics Movie Preferences Medicine Forensics
  • © 2014 MapR Technologies 74 Genes => Traits => Behaviors => Fitness Job Performance Psychometrics Movie Preferences Medicine Forensics Fitness Reproductive Outcomes
  • © 2014 MapR Technologies 75
  • © 2014 MapR Technologies 76 ENCODE http://www.nature.com/news/encode-the-human-encyclopaedia-1.11312
  • © 2014 MapR Technologies 77 Robot Scientist Sparkes, et al. 2010. Towards Robot Scientists for autonomous scientific discovery
  • © 2014 MapR Technologies 78 Robot (Data?) Scientist Sparkes, et al. 2010. Towards Robot Scientists for autonomous scientific discovery
  • © 2014 MapR Technologies 79© 2014 MapR Technologies Thanks