SlideShare a Scribd company logo
1 of 14
Download to read offline
FaST Algorithms for Extremely Large Genome-Wide Association Studies 
David Heckerman 
Christoph Lippert, Jennifer Listgarten, Bob Davidson, Carl Kadie 
Microsoft Research
eScience Research Group in MSR 
Tackling societal challenges with machine learning 
Vaccine design 
HIV, HCV, HRV 
Human genomics and personalized medicine 
Genome assembly of sugar cane
The Genomics Revolution
Personalized Medicine 
Use genetic markers to… 
•Understand causes of disease 
•Diagnose a disease 
•Infer propensity to get a disease 
•Predict favorable response to a drug 
•Predict bad reaction to a drug
GWAS: Genome-wide association studies 
C 
G 
C 
C 
C 
C 
G 
G 
C 
G 
G 
G 
…actgccgcga 
actgccgcgaggcctgactggcatccagtttagcggaactgccgcaa… 
…actgccgcga 
actgccgcgaggcctgactggcatccagtttagcggaactgccgcaa… 
…actgccgcga 
actgccgcgaggcctgactggcatccagtttagcggaactgccgcaa… 
…actgccgcga 
actgccgcgaggcctgactggcatccagtttagcggaactgccgcaa… 
…actgccgcga 
actgccgcgaggcctgactggcatccagtttagcggaactgccgcaa… 
…actgccgcga 
actgccgcgaggcctgactggcatccagtttagcggaactgccgcaa… 
…actgccgcga 
actgccgcgaggcctgactggcatccagtttagcggaactgccgcaa… 
…actgccgcga 
actgccgcgaggcctgactggcatccagtttagcggaactgccgcaa… 
…actgccgcga 
actgccgcgaggcctgactggcatccagtttagcggaactgccgcaa… 
…actgccgcga 
actgccgcgaggcctgactggcatccagtttagcggaactgccgcaa… 
…actgccgcga 
actgccgcgaggcctgactggcatccagtttagcggaactgccgcaa… 
…actgccgcga 
actgccgcgaggcctgactggcatccagtttagcggaactgccgcaa…
Genome-Wide Association Studies
GWAS issues 
•For common diseases, signals are spread out across the genome and weak 
•To pick up these signals, we need lots of data 
–deCode 
–23andMe 
–Kaiser 
•Large data  confounding 
–Multiple ethnicities 
–Related individuals
Challenge: Confounding factors 
•Suppose the set of cases has a different proportion of ethnicity X from control. 
•Suppose we use linear regression to look for SNP-trait correlations. 
•Then genetic markers that differ between X and other ethnicities in the study, Y, will appear artificially to be associated with disease. 
•Problem gets worse with more data. 
A 
A 
A 
A 
A 
A 
T 
A 
A 
A 
A 
A 
T 
T 
T 
T 
T 
T 
T 
T 
T 
T 
T 
T 
T 
A 
A 
A 
A 
A 
A 
A 
A 
A 
A 
A 
A 
A 
A 
A
One solution: Throw out data 
-Use one ethnicity 
-Discard data for individuals who are too closely related 
Bad idea: The most powerful comparisons are between closely- related individuals
FaST-LMM: Factored Spectrally Transformed Linear Mixed Models 
• 
• 
• 
nature | methods September 2011 
nature | methods June 2012 
FaST-LMM 
previous state of the art
Faster and more power: Really? 
•LMM captures confounding structure by computing similarities between all pairs of individuals 
•In this view, one would think all SNPs should be used 
•BUT: LMM is equivalent to performing linear regression with these SNPS as covariates 
•If someone gave you a million potential covariates, would you blindly use them all? 
•Solution: Variable selection (“feature selection”) 
•Result: ~100s of SNPs selected for similarity computation  less SNPs than individuals  O(N)
Application: Epistasis GWAS (SNP-SNP interactions) 
Seven common diseases: 
•Type I diabetes 
•Type II diabetes 
•Coronary artery disease 
•Hypertension 
•Crohn’s disease 
•Rheumatoid arthritis 
•Bipolar disorder With FaST-LMM and Azure, can look at all SNP pairs (about 60 billion of them) 1,000 compute years; 20 TB output – we did it in 13 days on Azure
New results for CAD 
PPARD 
LIMK1
Software freely available for academic use (just bing “FaST-LMM”)

More Related Content

Viewers also liked

Viewers also liked (15)

Fashion style is eternal
Fashion style is eternalFashion style is eternal
Fashion style is eternal
 
New Landscape Code 2011 - FINAL DRAFT2
New Landscape Code 2011 - FINAL DRAFT2New Landscape Code 2011 - FINAL DRAFT2
New Landscape Code 2011 - FINAL DRAFT2
 
Adapter Pattern (with Wreck It Ralph)
Adapter Pattern (with Wreck It Ralph)Adapter Pattern (with Wreck It Ralph)
Adapter Pattern (with Wreck It Ralph)
 
Thanushree
ThanushreeThanushree
Thanushree
 
HERRAMIENTAS EN LA APLICACIÓN DEL SISTEMA EDUCATIVO
HERRAMIENTAS EN LA APLICACIÓN DEL SISTEMA EDUCATIVOHERRAMIENTAS EN LA APLICACIÓN DEL SISTEMA EDUCATIVO
HERRAMIENTAS EN LA APLICACIÓN DEL SISTEMA EDUCATIVO
 
Dogs
DogsDogs
Dogs
 
тест знања Vi разред
тест знања Vi разредтест знања Vi разред
тест знања Vi разред
 
Passive voice
Passive voicePassive voice
Passive voice
 
My last vacations
My last vacationsMy last vacations
My last vacations
 
French_Beth Resume 2014
French_Beth Resume 2014French_Beth Resume 2014
French_Beth Resume 2014
 
brahmaiah Resume
brahmaiah  Resumebrahmaiah  Resume
brahmaiah Resume
 
Social Media Editorial Calendar
Social Media Editorial CalendarSocial Media Editorial Calendar
Social Media Editorial Calendar
 
CONTROL SYSTEM LAB MANUAL
CONTROL SYSTEM LAB MANUALCONTROL SYSTEM LAB MANUAL
CONTROL SYSTEM LAB MANUAL
 
Basics Electronics Engineering ppt.
Basics Electronics Engineering ppt.Basics Electronics Engineering ppt.
Basics Electronics Engineering ppt.
 
PowerPoint Thuyết trình Văn :)
PowerPoint Thuyết trình Văn :)PowerPoint Thuyết trình Văn :)
PowerPoint Thuyết trình Văn :)
 

Similar to MSR david-heckerman_genomics

ConstructPrecisePhenotypesBigDataChallenge
ConstructPrecisePhenotypesBigDataChallengeConstructPrecisePhenotypesBigDataChallenge
ConstructPrecisePhenotypesBigDataChallenge
Athula Herath
 
Systems Genetics of Cancer - big data and all that
Systems Genetics of Cancer - big data and all thatSystems Genetics of Cancer - big data and all that
Systems Genetics of Cancer - big data and all that
Florian Markowetz
 
Visualization Approaches for Biomedical Omics Data: Putting It All Together
Visualization Approaches for Biomedical Omics Data: Putting It All TogetherVisualization Approaches for Biomedical Omics Data: Putting It All Together
Visualization Approaches for Biomedical Omics Data: Putting It All Together
Nils Gehlenborg
 

Similar to MSR david-heckerman_genomics (20)

Amia tb-review-13
Amia tb-review-13Amia tb-review-13
Amia tb-review-13
 
ConstructPrecisePhenotypesBigDataChallenge
ConstructPrecisePhenotypesBigDataChallengeConstructPrecisePhenotypesBigDataChallenge
ConstructPrecisePhenotypesBigDataChallenge
 
Lecture 7 gwas full
Lecture 7 gwas fullLecture 7 gwas full
Lecture 7 gwas full
 
Construct precisephenotypesbigdatachallenge
Construct precisephenotypesbigdatachallengeConstruct precisephenotypesbigdatachallenge
Construct precisephenotypesbigdatachallenge
 
Amia tb-review-08
Amia tb-review-08Amia tb-review-08
Amia tb-review-08
 
Amia tb-review-15
Amia tb-review-15Amia tb-review-15
Amia tb-review-15
 
Crowdsourcing the Analysis of Genomes
Crowdsourcing the Analysis of GenomesCrowdsourcing the Analysis of Genomes
Crowdsourcing the Analysis of Genomes
 
Voskarides 2nd aging symposium-unic 240514
Voskarides 2nd aging symposium-unic 240514Voskarides 2nd aging symposium-unic 240514
Voskarides 2nd aging symposium-unic 240514
 
Sundaram et al. 2018 Presentation
Sundaram et al. 2018 PresentationSundaram et al. 2018 Presentation
Sundaram et al. 2018 Presentation
 
Dr. Christopher Yau (University of Birmingham) - Data-driven systems medicine
Dr. Christopher Yau (University of Birmingham) - Data-driven systems medicineDr. Christopher Yau (University of Birmingham) - Data-driven systems medicine
Dr. Christopher Yau (University of Birmingham) - Data-driven systems medicine
 
Systems Genetics of Cancer - big data and all that
Systems Genetics of Cancer - big data and all thatSystems Genetics of Cancer - big data and all that
Systems Genetics of Cancer - big data and all that
 
Math, Stats and CS in Public Health and Medical Research
Math, Stats and CS in Public Health and Medical ResearchMath, Stats and CS in Public Health and Medical Research
Math, Stats and CS in Public Health and Medical Research
 
Prediction, Big Data, and AI: Steyerberg, Basel Nov 1, 2019
Prediction, Big Data, and AI: Steyerberg, Basel Nov 1, 2019Prediction, Big Data, and AI: Steyerberg, Basel Nov 1, 2019
Prediction, Big Data, and AI: Steyerberg, Basel Nov 1, 2019
 
Use of data
Use of dataUse of data
Use of data
 
Predicting phenotype from genotype with machine learning
Predicting phenotype from genotype with machine learningPredicting phenotype from genotype with machine learning
Predicting phenotype from genotype with machine learning
 
Machine Learning in Biology and Why It Doesn't Make Sense - Theo Knijnenburg,...
Machine Learning in Biology and Why It Doesn't Make Sense - Theo Knijnenburg,...Machine Learning in Biology and Why It Doesn't Make Sense - Theo Knijnenburg,...
Machine Learning in Biology and Why It Doesn't Make Sense - Theo Knijnenburg,...
 
ICBO 2014, October 8, 2014
ICBO 2014, October 8, 2014ICBO 2014, October 8, 2014
ICBO 2014, October 8, 2014
 
Bioinformatics-R program의 실례
Bioinformatics-R program의 실례Bioinformatics-R program의 실례
Bioinformatics-R program의 실례
 
Visualization Approaches for Biomedical Omics Data: Putting It All Together
Visualization Approaches for Biomedical Omics Data: Putting It All TogetherVisualization Approaches for Biomedical Omics Data: Putting It All Together
Visualization Approaches for Biomedical Omics Data: Putting It All Together
 
NCI HTAN, cancer trajectories, precision oncology
NCI HTAN, cancer trajectories, precision oncologyNCI HTAN, cancer trajectories, precision oncology
NCI HTAN, cancer trajectories, precision oncology
 

MSR david-heckerman_genomics

  • 1. FaST Algorithms for Extremely Large Genome-Wide Association Studies David Heckerman Christoph Lippert, Jennifer Listgarten, Bob Davidson, Carl Kadie Microsoft Research
  • 2. eScience Research Group in MSR Tackling societal challenges with machine learning Vaccine design HIV, HCV, HRV Human genomics and personalized medicine Genome assembly of sugar cane
  • 4. Personalized Medicine Use genetic markers to… •Understand causes of disease •Diagnose a disease •Infer propensity to get a disease •Predict favorable response to a drug •Predict bad reaction to a drug
  • 5. GWAS: Genome-wide association studies C G C C C C G G C G G G …actgccgcga actgccgcgaggcctgactggcatccagtttagcggaactgccgcaa… …actgccgcga actgccgcgaggcctgactggcatccagtttagcggaactgccgcaa… …actgccgcga actgccgcgaggcctgactggcatccagtttagcggaactgccgcaa… …actgccgcga actgccgcgaggcctgactggcatccagtttagcggaactgccgcaa… …actgccgcga actgccgcgaggcctgactggcatccagtttagcggaactgccgcaa… …actgccgcga actgccgcgaggcctgactggcatccagtttagcggaactgccgcaa… …actgccgcga actgccgcgaggcctgactggcatccagtttagcggaactgccgcaa… …actgccgcga actgccgcgaggcctgactggcatccagtttagcggaactgccgcaa… …actgccgcga actgccgcgaggcctgactggcatccagtttagcggaactgccgcaa… …actgccgcga actgccgcgaggcctgactggcatccagtttagcggaactgccgcaa… …actgccgcga actgccgcgaggcctgactggcatccagtttagcggaactgccgcaa… …actgccgcga actgccgcgaggcctgactggcatccagtttagcggaactgccgcaa…
  • 7. GWAS issues •For common diseases, signals are spread out across the genome and weak •To pick up these signals, we need lots of data –deCode –23andMe –Kaiser •Large data  confounding –Multiple ethnicities –Related individuals
  • 8. Challenge: Confounding factors •Suppose the set of cases has a different proportion of ethnicity X from control. •Suppose we use linear regression to look for SNP-trait correlations. •Then genetic markers that differ between X and other ethnicities in the study, Y, will appear artificially to be associated with disease. •Problem gets worse with more data. A A A A A A T A A A A A T T T T T T T T T T T T T A A A A A A A A A A A A A A A
  • 9. One solution: Throw out data -Use one ethnicity -Discard data for individuals who are too closely related Bad idea: The most powerful comparisons are between closely- related individuals
  • 10. FaST-LMM: Factored Spectrally Transformed Linear Mixed Models • • • nature | methods September 2011 nature | methods June 2012 FaST-LMM previous state of the art
  • 11. Faster and more power: Really? •LMM captures confounding structure by computing similarities between all pairs of individuals •In this view, one would think all SNPs should be used •BUT: LMM is equivalent to performing linear regression with these SNPS as covariates •If someone gave you a million potential covariates, would you blindly use them all? •Solution: Variable selection (“feature selection”) •Result: ~100s of SNPs selected for similarity computation  less SNPs than individuals  O(N)
  • 12. Application: Epistasis GWAS (SNP-SNP interactions) Seven common diseases: •Type I diabetes •Type II diabetes •Coronary artery disease •Hypertension •Crohn’s disease •Rheumatoid arthritis •Bipolar disorder With FaST-LMM and Azure, can look at all SNP pairs (about 60 billion of them) 1,000 compute years; 20 TB output – we did it in 13 days on Azure
  • 13. New results for CAD PPARD LIMK1
  • 14. Software freely available for academic use (just bing “FaST-LMM”)