Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

A short and naive introduction to epistasis in association studies

201 views

Published on

Réseau EpiFun, métaprogramme SelGen, INRA
June 1st, Paris

Published in: Science
  • Be the first to comment

  • Be the first to like this

A short and naive introduction to epistasis in association studies

  1. 1. A short and naive introduction to epistasis in association studies Nathalie Villa-Vialaneix nathalie.villa-vialaneix@inra.fr http://www.nathalievilla.org EpiFun June 1st, 2018 - Paris Nathalie Villa-Vialaneix | Epistasis and GWAS 1/23
  2. 2. What is this presentation about? Standard GWAS Disease Healthy Nathalie Villa-Vialaneix | Epistasis and GWAS 2/23
  3. 3. What is this presentation about? Standard GWAS Disease Healthy What we are interesting in: epistasis: interaction between two (or more) SNPs influences the phenotype but every single SNP does not how to detect SNP/SNP, gene/gene interactions? Nathalie Villa-Vialaneix | Epistasis and GWAS 2/23
  4. 4. Everything is easier with a picture Nathalie Villa-Vialaneix | Epistasis and GWAS 3/23
  5. 5. Disclaimer naive (but hopefully comprehensive) presentation seeks at giving an overview rather than precise directions might contains errors, overclaims, missing references, badly understood concepts... to keep you awake Nathalie Villa-Vialaneix | Epistasis and GWAS 4/23
  6. 6. Disclaimer naive (but hopefully comprehensive) presentation seeks at giving an overview rather than precise directions might contains errors, overclaims, missing references, badly understood concepts... to keep you awake Two main reviews used to make these slides: [Neil et al., 2015, Stanislas, 2017, Emily, 2018]. Material: References at the end of the slides these slides on my website http://www.nathalievilla.org/seminars2018.html most articles available online at http://nextcloud. nathalievilla.org/index.php/s/VLlheqpwhwD8eeZ (ask me to be granted write rights) Nathalie Villa-Vialaneix | Epistasis and GWAS 4/23
  7. 7. Evidence for epistatis 1 missing heritability: in GWAS, only a little part of the genetic variance explains the phenotype (with a “one locus at a time” strategy) 2 small effect size of most SNPs 3 possible explanation from an evolutionnary perspective: yields robust systems resistant to variations Nathalie Villa-Vialaneix | Epistasis and GWAS 5/23
  8. 8. (a bit) More formal definition(s)... no consensus on the definition...! Nathalie Villa-Vialaneix | Epistasis and GWAS 6/23
  9. 9. (a bit) More formal definition(s)... no consensus on the definition...! biology/statistics [Neil et al., 2015] biological point of view: (originally) effect of an allele at a given locus is hidden by the effect of another allele at a second locus – (more recently) effect of an allele at a given locus depends on the presence or absence of a genetic variant at another locus statistical point of view [Fisher, 1918]: departure from additive effects of genetic variants with respect to their global contribution to the phenotype Nathalie Villa-Vialaneix | Epistasis and GWAS 6/23
  10. 10. (a bit) More formal definition(s)... no consensus on the definition...! [Emily, 2018]: in the case of a phenotype Y ∈ {0, 1} (cases and controls) and two loci with variants {A, a} and {B, b} respectively, definitions of epsistasis: at allele ((A, B), (A, b), (a, B), (a, b)) or genotype ((AA, BB), (Aa, BB), (Aa, Bb), ...) levels Nathalie Villa-Vialaneix | Epistasis and GWAS 6/23
  11. 11. (a bit) More formal definition(s)... no consensus on the definition...! [Emily, 2018]: in the case of a phenotype Y ∈ {0, 1} (cases and controls) and two loci with variants {A, a} and {B, b} respectively, definitions of epsistasis: at allele ((A, B), (A, b), (a, B), (a, b)) or genotype ((AA, BB), (Aa, BB), (Aa, Bb), ...) levels for a statistical (departure from linearity measured by odds-ratio between cases and controls) or a biological (measures of associations assumed to be equal in cases and controls) Nathalie Villa-Vialaneix | Epistasis and GWAS 6/23
  12. 12. Back to pictures G hides effect of B interaction independant effects original definition extension lack of epistasis? [Cordell, 2002]: statistical definition is less ambiguous even though it is often hard to interpret from a biological point of view Nathalie Villa-Vialaneix | Epistasis and GWAS 7/23
  13. 13. Challenges for epistatis detection statistical “small n large p problems” (at least at genome scale) computational complexity linear in n but exponential in p (when the number of interactions grows) biological gap between statistical and biological (functional) interpretations Nathalie Villa-Vialaneix | Epistasis and GWAS 8/23
  14. 14. 1 A tentative definition 2 SNP-SNP approaches 3 SNPset-SNPset approaches 4 GWAS Nathalie Villa-Vialaneix | Epistasis and GWAS 9/23
  15. 15. Background Purpose Given two loci X1 and X2 (allelic or genotype level), how to detect their epistatic effect on Y (cases/controls)? Nathalie Villa-Vialaneix | Epistasis and GWAS 10/23
  16. 16. Background Purpose Given two loci X1 and X2 (allelic or genotype level), how to detect their epistatic effect on Y (cases/controls)? 1 regression based methods (mostly linear) 2 comparison of correlation in cases / controls (or odds-ratio differences) 3 information theory based methods Other approaches based on ROC analysis for instance (not discussed) Nathalie Villa-Vialaneix | Epistasis and GWAS 10/23
  17. 17. Regression based methods 1 {stat, allele} PLINK [Purcell et al., 2007] logistic regression logit P (Y = 1|(x1, x2)) = α + βI{x1=A} + γI{x2=B} additive effect + δI{(x1,x2)=(A,B)} departure from additivity and test of δ = 0 (genotypic version in [Cordell, 2002]) 2 {stat, geno} BOOST [Wan et al., 2010] Poisson GLM (same approach with a count model and boolean computations) Nathalie Villa-Vialaneix | Epistasis and GWAS 11/23
  18. 18. Regression based methods 1 {stat, allele} PLINK [Purcell et al., 2007] logistic regression logit P (Y = 1|(x1, x2)) = α + βI{x1=A} + γI{x2=B} additive effect + δI{(x1,x2)=(A,B)} departure from additivity and test of δ = 0 (genotypic version in [Cordell, 2002]) 2 {stat, geno} BOOST [Wan et al., 2010] Poisson GLM (same approach with a count model and boolean computations) computational optimization of ML (can be numerically unstable or difficult), only linear interactions Nathalie Villa-Vialaneix | Epistasis and GWAS 11/23
  19. 19. Wald-like test methods Principle: test “H0: W = 0” for a W that measures “association” between X1 and X2 for the outcome Y, where (usually) W ∼ χ2 under H0 Nathalie Villa-Vialaneix | Epistasis and GWAS 12/23
  20. 20. Wald-like test methods Principle: test “H0: W = 0” for a W that measures “association” between X1 and X2 for the outcome Y, where (usually) W ∼ χ2 under H0 Example (the simplest): [Zhao et al., 2006] {bio, allele} W = (r1 − r0)2 Var(r1) + Var(r0) where rk = Cor(I{X1=A}, I{X2=B}|Y = k). Other approaches are based on odd-ratio [Emily, 2002] {bio, geno}. Nathalie Villa-Vialaneix | Epistasis and GWAS 12/23
  21. 21. Entropy based methods Methods based on information theory [Shannon, 1948] (powerful to catch nonlinear interactions) Mutual information I(X1, X2) = x1∈{AA,Aa,aa} x2∈{BB,Bb,bb} p12 log p12 p1p2 with p12 = P(X1 = x1, X2 = x2) and pj = P(Xj = xj). Nathalie Villa-Vialaneix | Epistasis and GWAS 13/23
  22. 22. Entropy based methods Methods based on information theory [Shannon, 1948] (powerful to catch nonlinear interactions) Mutual information I(X1, X2) = x1∈{AA,Aa,aa} x2∈{BB,Bb,bb} p12 log p12 p1p2 with p12 = P(X1 = x1, X2 = x2) and pj = P(Xj = xj). Example [Fan et al., 2011] IG = I(X1, X2|Y = 1) − I(X1, X2) + resampling methods to test significance lack of know distribution under H0 Nathalie Villa-Vialaneix | Epistasis and GWAS 13/23
  23. 23. Background Purpose Given two sets of SNPs (genes, aplotypes, ...) X1 = (X11, . . . , X1m1 ) and X2 = (X21, . . . , X2m2 ) (allelic or genotype level), how to detect a global epistatic effect on Y (cases/controls)? ⇒ “summary” of SNPs analyses. Nathalie Villa-Vialaneix | Epistasis and GWAS 14/23
  24. 24. Background Purpose Given two sets of SNPs (genes, aplotypes, ...) X1 = (X11, . . . , X1m1 ) and X2 = (X21, . . . , X2m2 ) (allelic or genotype level), how to detect a global epistatic effect on Y (cases/controls)? ⇒ “summary” of SNPs analyses. 1 combination of tests (multiple testing or global test) 2 multidimensional analysis (regression models, tests, enthropy based methods at the set level) 3 kernel based methods Nathalie Villa-Vialaneix | Epistasis and GWAS 14/23
  25. 25. Combining tests 1 Multiple testing tests all interactions (X1j, X2k ) and obtain m1m2 p-values (of non independant tests) + multiple testing procedure (Simes to control intersection of null hypotheses and “number of effective tests” to account for correlations): GATES [Li et al., 2011] (other approaches combining p-values have been proposed) Nathalie Villa-Vialaneix | Epistasis and GWAS 15/23
  26. 26. Combining tests 1 Multiple testing tests all interactions (X1j, X2k ) and obtain m1m2 p-values (of non independant tests) + multiple testing procedure (Simes to control intersection of null hypotheses and “number of effective tests” to account for correlations): GATES [Li et al., 2011] (other approaches combining p-values have been proposed) 2 Global distribution of test statistics: Wjk , test statistics for logistic regression ⇒ W = [W11, . . . , Wm1m2 ] ∼ N(0, Σ) derive a p-value from N(0, Σ), with an estimation of Σ: minP [Emily, 2016] Nathalie Villa-Vialaneix | Epistasis and GWAS 15/23
  27. 27. Combining tests 1 Multiple testing tests all interactions (X1j, X2k ) and obtain m1m2 p-values (of non independant tests) + multiple testing procedure (Simes to control intersection of null hypotheses and “number of effective tests” to account for correlations): GATES [Li et al., 2011] (other approaches combining p-values have been proposed) 2 Global distribution of test statistics: Wjk , test statistics for logistic regression ⇒ W = [W11, . . . , Wm1m2 ] ∼ N(0, Σ) derive a p-value from N(0, Σ), with an estimation of Σ: minP [Emily, 2016] only linear interactions ; computational issues (both methods) ; hyper-parameter hard to set (effective number of test; GATES) Nathalie Villa-Vialaneix | Epistasis and GWAS 15/23
  28. 28. Multidimensional methods I 1 dimension reduction Summarize a SNP set with a few numerical values (PCA, CCA...) and perform logistic regression with a test of the interaction on the summaries: logit P (Y = 1|(x1, x2)) = α+βPC1(x1)+γPC1(x2)+δPC1(x1)PC1(x2) and test “δ = 0” [Li et al., 2009, Stanislas et al., 2017] Nathalie Villa-Vialaneix | Epistasis and GWAS 16/23
  29. 29. Multidimensional methods I 1 dimension reduction Summarize a SNP set with a few numerical values (PCA, CCA...) and perform logistic regression with a test of the interaction on the summaries: logit P (Y = 1|(x1, x2)) = α+βPC1(x1)+γPC1(x2)+δPC1(x1)PC1(x2) and test “δ = 0” [Li et al., 2009, Stanislas et al., 2017] 2 tests Summarize the correlations of SNP sets in cases and controls (CCA) and compare these two quantities with a test: z1 − z0 Var(z1 − z0) ∼H0 N(0, 1) for zk an adequate transformation of Cor(CCA1(X1|Y = k), CCA1(X2|Y = k)) [Peng et al., 2010] extensions to PLS, KCCA, ... Nathalie Villa-Vialaneix | Epistasis and GWAS 16/23
  30. 30. Multidimensional methods II Here the purpose is a bit different: only one SNP set X = (X1, . . . , Xm). Is this SNP set associated to the phenotype? (similar to what is done in genomic selection) Nathalie Villa-Vialaneix | Epistasis and GWAS 17/23
  31. 31. Multidimensional methods II Here the purpose is a bit different: only one SNP set X = (X1, . . . , Xm). Is this SNP set associated to the phenotype? (similar to what is done in genomic selection) 3 Kernel methods SKAT [Wu et al., 2010] what is a kernel? K is a measure of association between individuals described by their SNP set, (x1, . . . , xn): K(xi, xj) measures a “ressemblance” between i and j. RKHS: under mild conditions, K defines a unique Hilbert space, H, and a unique mapping of the individuals into H, Φ, such that: K(xi, xj) = Φ(xi), Φ(xj) H Nathalie Villa-Vialaneix | Epistasis and GWAS 17/23
  32. 32. Multidimensional methods II Here the purpose is a bit different: only one SNP set X = (X1, . . . , Xm). Is this SNP set associated to the phenotype? (similar to what is done in genomic selection) 3 Kernel methods SKAT [Wu et al., 2010] what is a kernel? K is a measure of association between individuals described by their SNP set, (x1, . . . , xn): K(xi, xj) measures a “ressemblance” between i and j. the only purpose of the previous slide was to finish people not paying a close enough attention to my talk Nathalie Villa-Vialaneix | Epistasis and GWAS 17/23
  33. 33. Multidimensional methods II (again) Here the purpose is a bit different: only one SNP set X = (X1, . . . , Xm). Is this SNP set associated to the phenotype? (similar to what is done in genomic selection) 3 Kernel methods SKAT [Wu et al., 2010] fixed effect model in RKHS: logiti ∼ α + h(Xi) with h ∈ H to be estimated is equivalent to a mixed effect model logiti ∼ α + hi with hi ∼ N(0, τK), τ to be estimated and tests of “h(X) = 0” can be performed using the kernel K Nathalie Villa-Vialaneix | Epistasis and GWAS 18/23
  34. 34. Multidimensional methods II (again) Here the purpose is a bit different: only one SNP set X = (X1, . . . , Xm). Is this SNP set associated to the phenotype? (similar to what is done in genomic selection) 3 Kernel methods SKAT [Wu et al., 2010] fixed effect model in RKHS: logiti ∼ α + h(Xi) with h ∈ H to be estimated is equivalent to a mixed effect model logiti ∼ α + hi with hi ∼ N(0, τK), τ to be estimated and tests of “h(X) = 0” can be performed using the kernel K Idea: h is able to capture high order interactions between SNPs within the set X. Nathalie Villa-Vialaneix | Epistasis and GWAS 18/23
  35. 35. Background Purpose How to detect epistatic effects genome-wide? Nathalie Villa-Vialaneix | Epistasis and GWAS 19/23
  36. 36. Background Purpose How to detect epistatic effects genome-wide? Basics: combine information between SNP-SNP effects or SNPset-SNPset effects... but combinatorial issues, especially to catch high order interactions Nathalie Villa-Vialaneix | Epistasis and GWAS 19/23
  37. 37. Background Purpose How to detect epistatic effects genome-wide? Basics: combine information between SNP-SNP effects or SNPset-SNPset effects... but combinatorial issues, especially to catch high order interactions 1 exhaustive approaches 2 filtering 3 machine learning Nathalie Villa-Vialaneix | Epistasis and GWAS 19/23
  38. 38. Exhaustive approaches 1 exhaustive testing PLINK (which multiple testing corrections?) or (penalized) regression [Wu et al., 2009] (Lasso but not really genome-wide) mostly restricted to linear effects and pairwise interactions Nathalie Villa-Vialaneix | Epistasis and GWAS 20/23
  39. 39. Exhaustive approaches 1 exhaustive testing PLINK (which multiple testing corrections?) or (penalized) regression [Wu et al., 2009] (Lasso but not really genome-wide) mostly restricted to linear effects and pairwise interactions 2 Multiple Dimensionality Reduction (MDR) (non parametric, model free, can deal with high order interactions) [Ritchie et al., 2001] can fail to detect pure epistasis, strongly depends on several hyperparameters, overfits Nathalie Villa-Vialaneix | Epistasis and GWAS 20/23
  40. 40. Filtering Idea: filter SNPs or SNP pairs before exhaustive search 1 filtering on marginal effects (prevents from detecting pure epistasis) [Marchini et al., 2005] Nathalie Villa-Vialaneix | Epistasis and GWAS 21/23
  41. 41. Filtering Idea: filter SNPs or SNP pairs before exhaustive search 1 filtering on marginal effects (prevents from detecting pure epistasis) [Marchini et al., 2005] 2 Relief genetic distance between individuals is used to compute a measure of the importance of the SNP according to differences in the SNP between neighbors when they have common/different Y [Robnik-Šikonja and Kononenko, 2003] Nathalie Villa-Vialaneix | Epistasis and GWAS 21/23
  42. 42. Filtering Idea: filter SNPs or SNP pairs before exhaustive search 1 filtering on marginal effects (prevents from detecting pure epistasis) [Marchini et al., 2005] 2 Relief genetic distance between individuals is used to compute a measure of the importance of the SNP according to differences in the SNP between neighbors when they have common/different Y [Robnik-Šikonja and Kononenko, 2003] 3 biofilter combines information coming from 13 datasets that identify if SNP sets are related to the same pathway, to proteins that interact (PPI), ... [Pendergrass et al., 2013] strong bias toward most documented genes/pathways Nathalie Villa-Vialaneix | Epistasis and GWAS 21/23
  43. 43. ML approaches Idea: fit a ML model that predicts Y given all SNPs and try to extract information about interactions: random forests (with conditional variable importance [Bureau et al., 2004, Strobl et al., 2008]), Bayesian Network (BEAM, [Zhang and Liu, 2007]), ... (I guess: evolutionnary algorithms, deep NN, ant colony, ...) Nathalie Villa-Vialaneix | Epistasis and GWAS 22/23
  44. 44. ML approaches Idea: fit a ML model that predicts Y given all SNPs and try to extract information about interactions: random forests (with conditional variable importance [Bureau et al., 2004, Strobl et al., 2008]), Bayesian Network (BEAM, [Zhang and Liu, 2007]), ... (I guess: evolutionnary algorithms, deep NN, ant colony, ...) Limitations: n might be too small to make a non parametric estimation affordable (from a statistical perspective) Nathalie Villa-Vialaneix | Epistasis and GWAS 22/23
  45. 45. no conclusion because this is just the beginning of the discussion... (and I was dead tired finishing my slides at 4 am this morning) Nathalie Villa-Vialaneix | Epistasis and GWAS 23/23
  46. 46. References Bureau, A., Dupuis, J., Falls, K., Lunetta, K. L., Hayward, B., Keith, T. P., and van Eerdewegh, P. (2004). Identifying SNPs predictive of phenotype using random forests. Genetic Epidemiology, 28(2):171–182. Cordell, H. J. (2002). Epistasis: what it means, what it doesn’t mean, and statistical methods to detect it in humans. Human Molecular Genetics, 11(20):2463–2468. Emily, M. (2002). IndOR: a new statistical procedure to test for SNP-SNP epistasis in genome-wide association studies. Statistics in Medecine, 31(21):2359–2373. Emily, M. (2016). AGGrEGATOr: a gene-based gene-gene interaction test for case-control association studies. Statistical Applications in Genetics and Molecular Biology, 15(2):151–171. Emily, M. (2018). A survey of statistical methods for gene-gene interaction in case-control genome-wide association studies. Journal de la Société Française de Statistique, 159(1):27–67. Fan, R., Zhong, M., Wang, S., Andrew, A., Karagas, M., Chen, H.and Amos, C., Xiong, M., and Moore, J. (2011). Entropy-based information gain approaches to detect and to characterize gene-gene and gene-environment interactions/correlations of complex diseases. Genetic Epidemiology, 35:706–721. Fisher, R. (1918). The correlation between relatives on the supposition of Mendelian inheritance. Transactions of the Royal Society of Edinburgh, 52(9):399–433. Li, J., Tang, R., Biernacka, J. M., and de Andrade, M. (2009). Identification of gene-gene interaction using principal components. BMC Proceedings, 3(Suppl 7):S78. Nathalie Villa-Vialaneix | Epistasis and GWAS 23/23
  47. 47. Li, M.-X., Gui, H.-S., and Kwan, Johnny S.H. Sham, P. C. (2011). GATES: a rapid and powerful gene-based association test using extended Simes procedure. The American Journal of Human Genetics, 88(3):283–293. Marchini, J., Donnelly, P., and Cardon, L. R. (2005). Genome-wide strategies for detecting multiple loci that influence complex diseases. Nature Genetics, 37:413–417. Neil, C., Sinoquet, C., Dina, C., and Rocheleau, G. (2015). A survey about methods dedicated to epistasis detection. Frontiers in Genetics. Pendergrass, S. A., Frase, A., Wallace, J., Wolfe, D., Katiyar, N., Moore, C., and Ritchie, M. D. (2013). Genomic analyses with biofilter 2.0: knowledge driven filtering, annotation, and model development. BioData Mining, 6:25. Peng, Q., Zhao, J., and Xue, F. (2010). A gene-based method for detecting gene-gene co-association in a case-control association study. European Journal of Human Genetics, 18:582–587. Purcell, S., Neale, B., Todd-Brown, K., Thomas, L., Ferreira, M. A., Bender, D., Maller, J., and Skiar, P. (2007). PLINK: a tool set for whole-genome association and population-based linkage analyses. The American Journal of Human Genetics, 81(3):559–575. Ritchie, M. D., Hahn, L. W., Roodi, N., Bailey, L. R., Dupont, W. D., Parl, F. F., and Moore, J. H. (2001). Multifactor-dimensionality reduction reveals high-order interactions among estrogen-metabolism genes in sporadic breast cancer. The American Journal of Human Genetics, 69(1):138–147. Robnik-Šikonja, M. and Kononenko, I. (2003). Theoretical and empirical analysis of ReliefF and RReliefF. Machine Learning, 53(1-2):23–69. Shannon, C. E. (1948). Nathalie Villa-Vialaneix | Epistasis and GWAS 23/23
  48. 48. A mathematical theory of communication. Bell System Technical Journal, 27:347–423 and 623–656. Stanislas, V. (2017). Approches statistiques pour la detection d’épistasie dans les études d’associations pangénomiques. Thèse de doctorat, Université Paris Saclay, Paris, France. Stanislas, V., Dalmasso, C., and Christophe, A. (2017). Eigen-epistasis for detecting gene-gene interactions. BMC Bioinformatics, 18:54. Strobl, C., Boulesteix, A.-L., Kneib, T., Augustin, T., and Zeilis, A. (2008). Conditional variable importance for random forests. BMC Bioinformatics, 9:307. Wan, X., Yang, C., Yang, Q., Xue, H., Fan, X., Tang, N. L., and Yu, W. (2010). BOOST: a fast approach to detecting gene-gene interactions in genome-wide case-control studies. The American Journal of Human Genetics, 87(3):325–340. Wu, M. C., Kraft, P., Epstein, M. P., Taylor, D. M., Chanock, S. J., Hunter, D. J., and Lin, X. (2010). Powerful SNP-set analysis for case-control genome-wide association studies. American Journal of Human Genetics, 86(6):929–942. Wu, T. T., Chen, Y. F., Hastie, T., Sobel, E., and Lange, K. (2009). Genome-wide association analysis by lasso penalized logistic regression. Bioinformatics, 25(6):714–721. Zhang, Y. and Liu, J. S. (2007). Bayesian inference of epistatic interactions in case-control studies. Nature Genetics, 39:1167–1173. Zhao, J., Jin, L., and Xiong, M. (2006). Test for interaction between two unlinked loci. The American Journal of Human Genetics, 79(5):831–845. Nathalie Villa-Vialaneix | Epistasis and GWAS 23/23

×