Report

Share

•1 like•342 views

•1 like•342 views

Report

Share

Download to read offline

Introduction: Since the introduction of the LASSO, computational approaches to variable selection have been rigorously developed in the statistical literature. The need for such methods has become increasingly important with the advent of high-throughput technologies in genomics and brain imaging studies where it is believed that the number of truly important variables is small relative to the total number of variables. While the focus of these methods has been on additive models, there are several applications where interaction models can reflect biological phenomena and improve statistical power. For example, genome wide association studies (GWAS) have been unable to explain a large proportion of heritability (the variance in phenotype attributable to genetic variants) and it has been suggested that this missing heritability may in part be due to gene-environment interactions. Furthermore, diseases are now thought to be the result of entire biological networks whose states are affected by environmental factors. These systemic changes can induce or eliminate strong correlations between elements in a network without necessarily affecting their mean levels. Methods: Therefore, we propose a multivariate penalization procedure for detecting interactions between high dimensional data ($p >> n$) and an environmental factor, where the effect of this environmental factor on the high dimensional data is widespread and plays a role in predicting the response. Our approach improves on existing procedures for detecting such interactions in several ways; 1) it simultaneously performs model selection and estimation 2) it automatically enforces the strong heredity property, i.e., an interaction term can only be included in the model if the corresponding main effects are in the model 3) it reduces the dimensionality of the problem and leverages the high correlations by transforming the input feature space using network connectivity measures and 4) it leads to interpretable models which are biologically meaningful. Results: An extensive simulation study shows that our method outperforms LASSO, Elastic Net and Group LASSO in terms of both prediction accuracy and feature selection. We apply our methods to the NIH pediatric brain development study to refine estimates of which regions of the frontal cortex are associated with intelligence scores, and a sample of mother-child pairs from a prospective birth cohort to identify epigenetic marks observed at birth that help predict childhood obesity. Our method is implemented in an \texttt{R} package- http://sahirbhatnagar.com/eclust/

- 1. A Model for Interpretable High Dimensional Interactions Sahir Rai Bhatnagar Joint work with Yi Yang, Mathieu Blanchette and Celia Greenwood Poster Number 67
- 2. Motivation
- 3. one predictor variable at a time Predictor Variable Phenotype
- 4. one predictor variable at a time Predictor Variable Phenotype Test 1 Test 2 Test 3 Test 4 Test 5 1
- 5. a network based view Predictor Variable Phenotype
- 6. a network based view Predictor Variable Phenotype
- 7. a network based view Predictor Variable Phenotype Test 1 2
- 8. system level changes due to environment Predictor Variable PhenotypeEnvironment A B
- 9. system level changes due to environment Predictor Variable PhenotypeEnvironment A B Test 1 3
- 10. Motivating Dataset: Newborn epigenetic adaptations to gesta- tional diabetes exposure (Luigi Bouchard, Sherbrooke) Environment Gestational Diabetes Large Data Child’s epigenome (p ≈ 450k) Phenotype Obesity measures 4
- 11. Diﬀerential Correlation between environments (a) Gestational diabetes aﬀected pregnancy (b) Controls 5
- 12. formal statement of initial problem • n: number of subjects 6
- 13. formal statement of initial problem • n: number of subjects • p: number of predictor variables 6
- 14. formal statement of initial problem • n: number of subjects • p: number of predictor variables • Xn×p: high dimensional data set (p >> n) 6
- 15. formal statement of initial problem • n: number of subjects • p: number of predictor variables • Xn×p: high dimensional data set (p >> n) • Yn×1: phenotype 6
- 16. formal statement of initial problem • n: number of subjects • p: number of predictor variables • Xn×p: high dimensional data set (p >> n) • Yn×1: phenotype • En×1: environmental factor that has widespread eﬀect on X and can modify the relation between X and Y 6
- 17. formal statement of initial problem • n: number of subjects • p: number of predictor variables • Xn×p: high dimensional data set (p >> n) • Yn×1: phenotype • En×1: environmental factor that has widespread eﬀect on X and can modify the relation between X and Y Objective • Which elements of X that are associated with Y , depend on E? 6
- 18. Methods
- 19. ECLUST - our proposed method: 3 phases Original Data
- 20. ECLUST - our proposed method: 3 phases Original Data E = 0 1) Gene Similarity E = 1
- 21. ECLUST - our proposed method: 3 phases Original Data E = 0 1) Gene Similarity E = 1
- 22. ECLUST - our proposed method: 3 phases Original Data E = 0 1) Gene Similarity E = 1 2) Cluster Representation
- 23. ECLUST - our proposed method: 3 phases Original Data E = 0 1) Gene Similarity E = 1 2) Cluster Representation n × 1 n × 1
- 24. ECLUST - our proposed method: 3 phases Original Data E = 0 1) Gene Similarity E = 1 2) Cluster Representation n × 1 n × 1 3) Penalized Regression Yn×1∼ + ×E 7
- 25. the objective of statistical methods is the reduction of data. A quantity of data . . . is to be replaced by relatively few quantities which shall adequately represent . . . the relevant information contained in the original data. - Sir R. A. Fisher, 1922 7
- 26. Model g(µ) =β0 + β1X1 + · · · + βpXp + βE E main eﬀects + α1E (X1E) + · · · + αpE (XpE) interactions 1Choi et al. 2010, JASA 2Chipman 1996, Canadian Journal of Statistics 8
- 27. Model g(µ) =β0 + β1X1 + · · · + βpXp + βE E main eﬀects + α1E (X1E) + · · · + αpE (XpE) interactions Reparametrization1 : αjE = γjE βj βE . 1Choi et al. 2010, JASA 2Chipman 1996, Canadian Journal of Statistics 8
- 28. Model g(µ) =β0 + β1X1 + · · · + βpXp + βE E main eﬀects + α1E (X1E) + · · · + αpE (XpE) interactions Reparametrization1 : αjE = γjE βj βE . Strong heredity principle2 : ˆαjE = 0 ⇒ ˆβj = 0 and ˆβE = 0 1Choi et al. 2010, JASA 2Chipman 1996, Canadian Journal of Statistics 8
- 29. Strong Heredity Model with Penalization arg min β0,β,γ 1 2 Y − g(µ) 2 + λβ (w1β1 + · · · + wqβq + wE βE ) + λγ (w1E γ1E + · · · + wqE γqE ) wj = 1 ˆβj , wjE = ˆβj ˆβE ˆαjE 9
- 30. Results
- 31. Simulation Study: Jaccard Index and test set MSE 10
- 32. Open source software • Software implementation in R: http://sahirbhatnagar.com/eclust/ • Allows user speciﬁed interaction terms • Automatically determines the optimal tuning parameters through cross validation • Can also be applied to genetic data 11
- 33. Conclusions
- 34. Conclusions and Contributions • Large system-wide changes are observed in many environments 12
- 35. Conclusions and Contributions • Large system-wide changes are observed in many environments • This assumption can possibly be exploited to aid analysis of large data 12
- 36. Conclusions and Contributions • Large system-wide changes are observed in many environments • This assumption can possibly be exploited to aid analysis of large data • We develop and implement a multivariate penalization procedure for predicting a continuous or binary disease outcome while detecting interactions between high dimensional data (p >> n) and an environmental factor. 12
- 37. Conclusions and Contributions • Large system-wide changes are observed in many environments • This assumption can possibly be exploited to aid analysis of large data • We develop and implement a multivariate penalization procedure for predicting a continuous or binary disease outcome while detecting interactions between high dimensional data (p >> n) and an environmental factor. • Dimension reduction is achieved through leveraging the environmental-class-conditional correlations 12
- 38. Conclusions and Contributions • Large system-wide changes are observed in many environments • This assumption can possibly be exploited to aid analysis of large data • We develop and implement a multivariate penalization procedure for predicting a continuous or binary disease outcome while detecting interactions between high dimensional data (p >> n) and an environmental factor. • Dimension reduction is achieved through leveraging the environmental-class-conditional correlations • Also, we develop and implement a strong heredity framework within the penalized model 12
- 39. Conclusions and Contributions • Large system-wide changes are observed in many environments • This assumption can possibly be exploited to aid analysis of large data • We develop and implement a multivariate penalization procedure for predicting a continuous or binary disease outcome while detecting interactions between high dimensional data (p >> n) and an environmental factor. • Dimension reduction is achieved through leveraging the environmental-class-conditional correlations • Also, we develop and implement a strong heredity framework within the penalized model • R software: http://sahirbhatnagar.com/eclust/ 12
- 40. Limitations • There must be a high-dimensional signature of the exposure 13
- 41. Limitations • There must be a high-dimensional signature of the exposure • Clustering is unsupervised 13
- 42. Limitations • There must be a high-dimensional signature of the exposure • Clustering is unsupervised • Two tuning parameters 13
- 43. Limitations • There must be a high-dimensional signature of the exposure • Clustering is unsupervised • Two tuning parameters • Need more samples . . . Got data? (Poster 67) 13
- 44. acknowledgements • Dr. Celia Greenwood • Dr. Blanchette and Dr. Yang • Dr. Luigi Bouchard, Andr´e Anne Houde • Dr. Steele, Dr. Kramer, Dr. Abrahamowicz • Maxime Turgeon, Kevin McGregor, Lauren Mokry, Dr. Forest • Greg Voisin, Dr. Forgetta, Dr. Klein • Mothers and children from the study 14