## Just for you: FREE 60-day trial to the world’s largest digital library.

The SlideShare family just got bigger. Enjoy access to millions of ebooks, audiobooks, magazines, and more from Scribd.

Cancel anytime.Free with a 14 day trial from Scribd

- 1. A Model for Interpretable High Dimensional Interactions Sahir Rai Bhatnagar Joint work with Yi Yang, Mathieu Blanchette and Celia Greenwood Poster Number 67
- 2. Motivation
- 3. one predictor variable at a time Predictor Variable Phenotype
- 4. one predictor variable at a time Predictor Variable Phenotype Test 1 Test 2 Test 3 Test 4 Test 5 1
- 5. a network based view Predictor Variable Phenotype
- 6. a network based view Predictor Variable Phenotype
- 7. a network based view Predictor Variable Phenotype Test 1 2
- 8. system level changes due to environment Predictor Variable PhenotypeEnvironment A B
- 9. system level changes due to environment Predictor Variable PhenotypeEnvironment A B Test 1 3
- 10. Motivating Dataset: Newborn epigenetic adaptations to gesta- tional diabetes exposure (Luigi Bouchard, Sherbrooke) Environment Gestational Diabetes Large Data Child’s epigenome (p ≈ 450k) Phenotype Obesity measures 4
- 11. Diﬀerential Correlation between environments (a) Gestational diabetes aﬀected pregnancy (b) Controls 5
- 12. formal statement of initial problem • n: number of subjects 6
- 13. formal statement of initial problem • n: number of subjects • p: number of predictor variables 6
- 14. formal statement of initial problem • n: number of subjects • p: number of predictor variables • Xn×p: high dimensional data set (p >> n) 6
- 15. formal statement of initial problem • n: number of subjects • p: number of predictor variables • Xn×p: high dimensional data set (p >> n) • Yn×1: phenotype 6
- 16. formal statement of initial problem • n: number of subjects • p: number of predictor variables • Xn×p: high dimensional data set (p >> n) • Yn×1: phenotype • En×1: environmental factor that has widespread eﬀect on X and can modify the relation between X and Y 6
- 17. formal statement of initial problem • n: number of subjects • p: number of predictor variables • Xn×p: high dimensional data set (p >> n) • Yn×1: phenotype • En×1: environmental factor that has widespread eﬀect on X and can modify the relation between X and Y Objective • Which elements of X that are associated with Y , depend on E? 6
- 18. Methods
- 19. ECLUST - our proposed method: 3 phases Original Data
- 20. ECLUST - our proposed method: 3 phases Original Data E = 0 1) Gene Similarity E = 1
- 21. ECLUST - our proposed method: 3 phases Original Data E = 0 1) Gene Similarity E = 1
- 22. ECLUST - our proposed method: 3 phases Original Data E = 0 1) Gene Similarity E = 1 2) Cluster Representation
- 23. ECLUST - our proposed method: 3 phases Original Data E = 0 1) Gene Similarity E = 1 2) Cluster Representation n × 1 n × 1
- 24. ECLUST - our proposed method: 3 phases Original Data E = 0 1) Gene Similarity E = 1 2) Cluster Representation n × 1 n × 1 3) Penalized Regression Yn×1∼ + ×E 7
- 25. the objective of statistical methods is the reduction of data. A quantity of data . . . is to be replaced by relatively few quantities which shall adequately represent . . . the relevant information contained in the original data. - Sir R. A. Fisher, 1922 7
- 26. Model g(µ) =β0 + β1X1 + · · · + βpXp + βE E main eﬀects + α1E (X1E) + · · · + αpE (XpE) interactions 1Choi et al. 2010, JASA 2Chipman 1996, Canadian Journal of Statistics 8
- 27. Model g(µ) =β0 + β1X1 + · · · + βpXp + βE E main eﬀects + α1E (X1E) + · · · + αpE (XpE) interactions Reparametrization1 : αjE = γjE βj βE . 1Choi et al. 2010, JASA 2Chipman 1996, Canadian Journal of Statistics 8
- 28. Model g(µ) =β0 + β1X1 + · · · + βpXp + βE E main eﬀects + α1E (X1E) + · · · + αpE (XpE) interactions Reparametrization1 : αjE = γjE βj βE . Strong heredity principle2 : ˆαjE = 0 ⇒ ˆβj = 0 and ˆβE = 0 1Choi et al. 2010, JASA 2Chipman 1996, Canadian Journal of Statistics 8
- 29. Strong Heredity Model with Penalization arg min β0,β,γ 1 2 Y − g(µ) 2 + λβ (w1β1 + · · · + wqβq + wE βE ) + λγ (w1E γ1E + · · · + wqE γqE ) wj = 1 ˆβj , wjE = ˆβj ˆβE ˆαjE 9
- 30. Results
- 31. Simulation Study: Jaccard Index and test set MSE 10
- 32. Open source software • Software implementation in R: http://sahirbhatnagar.com/eclust/ • Allows user speciﬁed interaction terms • Automatically determines the optimal tuning parameters through cross validation • Can also be applied to genetic data 11
- 33. Conclusions
- 34. Conclusions and Contributions • Large system-wide changes are observed in many environments 12
- 35. Conclusions and Contributions • Large system-wide changes are observed in many environments • This assumption can possibly be exploited to aid analysis of large data 12
- 36. Conclusions and Contributions • Large system-wide changes are observed in many environments • This assumption can possibly be exploited to aid analysis of large data • We develop and implement a multivariate penalization procedure for predicting a continuous or binary disease outcome while detecting interactions between high dimensional data (p >> n) and an environmental factor. 12
- 37. Conclusions and Contributions • Large system-wide changes are observed in many environments • This assumption can possibly be exploited to aid analysis of large data • We develop and implement a multivariate penalization procedure for predicting a continuous or binary disease outcome while detecting interactions between high dimensional data (p >> n) and an environmental factor. • Dimension reduction is achieved through leveraging the environmental-class-conditional correlations 12
- 38. Conclusions and Contributions • Large system-wide changes are observed in many environments • This assumption can possibly be exploited to aid analysis of large data • We develop and implement a multivariate penalization procedure for predicting a continuous or binary disease outcome while detecting interactions between high dimensional data (p >> n) and an environmental factor. • Dimension reduction is achieved through leveraging the environmental-class-conditional correlations • Also, we develop and implement a strong heredity framework within the penalized model 12
- 39. Conclusions and Contributions • Large system-wide changes are observed in many environments • This assumption can possibly be exploited to aid analysis of large data • We develop and implement a multivariate penalization procedure for predicting a continuous or binary disease outcome while detecting interactions between high dimensional data (p >> n) and an environmental factor. • Dimension reduction is achieved through leveraging the environmental-class-conditional correlations • Also, we develop and implement a strong heredity framework within the penalized model • R software: http://sahirbhatnagar.com/eclust/ 12
- 40. Limitations • There must be a high-dimensional signature of the exposure 13
- 41. Limitations • There must be a high-dimensional signature of the exposure • Clustering is unsupervised 13
- 42. Limitations • There must be a high-dimensional signature of the exposure • Clustering is unsupervised • Two tuning parameters 13
- 43. Limitations • There must be a high-dimensional signature of the exposure • Clustering is unsupervised • Two tuning parameters • Need more samples . . . Got data? (Poster 67) 13
- 44. acknowledgements • Dr. Celia Greenwood • Dr. Blanchette and Dr. Yang • Dr. Luigi Bouchard, Andr´e Anne Houde • Dr. Steele, Dr. Kramer, Dr. Abrahamowicz • Maxime Turgeon, Kevin McGregor, Lauren Mokry, Dr. Forest • Greg Voisin, Dr. Forgetta, Dr. Klein • Mothers and children from the study 14