Successfully reported this slideshow.
Your SlideShare is downloading. ×

Strong Heredity Models in High Dimensional Data

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Upcoming SlideShare
Mixed models
Mixed models
Loading in …3
×

Check these out next

1 of 53 Ad
Advertisement

More Related Content

Similar to Strong Heredity Models in High Dimensional Data (20)

Advertisement
Advertisement

Strong Heredity Models in High Dimensional Data

  1. 1. A Model for Interpretable High Dimensional Interactions . Sahir Rai Bhatnagar Joint work with Yi Yang, Mathieu Blanchette and Celia Greenwood McGill University sahirbhatnagar.com
  2. 2. Motivation .
  3. 3. one predictor variable at a time . . . . . . Predictor Variable Phenotype
  4. 4. one predictor variable at a time . . . . . . Predictor Variable Phenotype Test 1 Test 2 Test 3 Test 4 Test 5 1/25
  5. 5. a network based view . . . . . . . . . . . . .. Predictor Variable . Phenotype
  6. 6. a network based view . . . . . . . . . . . . .. Predictor Variable . Phenotype
  7. 7. a network based view . . . . . . . . . . . . .. Predictor Variable . Phenotype . Test 1 2/25
  8. 8. system level changes due to environment . . . . . . . . . . . . .. Predictor Variable . Phenotype . Environment . . . . . . . . . .. A . B
  9. 9. system level changes due to environment . . . . . . . . . . . . .. Predictor Variable . Phenotype . Environment . . . . . . . . . .. A . B . Test 1 3/25
  10. 10. Motivating Dataset: Newborn epigenetic adaptations to gesta- tional diabetes exposure (Luigi Bouchard, Sherbrooke) ... Environment Gestational Diabetes .. Large Data Child's epigenome (p ≈ 450k) . . . Phenotype Obesity measures 4/25
  11. 11. Differential Correlation between environments (a) Gestational diabetes affected pregnancy (b) Controls 5/25
  12. 12. Differential Networking 6/25
  13. 13. formal statement of initial problem • n: number of subjects 7/25
  14. 14. formal statement of initial problem • n: number of subjects • p: number of predictor variables 7/25
  15. 15. formal statement of initial problem • n: number of subjects • p: number of predictor variables • Xn×p: high dimensional data set (p >> n) 7/25
  16. 16. formal statement of initial problem • n: number of subjects • p: number of predictor variables • Xn×p: high dimensional data set (p >> n) • Yn×1: phenotype 7/25
  17. 17. formal statement of initial problem • n: number of subjects • p: number of predictor variables • Xn×p: high dimensional data set (p >> n) • Yn×1: phenotype • En×1: environmental factor that has widespread effect on X 7/25
  18. 18. formal statement of initial problem • n: number of subjects • p: number of predictor variables • Xn×p: high dimensional data set (p >> n) • Yn×1: phenotype • En×1: environmental factor that has widespread effect on X Objective • Which elements of X that are associated with Y, depend on E? 7/25
  19. 19. Methods .
  20. 20. ECLUST - our proposed method: 3 phases ... Original Data
  21. 21. ECLUST - our proposed method: 3 phases ... Original Data .. E = 0 . 1) Gene Similarity .. E = 1
  22. 22. ECLUST - our proposed method: 3 phases ... Original Data .. E = 0 . 1) Gene Similarity .. E = 1
  23. 23. ECLUST - our proposed method: 3 phases ... Original Data .. E = 0 . 1) Gene Similarity .. E = 1 . 2) Cluster Representation .
  24. 24. ECLUST - our proposed method: 3 phases ... Original Data .. E = 0 . 1) Gene Similarity .. E = 1 . 2) Cluster Representation .. n × 1 . n × 1
  25. 25. ECLUST - our proposed method: 3 phases ... Original Data .. E = 0 . 1) Gene Similarity .. E = 1 . 2) Cluster Representation .. n × 1 . n × 1 . 3) Penalized Regression . Yn×1 . ∼ . + . ×E 8/25
  26. 26. the objective of statisti- cal methods is the reduction of data. A quantity of data . . . is to be replaced by relatively few quantities which shall adequately represent . . . the relevant information contained in the original data. - Sir R. A. Fisher, 1922 8/25
  27. 27. Underlying model Y = β0 + β1U + β2U · E + ε (1) X ∼ F(α0 + α1U, ΣE) (2) • U: unobserved latent variable • X: observed data which is a function of U • ΣE: environment sensitive correlation matrix 9/25
  28. 28. Measure of similarity: topological overlap matrix (TOM) 10/25
  29. 29. Method to detect gene clusters Table 1: Method to detect gene clusters General Approach Formula TOM Scoring |TOME=1 − TOME=0| 11/25
  30. 30. Cluster Representation Table 2: Methods to create cluster representations General Approach Type Unsupervised average 1st principal component 12/25
  31. 31. Model g(µ) =β0 + β1X1 + · · · + βpXp + βEE main effects + α1E(X1E) + · · · + αpE(XpE) interactions 1Choi et al. 2010, JASA 2Chipman 1996, Canadian Journal of Statistics 13/25
  32. 32. Model g(µ) =β0 + β1X1 + · · · + βpXp + βEE main effects + α1E(X1E) + · · · + αpE(XpE) interactions Reparametrization1 : αjE = γjEβjβE. 1Choi et al. 2010, JASA 2Chipman 1996, Canadian Journal of Statistics 13/25
  33. 33. Model g(µ) =β0 + β1X1 + · · · + βpXp + βEE main effects + α1E(X1E) + · · · + αpE(XpE) interactions Reparametrization1 : αjE = γjEβjβE. Strong heredity principle2 : ˆαjE ̸= 0 ⇒ ˆβj ̸= 0 and ˆβE ̸= 0 1Choi et al. 2010, JASA 2Chipman 1996, Canadian Journal of Statistics 13/25
  34. 34. Strong Heredity Model with Penalization arg min β0,β,γ 1 2 ∥Y − g(µ)∥ 2 + λβ (w1β1 + · · · + wqβq + wEβE) + λγ (w1Eγ1E + · · · + wqEγqE) wj = 1 ˆβj , wjE = ˆβj ˆβE ˆαjE 14/25
  35. 35. Results .
  36. 36. Simulation Study 15/25
  37. 37. TOM based on all subjects (a) TOM(Xall) 16/25
  38. 38. TOM based on unexposed subjects (a) TOM(XE=0) 17/25
  39. 39. TOM based on exposed subjects (a) TOM(XE=1) 18/25
  40. 40. Difference of TOMs (a) |TOM(XE=1) − TOM(XE=0)| 19/25
  41. 41. Results: Test set MSE 20/25
  42. 42. Results: Variable Selection 21/25
  43. 43. Open source software • Software implementation in R: http://sahirbhatnagar.com/eclust/ • Allows user specified interaction terms • Automatically determines the optimal tuning parameters through cross validation • Can also be applied to genetic data 22/25
  44. 44. Conclusions .
  45. 45. Conclusions and Contributions • Large system-wide changes are observed in many environments 23/25
  46. 46. Conclusions and Contributions • Large system-wide changes are observed in many environments • Dimension reduction is achieved through leveraging the environmental-class-conditional correlations 23/25
  47. 47. Conclusions and Contributions • Large system-wide changes are observed in many environments • Dimension reduction is achieved through leveraging the environmental-class-conditional correlations • R software: http://sahirbhatnagar.com/eclust/ 23/25
  48. 48. Limitations • There must be a high-dimensional signature of the exposure 24/25
  49. 49. Limitations • There must be a high-dimensional signature of the exposure • Clustering is unsupervised 24/25
  50. 50. Limitations • There must be a high-dimensional signature of the exposure • Clustering is unsupervised • Two tuning parameters 24/25
  51. 51. Limitations • There must be a high-dimensional signature of the exposure • Clustering is unsupervised • Two tuning parameters • Cautionary note on simulation studies 24/25
  52. 52. Limitations • There must be a high-dimensional signature of the exposure • Clustering is unsupervised • Two tuning parameters • Cautionary note on simulation studies • Need more samples . . . Got data? 24/25
  53. 53. acknowledgements • Dr. Celia Greenwood • Dr. Blanchette and Dr. Yang • Dr. Luigi Bouchard, André Anne Houde • Dr. Steele, Dr. Kramer, Dr. Abrahamowicz • Maxime Turgeon, Kevin McGregor, Lauren Mokry, Dr. Forest • Greg Voisin, Dr. Forgetta, Dr. Klein • Mothers and children from the study 25/25

×