Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
A Model for Interpretable High Dimensional
Interactions
.
Sahir Rai Bhatnagar
Joint work with Yi Yang, Mathieu Blanchette ...
Motivation
.
one predictor variable at a time
.
.
.
.
.
.
Predictor Variable Phenotype
one predictor variable at a time
.
.
.
.
.
.
Predictor Variable Phenotype
Test 1
Test 2
Test 3
Test 4
Test 5
1/25
a network based view
.
.
.
.
.
.
.
.
.
.
.
.
..
Predictor Variable
.
Phenotype
a network based view
.
.
.
.
.
.
.
.
.
.
.
.
..
Predictor Variable
.
Phenotype
a network based view
.
.
.
.
.
.
.
.
.
.
.
.
..
Predictor Variable
.
Phenotype
.
Test 1
2/25
system level changes due to environment
.
.
.
.
.
.
.
.
.
.
.
.
..
Predictor Variable
.
Phenotype
.
Environment
.
.
.
.
.
...
system level changes due to environment
.
.
.
.
.
.
.
.
.
.
.
.
..
Predictor Variable
.
Phenotype
.
Environment
.
.
.
.
.
...
Motivating Dataset: Newborn epigenetic adaptations to gesta-
tional diabetes exposure (Luigi Bouchard, Sherbrooke)
...
Env...
Differential Correlation between environments
(a) Gestational diabetes affected pregnancy (b) Controls
5/25
Differential Networking
6/25
formal statement of initial problem
• n: number of subjects
7/25
formal statement of initial problem
• n: number of subjects
• p: number of predictor variables
7/25
formal statement of initial problem
• n: number of subjects
• p: number of predictor variables
• Xn×p: high dimensional da...
formal statement of initial problem
• n: number of subjects
• p: number of predictor variables
• Xn×p: high dimensional da...
formal statement of initial problem
• n: number of subjects
• p: number of predictor variables
• Xn×p: high dimensional da...
formal statement of initial problem
• n: number of subjects
• p: number of predictor variables
• Xn×p: high dimensional da...
Methods
.
ECLUST - our proposed method: 3 phases
...
Original Data
ECLUST - our proposed method: 3 phases
...
Original Data
..
E = 0
.
1) Gene Similarity
..
E = 1
ECLUST - our proposed method: 3 phases
...
Original Data
..
E = 0
.
1) Gene Similarity
..
E = 1
ECLUST - our proposed method: 3 phases
...
Original Data
..
E = 0
.
1) Gene Similarity
..
E = 1
.
2) Cluster
Representatio...
ECLUST - our proposed method: 3 phases
...
Original Data
..
E = 0
.
1) Gene Similarity
..
E = 1
.
2) Cluster
Representatio...
ECLUST - our proposed method: 3 phases
...
Original Data
..
E = 0
.
1) Gene Similarity
..
E = 1
.
2) Cluster
Representatio...
the objective of statisti-
cal methods is the reduction of
data. A quantity of data . . . is to be
replaced by relatively ...
Underlying model
Y = β0 + β1U + β2U · E + ε (1)
X ∼ F(α0 + α1U, ΣE) (2)
• U: unobserved latent variable
• X: observed data...
Measure of similarity: topological overlap matrix (TOM)
10/25
Method to detect gene clusters
Table 1: Method to detect gene clusters
General Approach Formula
TOM Scoring |TOME=1 − TOME...
Cluster Representation
Table 2: Methods to create cluster representations
General Approach Type
Unsupervised average
1st p...
Model
g(µ) =β0 + β1X1 + · · · + βpXp + βEE
main effects
+ α1E(X1E) + · · · + αpE(XpE)
interactions
1Choi et al. 2010, JASA
...
Model
g(µ) =β0 + β1X1 + · · · + βpXp + βEE
main effects
+ α1E(X1E) + · · · + αpE(XpE)
interactions
Reparametrization1
: αjE...
Model
g(µ) =β0 + β1X1 + · · · + βpXp + βEE
main effects
+ α1E(X1E) + · · · + αpE(XpE)
interactions
Reparametrization1
: αjE...
Strong Heredity Model with Penalization
arg min
β0,β,γ
1
2
∥Y − g(µ)∥
2
+
λβ (w1β1 + · · · + wqβq + wEβE) +
λγ (w1Eγ1E + ·...
Results
.
Simulation Study
15/25
TOM based on all subjects
(a) TOM(Xall) 16/25
TOM based on unexposed subjects
(a) TOM(XE=0) 17/25
TOM based on exposed subjects
(a) TOM(XE=1) 18/25
Difference of TOMs
(a) |TOM(XE=1) − TOM(XE=0)| 19/25
Results: Test set MSE
20/25
Results: Variable Selection
21/25
Open source software
• Software implementation in R:
http://sahirbhatnagar.com/eclust/
• Allows user specified interaction ...
Conclusions
.
Conclusions and Contributions
• Large system-wide changes are observed in many
environments
23/25
Conclusions and Contributions
• Large system-wide changes are observed in many
environments
• Dimension reduction is achie...
Conclusions and Contributions
• Large system-wide changes are observed in many
environments
• Dimension reduction is achie...
Limitations
• There must be a high-dimensional signature of the exposure
24/25
Limitations
• There must be a high-dimensional signature of the exposure
• Clustering is unsupervised
24/25
Limitations
• There must be a high-dimensional signature of the exposure
• Clustering is unsupervised
• Two tuning paramet...
Limitations
• There must be a high-dimensional signature of the exposure
• Clustering is unsupervised
• Two tuning paramet...
Limitations
• There must be a high-dimensional signature of the exposure
• Clustering is unsupervised
• Two tuning paramet...
acknowledgements
• Dr. Celia Greenwood
• Dr. Blanchette and Dr. Yang
• Dr. Luigi Bouchard, André Anne
Houde
• Dr. Steele, ...
Upcoming SlideShare
Loading in …5
×

Strong Heredity Models in High Dimensional Data

61 views

Published on

Presentation at IBC 2016 in Victoria, BC

Published in: Science
  • Be the first to comment

  • Be the first to like this

Strong Heredity Models in High Dimensional Data

  1. 1. A Model for Interpretable High Dimensional Interactions . Sahir Rai Bhatnagar Joint work with Yi Yang, Mathieu Blanchette and Celia Greenwood McGill University sahirbhatnagar.com
  2. 2. Motivation .
  3. 3. one predictor variable at a time . . . . . . Predictor Variable Phenotype
  4. 4. one predictor variable at a time . . . . . . Predictor Variable Phenotype Test 1 Test 2 Test 3 Test 4 Test 5 1/25
  5. 5. a network based view . . . . . . . . . . . . .. Predictor Variable . Phenotype
  6. 6. a network based view . . . . . . . . . . . . .. Predictor Variable . Phenotype
  7. 7. a network based view . . . . . . . . . . . . .. Predictor Variable . Phenotype . Test 1 2/25
  8. 8. system level changes due to environment . . . . . . . . . . . . .. Predictor Variable . Phenotype . Environment . . . . . . . . . .. A . B
  9. 9. system level changes due to environment . . . . . . . . . . . . .. Predictor Variable . Phenotype . Environment . . . . . . . . . .. A . B . Test 1 3/25
  10. 10. Motivating Dataset: Newborn epigenetic adaptations to gesta- tional diabetes exposure (Luigi Bouchard, Sherbrooke) ... Environment Gestational Diabetes .. Large Data Child's epigenome (p ≈ 450k) . . . Phenotype Obesity measures 4/25
  11. 11. Differential Correlation between environments (a) Gestational diabetes affected pregnancy (b) Controls 5/25
  12. 12. Differential Networking 6/25
  13. 13. formal statement of initial problem • n: number of subjects 7/25
  14. 14. formal statement of initial problem • n: number of subjects • p: number of predictor variables 7/25
  15. 15. formal statement of initial problem • n: number of subjects • p: number of predictor variables • Xn×p: high dimensional data set (p >> n) 7/25
  16. 16. formal statement of initial problem • n: number of subjects • p: number of predictor variables • Xn×p: high dimensional data set (p >> n) • Yn×1: phenotype 7/25
  17. 17. formal statement of initial problem • n: number of subjects • p: number of predictor variables • Xn×p: high dimensional data set (p >> n) • Yn×1: phenotype • En×1: environmental factor that has widespread effect on X 7/25
  18. 18. formal statement of initial problem • n: number of subjects • p: number of predictor variables • Xn×p: high dimensional data set (p >> n) • Yn×1: phenotype • En×1: environmental factor that has widespread effect on X Objective • Which elements of X that are associated with Y, depend on E? 7/25
  19. 19. Methods .
  20. 20. ECLUST - our proposed method: 3 phases ... Original Data
  21. 21. ECLUST - our proposed method: 3 phases ... Original Data .. E = 0 . 1) Gene Similarity .. E = 1
  22. 22. ECLUST - our proposed method: 3 phases ... Original Data .. E = 0 . 1) Gene Similarity .. E = 1
  23. 23. ECLUST - our proposed method: 3 phases ... Original Data .. E = 0 . 1) Gene Similarity .. E = 1 . 2) Cluster Representation .
  24. 24. ECLUST - our proposed method: 3 phases ... Original Data .. E = 0 . 1) Gene Similarity .. E = 1 . 2) Cluster Representation .. n × 1 . n × 1
  25. 25. ECLUST - our proposed method: 3 phases ... Original Data .. E = 0 . 1) Gene Similarity .. E = 1 . 2) Cluster Representation .. n × 1 . n × 1 . 3) Penalized Regression . Yn×1 . ∼ . + . ×E 8/25
  26. 26. the objective of statisti- cal methods is the reduction of data. A quantity of data . . . is to be replaced by relatively few quantities which shall adequately represent . . . the relevant information contained in the original data. - Sir R. A. Fisher, 1922 8/25
  27. 27. Underlying model Y = β0 + β1U + β2U · E + ε (1) X ∼ F(α0 + α1U, ΣE) (2) • U: unobserved latent variable • X: observed data which is a function of U • ΣE: environment sensitive correlation matrix 9/25
  28. 28. Measure of similarity: topological overlap matrix (TOM) 10/25
  29. 29. Method to detect gene clusters Table 1: Method to detect gene clusters General Approach Formula TOM Scoring |TOME=1 − TOME=0| 11/25
  30. 30. Cluster Representation Table 2: Methods to create cluster representations General Approach Type Unsupervised average 1st principal component 12/25
  31. 31. Model g(µ) =β0 + β1X1 + · · · + βpXp + βEE main effects + α1E(X1E) + · · · + αpE(XpE) interactions 1Choi et al. 2010, JASA 2Chipman 1996, Canadian Journal of Statistics 13/25
  32. 32. Model g(µ) =β0 + β1X1 + · · · + βpXp + βEE main effects + α1E(X1E) + · · · + αpE(XpE) interactions Reparametrization1 : αjE = γjEβjβE. 1Choi et al. 2010, JASA 2Chipman 1996, Canadian Journal of Statistics 13/25
  33. 33. Model g(µ) =β0 + β1X1 + · · · + βpXp + βEE main effects + α1E(X1E) + · · · + αpE(XpE) interactions Reparametrization1 : αjE = γjEβjβE. Strong heredity principle2 : ˆαjE ̸= 0 ⇒ ˆβj ̸= 0 and ˆβE ̸= 0 1Choi et al. 2010, JASA 2Chipman 1996, Canadian Journal of Statistics 13/25
  34. 34. Strong Heredity Model with Penalization arg min β0,β,γ 1 2 ∥Y − g(µ)∥ 2 + λβ (w1β1 + · · · + wqβq + wEβE) + λγ (w1Eγ1E + · · · + wqEγqE) wj = 1 ˆβj , wjE = ˆβj ˆβE ˆαjE 14/25
  35. 35. Results .
  36. 36. Simulation Study 15/25
  37. 37. TOM based on all subjects (a) TOM(Xall) 16/25
  38. 38. TOM based on unexposed subjects (a) TOM(XE=0) 17/25
  39. 39. TOM based on exposed subjects (a) TOM(XE=1) 18/25
  40. 40. Difference of TOMs (a) |TOM(XE=1) − TOM(XE=0)| 19/25
  41. 41. Results: Test set MSE 20/25
  42. 42. Results: Variable Selection 21/25
  43. 43. Open source software • Software implementation in R: http://sahirbhatnagar.com/eclust/ • Allows user specified interaction terms • Automatically determines the optimal tuning parameters through cross validation • Can also be applied to genetic data 22/25
  44. 44. Conclusions .
  45. 45. Conclusions and Contributions • Large system-wide changes are observed in many environments 23/25
  46. 46. Conclusions and Contributions • Large system-wide changes are observed in many environments • Dimension reduction is achieved through leveraging the environmental-class-conditional correlations 23/25
  47. 47. Conclusions and Contributions • Large system-wide changes are observed in many environments • Dimension reduction is achieved through leveraging the environmental-class-conditional correlations • R software: http://sahirbhatnagar.com/eclust/ 23/25
  48. 48. Limitations • There must be a high-dimensional signature of the exposure 24/25
  49. 49. Limitations • There must be a high-dimensional signature of the exposure • Clustering is unsupervised 24/25
  50. 50. Limitations • There must be a high-dimensional signature of the exposure • Clustering is unsupervised • Two tuning parameters 24/25
  51. 51. Limitations • There must be a high-dimensional signature of the exposure • Clustering is unsupervised • Two tuning parameters • Cautionary note on simulation studies 24/25
  52. 52. Limitations • There must be a high-dimensional signature of the exposure • Clustering is unsupervised • Two tuning parameters • Cautionary note on simulation studies • Need more samples . . . Got data? 24/25
  53. 53. acknowledgements • Dr. Celia Greenwood • Dr. Blanchette and Dr. Yang • Dr. Luigi Bouchard, André Anne Houde • Dr. Steele, Dr. Kramer, Dr. Abrahamowicz • Maxime Turgeon, Kevin McGregor, Lauren Mokry, Dr. Forest • Greg Voisin, Dr. Forgetta, Dr. Klein • Mothers and children from the study 25/25

×