Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
A Model for Interpretable High Dimensional
Interactions
.
Sahir Rai Bhatnagar
Joint work with Yi Yang, Mathieu Blanchette ...
Motivation
.
one predictor variable at a time
.
.
.
.
.
.
Predictor Variable Phenotype
one predictor variable at a time
.
.
.
.
.
.
Predictor Variable Phenotype
Test 1
Test 2
Test 3
Test 4
Test 5
1/25
a network based view
.
.
.
.
.
.
.
.
.
.
.
.
..
Predictor Variable
.
Phenotype
a network based view
.
.
.
.
.
.
.
.
.
.
.
.
..
Predictor Variable
.
Phenotype
a network based view
.
.
.
.
.
.
.
.
.
.
.
.
..
Predictor Variable
.
Phenotype
.
Test 1
2/25
system level changes due to environment
.
.
.
.
.
.
.
.
.
.
.
.
..
Predictor Variable
.
Phenotype
.
Environment
.
.
.
.
.
...
system level changes due to environment
.
.
.
.
.
.
.
.
.
.
.
.
..
Predictor Variable
.
Phenotype
.
Environment
.
.
.
.
.
...
Motivating Dataset: Newborn epigenetic adaptations to gesta-
tional diabetes exposure (Luigi Bouchard, Sherbrooke)
...
Env...
Differential Correlation between environments
(a) Gestational diabetes affected pregnancy (b) Controls
5/25
Differential Networking
6/25
formal statement of initial problem
• n: number of subjects
7/25
formal statement of initial problem
• n: number of subjects
• p: number of predictor variables
7/25
formal statement of initial problem
• n: number of subjects
• p: number of predictor variables
• Xn×p: high dimensional da...
formal statement of initial problem
• n: number of subjects
• p: number of predictor variables
• Xn×p: high dimensional da...
formal statement of initial problem
• n: number of subjects
• p: number of predictor variables
• Xn×p: high dimensional da...
formal statement of initial problem
• n: number of subjects
• p: number of predictor variables
• Xn×p: high dimensional da...
Methods
.
ECLUST - our proposed method: 3 phases
...
Original Data
ECLUST - our proposed method: 3 phases
...
Original Data
..
E = 0
.
1) Gene Similarity
..
E = 1
ECLUST - our proposed method: 3 phases
...
Original Data
..
E = 0
.
1) Gene Similarity
..
E = 1
ECLUST - our proposed method: 3 phases
...
Original Data
..
E = 0
.
1) Gene Similarity
..
E = 1
.
2) Cluster
Representatio...
ECLUST - our proposed method: 3 phases
...
Original Data
..
E = 0
.
1) Gene Similarity
..
E = 1
.
2) Cluster
Representatio...
ECLUST - our proposed method: 3 phases
...
Original Data
..
E = 0
.
1) Gene Similarity
..
E = 1
.
2) Cluster
Representatio...
the objective of statisti-
cal methods is the reduction of
data. A quantity of data . . . is to be
replaced by relatively ...
Underlying model
Y = β0 + β1U + β2U · E + ε (1)
X ∼ F(α0 + α1U, ΣE) (2)
• U: unobserved latent variable
• X: observed data...
Measure of similarity: topological overlap matrix (TOM)
10/25
Method to detect gene clusters
Table 1: Method to detect gene clusters
General Approach Formula
TOM Scoring |TOME=1 − TOME...
Cluster Representation
Table 2: Methods to create cluster representations
General Approach Type
Unsupervised average
1st p...
Model
g(µ) =β0 + β1X1 + · · · + βpXp + βEE
main effects
+ α1E(X1E) + · · · + αpE(XpE)
interactions
1Choi et al. 2010, JASA
...
Model
g(µ) =β0 + β1X1 + · · · + βpXp + βEE
main effects
+ α1E(X1E) + · · · + αpE(XpE)
interactions
Reparametrization1
: αjE...
Model
g(µ) =β0 + β1X1 + · · · + βpXp + βEE
main effects
+ α1E(X1E) + · · · + αpE(XpE)
interactions
Reparametrization1
: αjE...
Strong Heredity Model with Penalization
arg min
β0,β,γ
1
2
∥Y − g(µ)∥
2
+
λβ (w1β1 + · · · + wqβq + wEβE) +
λγ (w1Eγ1E + ·...
Results
.
Simulation Study
15/25
TOM based on all subjects
(a) TOM(Xall) 16/25
TOM based on unexposed subjects
(a) TOM(XE=0) 17/25
TOM based on exposed subjects
(a) TOM(XE=1) 18/25
Difference of TOMs
(a) |TOM(XE=1) − TOM(XE=0)| 19/25
Results: Test set MSE
20/25
Results: Variable Selection
21/25
Open source software
• Software implementation in R:
http://sahirbhatnagar.com/eclust/
• Allows user specified interaction ...
Conclusions
.
Conclusions and Contributions
• Large system-wide changes are observed in many
environments
23/25
Conclusions and Contributions
• Large system-wide changes are observed in many
environments
• Dimension reduction is achie...
Conclusions and Contributions
• Large system-wide changes are observed in many
environments
• Dimension reduction is achie...
Limitations
• There must be a high-dimensional signature of the exposure
24/25
Limitations
• There must be a high-dimensional signature of the exposure
• Clustering is unsupervised
24/25
Limitations
• There must be a high-dimensional signature of the exposure
• Clustering is unsupervised
• Two tuning paramet...
Limitations
• There must be a high-dimensional signature of the exposure
• Clustering is unsupervised
• Two tuning paramet...
Limitations
• There must be a high-dimensional signature of the exposure
• Clustering is unsupervised
• Two tuning paramet...
acknowledgements
• Dr. Celia Greenwood
• Dr. Blanchette and Dr. Yang
• Dr. Luigi Bouchard, André Anne
Houde
• Dr. Steele, ...
Upcoming SlideShare
Loading in …5
×

Strong Heredity Models in High Dimensional Data

116 views

Published on

Presentation at IBC 2016 in Victoria, BC

Published in: Science
  • Diabetes is Now a Thing of the Past! A completely new and readily available solution may now be found below! With it you no longer have to worry about all the horrors formerly associated with this dreadful and merciless disease! Just go now to the link immediately below for the full facts: ▲▲▲ https://tinyurl.com/yx3etvck
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Be the first to like this

Strong Heredity Models in High Dimensional Data

  1. 1. A Model for Interpretable High Dimensional Interactions . Sahir Rai Bhatnagar Joint work with Yi Yang, Mathieu Blanchette and Celia Greenwood McGill University sahirbhatnagar.com
  2. 2. Motivation .
  3. 3. one predictor variable at a time . . . . . . Predictor Variable Phenotype
  4. 4. one predictor variable at a time . . . . . . Predictor Variable Phenotype Test 1 Test 2 Test 3 Test 4 Test 5 1/25
  5. 5. a network based view . . . . . . . . . . . . .. Predictor Variable . Phenotype
  6. 6. a network based view . . . . . . . . . . . . .. Predictor Variable . Phenotype
  7. 7. a network based view . . . . . . . . . . . . .. Predictor Variable . Phenotype . Test 1 2/25
  8. 8. system level changes due to environment . . . . . . . . . . . . .. Predictor Variable . Phenotype . Environment . . . . . . . . . .. A . B
  9. 9. system level changes due to environment . . . . . . . . . . . . .. Predictor Variable . Phenotype . Environment . . . . . . . . . .. A . B . Test 1 3/25
  10. 10. Motivating Dataset: Newborn epigenetic adaptations to gesta- tional diabetes exposure (Luigi Bouchard, Sherbrooke) ... Environment Gestational Diabetes .. Large Data Child's epigenome (p ≈ 450k) . . . Phenotype Obesity measures 4/25
  11. 11. Differential Correlation between environments (a) Gestational diabetes affected pregnancy (b) Controls 5/25
  12. 12. Differential Networking 6/25
  13. 13. formal statement of initial problem • n: number of subjects 7/25
  14. 14. formal statement of initial problem • n: number of subjects • p: number of predictor variables 7/25
  15. 15. formal statement of initial problem • n: number of subjects • p: number of predictor variables • Xn×p: high dimensional data set (p >> n) 7/25
  16. 16. formal statement of initial problem • n: number of subjects • p: number of predictor variables • Xn×p: high dimensional data set (p >> n) • Yn×1: phenotype 7/25
  17. 17. formal statement of initial problem • n: number of subjects • p: number of predictor variables • Xn×p: high dimensional data set (p >> n) • Yn×1: phenotype • En×1: environmental factor that has widespread effect on X 7/25
  18. 18. formal statement of initial problem • n: number of subjects • p: number of predictor variables • Xn×p: high dimensional data set (p >> n) • Yn×1: phenotype • En×1: environmental factor that has widespread effect on X Objective • Which elements of X that are associated with Y, depend on E? 7/25
  19. 19. Methods .
  20. 20. ECLUST - our proposed method: 3 phases ... Original Data
  21. 21. ECLUST - our proposed method: 3 phases ... Original Data .. E = 0 . 1) Gene Similarity .. E = 1
  22. 22. ECLUST - our proposed method: 3 phases ... Original Data .. E = 0 . 1) Gene Similarity .. E = 1
  23. 23. ECLUST - our proposed method: 3 phases ... Original Data .. E = 0 . 1) Gene Similarity .. E = 1 . 2) Cluster Representation .
  24. 24. ECLUST - our proposed method: 3 phases ... Original Data .. E = 0 . 1) Gene Similarity .. E = 1 . 2) Cluster Representation .. n × 1 . n × 1
  25. 25. ECLUST - our proposed method: 3 phases ... Original Data .. E = 0 . 1) Gene Similarity .. E = 1 . 2) Cluster Representation .. n × 1 . n × 1 . 3) Penalized Regression . Yn×1 . ∼ . + . ×E 8/25
  26. 26. the objective of statisti- cal methods is the reduction of data. A quantity of data . . . is to be replaced by relatively few quantities which shall adequately represent . . . the relevant information contained in the original data. - Sir R. A. Fisher, 1922 8/25
  27. 27. Underlying model Y = β0 + β1U + β2U · E + ε (1) X ∼ F(α0 + α1U, ΣE) (2) • U: unobserved latent variable • X: observed data which is a function of U • ΣE: environment sensitive correlation matrix 9/25
  28. 28. Measure of similarity: topological overlap matrix (TOM) 10/25
  29. 29. Method to detect gene clusters Table 1: Method to detect gene clusters General Approach Formula TOM Scoring |TOME=1 − TOME=0| 11/25
  30. 30. Cluster Representation Table 2: Methods to create cluster representations General Approach Type Unsupervised average 1st principal component 12/25
  31. 31. Model g(µ) =β0 + β1X1 + · · · + βpXp + βEE main effects + α1E(X1E) + · · · + αpE(XpE) interactions 1Choi et al. 2010, JASA 2Chipman 1996, Canadian Journal of Statistics 13/25
  32. 32. Model g(µ) =β0 + β1X1 + · · · + βpXp + βEE main effects + α1E(X1E) + · · · + αpE(XpE) interactions Reparametrization1 : αjE = γjEβjβE. 1Choi et al. 2010, JASA 2Chipman 1996, Canadian Journal of Statistics 13/25
  33. 33. Model g(µ) =β0 + β1X1 + · · · + βpXp + βEE main effects + α1E(X1E) + · · · + αpE(XpE) interactions Reparametrization1 : αjE = γjEβjβE. Strong heredity principle2 : ˆαjE ̸= 0 ⇒ ˆβj ̸= 0 and ˆβE ̸= 0 1Choi et al. 2010, JASA 2Chipman 1996, Canadian Journal of Statistics 13/25
  34. 34. Strong Heredity Model with Penalization arg min β0,β,γ 1 2 ∥Y − g(µ)∥ 2 + λβ (w1β1 + · · · + wqβq + wEβE) + λγ (w1Eγ1E + · · · + wqEγqE) wj = 1 ˆβj , wjE = ˆβj ˆβE ˆαjE 14/25
  35. 35. Results .
  36. 36. Simulation Study 15/25
  37. 37. TOM based on all subjects (a) TOM(Xall) 16/25
  38. 38. TOM based on unexposed subjects (a) TOM(XE=0) 17/25
  39. 39. TOM based on exposed subjects (a) TOM(XE=1) 18/25
  40. 40. Difference of TOMs (a) |TOM(XE=1) − TOM(XE=0)| 19/25
  41. 41. Results: Test set MSE 20/25
  42. 42. Results: Variable Selection 21/25
  43. 43. Open source software • Software implementation in R: http://sahirbhatnagar.com/eclust/ • Allows user specified interaction terms • Automatically determines the optimal tuning parameters through cross validation • Can also be applied to genetic data 22/25
  44. 44. Conclusions .
  45. 45. Conclusions and Contributions • Large system-wide changes are observed in many environments 23/25
  46. 46. Conclusions and Contributions • Large system-wide changes are observed in many environments • Dimension reduction is achieved through leveraging the environmental-class-conditional correlations 23/25
  47. 47. Conclusions and Contributions • Large system-wide changes are observed in many environments • Dimension reduction is achieved through leveraging the environmental-class-conditional correlations • R software: http://sahirbhatnagar.com/eclust/ 23/25
  48. 48. Limitations • There must be a high-dimensional signature of the exposure 24/25
  49. 49. Limitations • There must be a high-dimensional signature of the exposure • Clustering is unsupervised 24/25
  50. 50. Limitations • There must be a high-dimensional signature of the exposure • Clustering is unsupervised • Two tuning parameters 24/25
  51. 51. Limitations • There must be a high-dimensional signature of the exposure • Clustering is unsupervised • Two tuning parameters • Cautionary note on simulation studies 24/25
  52. 52. Limitations • There must be a high-dimensional signature of the exposure • Clustering is unsupervised • Two tuning parameters • Cautionary note on simulation studies • Need more samples . . . Got data? 24/25
  53. 53. acknowledgements • Dr. Celia Greenwood • Dr. Blanchette and Dr. Yang • Dr. Luigi Bouchard, André Anne Houde • Dr. Steele, Dr. Kramer, Dr. Abrahamowicz • Maxime Turgeon, Kevin McGregor, Lauren Mokry, Dr. Forest • Greg Voisin, Dr. Forgetta, Dr. Klein • Mothers and children from the study 25/25

×