Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
A Model for Interpretable High Dimensional
Interactions
Sahir Rai Bhatnagar
Joint work with Yi Yang, Mathieu Blanchette an...
Motivation
one predictor variable at a time
Predictor Variable Phenotype
one predictor variable at a time
Predictor Variable Phenotype
Test 1
Test 2
Test 3
Test 4
Test 5
1
a network based view
Predictor Variable Phenotype
a network based view
Predictor Variable Phenotype
a network based view
Predictor Variable Phenotype
Test 1
2
system level changes due to environment
Predictor Variable PhenotypeEnvironment
A
B
system level changes due to environment
Predictor Variable PhenotypeEnvironment
A
B
Test 1
3
Motivating Dataset: Newborn epigenetic adaptations to gesta-
tional diabetes exposure (Luigi Bouchard, Sherbrooke)
Environ...
Differential Correlation between environments
(a) Gestational diabetes affected pregnancy (b) Controls
5
formal statement of initial problem
• n: number of subjects
6
formal statement of initial problem
• n: number of subjects
• p: number of predictor variables
6
formal statement of initial problem
• n: number of subjects
• p: number of predictor variables
• Xn×p: high dimensional da...
formal statement of initial problem
• n: number of subjects
• p: number of predictor variables
• Xn×p: high dimensional da...
formal statement of initial problem
• n: number of subjects
• p: number of predictor variables
• Xn×p: high dimensional da...
formal statement of initial problem
• n: number of subjects
• p: number of predictor variables
• Xn×p: high dimensional da...
Methods
ECLUST - our proposed method: 3 phases
Original Data
ECLUST - our proposed method: 3 phases
Original Data
E = 0
1) Gene Similarity
E = 1
ECLUST - our proposed method: 3 phases
Original Data
E = 0
1) Gene Similarity
E = 1
ECLUST - our proposed method: 3 phases
Original Data
E = 0
1) Gene Similarity
E = 1
2) Cluster
Representation
ECLUST - our proposed method: 3 phases
Original Data
E = 0
1) Gene Similarity
E = 1
2) Cluster
Representation
n × 1 n × 1
ECLUST - our proposed method: 3 phases
Original Data
E = 0
1) Gene Similarity
E = 1
2) Cluster
Representation
n × 1 n × 1
...
the objective of statistical
methods is the reduction of data.
A quantity of data . . . is to be
replaced by relatively fe...
Model
g(µ) =β0 + β1X1 + · · · + βpXp + βE E
main effects
+ α1E (X1E) + · · · + αpE (XpE)
interactions
1Choi et al. 2010, JA...
Model
g(µ) =β0 + β1X1 + · · · + βpXp + βE E
main effects
+ α1E (X1E) + · · · + αpE (XpE)
interactions
Reparametrization1
: ...
Model
g(µ) =β0 + β1X1 + · · · + βpXp + βE E
main effects
+ α1E (X1E) + · · · + αpE (XpE)
interactions
Reparametrization1
: ...
Strong Heredity Model with Penalization
arg min
β0,β,γ
1
2
Y − g(µ)
2
+
λβ (w1β1 + · · · + wqβq + wE βE ) +
λγ (w1E γ1E + ...
Results
Simulation Study: Jaccard Index and test set MSE
10
Open source software
• Software implementation in R: http://sahirbhatnagar.com/eclust/
• Allows user specified interaction ...
Conclusions
Conclusions and Contributions
• Large system-wide changes are observed in many environments
12
Conclusions and Contributions
• Large system-wide changes are observed in many environments
• This assumption can possibly...
Conclusions and Contributions
• Large system-wide changes are observed in many environments
• This assumption can possibly...
Conclusions and Contributions
• Large system-wide changes are observed in many environments
• This assumption can possibly...
Conclusions and Contributions
• Large system-wide changes are observed in many environments
• This assumption can possibly...
Conclusions and Contributions
• Large system-wide changes are observed in many environments
• This assumption can possibly...
Limitations
• There must be a high-dimensional signature of the exposure
13
Limitations
• There must be a high-dimensional signature of the exposure
• Clustering is unsupervised
13
Limitations
• There must be a high-dimensional signature of the exposure
• Clustering is unsupervised
• Two tuning paramet...
Limitations
• There must be a high-dimensional signature of the exposure
• Clustering is unsupervised
• Two tuning paramet...
acknowledgements
• Dr. Celia Greenwood
• Dr. Blanchette and Dr. Yang
• Dr. Luigi Bouchard, Andr´e Anne
Houde
• Dr. Steele,...
Upcoming SlideShare
Loading in …5
×

A model for interpretable high dimensional interactions

153 views

Published on

Introduction: Since the introduction of the LASSO, computational approaches to variable selection have been rigorously developed in the statistical literature. The need for such methods has become increasingly important with the advent of high-throughput technologies in genomics and brain imaging studies where it is believed that the number of truly important variables is small relative to the total number of variables. While the focus of these methods has been on additive models, there are several applications where interaction models can reflect biological phenomena and improve statistical power. For example, genome wide association studies (GWAS) have been unable to explain a large proportion of heritability (the variance in phenotype attributable to genetic variants) and it has been suggested that this missing heritability may in part be due to gene-environment interactions. Furthermore, diseases are now thought to be the result of entire biological networks whose states are affected by environmental factors. These systemic changes can induce or eliminate strong correlations between elements in a network without necessarily affecting their mean levels.
Methods: Therefore, we propose a multivariate penalization procedure for detecting interactions between high dimensional data ($p >> n$) and an environmental factor, where the effect of this environmental factor on the high dimensional data is widespread and plays a role in predicting the response. Our approach improves on existing procedures for detecting such interactions in several ways; 1) it simultaneously performs model selection and estimation 2) it automatically enforces the strong heredity property, i.e., an interaction term can only be included in the model if the corresponding main effects are in the model 3) it reduces the dimensionality of the problem and leverages the high correlations by transforming the input feature space using network connectivity measures and 4) it leads to interpretable models which are biologically meaningful.
Results: An extensive simulation study shows that our method outperforms LASSO, Elastic Net and Group LASSO in terms of both prediction accuracy and feature selection. We apply our methods to the NIH pediatric brain development study to refine estimates of which regions of the frontal cortex are associated with intelligence scores, and a sample of mother-child pairs from a prospective birth cohort to identify epigenetic marks observed at birth that help predict childhood obesity. Our method is implemented in an \texttt{R} package- http://sahirbhatnagar.com/eclust/

Published in: Science
  • Be the first to comment

  • Be the first to like this

A model for interpretable high dimensional interactions

  1. 1. A Model for Interpretable High Dimensional Interactions Sahir Rai Bhatnagar Joint work with Yi Yang, Mathieu Blanchette and Celia Greenwood Poster Number 67
  2. 2. Motivation
  3. 3. one predictor variable at a time Predictor Variable Phenotype
  4. 4. one predictor variable at a time Predictor Variable Phenotype Test 1 Test 2 Test 3 Test 4 Test 5 1
  5. 5. a network based view Predictor Variable Phenotype
  6. 6. a network based view Predictor Variable Phenotype
  7. 7. a network based view Predictor Variable Phenotype Test 1 2
  8. 8. system level changes due to environment Predictor Variable PhenotypeEnvironment A B
  9. 9. system level changes due to environment Predictor Variable PhenotypeEnvironment A B Test 1 3
  10. 10. Motivating Dataset: Newborn epigenetic adaptations to gesta- tional diabetes exposure (Luigi Bouchard, Sherbrooke) Environment Gestational Diabetes Large Data Child’s epigenome (p ≈ 450k) Phenotype Obesity measures 4
  11. 11. Differential Correlation between environments (a) Gestational diabetes affected pregnancy (b) Controls 5
  12. 12. formal statement of initial problem • n: number of subjects 6
  13. 13. formal statement of initial problem • n: number of subjects • p: number of predictor variables 6
  14. 14. formal statement of initial problem • n: number of subjects • p: number of predictor variables • Xn×p: high dimensional data set (p >> n) 6
  15. 15. formal statement of initial problem • n: number of subjects • p: number of predictor variables • Xn×p: high dimensional data set (p >> n) • Yn×1: phenotype 6
  16. 16. formal statement of initial problem • n: number of subjects • p: number of predictor variables • Xn×p: high dimensional data set (p >> n) • Yn×1: phenotype • En×1: environmental factor that has widespread effect on X and can modify the relation between X and Y 6
  17. 17. formal statement of initial problem • n: number of subjects • p: number of predictor variables • Xn×p: high dimensional data set (p >> n) • Yn×1: phenotype • En×1: environmental factor that has widespread effect on X and can modify the relation between X and Y Objective • Which elements of X that are associated with Y , depend on E? 6
  18. 18. Methods
  19. 19. ECLUST - our proposed method: 3 phases Original Data
  20. 20. ECLUST - our proposed method: 3 phases Original Data E = 0 1) Gene Similarity E = 1
  21. 21. ECLUST - our proposed method: 3 phases Original Data E = 0 1) Gene Similarity E = 1
  22. 22. ECLUST - our proposed method: 3 phases Original Data E = 0 1) Gene Similarity E = 1 2) Cluster Representation
  23. 23. ECLUST - our proposed method: 3 phases Original Data E = 0 1) Gene Similarity E = 1 2) Cluster Representation n × 1 n × 1
  24. 24. ECLUST - our proposed method: 3 phases Original Data E = 0 1) Gene Similarity E = 1 2) Cluster Representation n × 1 n × 1 3) Penalized Regression Yn×1∼ + ×E 7
  25. 25. the objective of statistical methods is the reduction of data. A quantity of data . . . is to be replaced by relatively few quantities which shall adequately represent . . . the relevant information contained in the original data. - Sir R. A. Fisher, 1922 7
  26. 26. Model g(µ) =β0 + β1X1 + · · · + βpXp + βE E main effects + α1E (X1E) + · · · + αpE (XpE) interactions 1Choi et al. 2010, JASA 2Chipman 1996, Canadian Journal of Statistics 8
  27. 27. Model g(µ) =β0 + β1X1 + · · · + βpXp + βE E main effects + α1E (X1E) + · · · + αpE (XpE) interactions Reparametrization1 : αjE = γjE βj βE . 1Choi et al. 2010, JASA 2Chipman 1996, Canadian Journal of Statistics 8
  28. 28. Model g(µ) =β0 + β1X1 + · · · + βpXp + βE E main effects + α1E (X1E) + · · · + αpE (XpE) interactions Reparametrization1 : αjE = γjE βj βE . Strong heredity principle2 : ˆαjE = 0 ⇒ ˆβj = 0 and ˆβE = 0 1Choi et al. 2010, JASA 2Chipman 1996, Canadian Journal of Statistics 8
  29. 29. Strong Heredity Model with Penalization arg min β0,β,γ 1 2 Y − g(µ) 2 + λβ (w1β1 + · · · + wqβq + wE βE ) + λγ (w1E γ1E + · · · + wqE γqE ) wj = 1 ˆβj , wjE = ˆβj ˆβE ˆαjE 9
  30. 30. Results
  31. 31. Simulation Study: Jaccard Index and test set MSE 10
  32. 32. Open source software • Software implementation in R: http://sahirbhatnagar.com/eclust/ • Allows user specified interaction terms • Automatically determines the optimal tuning parameters through cross validation • Can also be applied to genetic data 11
  33. 33. Conclusions
  34. 34. Conclusions and Contributions • Large system-wide changes are observed in many environments 12
  35. 35. Conclusions and Contributions • Large system-wide changes are observed in many environments • This assumption can possibly be exploited to aid analysis of large data 12
  36. 36. Conclusions and Contributions • Large system-wide changes are observed in many environments • This assumption can possibly be exploited to aid analysis of large data • We develop and implement a multivariate penalization procedure for predicting a continuous or binary disease outcome while detecting interactions between high dimensional data (p >> n) and an environmental factor. 12
  37. 37. Conclusions and Contributions • Large system-wide changes are observed in many environments • This assumption can possibly be exploited to aid analysis of large data • We develop and implement a multivariate penalization procedure for predicting a continuous or binary disease outcome while detecting interactions between high dimensional data (p >> n) and an environmental factor. • Dimension reduction is achieved through leveraging the environmental-class-conditional correlations 12
  38. 38. Conclusions and Contributions • Large system-wide changes are observed in many environments • This assumption can possibly be exploited to aid analysis of large data • We develop and implement a multivariate penalization procedure for predicting a continuous or binary disease outcome while detecting interactions between high dimensional data (p >> n) and an environmental factor. • Dimension reduction is achieved through leveraging the environmental-class-conditional correlations • Also, we develop and implement a strong heredity framework within the penalized model 12
  39. 39. Conclusions and Contributions • Large system-wide changes are observed in many environments • This assumption can possibly be exploited to aid analysis of large data • We develop and implement a multivariate penalization procedure for predicting a continuous or binary disease outcome while detecting interactions between high dimensional data (p >> n) and an environmental factor. • Dimension reduction is achieved through leveraging the environmental-class-conditional correlations • Also, we develop and implement a strong heredity framework within the penalized model • R software: http://sahirbhatnagar.com/eclust/ 12
  40. 40. Limitations • There must be a high-dimensional signature of the exposure 13
  41. 41. Limitations • There must be a high-dimensional signature of the exposure • Clustering is unsupervised 13
  42. 42. Limitations • There must be a high-dimensional signature of the exposure • Clustering is unsupervised • Two tuning parameters 13
  43. 43. Limitations • There must be a high-dimensional signature of the exposure • Clustering is unsupervised • Two tuning parameters • Need more samples . . . Got data? (Poster 67) 13
  44. 44. acknowledgements • Dr. Celia Greenwood • Dr. Blanchette and Dr. Yang • Dr. Luigi Bouchard, Andr´e Anne Houde • Dr. Steele, Dr. Kramer, Dr. Abrahamowicz • Maxime Turgeon, Kevin McGregor, Lauren Mokry, Dr. Forest • Greg Voisin, Dr. Forgetta, Dr. Klein • Mothers and children from the study 14

×