Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

An analytic approach for interpretable predictive models in high dimensional data, in the presence of interactions with exposures

82 views

Published on

Presentation at IGES 2016 in Toronto. Full paper available at https://doi.org/10.1101/102475

Published in: Science
  • Be the first to comment

  • Be the first to like this

An analytic approach for interpretable predictive models in high dimensional data, in the presence of interactions with exposures

  1. 1. An analytic approach for interpretable predictive models in high dimensional data, in the presence of interactions with exposures Sahir Rai Bhatnagar, PhD Candidate Joint with Yi Yang, Mathieu Blanchette, Luigi Bouchard, Celia Greenwood Biostatistics, McGill University preprint available at sahirbhatnagar.com
  2. 2. Simulated Data ̸= Real Data 0/21
  3. 3. Simple Rule 11: Simulated Data ̸= Real Data 0/21
  4. 4. Motivation
  5. 5. one predictor variable at a time Predictor Variable Phenotype
  6. 6. one predictor variable at a time Predictor Variable Phenotype Test 1 Test 2 Test 3 Test 4 Test 5 1/21
  7. 7. a network based view Predictor Variable Phenotype
  8. 8. a network based view Predictor Variable Phenotype
  9. 9. a network based view Predictor Variable Phenotype Test 1 2/21
  10. 10. system level changes due to environment Predictor Variable PhenotypeEnvironment A B
  11. 11. system level changes due to environment Predictor Variable PhenotypeEnvironment A B Test 1 3/21
  12. 12. Motivating Dataset: Newborn epigenetic adaptations to gesta- tional diabetes exposure (Luigi Bouchard, USherbrooke) Environment Gestational Diabetes Large Data Child’s epigenome (p ≈ 450k) Phenotype Obesity measures 4/21
  13. 13. Differential Correlation between environments (a) Gestational diabetes affected pregnancy (b) Controls 5/21
  14. 14. NIH MRI brain study Environment Age Large Data Cortical Thickness (p ≈ 80k) Phenotype Intelligence 6/21
  15. 15. Goals of this study Objective (i) Whether clustering that incorporates known covariate or exposure information can improve prediction models 7/21
  16. 16. Goals of this study Objective (i) Whether clustering that incorporates known covariate or exposure information can improve prediction models (ii) Can the resulting clusters provide an easier route to interpretation 7/21
  17. 17. Methods
  18. 18. ECLUST - our proposed method: 2 steps Original Data
  19. 19. ECLUST - our proposed method: 2 steps Original Data E = 0 1a) Gene Similarity E = 1
  20. 20. ECLUST - our proposed method: 2 steps Original Data E = 0 1a) Gene Similarity E = 1
  21. 21. ECLUST - our proposed method: 2 steps Original Data E = 0 1a) Gene Similarity E = 1 1b) Cluster Representation
  22. 22. ECLUST - our proposed method: 2 steps Original Data E = 0 1a) Gene Similarity E = 1 1b) Cluster Representation n × 1 n × 1
  23. 23. ECLUST - our proposed method: 2 steps Original Data E = 0 1a) Gene Similarity E = 1 1b) Cluster Representation n × 1 n × 1 2) Penalized Regression Yn×1∼ + ×E 8/21
  24. 24. the objective of statistical methods is the reduction of data. A quantity of data . . . is to be replaced by relatively few quantities which shall adequately represent . . . the relevant information contained in the original data. - Sir R. A. Fisher, 1922 8/21
  25. 25. Step 1a: Method to detect gene clusters (i) Hierarchical clustering (average linkage) with TOM1 scoring dissimilarity2 : |TOME=1 − TOME=0| (ii) Number of clusters chosen using dynamicTreeCut algorithm 3 Original Data E = 0 1a) Gene Similarity E = 1 1Ravasz et al., Science (2002) 2Klein Oros et al., Frontiers in Genetics (2016) 3Langfelder and Zhang, Bioinformatics (2008) 9/21
  26. 26. Step 1b: Cluster Representation (i) Average 4 (ii) 1st Principal Component 5 Original Data E = 0 1a) Gene Similarity E = 1 1b) Cluster Representation n × 1 n × 1 4Hastie et al., Genome Biology (2001), Park et al., Biostatistics (2007) 5Kendall, A Course in Multivariate analysis (1957) 10/21
  27. 27. Step 2: Variable Selection (i) Linear effects: Lasso, Elastic Net 6 (ii) Non-linear effects: MARS 7 Original Data E = 0 1a) Gene Similarity E = 1 1b) Cluster Representation n × 1 n × 1 2) Penalized Regression Yn×1∼ + ×E 6Tibshirani, JRSSB (1996), Zou and Hastie, JRSSB (2005) 7Friedman, Annals of Statistics (1991) 11/21
  28. 28. Simulation Study
  29. 29. Simulated TOM by Exposure Status (a) TOM(XE=1) (b) TOM(XE=0) 12/21
  30. 30. Difference of TOMs (a) |TOM(XE=1) − TOM(XE=0)| 13/21
  31. 31. TOM based on all subjects (a) TOM(Xall) 14/21
  32. 32. Real Data Analysis
  33. 33. Gestational Diabetes: Prediction Performance 15/21
  34. 34. Gestational Diabetes: Interpretation of Clusters with IPA • Canonical Pathways: 1.25-dihydroxyvitamin D3 Biosynthesis – vitamin D associated with obesity 16/21
  35. 35. Gestational Diabetes: Interpretation of Clusters with IPA • Canonical Pathways: 1.25-dihydroxyvitamin D3 Biosynthesis – vitamin D associated with obesity • Diseases and Disorders: Hepatic System Disease – metabolism of glucose and lipids 16/21
  36. 36. Gestational Diabetes: Interpretation of Clusters with IPA • Canonical Pathways: 1.25-dihydroxyvitamin D3 Biosynthesis – vitamin D associated with obesity • Diseases and Disorders: Hepatic System Disease – metabolism of glucose and lipids • Physiological System Development and Function: (i) Behavior and neurodevelopment – associated with obesity (ii) Embryonic and organ development – GD associated with macrosomia 16/21
  37. 37. NIHPD: Age 17/21
  38. 38. NIHPD: Income 18/21
  39. 39. Final Remarks
  40. 40. Discussion and Contributions • Large system-wide changes are observed in many environments (DNA methylation, cortical thickness, gene expression) 19/21
  41. 41. Discussion and Contributions • Large system-wide changes are observed in many environments (DNA methylation, cortical thickness, gene expression) • Environment dependent clustering can improve prediction performance in high dimensional settings (n << p) 19/21
  42. 42. Discussion and Contributions • Large system-wide changes are observed in many environments (DNA methylation, cortical thickness, gene expression) • Environment dependent clustering can improve prediction performance in high dimensional settings (n << p) • Clusters can be interpreted but require much more expert knowledge 19/21
  43. 43. Discussion and Contributions • Large system-wide changes are observed in many environments (DNA methylation, cortical thickness, gene expression) • Environment dependent clustering can improve prediction performance in high dimensional settings (n << p) • Clusters can be interpreted but require much more expert knowledge • Leverages existing computationally fast algorithms and can run on a laptop computer (p ≈ 10k) 19/21
  44. 44. Discussion and Contributions • Large system-wide changes are observed in many environments (DNA methylation, cortical thickness, gene expression) • Environment dependent clustering can improve prediction performance in high dimensional settings (n << p) • Clusters can be interpreted but require much more expert knowledge • Leverages existing computationally fast algorithms and can run on a laptop computer (p ≈ 10k) • Software implementation in R: sahirbhatnagar.com 19/21
  45. 45. Limitations • There must be a high-dimensional signature of the exposure 20/21
  46. 46. Limitations • There must be a high-dimensional signature of the exposure • Covariance estimation 20/21
  47. 47. Limitations • There must be a high-dimensional signature of the exposure • Covariance estimation • Currently limited to binary environment 20/21
  48. 48. Limitations • There must be a high-dimensional signature of the exposure • Covariance estimation • Currently limited to binary environment • Interpretation can be difficult 20/21
  49. 49. Acknowledgements • Dr. Celia Greenwood • Dr. Blanchette and Dr. Yang • Dr. Luigi Bouchard, André Anne Houde • Dr. Steele, Dr. Kramer, Dr. Abrahamowicz • Maxime Turgeon, Kevin McGregor, Lauren Mokry, Dr. Forest • Greg Voisin, Dr. Forgetta, Dr. Klein • Mothers and children from the study 21/21

×