## Just for you: FREE 60-day trial to the world’s largest digital library.

The SlideShare family just got bigger. Enjoy access to millions of ebooks, audiobooks, magazines, and more from Scribd.

Cancel anytime.Free with a 14 day trial from Scribd

- 1. Methods for High Dimensional Interactions Sahir Rai Bhatnagar, PhD Candidate – McGill Biostatistics Joint work with Yi Yang, Mathieu Blanchette and Celia Greenwood Ludmer Center – May 19, 2016
- 2. Underlying objective of this talk 1
- 3. Motivation
- 4. one predictor variable at a time Predictor Variable Phenotype
- 5. one predictor variable at a time Predictor Variable Phenotype Test 1 Test 2 Test 3 Test 4 Test 5 2
- 6. a network based view Predictor Variable Phenotype
- 7. a network based view Predictor Variable Phenotype
- 8. a network based view Predictor Variable Phenotype Test 1 3
- 9. system level changes due to environment Predictor Variable PhenotypeEnvironment A B
- 10. system level changes due to environment Predictor Variable PhenotypeEnvironment A B Test 1 4
- 11. Motivating Dataset: Newborn epigenetic adaptations to gesta- tional diabetes exposure (Luigi Bouchard, Sherbrooke) Environment Gestational Diabetes Large Data Child’s epigenome (p ≈ 450k) Phenotype Obesity measures 5
- 12. Diﬀerential Correlation between environments (a) Gestational diabetes aﬀected pregnancy (b) Controls 6
- 13. Gene Expression: COPD patients (a) Gene Exp.: Never Smokers (b) Gene Exp.: Current Smokers (c) Correlations: Never Smokers (d) Correlations: Current Smokers 7
- 14. Imaging Data: Topological properties and Age 8
- 15. Correlations diﬀer between Age groups 9
- 16. NIH MRI brain study Environment Age Large Data Cortical Thickness (p ≈ 80k) Phenotype Intelligence 10
- 17. Diﬀerential Networking 11
- 18. formal statement of initial problem • n: number of subjects 12
- 19. formal statement of initial problem • n: number of subjects • p: number of predictor variables 12
- 20. formal statement of initial problem • n: number of subjects • p: number of predictor variables • Xn×p: high dimensional data set (p >> n) 12
- 21. formal statement of initial problem • n: number of subjects • p: number of predictor variables • Xn×p: high dimensional data set (p >> n) • Yn×1: phenotype 12
- 22. formal statement of initial problem • n: number of subjects • p: number of predictor variables • Xn×p: high dimensional data set (p >> n) • Yn×1: phenotype • En×1: environmental factor that has widespread eﬀect on X and can modify the relation between X and Y 12
- 23. formal statement of initial problem • n: number of subjects • p: number of predictor variables • Xn×p: high dimensional data set (p >> n) • Yn×1: phenotype • En×1: environmental factor that has widespread eﬀect on X and can modify the relation between X and Y Objective • Which elements of X that are associated with Y , depend on E? 12
- 24. conceptual model Environment ﬀ(Maternal care, Age, Diet) E = 0 E = 1
- 25. conceptual model Environment ﬀ(Maternal care, Age, Diet) E = 0 E = 1 Large Data (p >> n) Gene Expression t DNA Methylation t Brain Imaging Gene Expression t DNA Methylation t Brain Imaging
- 26. conceptual model Environment ﬀ(Maternal care, Age, Diet) E = 0 E = 1 Large Data (p >> n) Gene Expression t DNA Methylation t Brain Imaging Gene Expression t DNA Methylation t Brain Imaging Phenotype (Behavioral development, IQ scores, Death)
- 27. conceptual model Environment ﬀ(Maternal care, Age, Diet) E = 0 E = 1 Large Data (p >> n) Gene Expression t DNA Methylation t Brain Imaging Gene Expression t DNA Methylation t Brain Imaging Phenotype (Behavioral development, IQ scores, Death) epidemiological study
- 28. conceptual model Environment ﬀ(Maternal care, Age, Diet) E = 0 E = 1 Large Data (p >> n) Gene Expression t DNA Methylation t Brain Imaging Gene Expression t DNA Methylation t Brain Imaging Phenotype (Behavioral development, IQ scores, Death) (epi)genetic/imaging associations
- 29. conceptual model Environment ﬀ(Maternal care, Age, Diet) E = 0 E = 1 Large Data (p >> n) Gene Expression t DNA Methylation t Brain Imaging Gene Expression t DNA Methylation t Brain Imaging Phenotype (Behavioral development, IQ scores, Death) (epi)genetic/imaging associations (epi)genetic/imaging associations
- 30. conceptual model Environment ﬀ(Maternal care, Age, Diet) E = 0 E = 1 Large Data (p >> n) Gene Expression t DNA Methylation t Brain Imaging Gene Expression t DNA Methylation t Brain Imaging Phenotype (Behavioral development, IQ scores, Death) 13
- 31. Is this mediation analysis? 14
- 32. Is this mediation analysis? • No 14
- 33. Is this mediation analysis? • No • We are not making any causal claims i.e. direction of the arrows 14
- 34. Is this mediation analysis? • No • We are not making any causal claims i.e. direction of the arrows • There are many untestable assumptions required for such analysis → not well understood for HD data 14
- 35. Methods
- 36. analysis strategies marginal correlations (univariate p-value) multiple testing adjustment Single-Marker or Single Variable Tests
- 37. analysis strategies marginal correlations (univariate p-value) multiple testing adjustment Single-Marker or Single Variable Tests LASSO (convex penalty with one tuning parameter) MCP, SCAD, Dantzig selector (non-convex penalty with two tuning parameters) Group level penalization (group LASSO, SCAD and MCP) Multivariate Regression Approaches Including Penalization Methods
- 38. analysis strategies marginal correlations (univariate p-value) multiple testing adjustment Single-Marker or Single Variable Tests LASSO (convex penalty with one tuning parameter) MCP, SCAD, Dantzig selector (non-convex penalty with two tuning parameters) Group level penalization (group LASSO, SCAD and MCP) Multivariate Regression Approaches Including Penalization Methods cluster features based on euclidean distance, correlation, connectivity regression with group level summary (PCA, average) Clustering Together with Regression 15
- 39. ECLUST - our proposed method: 3 phases Original Data
- 40. ECLUST - our proposed method: 3 phases Original Data E = 0 1) Gene Similarity E = 1
- 41. ECLUST - our proposed method: 3 phases Original Data E = 0 1) Gene Similarity E = 1
- 42. ECLUST - our proposed method: 3 phases Original Data E = 0 1) Gene Similarity E = 1 2) Cluster Representation
- 43. ECLUST - our proposed method: 3 phases Original Data E = 0 1) Gene Similarity E = 1 2) Cluster Representation n × 1 n × 1
- 44. ECLUST - our proposed method: 3 phases Original Data E = 0 1) Gene Similarity E = 1 2) Cluster Representation n × 1 n × 1 3) Penalized Regression Yn×1∼ + ×E 16
- 45. the objective of statistical methods is the reduction of data. A quantity of data . . . is to be replaced by relatively few quantities which shall adequately represent . . . the relevant information contained in the original data. - Sir R. A. Fisher, 1922 16
- 46. Underlying model Y = β0 + β1U + β2U · E + ε (1) 17
- 47. Underlying model Y = β0 + β1U + β2U · E + ε (1) X ∼ F(α0 + α1U, ΣE ) (2) 17
- 48. Underlying model Y = β0 + β1U + β2U · E + ε (1) X ∼ F(α0 + α1U, ΣE ) (2) • U: unobserved latent variable • X: observed data which is a function of U • ΣE : environment sensitive correlation matrix 17
- 49. ECLUST - our proposed method: 3 phases Original Data E = 0 1) Gene Similarity E = 1 2) Cluster Representation n × 1 n × 1 3) Penalized Regression Yn×1∼ + ×E 18
- 50. advantages and disadvantages General Approach Advantages Disadvantages Single-Marker simple, easy to implement multiple testing burden, power, interpretability Penalization multivariate, variable selection, sparsity, eﬃcient optimization algorithms poor sensitivity with correlated data, ignores structure in design matrix, interpretability Environment Cluster with Regression multivariate, ﬂexible implementation, group structure, takes advantage of correlation, interpretability diﬃcult to identify relevant clusters, clustering is unsupervised 19
- 51. Methods to detect gene clusters Table 1: Methods to detect gene clusters General Approach Formula Correlation pearson, spearman, biweight midcorrelation Correlation Scoring |ρE=1 − ρE=0| Weighted Correlation Scoring c|ρE=1 − ρE=0| Fisher’s Z Transformation |zij0−zij1| √ 1/(n0−3)+1/(n1−3) 20
- 52. Cluster Representation Table 2: Methods to create cluster representations General Approach Type Unsupervised average K principal components Supervised partial least squares 21
- 53. Simulation Studies
- 54. Simulation Study 1 (a) Corr(XE=0) (b) Corr(XE=1) (c) |Corr(XE=1) − Corr(XE=0)| (d) Corr(Xall) 22
- 55. Results: Jaccard Index and test set MSE 23
- 56. Simulation Study 2 24
- 57. TOM based on all subjects (a) TOM(Xall) 25
- 58. TOM based on unexposed subjects (a) TOM(XE=0) 26
- 59. TOM based on exposed subjects (a) TOM(XE=1) 27
- 60. Diﬀerence of TOMs (a) |TOM(XE=1) − TOM(XE=0)| 28
- 61. Results: Test set MSE 29
- 62. Strong Heredity Models
- 63. Model g(µ) =β0 + β1X1 + · · · + βpXp + βE E main eﬀects + α1E (X1E) + · · · + αpE (XpE) interactions • g(·) is a known link function • µ = E [Y |X, E, β, α] • β = (β1, β2, . . . , βp, βE ) ∈ Rp+1 • α = (α1E , . . . , αpE ) ∈ Rp 30
- 64. Variable Selection arg min β0,β,α 1 2 Y − g(µ) 2 + λ ( β 1 + α 1) • Y − g(µ) 2 = i (yi − g(µi ))2 • β 1 = j |βj | • α 1 = j |αj | • λ ≥ 0: tuning parameter 31
- 65. Why Strong Heredity? • Statistical Power: large main eﬀects are more likely to lead to detectable interactions than small ones 32
- 66. Why Strong Heredity? • Statistical Power: large main eﬀects are more likely to lead to detectable interactions than small ones • Interpretability: Assuming a model with interaction only is generally not biologically plausible 32
- 67. Why Strong Heredity? • Statistical Power: large main eﬀects are more likely to lead to detectable interactions than small ones • Interpretability: Assuming a model with interaction only is generally not biologically plausible • Practical Sparsity: X1, E, X1 · E vs. X1, E, X2 · E 32
- 68. Model g(µ) =β0 + β1X1 + · · · + βpXp + βE E main eﬀects + α1E (X1E) + · · · + αpE (XpE) interactions 1Choi et al. 2010, JASA 2Chipman 1996, Canadian Journal of Statistics 33
- 69. Model g(µ) =β0 + β1X1 + · · · + βpXp + βE E main eﬀects + α1E (X1E) + · · · + αpE (XpE) interactions Reparametrization1 : αjE = γjE βj βE . 1Choi et al. 2010, JASA 2Chipman 1996, Canadian Journal of Statistics 33
- 70. Model g(µ) =β0 + β1X1 + · · · + βpXp + βE E main eﬀects + α1E (X1E) + · · · + αpE (XpE) interactions Reparametrization1 : αjE = γjE βj βE . Strong heredity principle2 : ˆαjE = 0 ⇒ ˆβj = 0 and ˆβE = 0 1Choi et al. 2010, JASA 2Chipman 1996, Canadian Journal of Statistics 33
- 71. Strong Heredity Model with Penalization arg min β0,β,γ 1 2 Y − g(µ) 2 + λβ (w1β1 + · · · + wqβq + wE βE ) + λγ (w1E γ1E + · · · + wqE γqE ) wj = 1 ˆβj , wjE = ˆβj ˆβE ˆαjE 34
- 72. Open source software • Software implementation in R: http://sahirbhatnagar.com/eclust/ • Allows user speciﬁed interaction terms • Automatically determines the optimal tuning parameters through cross validation • Can also be applied to genetic data (SNPs) 35
- 73. Feature Screening and Non-linear associations
- 74. The most popular way of feature screening How to ﬁt statistical models when you have over 100,000 features? 36
- 75. The most popular way of feature screening How to ﬁt statistical models when you have over 100,000 features? Marginal correlations, t-tests • for each feature, calculate the correlation between X and Y 36
- 76. The most popular way of feature screening How to ﬁt statistical models when you have over 100,000 features? Marginal correlations, t-tests • for each feature, calculate the correlation between X and Y • keep all features with correlation greater than some threshold 36
- 77. The most popular way of feature screening How to ﬁt statistical models when you have over 100,000 features? Marginal correlations, t-tests • for each feature, calculate the correlation between X and Y • keep all features with correlation greater than some threshold • However this procedure assumes a linear relationship between X and Y 36
- 78. Non-linear feature screening: Kolmogorov-Smirnov Test Mai & Zou (2012) proposed using the Kolmogorov-Smirnov (KS) test statistic ˆKj = sup x |ˆFj (x|Y = 1) − ˆFj (x|Y = 0)| (3) Figure 8: Depiction of KS statistic 37
- 79. Non-linear Interaction Models After feature screening, we can ﬁt non-linear relationships between X and Y Yi = β0 + f (Xij ) + f (Xij , Ei ) + εi (4) 38
- 80. Conclusions
- 81. Conclusions and Contributions • Large system-wide changes are observed in many environments 39
- 82. Conclusions and Contributions • Large system-wide changes are observed in many environments • This assumption can possibly be exploited to aid analysis of large data 39
- 83. Conclusions and Contributions • Large system-wide changes are observed in many environments • This assumption can possibly be exploited to aid analysis of large data • We develop and implement a multivariate penalization procedure for predicting a continuous or binary disease outcome while detecting interactions between high dimensional data (p >> n) and an environmental factor. 39
- 84. Conclusions and Contributions • Large system-wide changes are observed in many environments • This assumption can possibly be exploited to aid analysis of large data • We develop and implement a multivariate penalization procedure for predicting a continuous or binary disease outcome while detecting interactions between high dimensional data (p >> n) and an environmental factor. • Dimension reduction is achieved through leveraging the environmental-class-conditional correlations 39
- 85. Conclusions and Contributions • Large system-wide changes are observed in many environments • This assumption can possibly be exploited to aid analysis of large data • We develop and implement a multivariate penalization procedure for predicting a continuous or binary disease outcome while detecting interactions between high dimensional data (p >> n) and an environmental factor. • Dimension reduction is achieved through leveraging the environmental-class-conditional correlations • Also, we develop and implement a strong heredity framework within the penalized model 39
- 86. Conclusions and Contributions • Large system-wide changes are observed in many environments • This assumption can possibly be exploited to aid analysis of large data • We develop and implement a multivariate penalization procedure for predicting a continuous or binary disease outcome while detecting interactions between high dimensional data (p >> n) and an environmental factor. • Dimension reduction is achieved through leveraging the environmental-class-conditional correlations • Also, we develop and implement a strong heredity framework within the penalized model • R software: http://sahirbhatnagar.com/eclust/ 39
- 87. Limitations • There must be a high-dimensional signature of the exposure 40
- 88. Limitations • There must be a high-dimensional signature of the exposure • Clustering is unsupervised 40
- 89. Limitations • There must be a high-dimensional signature of the exposure • Clustering is unsupervised • Two tuning parameters 40
- 90. What type of data is required to use these methods
- 91. ECLUST method 1. environmental exposure (currently only binary) 2. a high dimensional dataset that can be aﬀected by the exposure 3. a single phenotype (continuous or binary) 4. Must be a high-dimensional signature of the exposure 41
- 92. Strong Heredity and Non-linear Models 1. a single phenotype (continuous or binary) 2. environment variable (continuous or binary) 3. any number of predictor variables 42
- 93. Check out our Lab’s Software! http://greenwoodlab.github.io/software/ 43
- 94. acknowledgements • Dr. Celia Greenwood • Dr. Blanchette and Dr. Yang • Dr. Luigi Bouchard, Andr´e Anne Houde • Dr. Steele, Dr. Kramer, Dr. Abrahamowicz • Maxime Turgeon, Kevin McGregor, Lauren Mokry, Marie Forest, Pablo Ginestet • Greg Voisin, Vince Forgetta, Kathleen Klein • Mothers and children from the study 44