Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Methods for High Dimensional Interactions
Sahir Rai Bhatnagar, PhD Candidate – McGill Biostatistics
Joint work with Yi Yan...
Underlying objective of this talk
1
Motivation
one predictor variable at a time
Predictor Variable Phenotype
one predictor variable at a time
Predictor Variable Phenotype
Test 1
Test 2
Test 3
Test 4
Test 5
2
a network based view
Predictor Variable Phenotype
a network based view
Predictor Variable Phenotype
a network based view
Predictor Variable Phenotype
Test 1
3
system level changes due to environment
Predictor Variable PhenotypeEnvironment
A
B
system level changes due to environment
Predictor Variable PhenotypeEnvironment
A
B
Test 1
4
Motivating Dataset: Newborn epigenetic adaptations to gesta-
tional diabetes exposure (Luigi Bouchard, Sherbrooke)
Environ...
Differential Correlation between environments
(a) Gestational diabetes affected pregnancy (b) Controls
6
Gene Expression: COPD patients
(a) Gene Exp.: Never Smokers (b) Gene Exp.: Current Smokers
(c) Correlations: Never Smokers...
Imaging Data: Topological properties and Age
8
Correlations differ between Age groups
9
NIH MRI brain study
Environment
Age
Large Data
Cortical Thickness
(p ≈ 80k)
Phenotype
Intelligence
10
Differential Networking
11
formal statement of initial problem
• n: number of subjects
12
formal statement of initial problem
• n: number of subjects
• p: number of predictor variables
12
formal statement of initial problem
• n: number of subjects
• p: number of predictor variables
• Xn×p: high dimensional da...
formal statement of initial problem
• n: number of subjects
• p: number of predictor variables
• Xn×p: high dimensional da...
formal statement of initial problem
• n: number of subjects
• p: number of predictor variables
• Xn×p: high dimensional da...
formal statement of initial problem
• n: number of subjects
• p: number of predictor variables
• Xn×p: high dimensional da...
conceptual model
Environment
ff(Maternal
care, Age, Diet)
E = 0
E = 1
conceptual model
Environment
ff(Maternal
care, Age, Diet)
E = 0
E = 1
Large Data (p >> n)
Gene Expression t
DNA Methylation...
conceptual model
Environment
ff(Maternal
care, Age, Diet)
E = 0
E = 1
Large Data (p >> n)
Gene Expression t
DNA Methylation...
conceptual model
Environment
ff(Maternal
care, Age, Diet)
E = 0
E = 1
Large Data (p >> n)
Gene Expression t
DNA Methylation...
conceptual model
Environment
ff(Maternal
care, Age, Diet)
E = 0
E = 1
Large Data (p >> n)
Gene Expression t
DNA Methylation...
conceptual model
Environment
ff(Maternal
care, Age, Diet)
E = 0
E = 1
Large Data (p >> n)
Gene Expression t
DNA Methylation...
conceptual model
Environment
ff(Maternal
care, Age, Diet)
E = 0
E = 1
Large Data (p >> n)
Gene Expression t
DNA Methylation...
Is this mediation analysis?
14
Is this mediation analysis?
• No
14
Is this mediation analysis?
• No
• We are not making any causal claims i.e. direction of the arrows
14
Is this mediation analysis?
• No
• We are not making any causal claims i.e. direction of the arrows
• There are many untes...
Methods
analysis strategies
marginal correlations (univariate p-value)
multiple testing adjustment
Single-Marker or Single Variabl...
analysis strategies
marginal correlations (univariate p-value)
multiple testing adjustment
Single-Marker or Single Variabl...
analysis strategies
marginal correlations (univariate p-value)
multiple testing adjustment
Single-Marker or Single Variabl...
ECLUST - our proposed method: 3 phases
Original Data
ECLUST - our proposed method: 3 phases
Original Data
E = 0
1) Gene Similarity
E = 1
ECLUST - our proposed method: 3 phases
Original Data
E = 0
1) Gene Similarity
E = 1
ECLUST - our proposed method: 3 phases
Original Data
E = 0
1) Gene Similarity
E = 1
2) Cluster
Representation
ECLUST - our proposed method: 3 phases
Original Data
E = 0
1) Gene Similarity
E = 1
2) Cluster
Representation
n × 1 n × 1
ECLUST - our proposed method: 3 phases
Original Data
E = 0
1) Gene Similarity
E = 1
2) Cluster
Representation
n × 1 n × 1
...
the objective of statistical
methods is the reduction of data.
A quantity of data . . . is to be
replaced by relatively fe...
Underlying model
Y = β0 + β1U + β2U · E + ε (1)
17
Underlying model
Y = β0 + β1U + β2U · E + ε (1)
X ∼ F(α0 + α1U, ΣE ) (2)
17
Underlying model
Y = β0 + β1U + β2U · E + ε (1)
X ∼ F(α0 + α1U, ΣE ) (2)
• U: unobserved latent variable
• X: observed dat...
ECLUST - our proposed method: 3 phases
Original Data
E = 0
1) Gene Similarity
E = 1
2) Cluster
Representation
n × 1 n × 1
...
advantages and disadvantages
General Approach Advantages Disadvantages
Single-Marker simple, easy to implement
multiple te...
Methods to detect gene clusters
Table 1: Methods to detect gene clusters
General Approach Formula
Correlation
pearson, spe...
Cluster Representation
Table 2: Methods to create cluster representations
General Approach Type
Unsupervised average
K pri...
Simulation Studies
Simulation Study 1
(a) Corr(XE=0) (b) Corr(XE=1)
(c) |Corr(XE=1) − Corr(XE=0)| (d) Corr(Xall)
22
Results: Jaccard Index and test set MSE
23
Simulation Study 2
24
TOM based on all subjects
(a) TOM(Xall)
25
TOM based on unexposed subjects
(a) TOM(XE=0)
26
TOM based on exposed subjects
(a) TOM(XE=1)
27
Difference of TOMs
(a) |TOM(XE=1) − TOM(XE=0)|
28
Results: Test set MSE
29
Strong Heredity Models
Model
g(µ) =β0 + β1X1 + · · · + βpXp + βE E
main effects
+ α1E (X1E) + · · · + αpE (XpE)
interactions
• g(·) is a known lin...
Variable Selection
arg min
β0,β,α
1
2
Y − g(µ)
2
+ λ ( β 1 + α 1)
• Y − g(µ)
2
= i (yi − g(µi ))2
• β 1 = j |βj |
• α 1 = ...
Why Strong Heredity?
• Statistical Power: large main effects are more likely to lead to
detectable interactions than small ...
Why Strong Heredity?
• Statistical Power: large main effects are more likely to lead to
detectable interactions than small ...
Why Strong Heredity?
• Statistical Power: large main effects are more likely to lead to
detectable interactions than small ...
Model
g(µ) =β0 + β1X1 + · · · + βpXp + βE E
main effects
+ α1E (X1E) + · · · + αpE (XpE)
interactions
1Choi et al. 2010, JA...
Model
g(µ) =β0 + β1X1 + · · · + βpXp + βE E
main effects
+ α1E (X1E) + · · · + αpE (XpE)
interactions
Reparametrization1
: ...
Model
g(µ) =β0 + β1X1 + · · · + βpXp + βE E
main effects
+ α1E (X1E) + · · · + αpE (XpE)
interactions
Reparametrization1
: ...
Strong Heredity Model with Penalization
arg min
β0,β,γ
1
2
Y − g(µ)
2
+
λβ (w1β1 + · · · + wqβq + wE βE ) +
λγ (w1E γ1E + ...
Open source software
• Software implementation in R: http://sahirbhatnagar.com/eclust/
• Allows user specified interaction ...
Feature Screening and
Non-linear associations
The most popular way of feature screening
How to fit statistical models when you have over 100,000 features?
36
The most popular way of feature screening
How to fit statistical models when you have over 100,000 features?
Marginal corre...
The most popular way of feature screening
How to fit statistical models when you have over 100,000 features?
Marginal corre...
The most popular way of feature screening
How to fit statistical models when you have over 100,000 features?
Marginal corre...
Non-linear feature screening: Kolmogorov-Smirnov Test
Mai & Zou (2012) proposed using the Kolmogorov-Smirnov (KS) test
sta...
Non-linear Interaction Models
After feature screening, we can fit non-linear relationships between
X and Y
Yi = β0 + f (Xij...
Conclusions
Conclusions and Contributions
• Large system-wide changes are observed in many environments
39
Conclusions and Contributions
• Large system-wide changes are observed in many environments
• This assumption can possibly...
Conclusions and Contributions
• Large system-wide changes are observed in many environments
• This assumption can possibly...
Conclusions and Contributions
• Large system-wide changes are observed in many environments
• This assumption can possibly...
Conclusions and Contributions
• Large system-wide changes are observed in many environments
• This assumption can possibly...
Conclusions and Contributions
• Large system-wide changes are observed in many environments
• This assumption can possibly...
Limitations
• There must be a high-dimensional signature of the exposure
40
Limitations
• There must be a high-dimensional signature of the exposure
• Clustering is unsupervised
40
Limitations
• There must be a high-dimensional signature of the exposure
• Clustering is unsupervised
• Two tuning paramet...
What type of data is required to
use these methods
ECLUST method
1. environmental exposure (currently only binary)
2. a high dimensional dataset that can be affected by the e...
Strong Heredity and Non-linear Models
1. a single phenotype (continuous or binary)
2. environment variable (continuous or ...
Check out our Lab’s Software!
http://greenwoodlab.github.io/software/
43
acknowledgements
• Dr. Celia Greenwood
• Dr. Blanchette and Dr. Yang
• Dr. Luigi Bouchard, Andr´e Anne
Houde
• Dr. Steele,...
Upcoming SlideShare
Loading in …5
×

Methods for High Dimensional Interactions

245 views

Published on

Gene environment interaction. DNA Methylation, gene expression, brain imaging data that are affected by exposure. Strong heredity interaction models

Published in: Science
  • Be the first to comment

  • Be the first to like this

Methods for High Dimensional Interactions

  1. 1. Methods for High Dimensional Interactions Sahir Rai Bhatnagar, PhD Candidate – McGill Biostatistics Joint work with Yi Yang, Mathieu Blanchette and Celia Greenwood Ludmer Center – May 19, 2016
  2. 2. Underlying objective of this talk 1
  3. 3. Motivation
  4. 4. one predictor variable at a time Predictor Variable Phenotype
  5. 5. one predictor variable at a time Predictor Variable Phenotype Test 1 Test 2 Test 3 Test 4 Test 5 2
  6. 6. a network based view Predictor Variable Phenotype
  7. 7. a network based view Predictor Variable Phenotype
  8. 8. a network based view Predictor Variable Phenotype Test 1 3
  9. 9. system level changes due to environment Predictor Variable PhenotypeEnvironment A B
  10. 10. system level changes due to environment Predictor Variable PhenotypeEnvironment A B Test 1 4
  11. 11. Motivating Dataset: Newborn epigenetic adaptations to gesta- tional diabetes exposure (Luigi Bouchard, Sherbrooke) Environment Gestational Diabetes Large Data Child’s epigenome (p ≈ 450k) Phenotype Obesity measures 5
  12. 12. Differential Correlation between environments (a) Gestational diabetes affected pregnancy (b) Controls 6
  13. 13. Gene Expression: COPD patients (a) Gene Exp.: Never Smokers (b) Gene Exp.: Current Smokers (c) Correlations: Never Smokers (d) Correlations: Current Smokers 7
  14. 14. Imaging Data: Topological properties and Age 8
  15. 15. Correlations differ between Age groups 9
  16. 16. NIH MRI brain study Environment Age Large Data Cortical Thickness (p ≈ 80k) Phenotype Intelligence 10
  17. 17. Differential Networking 11
  18. 18. formal statement of initial problem • n: number of subjects 12
  19. 19. formal statement of initial problem • n: number of subjects • p: number of predictor variables 12
  20. 20. formal statement of initial problem • n: number of subjects • p: number of predictor variables • Xn×p: high dimensional data set (p >> n) 12
  21. 21. formal statement of initial problem • n: number of subjects • p: number of predictor variables • Xn×p: high dimensional data set (p >> n) • Yn×1: phenotype 12
  22. 22. formal statement of initial problem • n: number of subjects • p: number of predictor variables • Xn×p: high dimensional data set (p >> n) • Yn×1: phenotype • En×1: environmental factor that has widespread effect on X and can modify the relation between X and Y 12
  23. 23. formal statement of initial problem • n: number of subjects • p: number of predictor variables • Xn×p: high dimensional data set (p >> n) • Yn×1: phenotype • En×1: environmental factor that has widespread effect on X and can modify the relation between X and Y Objective • Which elements of X that are associated with Y , depend on E? 12
  24. 24. conceptual model Environment ff(Maternal care, Age, Diet) E = 0 E = 1
  25. 25. conceptual model Environment ff(Maternal care, Age, Diet) E = 0 E = 1 Large Data (p >> n) Gene Expression t DNA Methylation t Brain Imaging Gene Expression t DNA Methylation t Brain Imaging
  26. 26. conceptual model Environment ff(Maternal care, Age, Diet) E = 0 E = 1 Large Data (p >> n) Gene Expression t DNA Methylation t Brain Imaging Gene Expression t DNA Methylation t Brain Imaging Phenotype (Behavioral development, IQ scores, Death)
  27. 27. conceptual model Environment ff(Maternal care, Age, Diet) E = 0 E = 1 Large Data (p >> n) Gene Expression t DNA Methylation t Brain Imaging Gene Expression t DNA Methylation t Brain Imaging Phenotype (Behavioral development, IQ scores, Death) epidemiological study
  28. 28. conceptual model Environment ff(Maternal care, Age, Diet) E = 0 E = 1 Large Data (p >> n) Gene Expression t DNA Methylation t Brain Imaging Gene Expression t DNA Methylation t Brain Imaging Phenotype (Behavioral development, IQ scores, Death) (epi)genetic/imaging associations
  29. 29. conceptual model Environment ff(Maternal care, Age, Diet) E = 0 E = 1 Large Data (p >> n) Gene Expression t DNA Methylation t Brain Imaging Gene Expression t DNA Methylation t Brain Imaging Phenotype (Behavioral development, IQ scores, Death) (epi)genetic/imaging associations (epi)genetic/imaging associations
  30. 30. conceptual model Environment ff(Maternal care, Age, Diet) E = 0 E = 1 Large Data (p >> n) Gene Expression t DNA Methylation t Brain Imaging Gene Expression t DNA Methylation t Brain Imaging Phenotype (Behavioral development, IQ scores, Death) 13
  31. 31. Is this mediation analysis? 14
  32. 32. Is this mediation analysis? • No 14
  33. 33. Is this mediation analysis? • No • We are not making any causal claims i.e. direction of the arrows 14
  34. 34. Is this mediation analysis? • No • We are not making any causal claims i.e. direction of the arrows • There are many untestable assumptions required for such analysis → not well understood for HD data 14
  35. 35. Methods
  36. 36. analysis strategies marginal correlations (univariate p-value) multiple testing adjustment Single-Marker or Single Variable Tests
  37. 37. analysis strategies marginal correlations (univariate p-value) multiple testing adjustment Single-Marker or Single Variable Tests LASSO (convex penalty with one tuning parameter) MCP, SCAD, Dantzig selector (non-convex penalty with two tuning parameters) Group level penalization (group LASSO, SCAD and MCP) Multivariate Regression Approaches Including Penalization Methods
  38. 38. analysis strategies marginal correlations (univariate p-value) multiple testing adjustment Single-Marker or Single Variable Tests LASSO (convex penalty with one tuning parameter) MCP, SCAD, Dantzig selector (non-convex penalty with two tuning parameters) Group level penalization (group LASSO, SCAD and MCP) Multivariate Regression Approaches Including Penalization Methods cluster features based on euclidean distance, correlation, connectivity regression with group level summary (PCA, average) Clustering Together with Regression 15
  39. 39. ECLUST - our proposed method: 3 phases Original Data
  40. 40. ECLUST - our proposed method: 3 phases Original Data E = 0 1) Gene Similarity E = 1
  41. 41. ECLUST - our proposed method: 3 phases Original Data E = 0 1) Gene Similarity E = 1
  42. 42. ECLUST - our proposed method: 3 phases Original Data E = 0 1) Gene Similarity E = 1 2) Cluster Representation
  43. 43. ECLUST - our proposed method: 3 phases Original Data E = 0 1) Gene Similarity E = 1 2) Cluster Representation n × 1 n × 1
  44. 44. ECLUST - our proposed method: 3 phases Original Data E = 0 1) Gene Similarity E = 1 2) Cluster Representation n × 1 n × 1 3) Penalized Regression Yn×1∼ + ×E 16
  45. 45. the objective of statistical methods is the reduction of data. A quantity of data . . . is to be replaced by relatively few quantities which shall adequately represent . . . the relevant information contained in the original data. - Sir R. A. Fisher, 1922 16
  46. 46. Underlying model Y = β0 + β1U + β2U · E + ε (1) 17
  47. 47. Underlying model Y = β0 + β1U + β2U · E + ε (1) X ∼ F(α0 + α1U, ΣE ) (2) 17
  48. 48. Underlying model Y = β0 + β1U + β2U · E + ε (1) X ∼ F(α0 + α1U, ΣE ) (2) • U: unobserved latent variable • X: observed data which is a function of U • ΣE : environment sensitive correlation matrix 17
  49. 49. ECLUST - our proposed method: 3 phases Original Data E = 0 1) Gene Similarity E = 1 2) Cluster Representation n × 1 n × 1 3) Penalized Regression Yn×1∼ + ×E 18
  50. 50. advantages and disadvantages General Approach Advantages Disadvantages Single-Marker simple, easy to implement multiple testing burden, power, interpretability Penalization multivariate, variable selection, sparsity, efficient optimization algorithms poor sensitivity with correlated data, ignores structure in design matrix, interpretability Environment Cluster with Regression multivariate, flexible implementation, group structure, takes advantage of correlation, interpretability difficult to identify relevant clusters, clustering is unsupervised 19
  51. 51. Methods to detect gene clusters Table 1: Methods to detect gene clusters General Approach Formula Correlation pearson, spearman, biweight midcorrelation Correlation Scoring |ρE=1 − ρE=0| Weighted Correlation Scoring c|ρE=1 − ρE=0| Fisher’s Z Transformation |zij0−zij1| √ 1/(n0−3)+1/(n1−3) 20
  52. 52. Cluster Representation Table 2: Methods to create cluster representations General Approach Type Unsupervised average K principal components Supervised partial least squares 21
  53. 53. Simulation Studies
  54. 54. Simulation Study 1 (a) Corr(XE=0) (b) Corr(XE=1) (c) |Corr(XE=1) − Corr(XE=0)| (d) Corr(Xall) 22
  55. 55. Results: Jaccard Index and test set MSE 23
  56. 56. Simulation Study 2 24
  57. 57. TOM based on all subjects (a) TOM(Xall) 25
  58. 58. TOM based on unexposed subjects (a) TOM(XE=0) 26
  59. 59. TOM based on exposed subjects (a) TOM(XE=1) 27
  60. 60. Difference of TOMs (a) |TOM(XE=1) − TOM(XE=0)| 28
  61. 61. Results: Test set MSE 29
  62. 62. Strong Heredity Models
  63. 63. Model g(µ) =β0 + β1X1 + · · · + βpXp + βE E main effects + α1E (X1E) + · · · + αpE (XpE) interactions • g(·) is a known link function • µ = E [Y |X, E, β, α] • β = (β1, β2, . . . , βp, βE ) ∈ Rp+1 • α = (α1E , . . . , αpE ) ∈ Rp 30
  64. 64. Variable Selection arg min β0,β,α 1 2 Y − g(µ) 2 + λ ( β 1 + α 1) • Y − g(µ) 2 = i (yi − g(µi ))2 • β 1 = j |βj | • α 1 = j |αj | • λ ≥ 0: tuning parameter 31
  65. 65. Why Strong Heredity? • Statistical Power: large main effects are more likely to lead to detectable interactions than small ones 32
  66. 66. Why Strong Heredity? • Statistical Power: large main effects are more likely to lead to detectable interactions than small ones • Interpretability: Assuming a model with interaction only is generally not biologically plausible 32
  67. 67. Why Strong Heredity? • Statistical Power: large main effects are more likely to lead to detectable interactions than small ones • Interpretability: Assuming a model with interaction only is generally not biologically plausible • Practical Sparsity: X1, E, X1 · E vs. X1, E, X2 · E 32
  68. 68. Model g(µ) =β0 + β1X1 + · · · + βpXp + βE E main effects + α1E (X1E) + · · · + αpE (XpE) interactions 1Choi et al. 2010, JASA 2Chipman 1996, Canadian Journal of Statistics 33
  69. 69. Model g(µ) =β0 + β1X1 + · · · + βpXp + βE E main effects + α1E (X1E) + · · · + αpE (XpE) interactions Reparametrization1 : αjE = γjE βj βE . 1Choi et al. 2010, JASA 2Chipman 1996, Canadian Journal of Statistics 33
  70. 70. Model g(µ) =β0 + β1X1 + · · · + βpXp + βE E main effects + α1E (X1E) + · · · + αpE (XpE) interactions Reparametrization1 : αjE = γjE βj βE . Strong heredity principle2 : ˆαjE = 0 ⇒ ˆβj = 0 and ˆβE = 0 1Choi et al. 2010, JASA 2Chipman 1996, Canadian Journal of Statistics 33
  71. 71. Strong Heredity Model with Penalization arg min β0,β,γ 1 2 Y − g(µ) 2 + λβ (w1β1 + · · · + wqβq + wE βE ) + λγ (w1E γ1E + · · · + wqE γqE ) wj = 1 ˆβj , wjE = ˆβj ˆβE ˆαjE 34
  72. 72. Open source software • Software implementation in R: http://sahirbhatnagar.com/eclust/ • Allows user specified interaction terms • Automatically determines the optimal tuning parameters through cross validation • Can also be applied to genetic data (SNPs) 35
  73. 73. Feature Screening and Non-linear associations
  74. 74. The most popular way of feature screening How to fit statistical models when you have over 100,000 features? 36
  75. 75. The most popular way of feature screening How to fit statistical models when you have over 100,000 features? Marginal correlations, t-tests • for each feature, calculate the correlation between X and Y 36
  76. 76. The most popular way of feature screening How to fit statistical models when you have over 100,000 features? Marginal correlations, t-tests • for each feature, calculate the correlation between X and Y • keep all features with correlation greater than some threshold 36
  77. 77. The most popular way of feature screening How to fit statistical models when you have over 100,000 features? Marginal correlations, t-tests • for each feature, calculate the correlation between X and Y • keep all features with correlation greater than some threshold • However this procedure assumes a linear relationship between X and Y 36
  78. 78. Non-linear feature screening: Kolmogorov-Smirnov Test Mai & Zou (2012) proposed using the Kolmogorov-Smirnov (KS) test statistic ˆKj = sup x |ˆFj (x|Y = 1) − ˆFj (x|Y = 0)| (3) Figure 8: Depiction of KS statistic 37
  79. 79. Non-linear Interaction Models After feature screening, we can fit non-linear relationships between X and Y Yi = β0 + f (Xij ) + f (Xij , Ei ) + εi (4) 38
  80. 80. Conclusions
  81. 81. Conclusions and Contributions • Large system-wide changes are observed in many environments 39
  82. 82. Conclusions and Contributions • Large system-wide changes are observed in many environments • This assumption can possibly be exploited to aid analysis of large data 39
  83. 83. Conclusions and Contributions • Large system-wide changes are observed in many environments • This assumption can possibly be exploited to aid analysis of large data • We develop and implement a multivariate penalization procedure for predicting a continuous or binary disease outcome while detecting interactions between high dimensional data (p >> n) and an environmental factor. 39
  84. 84. Conclusions and Contributions • Large system-wide changes are observed in many environments • This assumption can possibly be exploited to aid analysis of large data • We develop and implement a multivariate penalization procedure for predicting a continuous or binary disease outcome while detecting interactions between high dimensional data (p >> n) and an environmental factor. • Dimension reduction is achieved through leveraging the environmental-class-conditional correlations 39
  85. 85. Conclusions and Contributions • Large system-wide changes are observed in many environments • This assumption can possibly be exploited to aid analysis of large data • We develop and implement a multivariate penalization procedure for predicting a continuous or binary disease outcome while detecting interactions between high dimensional data (p >> n) and an environmental factor. • Dimension reduction is achieved through leveraging the environmental-class-conditional correlations • Also, we develop and implement a strong heredity framework within the penalized model 39
  86. 86. Conclusions and Contributions • Large system-wide changes are observed in many environments • This assumption can possibly be exploited to aid analysis of large data • We develop and implement a multivariate penalization procedure for predicting a continuous or binary disease outcome while detecting interactions between high dimensional data (p >> n) and an environmental factor. • Dimension reduction is achieved through leveraging the environmental-class-conditional correlations • Also, we develop and implement a strong heredity framework within the penalized model • R software: http://sahirbhatnagar.com/eclust/ 39
  87. 87. Limitations • There must be a high-dimensional signature of the exposure 40
  88. 88. Limitations • There must be a high-dimensional signature of the exposure • Clustering is unsupervised 40
  89. 89. Limitations • There must be a high-dimensional signature of the exposure • Clustering is unsupervised • Two tuning parameters 40
  90. 90. What type of data is required to use these methods
  91. 91. ECLUST method 1. environmental exposure (currently only binary) 2. a high dimensional dataset that can be affected by the exposure 3. a single phenotype (continuous or binary) 4. Must be a high-dimensional signature of the exposure 41
  92. 92. Strong Heredity and Non-linear Models 1. a single phenotype (continuous or binary) 2. environment variable (continuous or binary) 3. any number of predictor variables 42
  93. 93. Check out our Lab’s Software! http://greenwoodlab.github.io/software/ 43
  94. 94. acknowledgements • Dr. Celia Greenwood • Dr. Blanchette and Dr. Yang • Dr. Luigi Bouchard, Andr´e Anne Houde • Dr. Steele, Dr. Kramer, Dr. Abrahamowicz • Maxime Turgeon, Kevin McGregor, Lauren Mokry, Marie Forest, Pablo Ginestet • Greg Voisin, Vince Forgetta, Kathleen Klein • Mothers and children from the study 44

×