Methods for High Dimensional Interactions

Methods for High Dimensional Interactions
Sahir Rai Bhatnagar, PhD Candidate – McGill Biostatistics
Joint work with Yi Yang, Mathieu Blanchette and Celia Greenwood
Ludmer Center – May 19, 2016
Underlying objective of this talk
1
Motivation
one predictor variable at a time
Predictor Variable Phenotype
one predictor variable at a time
Predictor Variable Phenotype
Test 1
Test 2
Test 3
Test 4
Test 5
2
a network based view
Predictor Variable Phenotype
a network based view
Predictor Variable Phenotype
a network based view
Predictor Variable Phenotype
Test 1
3
system level changes due to environment
Predictor Variable PhenotypeEnvironment
A
B
system level changes due to environment
Predictor Variable PhenotypeEnvironment
A
B
Test 1
4
Motivating Dataset: Newborn epigenetic adaptations to gesta-
tional diabetes exposure (Luigi Bouchard, Sherbrooke)
Environment
Gestational
Diabetes
Large Data
Child’s epigenome
(p ≈ 450k)
Phenotype
Obesity measures
5
Differential Correlation between environments
(a) Gestational diabetes affected pregnancy (b) Controls
6
Gene Expression: COPD patients
(a) Gene Exp.: Never Smokers (b) Gene Exp.: Current Smokers
(c) Correlations: Never Smokers (d) Correlations: Current Smokers
7
Imaging Data: Topological properties and Age
8
Correlations differ between Age groups
9
NIH MRI brain study
Environment
Age
Large Data
Cortical Thickness
(p ≈ 80k)
Phenotype
Intelligence
10
Differential Networking
11
formal statement of initial problem
• n: number of subjects
12
formal statement of initial problem
• n: number of subjects
• p: number of predictor variables
12
formal statement of initial problem
• n: number of subjects
• p: number of predictor variables
• Xn×p: high dimensional data set (p >> n)
12
formal statement of initial problem
• n: number of subjects
• p: number of predictor variables
• Xn×p: high dimensional data set (p >> n)
• Yn×1: phenotype
12
formal statement of initial problem
• n: number of subjects
• p: number of predictor variables
• Xn×p: high dimensional data set (p >> n)
• Yn×1: phenotype
• En×1: environmental factor that has widespread effect on X and can
modify the relation between X and Y
12
formal statement of initial problem
• n: number of subjects
• p: number of predictor variables
• Xn×p: high dimensional data set (p >> n)
• Yn×1: phenotype
• En×1: environmental factor that has widespread effect on X and can
modify the relation between X and Y
Objective
• Which elements of X that are associated with Y , depend on E?
12
conceptual model
Environment
ff(Maternal
care, Age, Diet)
E = 0
E = 1
conceptual model
Environment
ff(Maternal
care, Age, Diet)
E = 0
E = 1
Large Data (p >> n)
Gene Expression t
DNA Methylation
t Brain Imaging
Gene Expression t
DNA Methylation
t Brain Imaging
conceptual model
Environment
ff(Maternal
care, Age, Diet)
E = 0
E = 1
Large Data (p >> n)
Gene Expression t
DNA Methylation
t Brain Imaging
Gene Expression t
DNA Methylation
t Brain Imaging
Phenotype (Behavioral
development, IQ
scores, Death)
conceptual model
Environment
ff(Maternal
care, Age, Diet)
E = 0
E = 1
Large Data (p >> n)
Gene Expression t
DNA Methylation
t Brain Imaging
Gene Expression t
DNA Methylation
t Brain Imaging
Phenotype (Behavioral
development, IQ
scores, Death)
epidemiological study
conceptual model
Environment
ff(Maternal
care, Age, Diet)
E = 0
E = 1
Large Data (p >> n)
Gene Expression t
DNA Methylation
t Brain Imaging
Gene Expression t
DNA Methylation
t Brain Imaging
Phenotype (Behavioral
development, IQ
scores, Death)
(epi)genetic/imaging associations
conceptual model
Environment
ff(Maternal
care, Age, Diet)
E = 0
E = 1
Large Data (p >> n)
Gene Expression t
DNA Methylation
t Brain Imaging
Gene Expression t
DNA Methylation
t Brain Imaging
Phenotype (Behavioral
development, IQ
scores, Death)
(epi)genetic/imaging associations
(epi)genetic/imaging associations
conceptual model
Environment
ff(Maternal
care, Age, Diet)
E = 0
E = 1
Large Data (p >> n)
Gene Expression t
DNA Methylation
t Brain Imaging
Gene Expression t
DNA Methylation
t Brain Imaging
Phenotype (Behavioral
development, IQ
scores, Death)
13
Is this mediation analysis?
14
Is this mediation analysis?
• No
14
Is this mediation analysis?
• No
• We are not making any causal claims i.e. direction of the arrows
14
Is this mediation analysis?
• No
• We are not making any causal claims i.e. direction of the arrows
• There are many untestable assumptions required for such analysis
→ not well understood for HD data
14
Methods
analysis strategies
marginal correlations (univariate p-value)
multiple testing adjustment
Single-Marker or Single Variable Tests
analysis strategies
marginal correlations (univariate p-value)
multiple testing adjustment
Single-Marker or Single Variable Tests
LASSO (convex penalty with one tuning parameter)
MCP, SCAD, Dantzig selector (non-convex penalty with two tuning parameters)
Group level penalization (group LASSO, SCAD and MCP)
Multivariate Regression Approaches Including Penalization Methods
analysis strategies
marginal correlations (univariate p-value)
multiple testing adjustment
Single-Marker or Single Variable Tests
LASSO (convex penalty with one tuning parameter)
MCP, SCAD, Dantzig selector (non-convex penalty with two tuning parameters)
Group level penalization (group LASSO, SCAD and MCP)
Multivariate Regression Approaches Including Penalization Methods
cluster features based on euclidean distance, correlation, connectivity
regression with group level summary (PCA, average)
Clustering Together with Regression
15
ECLUST - our proposed method: 3 phases
Original Data
ECLUST - our proposed method: 3 phases
Original Data
E = 0
1) Gene Similarity
E = 1
ECLUST - our proposed method: 3 phases
Original Data
E = 0
1) Gene Similarity
E = 1
ECLUST - our proposed method: 3 phases
Original Data
E = 0
1) Gene Similarity
E = 1
2) Cluster
Representation
ECLUST - our proposed method: 3 phases
Original Data
E = 0
1) Gene Similarity
E = 1
2) Cluster
Representation
n × 1 n × 1
ECLUST - our proposed method: 3 phases
Original Data
E = 0
1) Gene Similarity
E = 1
2) Cluster
Representation
n × 1 n × 1
3) Penalized
Regression
Yn×1∼ + ×E
16
the objective of statistical
methods is the reduction of data.
A quantity of data . . . is to be
replaced by relatively few quantities
which shall adequately represent
. . . the relevant information
contained in the original data.
- Sir R. A. Fisher, 1922
16
Underlying model
Y = β0 + β1U + β2U · E + ε (1)
17
Underlying model
Y = β0 + β1U + β2U · E + ε (1)
X ∼ F(α0 + α1U, ΣE ) (2)
17
Underlying model
Y = β0 + β1U + β2U · E + ε (1)
X ∼ F(α0 + α1U, ΣE ) (2)
• U: unobserved latent variable
• X: observed data which is a function of U
• ΣE : environment sensitive correlation matrix
17
ECLUST - our proposed method: 3 phases
Original Data
E = 0
1) Gene Similarity
E = 1
2) Cluster
Representation
n × 1 n × 1
3) Penalized
Regression
Yn×1∼ + ×E
18
advantages and disadvantages
General Approach Advantages Disadvantages
Single-Marker simple, easy to implement
multiple testing burden,
power, interpretability
Penalization
multivariate, variable
selection, sparsity, efficient
optimization algorithms
poor sensitivity with
correlated data, ignores
structure in design matrix,
interpretability
Environment Cluster with
Regression
multivariate, flexible
implementation,
group structure, takes
advantage of correlation,
interpretability
difficult to identify relevant
clusters, clustering is
unsupervised
19
Methods to detect gene clusters
Table 1: Methods to detect gene clusters
General Approach Formula
Correlation
pearson, spearman,
biweight midcorrelation
Correlation Scoring |ρE=1 − ρE=0|
Weighted Correlation
Scoring
c|ρE=1 − ρE=0|
Fisher’s Z
Transformation
|zij0−zij1|
√
1/(n0−3)+1/(n1−3)
20
Cluster Representation
Table 2: Methods to create cluster representations
General Approach Type
Unsupervised average
K principal components
Supervised partial least squares
21
Simulation Studies
Simulation Study 1
(a) Corr(XE=0) (b) Corr(XE=1)
(c) |Corr(XE=1) − Corr(XE=0)| (d) Corr(Xall)
22
Results: Jaccard Index and test set MSE
23
Simulation Study 2
24
TOM based on all subjects
(a) TOM(Xall)
25
TOM based on unexposed subjects
(a) TOM(XE=0)
26
TOM based on exposed subjects
(a) TOM(XE=1)
27
Difference of TOMs
(a) |TOM(XE=1) − TOM(XE=0)|
28
Results: Test set MSE
29
Strong Heredity Models
Model
g(µ) =β0 + β1X1 + · · · + βpXp + βE E
main effects
+ α1E (X1E) + · · · + αpE (XpE)
interactions
• g(·) is a known link function
• µ = E [Y |X, E, β, α]
• β = (β1, β2, . . . , βp, βE ) ∈ Rp+1
• α = (α1E , . . . , αpE ) ∈ Rp
30
Variable Selection
arg min
β0,β,α
1
2
Y − g(µ)
2
+ λ ( β 1 + α 1)
• Y − g(µ)
2
= i (yi − g(µi ))2
• β 1 = j |βj |
• α 1 = j |αj |
• λ ≥ 0: tuning parameter
31
Why Strong Heredity?
• Statistical Power: large main effects are more likely to lead to
detectable interactions than small ones
32
Why Strong Heredity?
• Statistical Power: large main effects are more likely to lead to
detectable interactions than small ones
• Interpretability: Assuming a model with interaction only is generally
not biologically plausible
32
Why Strong Heredity?
• Statistical Power: large main effects are more likely to lead to
detectable interactions than small ones
• Interpretability: Assuming a model with interaction only is generally
not biologically plausible
• Practical Sparsity: X1, E, X1 · E vs. X1, E, X2 · E
32
Model
g(µ) =β0 + β1X1 + · · · + βpXp + βE E
main effects
+ α1E (X1E) + · · · + αpE (XpE)
interactions
1Choi et al. 2010, JASA
2Chipman 1996, Canadian Journal of Statistics
33
Model
g(µ) =β0 + β1X1 + · · · + βpXp + βE E
main effects
+ α1E (X1E) + · · · + αpE (XpE)
interactions
Reparametrization1
: αjE = γjE βj βE .
1Choi et al. 2010, JASA
2Chipman 1996, Canadian Journal of Statistics
33
Model
g(µ) =β0 + β1X1 + · · · + βpXp + βE E
main effects
+ α1E (X1E) + · · · + αpE (XpE)
interactions
Reparametrization1
: αjE = γjE βj βE .
Strong heredity principle2
:
ˆαjE = 0 ⇒ ˆβj = 0 and ˆβE = 0
1Choi et al. 2010, JASA
2Chipman 1996, Canadian Journal of Statistics
33
Strong Heredity Model with Penalization
arg min
β0,β,γ
1
2
Y − g(µ)
2
+
λβ (w1β1 + · · · + wqβq + wE βE ) +
λγ (w1E γ1E + · · · + wqE γqE )
wj =
1
ˆβj
, wjE =
ˆβj
ˆβE
ˆαjE
34
Open source software
• Software implementation in R: http://sahirbhatnagar.com/eclust/
• Allows user specified interaction terms
• Automatically determines the optimal tuning parameters through
cross validation
• Can also be applied to genetic data (SNPs)
35
Feature Screening and
Non-linear associations
The most popular way of feature screening
How to fit statistical models when you have over 100,000 features?
36
The most popular way of feature screening
How to fit statistical models when you have over 100,000 features?
Marginal correlations, t-tests
• for each feature, calculate the correlation between X and Y
36
The most popular way of feature screening
How to fit statistical models when you have over 100,000 features?
Marginal correlations, t-tests
• for each feature, calculate the correlation between X and Y
• keep all features with correlation greater than some threshold
36
The most popular way of feature screening
How to fit statistical models when you have over 100,000 features?
Marginal correlations, t-tests
• for each feature, calculate the correlation between X and Y
• keep all features with correlation greater than some threshold
• However this procedure assumes a linear relationship between X and
Y
36
Non-linear feature screening: Kolmogorov-Smirnov Test
Mai & Zou (2012) proposed using the Kolmogorov-Smirnov (KS) test
statistic
ˆKj = sup
x
|ˆFj (x|Y = 1) − ˆFj (x|Y = 0)| (3)
Figure 8: Depiction of KS statistic
37
Non-linear Interaction Models
After feature screening, we can fit non-linear relationships between
X and Y
Yi = β0 + f (Xij ) + f (Xij , Ei ) + εi (4)
38
Conclusions
Conclusions and Contributions
• Large system-wide changes are observed in many environments
39
Conclusions and Contributions
• Large system-wide changes are observed in many environments
• This assumption can possibly be exploited to aid analysis of large
data
39
Conclusions and Contributions
• Large system-wide changes are observed in many environments
• This assumption can possibly be exploited to aid analysis of large
data
• We develop and implement a multivariate penalization procedure for
predicting a continuous or binary disease outcome while detecting
interactions between high dimensional data (p >> n) and an
environmental factor.
39
Conclusions and Contributions
• Large system-wide changes are observed in many environments
• This assumption can possibly be exploited to aid analysis of large
data
• We develop and implement a multivariate penalization procedure for
predicting a continuous or binary disease outcome while detecting
interactions between high dimensional data (p >> n) and an
environmental factor.
• Dimension reduction is achieved through leveraging the
environmental-class-conditional correlations
39
Conclusions and Contributions
• Large system-wide changes are observed in many environments
• This assumption can possibly be exploited to aid analysis of large
data
• We develop and implement a multivariate penalization procedure for
predicting a continuous or binary disease outcome while detecting
interactions between high dimensional data (p >> n) and an
environmental factor.
• Dimension reduction is achieved through leveraging the
environmental-class-conditional correlations
• Also, we develop and implement a strong heredity framework
within the penalized model
39
Conclusions and Contributions
• Large system-wide changes are observed in many environments
• This assumption can possibly be exploited to aid analysis of large
data
• We develop and implement a multivariate penalization procedure for
predicting a continuous or binary disease outcome while detecting
interactions between high dimensional data (p >> n) and an
environmental factor.
• Dimension reduction is achieved through leveraging the
environmental-class-conditional correlations
• Also, we develop and implement a strong heredity framework
within the penalized model
• R software: http://sahirbhatnagar.com/eclust/
39
Limitations
• There must be a high-dimensional signature of the exposure
40
Limitations
• There must be a high-dimensional signature of the exposure
• Clustering is unsupervised
40
Limitations
• There must be a high-dimensional signature of the exposure
• Clustering is unsupervised
• Two tuning parameters
40
What type of data is required to
use these methods
ECLUST method
1. environmental exposure (currently only binary)
2. a high dimensional dataset that can be affected by the exposure
3. a single phenotype (continuous or binary)
4. Must be a high-dimensional signature of the exposure
41
Strong Heredity and Non-linear Models
1. a single phenotype (continuous or binary)
2. environment variable (continuous or binary)
3. any number of predictor variables
42
Check out our Lab’s Software!
http://greenwoodlab.github.io/software/
43
acknowledgements
• Dr. Celia Greenwood
• Dr. Blanchette and Dr. Yang
• Dr. Luigi Bouchard, Andr´e Anne
Houde
• Dr. Steele, Dr. Kramer,
Dr. Abrahamowicz
• Maxime Turgeon, Kevin
McGregor, Lauren Mokry,
Marie Forest, Pablo Ginestet
• Greg Voisin, Vince Forgetta,
Kathleen Klein
• Mothers and children from the
study
44
1 of 94

Recommended

Logit model testing and interpretation by
Logit model testing and interpretationLogit model testing and interpretation
Logit model testing and interpretationFelipe Affonso
1.3K views12 slides
Propensity Score Methods for Comparative Effectiveness Research with Multiple... by
Propensity Score Methods for Comparative Effectiveness Research with Multiple...Propensity Score Methods for Comparative Effectiveness Research with Multiple...
Propensity Score Methods for Comparative Effectiveness Research with Multiple...Kazuki Yoshida
1.2K views51 slides
ENAR 2018 Matching Weights to Simultaneously Compare Three Treatment Groups: ... by
ENAR 2018 Matching Weights to Simultaneously Compare Three Treatment Groups: ...ENAR 2018 Matching Weights to Simultaneously Compare Three Treatment Groups: ...
ENAR 2018 Matching Weights to Simultaneously Compare Three Treatment Groups: ...Kazuki Yoshida
511 views40 slides
2008 JSM - Meta Study Data vs Patient Data by
2008 JSM - Meta Study Data vs Patient Data2008 JSM - Meta Study Data vs Patient Data
2008 JSM - Meta Study Data vs Patient DataTerry Liao
325 views27 slides
Introduction tocausalinference april02_2020 by
Introduction tocausalinference april02_2020Introduction tocausalinference april02_2020
Introduction tocausalinference april02_2020Viswanath Gangavaram
27 views21 slides

More Related Content

Similar to Methods for High Dimensional Interactions

Combining co-expression and co-location for gene network inference in porcine... by
Combining co-expression and co-location for gene network inference in porcine...Combining co-expression and co-location for gene network inference in porcine...
Combining co-expression and co-location for gene network inference in porcine...tuxette
246 views60 slides
High Dimensional Biological Data Analysis and Visualization by
High Dimensional Biological Data Analysis and VisualizationHigh Dimensional Biological Data Analysis and Visualization
High Dimensional Biological Data Analysis and VisualizationDmitry Grapov
22.5K views30 slides
MUMS: Bayesian, Fiducial, and Frequentist Conference - Spatially Informed Var... by
MUMS: Bayesian, Fiducial, and Frequentist Conference - Spatially Informed Var...MUMS: Bayesian, Fiducial, and Frequentist Conference - Spatially Informed Var...
MUMS: Bayesian, Fiducial, and Frequentist Conference - Spatially Informed Var...The Statistical and Applied Mathematical Sciences Institute
97 views29 slides
Foundations of Statistics in Ecology and Evolution. 8. Bayesian Statistics by
Foundations of Statistics in Ecology and Evolution. 8. Bayesian StatisticsFoundations of Statistics in Ecology and Evolution. 8. Bayesian Statistics
Foundations of Statistics in Ecology and Evolution. 8. Bayesian StatisticsAndres Lopez-Sepulcre
425 views51 slides
Causal Inference Opening Workshop - Bayesian Nonparametric Models for Treatme... by
Causal Inference Opening Workshop - Bayesian Nonparametric Models for Treatme...Causal Inference Opening Workshop - Bayesian Nonparametric Models for Treatme...
Causal Inference Opening Workshop - Bayesian Nonparametric Models for Treatme...The Statistical and Applied Mathematical Sciences Institute
162 views30 slides
DOE Project ANOVA Analysis Diet Type by
DOE Project ANOVA Analysis Diet TypeDOE Project ANOVA Analysis Diet Type
DOE Project ANOVA Analysis Diet Typevidit jain
108 views32 slides

Similar to Methods for High Dimensional Interactions(20)

Combining co-expression and co-location for gene network inference in porcine... by tuxette
Combining co-expression and co-location for gene network inference in porcine...Combining co-expression and co-location for gene network inference in porcine...
Combining co-expression and co-location for gene network inference in porcine...
tuxette246 views
High Dimensional Biological Data Analysis and Visualization by Dmitry Grapov
High Dimensional Biological Data Analysis and VisualizationHigh Dimensional Biological Data Analysis and Visualization
High Dimensional Biological Data Analysis and Visualization
Dmitry Grapov22.5K views
Foundations of Statistics in Ecology and Evolution. 8. Bayesian Statistics by Andres Lopez-Sepulcre
Foundations of Statistics in Ecology and Evolution. 8. Bayesian StatisticsFoundations of Statistics in Ecology and Evolution. 8. Bayesian Statistics
Foundations of Statistics in Ecology and Evolution. 8. Bayesian Statistics
DOE Project ANOVA Analysis Diet Type by vidit jain
DOE Project ANOVA Analysis Diet TypeDOE Project ANOVA Analysis Diet Type
DOE Project ANOVA Analysis Diet Type
vidit jain108 views
Genomic selection in Livestock by ILRI
Genomic  selection in LivestockGenomic  selection in Livestock
Genomic selection in Livestock
ILRI797 views
Integration of biological annotations using hierarchical modeling by USC
Integration of biological annotations using hierarchical modelingIntegration of biological annotations using hierarchical modeling
Integration of biological annotations using hierarchical modeling
USC310 views
Multivariate Analysis and Visualization of Proteomic Data by UC Davis
Multivariate Analysis and Visualization of Proteomic DataMultivariate Analysis and Visualization of Proteomic Data
Multivariate Analysis and Visualization of Proteomic Data
UC Davis2.9K views
a brief introduction to epistasis detection by Hyun-hwan Jeong
a brief introduction to epistasis detectiona brief introduction to epistasis detection
a brief introduction to epistasis detection
Hyun-hwan Jeong1.4K views
Repurposing predictive tools for causal research by Galit Shmueli
Repurposing predictive tools for causal researchRepurposing predictive tools for causal research
Repurposing predictive tools for causal research
Galit Shmueli492 views
Constraints and Global Optimization for Gene Prediction Overlap Resolution by Christian Have
Constraints and Global Optimization for Gene Prediction Overlap ResolutionConstraints and Global Optimization for Gene Prediction Overlap Resolution
Constraints and Global Optimization for Gene Prediction Overlap Resolution
Christian Have630 views
Split Criterions for Variable Selection Using Decision Trees by NTNU
Split Criterions for Variable Selection Using Decision TreesSplit Criterions for Variable Selection Using Decision Trees
Split Criterions for Variable Selection Using Decision Trees
NTNU713 views
Thesis seminar by gvesom
Thesis seminarThesis seminar
Thesis seminar
gvesom451 views
Subgroup identification for precision medicine. a comparative review of 13 me... by SuciAidaDahhar
Subgroup identification for precision medicine. a comparative review of 13 me...Subgroup identification for precision medicine. a comparative review of 13 me...
Subgroup identification for precision medicine. a comparative review of 13 me...
SuciAidaDahhar23 views
Basic Concepts of Experimental Design & Standard Design ( Statistics ) by Hasnat Israq
Basic Concepts of Experimental Design & Standard Design ( Statistics )Basic Concepts of Experimental Design & Standard Design ( Statistics )
Basic Concepts of Experimental Design & Standard Design ( Statistics )
Hasnat Israq8.3K views
Matching Weights to Simultaneously Compare Three Treatment Groups: a Simulati... by Kazuki Yoshida
Matching Weights to Simultaneously Compare Three Treatment Groups: a Simulati...Matching Weights to Simultaneously Compare Three Treatment Groups: a Simulati...
Matching Weights to Simultaneously Compare Three Treatment Groups: a Simulati...
Kazuki Yoshida2K views
Exact Data Reduction for Big Data by Jieping Ye by BigMine
Exact Data Reduction for Big Data by Jieping YeExact Data Reduction for Big Data by Jieping Ye
Exact Data Reduction for Big Data by Jieping Ye
BigMine2.9K views

More from sahirbhatnagar

An introduction to knitr and R Markdown by
An introduction to knitr and R MarkdownAn introduction to knitr and R Markdown
An introduction to knitr and R Markdownsahirbhatnagar
2.7K views40 slides
Atelier r-gerad by
Atelier r-geradAtelier r-gerad
Atelier r-geradsahirbhatnagar
1.9K views153 slides
Reproducible Research: An Introduction to knitr by
Reproducible Research: An Introduction to knitrReproducible Research: An Introduction to knitr
Reproducible Research: An Introduction to knitrsahirbhatnagar
839 views43 slides
Analysis of DNA methylation and Gene expression to predict childhood obesity by
Analysis of DNA methylation and Gene expression to predict childhood obesityAnalysis of DNA methylation and Gene expression to predict childhood obesity
Analysis of DNA methylation and Gene expression to predict childhood obesitysahirbhatnagar
926 views33 slides
Estimation and Accuracy after Model Selection by
Estimation and Accuracy after Model SelectionEstimation and Accuracy after Model Selection
Estimation and Accuracy after Model Selectionsahirbhatnagar
1.1K views58 slides
Absolute risk estimation in a case cohort study of prostate cancer by
Absolute risk estimation in a case cohort study of prostate cancerAbsolute risk estimation in a case cohort study of prostate cancer
Absolute risk estimation in a case cohort study of prostate cancersahirbhatnagar
1.1K views74 slides

More from sahirbhatnagar(10)

An introduction to knitr and R Markdown by sahirbhatnagar
An introduction to knitr and R MarkdownAn introduction to knitr and R Markdown
An introduction to knitr and R Markdown
sahirbhatnagar2.7K views
Reproducible Research: An Introduction to knitr by sahirbhatnagar
Reproducible Research: An Introduction to knitrReproducible Research: An Introduction to knitr
Reproducible Research: An Introduction to knitr
sahirbhatnagar839 views
Analysis of DNA methylation and Gene expression to predict childhood obesity by sahirbhatnagar
Analysis of DNA methylation and Gene expression to predict childhood obesityAnalysis of DNA methylation and Gene expression to predict childhood obesity
Analysis of DNA methylation and Gene expression to predict childhood obesity
sahirbhatnagar926 views
Estimation and Accuracy after Model Selection by sahirbhatnagar
Estimation and Accuracy after Model SelectionEstimation and Accuracy after Model Selection
Estimation and Accuracy after Model Selection
sahirbhatnagar1.1K views
Absolute risk estimation in a case cohort study of prostate cancer by sahirbhatnagar
Absolute risk estimation in a case cohort study of prostate cancerAbsolute risk estimation in a case cohort study of prostate cancer
Absolute risk estimation in a case cohort study of prostate cancer
sahirbhatnagar1.1K views
Factors influencing participation in cancer screening by sahirbhatnagar
Factors influencing participation in cancer screeningFactors influencing participation in cancer screening
Factors influencing participation in cancer screening
sahirbhatnagar485 views
Methylation and Expression data integration by sahirbhatnagar
Methylation and Expression data integrationMethylation and Expression data integration
Methylation and Expression data integration
sahirbhatnagar643 views

Recently uploaded

Open Access Publishing in Astrophysics by
Open Access Publishing in AstrophysicsOpen Access Publishing in Astrophysics
Open Access Publishing in AstrophysicsPeter Coles
1.2K views26 slides
Small ruminant keepers’ knowledge, attitudes and practices towards peste des ... by
Small ruminant keepers’ knowledge, attitudes and practices towards peste des ...Small ruminant keepers’ knowledge, attitudes and practices towards peste des ...
Small ruminant keepers’ knowledge, attitudes and practices towards peste des ...ILRI
5 views6 slides
TF-FAIR.pdf by
TF-FAIR.pdfTF-FAIR.pdf
TF-FAIR.pdfDirk Roorda
6 views120 slides
POSTER IV LAWCN_ROVER_IUE.pdf by
POSTER IV LAWCN_ROVER_IUE.pdfPOSTER IV LAWCN_ROVER_IUE.pdf
POSTER IV LAWCN_ROVER_IUE.pdfSOCIEDAD JULIO GARAVITO
11 views1 slide
Exploring the nature and synchronicity of early cluster formation in the Larg... by
Exploring the nature and synchronicity of early cluster formation in the Larg...Exploring the nature and synchronicity of early cluster formation in the Larg...
Exploring the nature and synchronicity of early cluster formation in the Larg...Sérgio Sacani
1.2K views12 slides
application of genetic engineering 2.pptx by
application of genetic engineering 2.pptxapplication of genetic engineering 2.pptx
application of genetic engineering 2.pptxSankSurezz
14 views12 slides

Recently uploaded(20)

Open Access Publishing in Astrophysics by Peter Coles
Open Access Publishing in AstrophysicsOpen Access Publishing in Astrophysics
Open Access Publishing in Astrophysics
Peter Coles1.2K views
Small ruminant keepers’ knowledge, attitudes and practices towards peste des ... by ILRI
Small ruminant keepers’ knowledge, attitudes and practices towards peste des ...Small ruminant keepers’ knowledge, attitudes and practices towards peste des ...
Small ruminant keepers’ knowledge, attitudes and practices towards peste des ...
ILRI5 views
Exploring the nature and synchronicity of early cluster formation in the Larg... by Sérgio Sacani
Exploring the nature and synchronicity of early cluster formation in the Larg...Exploring the nature and synchronicity of early cluster formation in the Larg...
Exploring the nature and synchronicity of early cluster formation in the Larg...
Sérgio Sacani1.2K views
application of genetic engineering 2.pptx by SankSurezz
application of genetic engineering 2.pptxapplication of genetic engineering 2.pptx
application of genetic engineering 2.pptx
SankSurezz14 views
Light Pollution for LVIS students by CWBarthlmew
Light Pollution for LVIS studentsLight Pollution for LVIS students
Light Pollution for LVIS students
CWBarthlmew12 views
Factors affecting fluorescence and phosphorescence.pptx by SamarthGiri1
Factors affecting fluorescence and phosphorescence.pptxFactors affecting fluorescence and phosphorescence.pptx
Factors affecting fluorescence and phosphorescence.pptx
SamarthGiri17 views
Discovery of therapeutic agents targeting PKLR for NAFLD using drug repositio... by Trustlife
Discovery of therapeutic agents targeting PKLR for NAFLD using drug repositio...Discovery of therapeutic agents targeting PKLR for NAFLD using drug repositio...
Discovery of therapeutic agents targeting PKLR for NAFLD using drug repositio...
Trustlife142 views
Effect of Integrated Nutrient Management on Growth and Yield of Solanaceous F... by SwagatBehera9
Effect of Integrated Nutrient Management on Growth and Yield of Solanaceous F...Effect of Integrated Nutrient Management on Growth and Yield of Solanaceous F...
Effect of Integrated Nutrient Management on Growth and Yield of Solanaceous F...
SwagatBehera95 views
ELECTRON TRANSPORT CHAIN by DEEKSHA RANI
ELECTRON TRANSPORT CHAINELECTRON TRANSPORT CHAIN
ELECTRON TRANSPORT CHAIN
DEEKSHA RANI10 views
A giant thin stellar stream in the Coma Galaxy Cluster by Sérgio Sacani
A giant thin stellar stream in the Coma Galaxy ClusterA giant thin stellar stream in the Coma Galaxy Cluster
A giant thin stellar stream in the Coma Galaxy Cluster
Sérgio Sacani18 views
Applications of Large Language Models in Materials Discovery and Design by Anubhav Jain
Applications of Large Language Models in Materials Discovery and DesignApplications of Large Language Models in Materials Discovery and Design
Applications of Large Language Models in Materials Discovery and Design
Anubhav Jain13 views
Experimental animal Guinea pigs.pptx by Mansee Arya
Experimental animal Guinea pigs.pptxExperimental animal Guinea pigs.pptx
Experimental animal Guinea pigs.pptx
Mansee Arya38 views

Methods for High Dimensional Interactions

  • 1. Methods for High Dimensional Interactions Sahir Rai Bhatnagar, PhD Candidate – McGill Biostatistics Joint work with Yi Yang, Mathieu Blanchette and Celia Greenwood Ludmer Center – May 19, 2016
  • 4. one predictor variable at a time Predictor Variable Phenotype
  • 5. one predictor variable at a time Predictor Variable Phenotype Test 1 Test 2 Test 3 Test 4 Test 5 2
  • 6. a network based view Predictor Variable Phenotype
  • 7. a network based view Predictor Variable Phenotype
  • 8. a network based view Predictor Variable Phenotype Test 1 3
  • 9. system level changes due to environment Predictor Variable PhenotypeEnvironment A B
  • 10. system level changes due to environment Predictor Variable PhenotypeEnvironment A B Test 1 4
  • 11. Motivating Dataset: Newborn epigenetic adaptations to gesta- tional diabetes exposure (Luigi Bouchard, Sherbrooke) Environment Gestational Diabetes Large Data Child’s epigenome (p ≈ 450k) Phenotype Obesity measures 5
  • 12. Differential Correlation between environments (a) Gestational diabetes affected pregnancy (b) Controls 6
  • 13. Gene Expression: COPD patients (a) Gene Exp.: Never Smokers (b) Gene Exp.: Current Smokers (c) Correlations: Never Smokers (d) Correlations: Current Smokers 7
  • 14. Imaging Data: Topological properties and Age 8
  • 16. NIH MRI brain study Environment Age Large Data Cortical Thickness (p ≈ 80k) Phenotype Intelligence 10
  • 18. formal statement of initial problem • n: number of subjects 12
  • 19. formal statement of initial problem • n: number of subjects • p: number of predictor variables 12
  • 20. formal statement of initial problem • n: number of subjects • p: number of predictor variables • Xn×p: high dimensional data set (p >> n) 12
  • 21. formal statement of initial problem • n: number of subjects • p: number of predictor variables • Xn×p: high dimensional data set (p >> n) • Yn×1: phenotype 12
  • 22. formal statement of initial problem • n: number of subjects • p: number of predictor variables • Xn×p: high dimensional data set (p >> n) • Yn×1: phenotype • En×1: environmental factor that has widespread effect on X and can modify the relation between X and Y 12
  • 23. formal statement of initial problem • n: number of subjects • p: number of predictor variables • Xn×p: high dimensional data set (p >> n) • Yn×1: phenotype • En×1: environmental factor that has widespread effect on X and can modify the relation between X and Y Objective • Which elements of X that are associated with Y , depend on E? 12
  • 25. conceptual model Environment ff(Maternal care, Age, Diet) E = 0 E = 1 Large Data (p >> n) Gene Expression t DNA Methylation t Brain Imaging Gene Expression t DNA Methylation t Brain Imaging
  • 26. conceptual model Environment ff(Maternal care, Age, Diet) E = 0 E = 1 Large Data (p >> n) Gene Expression t DNA Methylation t Brain Imaging Gene Expression t DNA Methylation t Brain Imaging Phenotype (Behavioral development, IQ scores, Death)
  • 27. conceptual model Environment ff(Maternal care, Age, Diet) E = 0 E = 1 Large Data (p >> n) Gene Expression t DNA Methylation t Brain Imaging Gene Expression t DNA Methylation t Brain Imaging Phenotype (Behavioral development, IQ scores, Death) epidemiological study
  • 28. conceptual model Environment ff(Maternal care, Age, Diet) E = 0 E = 1 Large Data (p >> n) Gene Expression t DNA Methylation t Brain Imaging Gene Expression t DNA Methylation t Brain Imaging Phenotype (Behavioral development, IQ scores, Death) (epi)genetic/imaging associations
  • 29. conceptual model Environment ff(Maternal care, Age, Diet) E = 0 E = 1 Large Data (p >> n) Gene Expression t DNA Methylation t Brain Imaging Gene Expression t DNA Methylation t Brain Imaging Phenotype (Behavioral development, IQ scores, Death) (epi)genetic/imaging associations (epi)genetic/imaging associations
  • 30. conceptual model Environment ff(Maternal care, Age, Diet) E = 0 E = 1 Large Data (p >> n) Gene Expression t DNA Methylation t Brain Imaging Gene Expression t DNA Methylation t Brain Imaging Phenotype (Behavioral development, IQ scores, Death) 13
  • 31. Is this mediation analysis? 14
  • 32. Is this mediation analysis? • No 14
  • 33. Is this mediation analysis? • No • We are not making any causal claims i.e. direction of the arrows 14
  • 34. Is this mediation analysis? • No • We are not making any causal claims i.e. direction of the arrows • There are many untestable assumptions required for such analysis → not well understood for HD data 14
  • 36. analysis strategies marginal correlations (univariate p-value) multiple testing adjustment Single-Marker or Single Variable Tests
  • 37. analysis strategies marginal correlations (univariate p-value) multiple testing adjustment Single-Marker or Single Variable Tests LASSO (convex penalty with one tuning parameter) MCP, SCAD, Dantzig selector (non-convex penalty with two tuning parameters) Group level penalization (group LASSO, SCAD and MCP) Multivariate Regression Approaches Including Penalization Methods
  • 38. analysis strategies marginal correlations (univariate p-value) multiple testing adjustment Single-Marker or Single Variable Tests LASSO (convex penalty with one tuning parameter) MCP, SCAD, Dantzig selector (non-convex penalty with two tuning parameters) Group level penalization (group LASSO, SCAD and MCP) Multivariate Regression Approaches Including Penalization Methods cluster features based on euclidean distance, correlation, connectivity regression with group level summary (PCA, average) Clustering Together with Regression 15
  • 39. ECLUST - our proposed method: 3 phases Original Data
  • 40. ECLUST - our proposed method: 3 phases Original Data E = 0 1) Gene Similarity E = 1
  • 41. ECLUST - our proposed method: 3 phases Original Data E = 0 1) Gene Similarity E = 1
  • 42. ECLUST - our proposed method: 3 phases Original Data E = 0 1) Gene Similarity E = 1 2) Cluster Representation
  • 43. ECLUST - our proposed method: 3 phases Original Data E = 0 1) Gene Similarity E = 1 2) Cluster Representation n × 1 n × 1
  • 44. ECLUST - our proposed method: 3 phases Original Data E = 0 1) Gene Similarity E = 1 2) Cluster Representation n × 1 n × 1 3) Penalized Regression Yn×1∼ + ×E 16
  • 45. the objective of statistical methods is the reduction of data. A quantity of data . . . is to be replaced by relatively few quantities which shall adequately represent . . . the relevant information contained in the original data. - Sir R. A. Fisher, 1922 16
  • 46. Underlying model Y = β0 + β1U + β2U · E + ε (1) 17
  • 47. Underlying model Y = β0 + β1U + β2U · E + ε (1) X ∼ F(α0 + α1U, ΣE ) (2) 17
  • 48. Underlying model Y = β0 + β1U + β2U · E + ε (1) X ∼ F(α0 + α1U, ΣE ) (2) • U: unobserved latent variable • X: observed data which is a function of U • ΣE : environment sensitive correlation matrix 17
  • 49. ECLUST - our proposed method: 3 phases Original Data E = 0 1) Gene Similarity E = 1 2) Cluster Representation n × 1 n × 1 3) Penalized Regression Yn×1∼ + ×E 18
  • 50. advantages and disadvantages General Approach Advantages Disadvantages Single-Marker simple, easy to implement multiple testing burden, power, interpretability Penalization multivariate, variable selection, sparsity, efficient optimization algorithms poor sensitivity with correlated data, ignores structure in design matrix, interpretability Environment Cluster with Regression multivariate, flexible implementation, group structure, takes advantage of correlation, interpretability difficult to identify relevant clusters, clustering is unsupervised 19
  • 51. Methods to detect gene clusters Table 1: Methods to detect gene clusters General Approach Formula Correlation pearson, spearman, biweight midcorrelation Correlation Scoring |ρE=1 − ρE=0| Weighted Correlation Scoring c|ρE=1 − ρE=0| Fisher’s Z Transformation |zij0−zij1| √ 1/(n0−3)+1/(n1−3) 20
  • 52. Cluster Representation Table 2: Methods to create cluster representations General Approach Type Unsupervised average K principal components Supervised partial least squares 21
  • 54. Simulation Study 1 (a) Corr(XE=0) (b) Corr(XE=1) (c) |Corr(XE=1) − Corr(XE=0)| (d) Corr(Xall) 22
  • 55. Results: Jaccard Index and test set MSE 23
  • 57. TOM based on all subjects (a) TOM(Xall) 25
  • 58. TOM based on unexposed subjects (a) TOM(XE=0) 26
  • 59. TOM based on exposed subjects (a) TOM(XE=1) 27
  • 60. Difference of TOMs (a) |TOM(XE=1) − TOM(XE=0)| 28
  • 63. Model g(µ) =β0 + β1X1 + · · · + βpXp + βE E main effects + α1E (X1E) + · · · + αpE (XpE) interactions • g(·) is a known link function • µ = E [Y |X, E, β, α] • β = (β1, β2, . . . , βp, βE ) ∈ Rp+1 • α = (α1E , . . . , αpE ) ∈ Rp 30
  • 64. Variable Selection arg min β0,β,α 1 2 Y − g(µ) 2 + λ ( β 1 + α 1) • Y − g(µ) 2 = i (yi − g(µi ))2 • β 1 = j |βj | • α 1 = j |αj | • λ ≥ 0: tuning parameter 31
  • 65. Why Strong Heredity? • Statistical Power: large main effects are more likely to lead to detectable interactions than small ones 32
  • 66. Why Strong Heredity? • Statistical Power: large main effects are more likely to lead to detectable interactions than small ones • Interpretability: Assuming a model with interaction only is generally not biologically plausible 32
  • 67. Why Strong Heredity? • Statistical Power: large main effects are more likely to lead to detectable interactions than small ones • Interpretability: Assuming a model with interaction only is generally not biologically plausible • Practical Sparsity: X1, E, X1 · E vs. X1, E, X2 · E 32
  • 68. Model g(µ) =β0 + β1X1 + · · · + βpXp + βE E main effects + α1E (X1E) + · · · + αpE (XpE) interactions 1Choi et al. 2010, JASA 2Chipman 1996, Canadian Journal of Statistics 33
  • 69. Model g(µ) =β0 + β1X1 + · · · + βpXp + βE E main effects + α1E (X1E) + · · · + αpE (XpE) interactions Reparametrization1 : αjE = γjE βj βE . 1Choi et al. 2010, JASA 2Chipman 1996, Canadian Journal of Statistics 33
  • 70. Model g(µ) =β0 + β1X1 + · · · + βpXp + βE E main effects + α1E (X1E) + · · · + αpE (XpE) interactions Reparametrization1 : αjE = γjE βj βE . Strong heredity principle2 : ˆαjE = 0 ⇒ ˆβj = 0 and ˆβE = 0 1Choi et al. 2010, JASA 2Chipman 1996, Canadian Journal of Statistics 33
  • 71. Strong Heredity Model with Penalization arg min β0,β,γ 1 2 Y − g(µ) 2 + λβ (w1β1 + · · · + wqβq + wE βE ) + λγ (w1E γ1E + · · · + wqE γqE ) wj = 1 ˆβj , wjE = ˆβj ˆβE ˆαjE 34
  • 72. Open source software • Software implementation in R: http://sahirbhatnagar.com/eclust/ • Allows user specified interaction terms • Automatically determines the optimal tuning parameters through cross validation • Can also be applied to genetic data (SNPs) 35
  • 74. The most popular way of feature screening How to fit statistical models when you have over 100,000 features? 36
  • 75. The most popular way of feature screening How to fit statistical models when you have over 100,000 features? Marginal correlations, t-tests • for each feature, calculate the correlation between X and Y 36
  • 76. The most popular way of feature screening How to fit statistical models when you have over 100,000 features? Marginal correlations, t-tests • for each feature, calculate the correlation between X and Y • keep all features with correlation greater than some threshold 36
  • 77. The most popular way of feature screening How to fit statistical models when you have over 100,000 features? Marginal correlations, t-tests • for each feature, calculate the correlation between X and Y • keep all features with correlation greater than some threshold • However this procedure assumes a linear relationship between X and Y 36
  • 78. Non-linear feature screening: Kolmogorov-Smirnov Test Mai & Zou (2012) proposed using the Kolmogorov-Smirnov (KS) test statistic ˆKj = sup x |ˆFj (x|Y = 1) − ˆFj (x|Y = 0)| (3) Figure 8: Depiction of KS statistic 37
  • 79. Non-linear Interaction Models After feature screening, we can fit non-linear relationships between X and Y Yi = β0 + f (Xij ) + f (Xij , Ei ) + εi (4) 38
  • 81. Conclusions and Contributions • Large system-wide changes are observed in many environments 39
  • 82. Conclusions and Contributions • Large system-wide changes are observed in many environments • This assumption can possibly be exploited to aid analysis of large data 39
  • 83. Conclusions and Contributions • Large system-wide changes are observed in many environments • This assumption can possibly be exploited to aid analysis of large data • We develop and implement a multivariate penalization procedure for predicting a continuous or binary disease outcome while detecting interactions between high dimensional data (p >> n) and an environmental factor. 39
  • 84. Conclusions and Contributions • Large system-wide changes are observed in many environments • This assumption can possibly be exploited to aid analysis of large data • We develop and implement a multivariate penalization procedure for predicting a continuous or binary disease outcome while detecting interactions between high dimensional data (p >> n) and an environmental factor. • Dimension reduction is achieved through leveraging the environmental-class-conditional correlations 39
  • 85. Conclusions and Contributions • Large system-wide changes are observed in many environments • This assumption can possibly be exploited to aid analysis of large data • We develop and implement a multivariate penalization procedure for predicting a continuous or binary disease outcome while detecting interactions between high dimensional data (p >> n) and an environmental factor. • Dimension reduction is achieved through leveraging the environmental-class-conditional correlations • Also, we develop and implement a strong heredity framework within the penalized model 39
  • 86. Conclusions and Contributions • Large system-wide changes are observed in many environments • This assumption can possibly be exploited to aid analysis of large data • We develop and implement a multivariate penalization procedure for predicting a continuous or binary disease outcome while detecting interactions between high dimensional data (p >> n) and an environmental factor. • Dimension reduction is achieved through leveraging the environmental-class-conditional correlations • Also, we develop and implement a strong heredity framework within the penalized model • R software: http://sahirbhatnagar.com/eclust/ 39
  • 87. Limitations • There must be a high-dimensional signature of the exposure 40
  • 88. Limitations • There must be a high-dimensional signature of the exposure • Clustering is unsupervised 40
  • 89. Limitations • There must be a high-dimensional signature of the exposure • Clustering is unsupervised • Two tuning parameters 40
  • 90. What type of data is required to use these methods
  • 91. ECLUST method 1. environmental exposure (currently only binary) 2. a high dimensional dataset that can be affected by the exposure 3. a single phenotype (continuous or binary) 4. Must be a high-dimensional signature of the exposure 41
  • 92. Strong Heredity and Non-linear Models 1. a single phenotype (continuous or binary) 2. environment variable (continuous or binary) 3. any number of predictor variables 42
  • 93. Check out our Lab’s Software! http://greenwoodlab.github.io/software/ 43
  • 94. acknowledgements • Dr. Celia Greenwood • Dr. Blanchette and Dr. Yang • Dr. Luigi Bouchard, Andr´e Anne Houde • Dr. Steele, Dr. Kramer, Dr. Abrahamowicz • Maxime Turgeon, Kevin McGregor, Lauren Mokry, Marie Forest, Pablo Ginestet • Greg Voisin, Vince Forgetta, Kathleen Klein • Mothers and children from the study 44