1. Integrative causality analysis of genetic, epigenetic, and
transcriptomic data in a large cohort
Rosemary McCloskey and Sara Mostafavi
rmcclosk.math@gmail.com
http://slideshare.net/rmcclosk/omics-integration
March 27, 2015
R. McCloskey & S. Mostafavi () Omics data integration March 27, 2015 1 / 12
2. Motivation
genetic, epigenetic, and transcriptomic data provide snapshots of
cellular processes
GATTACA
gene
expression
methylation
histone
acetylation
genotype
R. McCloskey & S. Mostafavi () Omics data integration March 27, 2015 2 / 12
3. Motivation
genetic, epigenetic, and transcriptomic data provide snapshots of
cellular processes
usually one data type is studied at a time, in relation to a phenotype
or disease
GATTACA
gene
expression
methylation
histone
acetylation
genotype
R. McCloskey & S. Mostafavi () Omics data integration March 27, 2015 2 / 12
4. Motivation
genetic, epigenetic, and transcriptomic data provide snapshots of
cellular processes
usually one data type is studied at a time, in relation to a phenotype
or disease
GATTACA
?
gene
expression
methylation
histone
acetylation
genotype
how do these data fit together?
R. McCloskey & S. Mostafavi () Omics data integration March 27, 2015 2 / 12
5. The data
large cohort designed
to study cognitive
decline and
Alzheimer’s disease
2
19
1080
0
3
392
152
20
0
1
40 61
47
17
11
expression methylation
acetylation genotype
R. McCloskey & S. Mostafavi () Omics data integration March 27, 2015 3 / 12
6. The data
large cohort designed
to study cognitive
decline and
Alzheimer’s disease
genotype, gene
expression, DNA
methylation, and
histone acetylation
(CHiP-seq) data
2
19
1080
0
3
392
152
20
0
1
40 61
47
17
11
expression methylation
acetylation genotype
R. McCloskey & S. Mostafavi () Omics data integration March 27, 2015 3 / 12
7. The data
large cohort designed
to study cognitive
decline and
Alzheimer’s disease
genotype, gene
expression, DNA
methylation, and
histone acetylation
(CHiP-seq) data
392 individuals with
all four data types
were used for this
analysis
2
19
1080
0
3
392
152
20
0
1
40 61
47
17
11
expression methylation
acetylation genotype
R. McCloskey & S. Mostafavi () Omics data integration March 27, 2015 3 / 12
8. Quantitative trait loci (QTLs)
a QTL is a genetic locus
correlated with a
phenotype
-2
-1
0
1
2
3
-2
-1
0
1
2
-1
0
1
expressionacetylationmethylation
0 1 2
genotype
R. McCloskey & S. Mostafavi () Omics data integration March 27, 2015 4 / 12
9. Quantitative trait loci (QTLs)
a QTL is a genetic locus
correlated with a
phenotype
we are interested in
QTLs for gene
expression (eQTLs),
histone acetylation
(aceQTLs), and
methylation (meQTLs)
-2
-1
0
1
2
3
-2
-1
0
1
2
-1
0
1
expressionacetylationmethylation
0 1 2
genotype
R. McCloskey & S. Mostafavi () Omics data integration March 27, 2015 4 / 12
10. Quantitative trait loci (QTLs)
a QTL is a genetic locus
correlated with a
phenotype
we are interested in
QTLs for gene
expression (eQTLs),
histone acetylation
(aceQTLs), and
methylation (meQTLs)
QTLs provide a tool to
study interaction
between other molecular
phenotypes
-2
-1
0
1
2
3
-2
-1
0
1
2
-1
0
1
expressionacetylationmethylation
0 1 2
genotype
R. McCloskey & S. Mostafavi () Omics data integration March 27, 2015 4 / 12
12. Identifying QTLs
↓
SNPs in 200 kb window
Spearman’s ρ
R. McCloskey & S. Mostafavi () Omics data integration March 27, 2015 5 / 12
13. Identifying QTLs
↓
SNPs in 200 kb window
Spearman’s ρ
↓
Holm-Bonferroni correction
best SNP per feature
R. McCloskey & S. Mostafavi () Omics data integration March 27, 2015 5 / 12
14. Identifying QTLs
↓
SNPs in 200 kb window
Spearman’s ρ
↓
Holm-Bonferroni correction
best SNP per feature
↓ FDR correction
R. McCloskey & S. Mostafavi () Omics data integration March 27, 2015 5 / 12
15. Removing Principal Components
technical, environmental,
and biological covariates
can swamp out QTL
effects
4000
4500
5000
5500
6000
3000
3500
4000
75000
80000
85000
90000
95000
genespeaksCpGs
0 5 10 15 20
PCs removed
R. McCloskey & S. Mostafavi () Omics data integration March 27, 2015 6 / 12
16. Removing Principal Components
technical, environmental,
and biological covariates
can swamp out QTL
effects
correct by removing
principal components
4000
4500
5000
5500
6000
3000
3500
4000
75000
80000
85000
90000
95000
genespeaksCpGs
0 5 10 15 20
PCs removed
R. McCloskey & S. Mostafavi () Omics data integration March 27, 2015 6 / 12
17. Removing Principal Components
technical, environmental,
and biological covariates
can swamp out QTL
effects
correct by removing
principal components
number of peaks with a
QTL plateaus at 10 PCs,
while genes and CpGs
continue to increase
4000
4500
5000
5500
6000
3000
3500
4000
75000
80000
85000
90000
95000
genespeaksCpGs
0 5 10 15 20
PCs removed
R. McCloskey & S. Mostafavi () Omics data integration March 27, 2015 6 / 12
18. Removing Principal Components
technical, environmental,
and biological covariates
can swamp out QTL
effects
correct by removing
principal components
number of peaks with a
QTL plateaus at 10 PCs,
while genes and CpGs
continue to increase
for this analysis, removed
10 PCs from all data
4000
4500
5000
5500
6000
3000
3500
4000
75000
80000
85000
90000
95000
genespeaksCpGs
0 5 10 15 20
PCs removed
R. McCloskey & S. Mostafavi () Omics data integration March 27, 2015 6 / 12
19. Identifying multi-QTLs
By intersecting QTL sets, found
240 gene, CpG, and peak triples
which shared the same QTL
2984
1799
50981
127
240
1604
2129
eQTL meQTL
aceQTL
2984
1799
50981
127
240
1604
2129
eQTL meQTL
aceQTL
R. McCloskey & S. Mostafavi () Omics data integration March 27, 2015 7 / 12
20. Identifying multi-QTLs
By intersecting QTL sets, found
240 gene, CpG, and peak triples
which shared the same QTL
2984
1799
50981
127
240
1604
2129
eQTL meQTL
aceQTL
2984
1799
50981
127
240
1604
2129
eQTL meQTL
aceQTL
Also assessed QTL overlap using
π0 approach
100 %
46 %
14 %
31 %
100 %
11 %
83 %
84 %
100 %
eQTLs
aceQTLs
meQTLs
eQTLs
aceQTLs
meQTLs
R. McCloskey & S. Mostafavi () Omics data integration March 27, 2015 7 / 12
21. Bayesian networks
Bayesian networks are directed graphical models, where the directed
edges represent causal relationships
temperature precipitation
R. McCloskey & S. Mostafavi () Omics data integration March 27, 2015 8 / 12
22. Bayesian networks
Bayesian networks are directed graphical models, where the directed
edges represent causal relationships
We use conditional Gaussian networks
temperature precipitation
Pr(temp) ∼ N(0, 1) Pr(precip | temp) ∼ N(0, 1)
R. McCloskey & S. Mostafavi () Omics data integration March 27, 2015 8 / 12
23. Bayesian networks
Bayesian networks are directed graphical models, where the directed
edges represent causal relationships
We use conditional Gaussian networks
Score = likelihood of data given network
temperature precipitation
Pr(temp) ∼ N(0, 1) Pr(precip | temp) ∼ N(0, 1)
0.7 0.5
R. McCloskey & S. Mostafavi () Omics data integration March 27, 2015 8 / 12
24. Bayesian networks
Bayesian networks are directed graphical models, where the directed
edges represent causal relationships
We use conditional Gaussian networks
Score = likelihood of data given network
temperature precipitation
Pr(temp) ∼ N(0, 1) Pr(precip | temp) ∼ N(0, 1)
0.7 0.5
Pr(N(0, 1) = 0.7) Pr(N(0.7, 1) = 0.5)×
R. McCloskey & S. Mostafavi () Omics data integration March 27, 2015 8 / 12
25. Networks for QTLs
deal and CGBayesNets packages to construct one Bayesian network
for each multi-QTL by exhaustive search
genotypeexpression acetylation
methylation
R. McCloskey & S. Mostafavi () Omics data integration March 27, 2015 9 / 12
26. Networks for QTLs
deal and CGBayesNets packages to construct one Bayesian network
for each multi-QTL by exhaustive search
With deal, edges into genotype were blacklisted
genotypeexpression acetylation
methylation
R. McCloskey & S. Mostafavi () Omics data integration March 27, 2015 9 / 12
27. Networks for QTLs
deal and CGBayesNets packages to construct one Bayesian network
for each multi-QTL by exhaustive search
With deal, edges into genotype were blacklisted
Most common network structure was independence
genotypeexpression acetylation
methylation
R. McCloskey & S. Mostafavi () Omics data integration March 27, 2015 9 / 12
28. Networks for QTLs
deal and CGBayesNets packages to construct one Bayesian network
for each multi-QTL by exhaustive search
With deal, edges into genotype were blacklisted
Most common network structure was independence
Accounted for 42% of deal networks, 29% of CGBayesNets networks
genotypeexpression acetylation
methylation
R. McCloskey & S. Mostafavi () Omics data integration March 27, 2015 9 / 12
29. Future Work
Expand the number of multi-QTLs
More that just the best SNP per feature
Identify overlapping QTLs intelligently
More rigourous criterion for number of PCs to remove
Try other packages for network learning (HyPhy)
Are QTLs enriched in SNPs identified in GWAS studies?
Correlations with phenotype (cognitive decline etc.)
R. McCloskey & S. Mostafavi () Omics data integration March 27, 2015 10 / 12
30. Thank you!
Harvard / Broad
Philip L. D. Jager
Lori Chibnik
Jishu Xu
Charles White
Cristin McCabe
Towfique Raj
Rush
David A Bennett
Chris Gaiteri
Lei Yu
Bioinformatics Training Program
All the students
Sharon Ruschkowski
R. McCloskey & S. Mostafavi () Omics data integration March 27, 2015 11 / 12
31. Software
QTL analysis
Matrix eQTL
qvalue
Bayesian networks
deal
CGBayesNets
Slides
beamer
TikZ
tikzDevice
Plots
pheatmap
ggplot2
VennDiagram
Colour Scheme
solarized
R. McCloskey & S. Mostafavi () Omics data integration March 27, 2015 12 / 12