Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Analysis of DNA Methylation and Gene
Expression data in Placenta tissue to
predict childhood obesity
An Integrative Approa...
Motivating Question # 1
sahirbhatnagar.com Data Integration CHSGM 2015 2 / 25
Motivation
1 in 4 adult Canadians and 1 in 10 children are clinically obese.
Events during pregnancy are suspected to play...
Motivating Question # 2
sahirbhatnagar.com Data Integration CHSGM 2015 4 / 25
Motivating Question
sample size
genomic data
25 50
Gene
Expression
Motivating Question
sample size
genomic data
25 50
Gene
Expression
DNA
Methylation
DNA
Methylation
Gene
Expression
Motivating Question
sample size
genomic data
25 50
Gene
Expression
DNA
Methylation
DNA
Methylation
Gene
Expression
??
?
sa...
The Data
sahirbhatnagar.com Data Integration CHSGM 2015 6 / 25
The Data
Expression
HT-12 v4
p = 46, 889
Methylation
Illumina 450k
p = 375, 561
Gestational
Diabetes
n = 45
GD = 29
Placen...
The Data
Expression
HT-12 v4
p = 46, 889
Methylation
Illumina 450k
p = 375, 561
Gestational
Diabetes
n = 45
GD = 29
Placen...
The Data
Expression
HT-12 v4
p = 46, 889
Methylation
Illumina 450k
p = 375, 561
Gestational
Diabetes
n = 45
GD = 29
Placen...
The Data
Expression
HT-12 v4
p = 46, 889
Methylation
Illumina 450k
p = 375, 561
Gestational
Diabetes
n = 45
GD = 29
Placen...
The Data
Expression
HT-12 v4
p = 46, 889
Methylation
Illumina 450k
p = 375, 561
Gestational
Diabetes
n = 45
GD = 29
Placen...
Summarizing Expression,
Methylation and Gestational
Diabetes Phenotype in Placenta
Tissue
sahirbhatnagar.com Data Integrat...
Sparse Canonical Correlation Analysis (sCCA)
CCA requires calculation of XT
X
−1
and YT
Y
−1
When p + q >> n, these matric...
Supervised Sparse CCA
Main idea:
1. The features that are most associated with the outcome Q are
identified to form the red...
Importance of Gestational Diabetes Phenotype
0.88
0.90
0.92
0.94
0.96
0.98
#non−0expressionprobes
# non−0 methylation prob...
GO Stat Analysis for Enrichment
Enrichment Analysis based on non zero vector of 1st component from
the Supervised sCCA ana...
Summarizing Bodyfat Measures
sahirbhatnagar.com Data Integration CHSGM 2015 13 / 25
Cluster 6 Bodyfat measures in 2 groups
34
14
8
16
7
6
38
30
20
25
13
3
12
11
17
21
39
31
19
37
28
32
18
Zscore BMI
percent...
Circle of Correlations
−1.0 −0.5 0.0 0.5 1.0
−1.0−0.50.00.51.0
Variables factor map (PCA)
Dim 1 (50.68%)
Dim2(15.41%)
Zsco...
Combining Both Data
sahirbhatnagar.com Data Integration CHSGM 2015 16 / 25
Regression via Elastic Net
Expression
HT-12 v4
p = 46, 889
Methylation
Illumina 450k
p = 375, 561
Gestational
Diabetes
n =...
1st PC as Summary Bodyfat Measure
3
8
32
14294
102
1443
187
12853
124
375563
36
37513
81
338052
197
75115
30
7505
188
6761...
Ward Clustering Groups
1
8
22
14294
1
1443
20
12853
331
375563
1
37513
54
338052
6
75115
1
7505
6
67612
7
84503
1
9380
30
...
Neuropeptide Y Receptor (NPY1R)
From OMIM:
One of the most abundant neuropeptides in the mammalian
nervous system
Exhibits...
Motivating Question #2: My Answer
sample size
genomic data
25 50
Gene
Expression
DNA
Methylation
DNA
Methylation
Gene
Expr...
Big Data
sahirbhatnagar.com Data Integration CHSGM 2015 22 / 25
Big Data
Data Integration
sahirbhatnagar.com Data Integration CHSGM 2015 22 / 25
Big Data
Data Integration
Machine Learning
sahirbhatnagar.com Data Integration CHSGM 2015 22 / 25
Smalln Data
sahirbhatnagar.com Data Integration CHSGM 2015 23 / 25
Acknowledgements
Celia Greenwood and
Mathieu Blanchette
Greg Voisin, Andr´ee-Anne
Houde, Luigi Bouchard
All the mothers an...
References
Principal component analysis plots and beamer template. URL
http://gastonsanchez.com/.
Elena Parkhomenko, David...
Upcoming SlideShare
Loading in …5
×

Analysis of DNA methylation and Gene expression to predict childhood obesity

664 views

Published on

Recent advances in genomic technologies have made it feasible to measure, on the same individual, multiple types of genomic activity such as genotypes, gene expression, DNA copy number, methylation and microRNA expression. However, in order to benefit from the increasing amounts of heterogeneous data and to obtain a more complete view of genomic functions, there is a great need for statistical and computationally efficient methods that allow us to combine this information in an intelligent way. Challenges with prediction models in this setting arise from the high-dimensional non-linear nature of the data, the large number of measurements compared to the few samples for whom they are collected, and the presence of complex interactions between the different types of data. Methods such as sparse regression, hierarchical clustering and principal component analysis can address any one of these challenges, but can not do so simultaneously. Kernel methods, which use matrices measuring the similarity between two individuals, offer a powerful way of simultaneously addressing these challenges without significantly increasing the computational burden. In this work, we investigate the benefits and challenges that arise from using kernel methods in the context of integrating DNA methylation, gene expression and phenotypic data in a sample of mother-child pairs from a prospective birth cohort. The goal of this study is to identify epigenetic marks observed at birth that help predict childhood obesity.

Published in: Science
  • Be the first to comment

  • Be the first to like this

Analysis of DNA methylation and Gene expression to predict childhood obesity

  1. 1. Analysis of DNA Methylation and Gene Expression data in Placenta tissue to predict childhood obesity An Integrative Approach Bhatnagar SR1,2 , Houde A4,5 , Voisin G2 , Bouchard L4,5 , Greenwood CMT1,2,3 1 Department of Epidemiology, Biostatistics and Occupational Health, McGill University 2 Lady Davis Institute, Jewish General Hospital, Montr´eal, QC 3 Departments of Oncology and Human Genetics, McGill University 4 Department of Biochemistry, Universit´e de Sherbrooke, QC 5 ECOGENE-21 and Lipid Clinic, Chicoutimi Hospital, QC sahirbhatnagar.com/talks Poster Session B, # 56
  2. 2. Motivating Question # 1 sahirbhatnagar.com Data Integration CHSGM 2015 2 / 25
  3. 3. Motivation 1 in 4 adult Canadians and 1 in 10 children are clinically obese. Events during pregnancy are suspected to play a role in childhood obesity → we don’t know about the mechanisms involved Children born to women who had a gestational diabetes mellitus-affected pregnancy are more likely to be overweight and obese Evidence suggests epigenetic factors are important piece of the puzzle sahirbhatnagar.com Data Integration CHSGM 2015 3 / 25
  4. 4. Motivating Question # 2 sahirbhatnagar.com Data Integration CHSGM 2015 4 / 25
  5. 5. Motivating Question sample size genomic data 25 50 Gene Expression
  6. 6. Motivating Question sample size genomic data 25 50 Gene Expression DNA Methylation DNA Methylation Gene Expression
  7. 7. Motivating Question sample size genomic data 25 50 Gene Expression DNA Methylation DNA Methylation Gene Expression ?? ? sahirbhatnagar.com Data Integration CHSGM 2015 5 / 25
  8. 8. The Data sahirbhatnagar.com Data Integration CHSGM 2015 6 / 25
  9. 9. The Data Expression HT-12 v4 p = 46, 889 Methylation Illumina 450k p = 375, 561 Gestational Diabetes n = 45 GD = 29 Placenta n = 45 time at birth age 5 | |
  10. 10. The Data Expression HT-12 v4 p = 46, 889 Methylation Illumina 450k p = 375, 561 Gestational Diabetes n = 45 GD = 29 Placenta n = 45 time at birth age 5 | | X
  11. 11. The Data Expression HT-12 v4 p = 46, 889 Methylation Illumina 450k p = 375, 561 Gestational Diabetes n = 45 GD = 29 Placenta n = 45 time at birth age 5 | | X 7 Fat Measures Child n = 23 GD = 16
  12. 12. The Data Expression HT-12 v4 p = 46, 889 Methylation Illumina 450k p = 375, 561 Gestational Diabetes n = 45 GD = 29 Placenta n = 45 time at birth age 5 | | X 7 Fat Measures Child n = 23 GD = 16 Y
  13. 13. The Data Expression HT-12 v4 p = 46, 889 Methylation Illumina 450k p = 375, 561 Gestational Diabetes n = 45 GD = 29 Placenta n = 45 time at birth age 5 | | X 7 Fat Measures Child n = 23 GD = 16 Y ? sahirbhatnagar.com Data Integration CHSGM 2015 7 / 25
  14. 14. Summarizing Expression, Methylation and Gestational Diabetes Phenotype in Placenta Tissue sahirbhatnagar.com Data Integration CHSGM 2015 8 / 25
  15. 15. Sparse Canonical Correlation Analysis (sCCA) CCA requires calculation of XT X −1 and YT Y −1 When p + q >> n, these matrices are singular sCCA applies an L1 penalty to the canonical vectors to obtain sparse solutions (Witten et al., 2009; Parkhomenko et al., 2009) Assumes XT X = Ip, YT Y = Iq maximizeu,v uT XT Yv subject to u 2 2 ≤ 1, v 2 2 ≤ 1 and P1(u) ≤ λ1, P2(v) ≤ λ2 sahirbhatnagar.com Data Integration CHSGM 2015 9 / 25
  16. 16. Supervised Sparse CCA Main idea: 1. The features that are most associated with the outcome Q are identified to form the reduced matrices X and Y 2. sCCA is performed on X and Y sahirbhatnagar.com Data Integration CHSGM 2015 10 / 25
  17. 17. Importance of Gestational Diabetes Phenotype 0.88 0.90 0.92 0.94 0.96 0.98 #non−0expressionprobes # non−0 methylation probes Correlation Gestational Diabetes Status Used in Sparse CCA 0.88 0.90 0.92 0.94 0.96 0.98 #non−0expressionprobes # non−0 methylation probes Correlation Gestational Diabetes Status Not Used sahirbhatnagar.com Data Integration CHSGM 2015 11 / 25
  18. 18. GO Stat Analysis for Enrichment Enrichment Analysis based on non zero vector of 1st component from the Supervised sCCA analysis Genes associated with inflammatory processes Table : Top list of enriched GO terms GOBPID FDR OR E.Count Count Size Term 0002376 < 10−14 2.1 131.6 227 2178 immune system process 0006955 < 10−13 2.3 78.7 153 1303 immune response 0002252 < 10−9 2.7 34.1 80 565 immune effector process 0045087 < 10−8 2.3 49.0 99 811 innate immune response 0002682 < 10−8 2.1 66.56 122 1102 regulation of immune system process 0002684 < 10−8 2.4 40.1 84 664 positive regulation of immune system proces 0006952 < 10−8 1.9 84.5 144 1399 defense response 0050776 < 10−8 2.3 44.5 90 738 regulation of immune response 0050778 < 10−7 2.6 28.5 65 473 positive regulation of immune response 0006950 < 10−7 1.6 196.8 271 3258 response to stress sahirbhatnagar.com Data Integration CHSGM 2015 12 / 25
  19. 19. Summarizing Bodyfat Measures sahirbhatnagar.com Data Integration CHSGM 2015 13 / 25
  20. 20. Cluster 6 Bodyfat measures in 2 groups 34 14 8 16 7 6 38 30 20 25 13 3 12 11 17 21 39 31 19 37 28 32 18 Zscore BMI percent fat subscapularis bicep tricep iliacus −2 0 2 Value Color Key sahirbhatnagar.com Data Integration CHSGM 2015 14 / 25
  21. 21. Circle of Correlations −1.0 −0.5 0.0 0.5 1.0 −1.0−0.50.00.51.0 Variables factor map (PCA) Dim 1 (50.68%) Dim2(15.41%) Zscore BMI percent fat tricep bicep subscapularis iliacus sahirbhatnagar.com Data Integration CHSGM 2015 15 / 25
  22. 22. Combining Both Data sahirbhatnagar.com Data Integration CHSGM 2015 16 / 25
  23. 23. Regression via Elastic Net Expression HT-12 v4 p = 46, 889 Methylation Illumina 450k p = 375, 561 Gestational Diabetes n = 45 GD = 29 Placenta n = 45 time at birth age 5 | | X 7 Fat Measures Child n = 23 GD = 16 Y ? sahirbhatnagar.com Data Integration CHSGM 2015 17 / 25
  24. 24. 1st PC as Summary Bodyfat Measure 3 8 32 14294 102 1443 187 12853 124 375563 36 37513 81 338052 197 75115 30 7505 188 67612 196 84503 37 9380 202 75125 1 2 3 4 data used to predict 1st PC of bodyfat measures LOOCVmeansquarederror data.type Canonical Variables Expr+Methy non 0 CCA factors Expr non 0 CCA factors Methy non 0 CCA factors Expr+Methy Filter Expr Filter low means Methy Filter low var Expr+Methy Filter low+t.test Expr Filter low+t.test Methy Filter low+t.test Expr+Methy Filter t.test Expr Filter t.test Methy Filter t.test sahirbhatnagar.com Data Integration CHSGM 2015 18 / 25
  25. 25. Ward Clustering Groups 1 8 22 14294 1 1443 20 12853 331 375563 1 37513 54 338052 6 75115 1 7505 6 67612 7 84503 1 9380 30 75125 0.0 0.1 0.2 0.3 0.4 0.5 data used to predict Ward clustering groups LOOCVmisclassificationerror data.type Canonical Variables Expr+Methy non 0 CCA factors Expr non 0 CCA factors Methy non 0 CCA factors Expr+Methy Filter Expr Filter low means Methy Filter low var Expr+Methy Filter low+t.test Expr Filter low+t.test Methy Filter low+t.test Expr+Methy Filter t.test Expr Filter t.test Methy Filter t.test sahirbhatnagar.com Data Integration CHSGM 2015 19 / 25
  26. 26. Neuropeptide Y Receptor (NPY1R) From OMIM: One of the most abundant neuropeptides in the mammalian nervous system Exhibits a diverse range of important physiologic activities, including effects on food intake Have been identified in a variety of tissues, including placenta (Herzog et al., 1992). sahirbhatnagar.com Data Integration CHSGM 2015 20 / 25
  27. 27. Motivating Question #2: My Answer sample size genomic data 25 50 Gene Expression DNA Methylation DNA Methylation Gene Expression sahirbhatnagar.com Data Integration CHSGM 2015 21 / 25
  28. 28. Big Data sahirbhatnagar.com Data Integration CHSGM 2015 22 / 25
  29. 29. Big Data Data Integration sahirbhatnagar.com Data Integration CHSGM 2015 22 / 25
  30. 30. Big Data Data Integration Machine Learning sahirbhatnagar.com Data Integration CHSGM 2015 22 / 25
  31. 31. Smalln Data sahirbhatnagar.com Data Integration CHSGM 2015 23 / 25
  32. 32. Acknowledgements Celia Greenwood and Mathieu Blanchette Greg Voisin, Andr´ee-Anne Houde, Luigi Bouchard All the mothers and children that took part in this study You sahirbhatnagar.com Data Integration CHSGM 2015 24 / 25
  33. 33. References Principal component analysis plots and beamer template. URL http://gastonsanchez.com/. Elena Parkhomenko, David Tritchler, and Joseph Beyene. Sparse canonical correlation analysis with application to genomic data integration. Statistical Applications in Genetics and Molecular Biology, 8(1):1–34, 2009. Daniela M Witten and Robert J Tibshirani. Extensions of sparse canonical correlation analysis with applications to genomic data. Statistical applications in genetics and molecular biology, 8(1):1–27, 2009. Daniela M Witten, Robert Tibshirani, and Trevor Hastie. A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics, page kxp008, 2009. sahirbhatnagar.com Data Integration CHSGM 2015 25 / 25

×