Report

sahirbhatnagarFollow

Apr. 19, 2015•0 likes## 0 likes

•924 views## views

Be the first to like this

Show More

Total views

0

On Slideshare

0

From embeds

0

Number of embeds

0

Apr. 19, 2015•0 likes## 0 likes

•924 views## views

Be the first to like this

Show More

Total views

0

On Slideshare

0

From embeds

0

Number of embeds

0

Download to read offline

Report

Science

Recent advances in genomic technologies have made it feasible to measure, on the same individual, multiple types of genomic activity such as genotypes, gene expression, DNA copy number, methylation and microRNA expression. However, in order to benefit from the increasing amounts of heterogeneous data and to obtain a more complete view of genomic functions, there is a great need for statistical and computationally efficient methods that allow us to combine this information in an intelligent way. Challenges with prediction models in this setting arise from the high-dimensional non-linear nature of the data, the large number of measurements compared to the few samples for whom they are collected, and the presence of complex interactions between the different types of data. Methods such as sparse regression, hierarchical clustering and principal component analysis can address any one of these challenges, but can not do so simultaneously. Kernel methods, which use matrices measuring the similarity between two individuals, offer a powerful way of simultaneously addressing these challenges without significantly increasing the computational burden. In this work, we investigate the benefits and challenges that arise from using kernel methods in the context of integrating DNA methylation, gene expression and phenotypic data in a sample of mother-child pairs from a prospective birth cohort. The goal of this study is to identify epigenetic marks observed at birth that help predict childhood obesity.

sahirbhatnagarFollow

Machine Learning in Reproductive Science: Human Embryo Selection and BeyondSri Ambati

How would Mary Poppins fare in labour PDFRichie Sweeney

Clinical Decision Making with Machine LearningSri Ambati

Normalization of Illumina 450 DNA methylation dataBrock Donovan

Dna methylationSushma Marla

Regulation of Gene Expression pptKhaled Elmasry

- Analysis of DNA Methylation and Gene Expression data in Placenta tissue to predict childhood obesity An Integrative Approach Bhatnagar SR1,2 , Houde A4,5 , Voisin G2 , Bouchard L4,5 , Greenwood CMT1,2,3 1 Department of Epidemiology, Biostatistics and Occupational Health, McGill University 2 Lady Davis Institute, Jewish General Hospital, Montr´eal, QC 3 Departments of Oncology and Human Genetics, McGill University 4 Department of Biochemistry, Universit´e de Sherbrooke, QC 5 ECOGENE-21 and Lipid Clinic, Chicoutimi Hospital, QC sahirbhatnagar.com/talks Poster Session B, # 56
- Motivating Question # 1 sahirbhatnagar.com Data Integration CHSGM 2015 2 / 25
- Motivation 1 in 4 adult Canadians and 1 in 10 children are clinically obese. Events during pregnancy are suspected to play a role in childhood obesity → we don’t know about the mechanisms involved Children born to women who had a gestational diabetes mellitus-aﬀected pregnancy are more likely to be overweight and obese Evidence suggests epigenetic factors are important piece of the puzzle sahirbhatnagar.com Data Integration CHSGM 2015 3 / 25
- Motivating Question # 2 sahirbhatnagar.com Data Integration CHSGM 2015 4 / 25
- Motivating Question sample size genomic data 25 50 Gene Expression
- Motivating Question sample size genomic data 25 50 Gene Expression DNA Methylation DNA Methylation Gene Expression
- Motivating Question sample size genomic data 25 50 Gene Expression DNA Methylation DNA Methylation Gene Expression ?? ? sahirbhatnagar.com Data Integration CHSGM 2015 5 / 25
- The Data sahirbhatnagar.com Data Integration CHSGM 2015 6 / 25
- The Data Expression HT-12 v4 p = 46, 889 Methylation Illumina 450k p = 375, 561 Gestational Diabetes n = 45 GD = 29 Placenta n = 45 time at birth age 5 | |
- The Data Expression HT-12 v4 p = 46, 889 Methylation Illumina 450k p = 375, 561 Gestational Diabetes n = 45 GD = 29 Placenta n = 45 time at birth age 5 | | X
- The Data Expression HT-12 v4 p = 46, 889 Methylation Illumina 450k p = 375, 561 Gestational Diabetes n = 45 GD = 29 Placenta n = 45 time at birth age 5 | | X 7 Fat Measures Child n = 23 GD = 16
- The Data Expression HT-12 v4 p = 46, 889 Methylation Illumina 450k p = 375, 561 Gestational Diabetes n = 45 GD = 29 Placenta n = 45 time at birth age 5 | | X 7 Fat Measures Child n = 23 GD = 16 Y
- The Data Expression HT-12 v4 p = 46, 889 Methylation Illumina 450k p = 375, 561 Gestational Diabetes n = 45 GD = 29 Placenta n = 45 time at birth age 5 | | X 7 Fat Measures Child n = 23 GD = 16 Y ? sahirbhatnagar.com Data Integration CHSGM 2015 7 / 25
- Summarizing Expression, Methylation and Gestational Diabetes Phenotype in Placenta Tissue sahirbhatnagar.com Data Integration CHSGM 2015 8 / 25
- Sparse Canonical Correlation Analysis (sCCA) CCA requires calculation of XT X −1 and YT Y −1 When p + q >> n, these matrices are singular sCCA applies an L1 penalty to the canonical vectors to obtain sparse solutions (Witten et al., 2009; Parkhomenko et al., 2009) Assumes XT X = Ip, YT Y = Iq maximizeu,v uT XT Yv subject to u 2 2 ≤ 1, v 2 2 ≤ 1 and P1(u) ≤ λ1, P2(v) ≤ λ2 sahirbhatnagar.com Data Integration CHSGM 2015 9 / 25
- Supervised Sparse CCA Main idea: 1. The features that are most associated with the outcome Q are identiﬁed to form the reduced matrices X and Y 2. sCCA is performed on X and Y sahirbhatnagar.com Data Integration CHSGM 2015 10 / 25
- Importance of Gestational Diabetes Phenotype 0.88 0.90 0.92 0.94 0.96 0.98 #non−0expressionprobes # non−0 methylation probes Correlation Gestational Diabetes Status Used in Sparse CCA 0.88 0.90 0.92 0.94 0.96 0.98 #non−0expressionprobes # non−0 methylation probes Correlation Gestational Diabetes Status Not Used sahirbhatnagar.com Data Integration CHSGM 2015 11 / 25
- GO Stat Analysis for Enrichment Enrichment Analysis based on non zero vector of 1st component from the Supervised sCCA analysis Genes associated with inﬂammatory processes Table : Top list of enriched GO terms GOBPID FDR OR E.Count Count Size Term 0002376 < 10−14 2.1 131.6 227 2178 immune system process 0006955 < 10−13 2.3 78.7 153 1303 immune response 0002252 < 10−9 2.7 34.1 80 565 immune eﬀector process 0045087 < 10−8 2.3 49.0 99 811 innate immune response 0002682 < 10−8 2.1 66.56 122 1102 regulation of immune system process 0002684 < 10−8 2.4 40.1 84 664 positive regulation of immune system proces 0006952 < 10−8 1.9 84.5 144 1399 defense response 0050776 < 10−8 2.3 44.5 90 738 regulation of immune response 0050778 < 10−7 2.6 28.5 65 473 positive regulation of immune response 0006950 < 10−7 1.6 196.8 271 3258 response to stress sahirbhatnagar.com Data Integration CHSGM 2015 12 / 25
- Summarizing Bodyfat Measures sahirbhatnagar.com Data Integration CHSGM 2015 13 / 25
- Cluster 6 Bodyfat measures in 2 groups 34 14 8 16 7 6 38 30 20 25 13 3 12 11 17 21 39 31 19 37 28 32 18 Zscore BMI percent fat subscapularis bicep tricep iliacus −2 0 2 Value Color Key sahirbhatnagar.com Data Integration CHSGM 2015 14 / 25
- Circle of Correlations −1.0 −0.5 0.0 0.5 1.0 −1.0−0.50.00.51.0 Variables factor map (PCA) Dim 1 (50.68%) Dim2(15.41%) Zscore BMI percent fat tricep bicep subscapularis iliacus sahirbhatnagar.com Data Integration CHSGM 2015 15 / 25
- Combining Both Data sahirbhatnagar.com Data Integration CHSGM 2015 16 / 25
- Regression via Elastic Net Expression HT-12 v4 p = 46, 889 Methylation Illumina 450k p = 375, 561 Gestational Diabetes n = 45 GD = 29 Placenta n = 45 time at birth age 5 | | X 7 Fat Measures Child n = 23 GD = 16 Y ? sahirbhatnagar.com Data Integration CHSGM 2015 17 / 25
- 1st PC as Summary Bodyfat Measure 3 8 32 14294 102 1443 187 12853 124 375563 36 37513 81 338052 197 75115 30 7505 188 67612 196 84503 37 9380 202 75125 1 2 3 4 data used to predict 1st PC of bodyfat measures LOOCVmeansquarederror data.type Canonical Variables Expr+Methy non 0 CCA factors Expr non 0 CCA factors Methy non 0 CCA factors Expr+Methy Filter Expr Filter low means Methy Filter low var Expr+Methy Filter low+t.test Expr Filter low+t.test Methy Filter low+t.test Expr+Methy Filter t.test Expr Filter t.test Methy Filter t.test sahirbhatnagar.com Data Integration CHSGM 2015 18 / 25
- Ward Clustering Groups 1 8 22 14294 1 1443 20 12853 331 375563 1 37513 54 338052 6 75115 1 7505 6 67612 7 84503 1 9380 30 75125 0.0 0.1 0.2 0.3 0.4 0.5 data used to predict Ward clustering groups LOOCVmisclassificationerror data.type Canonical Variables Expr+Methy non 0 CCA factors Expr non 0 CCA factors Methy non 0 CCA factors Expr+Methy Filter Expr Filter low means Methy Filter low var Expr+Methy Filter low+t.test Expr Filter low+t.test Methy Filter low+t.test Expr+Methy Filter t.test Expr Filter t.test Methy Filter t.test sahirbhatnagar.com Data Integration CHSGM 2015 19 / 25
- Neuropeptide Y Receptor (NPY1R) From OMIM: One of the most abundant neuropeptides in the mammalian nervous system Exhibits a diverse range of important physiologic activities, including eﬀects on food intake Have been identiﬁed in a variety of tissues, including placenta (Herzog et al., 1992). sahirbhatnagar.com Data Integration CHSGM 2015 20 / 25
- Motivating Question #2: My Answer sample size genomic data 25 50 Gene Expression DNA Methylation DNA Methylation Gene Expression sahirbhatnagar.com Data Integration CHSGM 2015 21 / 25
- Big Data sahirbhatnagar.com Data Integration CHSGM 2015 22 / 25
- Big Data Data Integration sahirbhatnagar.com Data Integration CHSGM 2015 22 / 25
- Big Data Data Integration Machine Learning sahirbhatnagar.com Data Integration CHSGM 2015 22 / 25
- Smalln Data sahirbhatnagar.com Data Integration CHSGM 2015 23 / 25
- Acknowledgements Celia Greenwood and Mathieu Blanchette Greg Voisin, Andr´ee-Anne Houde, Luigi Bouchard All the mothers and children that took part in this study You sahirbhatnagar.com Data Integration CHSGM 2015 24 / 25
- References Principal component analysis plots and beamer template. URL http://gastonsanchez.com/. Elena Parkhomenko, David Tritchler, and Joseph Beyene. Sparse canonical correlation analysis with application to genomic data integration. Statistical Applications in Genetics and Molecular Biology, 8(1):1–34, 2009. Daniela M Witten and Robert J Tibshirani. Extensions of sparse canonical correlation analysis with applications to genomic data. Statistical applications in genetics and molecular biology, 8(1):1–27, 2009. Daniela M Witten, Robert Tibshirani, and Trevor Hastie. A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics, page kxp008, 2009. sahirbhatnagar.com Data Integration CHSGM 2015 25 / 25