Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Pathways-Driven Sparse Regression Identifies
Pathways and Genes Associated with High-Density
Lipoprotein Cholesterol in Tw...
Introduction
• Genes do not act in isolation, but interact in complex
networks or pathways
• Rather than univariate approa...
Sparse group lasso model
• N individuals, P SNPs, (N x P) genotype matrix X, L pathways
• Assumptions
• All P SNPs may be ...
SGL model estimation
• To estimate 𝛽 𝑆𝐺𝐿
,
• block, or group-wise coordinate gradient
descent (BCGD) algorithm
• Select a ...
SGL simulation study 1
• Hypothesis
• causal SNPs are enriched in a given pathway
• pathway-driven SNP selection using SGL...
The problem of overlapping pathways
• Genes and SNPs may map to multiple pathways
• The optimization is no longer separabl...
The problem of overlapping pathways
•
• each pathway is regressed against the phenotype vector y
• Only coordinate gradien...
SGL simulation study 2
Figure 5. SGL Simulation Study with overlapping pathways
Table 1. Mean number of pathways and SNPs ...
SGL simulation study 2
• Pathway and SNP selection power and
False positive rates (FPR) at MC
simulation z
• SGL-CGD consi...
Pathway and SNP selection bias
• Biasing factors
• pathway size, varying patterns of SNP-SNP correlations, and gene
sizes
...
Ranking variables
• A resampling strategy
• calculate pathway, gene and SNP selection frequencies by repeatedly
fitting th...
Simulation study 3
• Evaluate ranking strategies
• Use real genotype and pathways data
• genome-wide SNP dataset ‘SP2’
• K...
Simulation study 3
TPR: The proportion of subsamples in
which the correct causal pathway is
selected
Figure 7. A–F: SNP an...
Pathway mapping
• Genes are mapped to pathways using information on
gene-gene interactions.
• Many SNPs and genes do not m...
Results
• Pathways-driven SNP selection on the SP2 and SiMES
datasets separately using SGL
• Combine this with the subsamp...
• Compare with the resulting pathway and
SNP selection frequency distributions with
null distributions
• A greater number ...
Pathway and SNP selection results
Figure 11. Empirical and null pathway selection
frequency distributions for all 185 KEGG...
Pathway and SNP selection results
Figure 13. SP2 dataset: scatter plots comparing empirical and null
selection frequencies...
• Increased correlation between empirical and null selection
frequency distributions at the lower 𝛼 increase bias in the
e...
Top 30 pathways and genes
... … … … …
Table 7. SP2 dataset: Top 30 pathways, ranked by pathway selection frequency, 𝜋 𝑝𝑎𝑡ℎ...
Top 30 pathways
... … … … …
Table 10. SiMES dataset: Top 30 pathways, ranked by pathway selection frequency, 𝜋 𝑝𝑎𝑡ℎ
.
Comparison of ranked pathway and gene lists
• Pathway rankings
Figure 16. Comparison of top-k SP2 and SiMES pathway rankin...
Comparison of ranked pathway and gene lists
• Gene rankings
Figure 17. Comparison of top-k SP2 and SiMES gene rankings, fo...
Discussion
• A method for the detection of pathways and genes associated with a
quantitative trait
• uses a sparse regress...
Thank you
Q & A
Upcoming SlideShare
Loading in …5
×

Pathways-Driven Sparse Regression Identifies Pathways and Genes Associated with High-Density Lipoprotein Cholesterol in Two Asian Cohorts

244 views

Published on

Summary of paper "Pathways-Driven Sparse Regression Identifies Pathways and Genes Associated with High-Density Lipoprotein Cholesterol in Two Asian Cohorts",
Silver M, Chen P, Li R, Cheng C-Y, Wong T-Y, et al.
In PLOS Genetics, 2013

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Pathways-Driven Sparse Regression Identifies Pathways and Genes Associated with High-Density Lipoprotein Cholesterol in Two Asian Cohorts

  1. 1. Pathways-Driven Sparse Regression Identifies Pathways and Genes Associated with High-Density Lipoprotein Cholesterol in Two Asian Cohorts Silver M, Chen P, Li R, Cheng C-Y, Wong T-Y, et al. In PLOS Genetics, 2013
  2. 2. Introduction • Genes do not act in isolation, but interact in complex networks or pathways • Rather than univariate approaches, a joint modelling approach, a dual-level, sparse regression model is proposed • can simultaneously identify pathways and genes for pathway selection • Pathways-driven gene selection in a search for pathways and genes associated with variation
  3. 3. Sparse group lasso model • N individuals, P SNPs, (N x P) genotype matrix X, L pathways • Assumptions • All P SNPs may be mapped to L groups or pathways • Pathways are disjoint or non-overlapping causal SNPs causal pathways Pathway level constraint SNP level constraint 𝛼 controls how the sparsity constraint is distributed between the two penalties 𝜆 controls the degree of sparsity in 𝛽
  4. 4. SGL model estimation • To estimate 𝛽 𝑆𝐺𝐿 , • block, or group-wise coordinate gradient descent (BCGD) algorithm • Select a pathway 𝑙 • Select SNP 𝑗 in selected pathway 𝑙 • Pathway, SNP partial residuals • Regress out the current estimated effects of all other pathways and SNPs
  5. 5. SGL simulation study 1 • Hypothesis • causal SNPs are enriched in a given pathway • pathway-driven SNP selection using SGL will outperform simple lasso selection • Randomly select 5 causal SNPs from a single pathway / all 2500 SNPs (without pathway information)
  6. 6. The problem of overlapping pathways • Genes and SNPs may map to multiple pathways • The optimization is no longer separable into groups (pathways) • Not be able to select pathways independently • By duplicating SNP predictors, SNPs belonging to more than one pathway can enter the model separately • SNPs are selected in each pathway whose joint effects pass a pathway selection threshold, irrespective of overlaps between pathways • Pathways are independent • they do not compete in the model estimation process Partially overlapping causal SNPs
  7. 7. The problem of overlapping pathways • • each pathway is regressed against the phenotype vector y • Only coordinate gradient descent within selected pathway (SGL-CGD) • Under the independence assumption, the estimation of each 𝛽𝑙 ∗ doesn’t depend on the other estimates 𝛽 𝑘 ∗ • Need only record the set of selected SNPs in each selected pathway
  8. 8. SGL simulation study 2 Figure 5. SGL Simulation Study with overlapping pathways Table 1. Mean number of pathways and SNPs selected by each model at each effect size, γ, across 2000 MC simulations • SNPs are mapped to 50 overlapping pathways, each containing 30 SNPs • Each pathway overlaps any adjacent pathway by 10 SNPs • The number of selected pathways or SNPs increases with decreasing effect size, as the number of pathways close to the selection threshold set
  9. 9. SGL simulation study 2 • Pathway and SNP selection power and False positive rates (FPR) at MC simulation z • SGL-CGD consistently outperforms SGL, both in terms of pathway selection sensitivity and control of false positives • SGL-BCGD typically has a higher FPR than SGL-CGD, since more SNPs are selected from non-causal pathways • SGL-CGD is more often able to select both causal pathways, and to select additional causal SNPs that are missed by SGL Figure 6. SGL-CGD vs SGL-BCGD performance
  10. 10. Pathway and SNP selection bias • Biasing factors • pathway size, varying patterns of SNP-SNP correlations, and gene sizes • An adaptive weight-tuning strategy to reduce selection bias • tuning the pathway weight vector 𝑤 to ensure that each pathway must have an equal chance of being selected
  11. 11. Ranking variables • A resampling strategy • calculate pathway, gene and SNP selection frequencies by repeatedly fitting the model over B subsamples of the data, at fixed values for 𝛼 and 𝜆 • exploit knowledge of finite sample variability obtained by subsampling, to achieve better estimates of a variable's importance • can rank pathways, genes and SNPs in order of their strength of association with the phenotype • Pathways or SNPs and genes are ranked in order of their selection probabilities
  12. 12. Simulation study 3 • Evaluate ranking strategies • Use real genotype and pathways data • genome-wide SNP dataset ‘SP2’ • KEGG pathways database • SNP ranking • TP: selected SNPs that tag at least one causal SNP • FP: selected SNPs which do not tag any causal SNP • gene ranking • TP: selected causal genes(map to true causal SNP) • FP: selected non-causal genes • Compared with SNP and gene rankings using a univariate, regression-based quantitative trait test (QTT) K: the number of causal SNPs GV, TV: proportion of trait variance
  13. 13. Simulation study 3 TPR: The proportion of subsamples in which the correct causal pathway is selected Figure 7. A–F: SNP and gene ranking performance for the six different scenarios
  14. 14. Pathway mapping • Genes are mapped to pathways using information on gene-gene interactions. • Many SNPs and genes do not map to any known pathway. • Genes and SNPs may map to more than one pathway. • Many SNPs cannot be mapped to a pathway since they do not map to a mapped gene. Available SNPs 492,639 SNPs (SP2) 515,503 SNPs (SiMES) Genes: GRCH36/hg18 21,004 genes 239,757 SNPs (SP2) 251,089 SNPs (SiMES) mapped to 18,845 genes (SP2) 18,919 genes (SiMES) within 10kbp Pathways: KEGG 185 Pathways containing 5,267 distinct genes SNP to gene mapping 75,389 SNPs (SP2) 78,933 SNPs (SiMES) mapped to 4,734 genes (SP2) 4,751 genes (SiMES) and 185 pathways SNP to pathway mapping
  15. 15. Results • Pathways-driven SNP selection on the SP2 and SiMES datasets separately using SGL • Combine this with the subsampling procedure to highlight pathways and genes associated with variation • Compare results from both datasets
  16. 16. • Compare with the resulting pathway and SNP selection frequency distributions with null distributions • A greater number of SNPs contribute to increase the number of pathways • The number of SNPs may affect the resulting pathway and SNP rankings • Optimal 𝛼=? Table 5. Separate combinations of regularisation parameters, 𝜆 and 𝛼 used for analysis of the SP2 dataset. Pathway level constraint SNP level constraint Pathway and SNP selection results
  17. 17. Pathway and SNP selection results Figure 11. Empirical and null pathway selection frequency distributions for all 185 KEGG pathways with the SP2 dataset Figure 12. Empirical and null SNP selection frequency distributions with the SP2 dataset Figure 14. Empirical and null pathway (top) and SNP (bottom) selection frequency distributions for the SiMES dataset 𝛼 = 0.85 𝛼 = 0.95 clearer separation of empirical and null distributions Biased empirical pathway and SNP selection frequency distributions 𝛼 = 0.95
  18. 18. Pathway and SNP selection results Figure 13. SP2 dataset: scatter plots comparing empirical and null selection frequencies presented in Figures 11 and 12 Figure 15. SiMES dataset: Scatter plots comparing empirical and null pathway (left) and SNP (right) selection frequencies presented in Figure 14
  19. 19. • Increased correlation between empirical and null selection frequency distributions at the lower 𝛼 increase bias in the empirical results • The selection of too many SNPs will add noise, bias Table 6. SP2 dataset: Pearson correlation coefficients (r) and p- values for the data plotted in Figure 13 Table 9. SiMES dataset: Pearson correlation coefficients (r) and p-values for the data plotted in Figure 15. Pathway and SNP selection results
  20. 20. Top 30 pathways and genes ... … … … … Table 7. SP2 dataset: Top 30 pathways, ranked by pathway selection frequency, 𝜋 𝑝𝑎𝑡ℎ . Table 8. SP2 and SiMES datasets: Top 30 genes ranked by gene selection frequency, 𝜋 𝑔𝑒𝑛𝑒 . ... … … …
  21. 21. Top 30 pathways ... … … … … Table 10. SiMES dataset: Top 30 pathways, ranked by pathway selection frequency, 𝜋 𝑝𝑎𝑡ℎ .
  22. 22. Comparison of ranked pathway and gene lists • Pathway rankings Figure 16. Comparison of top-k SP2 and SiMES pathway rankings Normalized Canberra distance(left), FDR q-values (right) Table 11. Consensus set of important pathways, Ψ25 𝑝𝑎𝑡ℎ , for SP2 and SiMES datasets with k = 25. closest agreement when k = 25
  23. 23. Comparison of ranked pathway and gene lists • Gene rankings Figure 17. Comparison of top-k SP2 and SiMES gene rankings, for k = 1,…,500. Normalized Canberra distance(left), FDR q-values (right) Table 13. Top 30 consensus genes ordered by their average rank, 𝜓244 𝑔𝑒𝑛𝑒 closest agreement when k=244
  24. 24. Discussion • A method for the detection of pathways and genes associated with a quantitative trait • uses a sparse regression model, the sparse group lasso, that enforces sparsity at the pathway and SNP level. • identify important pathways and also maximize the power to detect causal SNPs • Simulation studies • SGL has greater SNP selection power than lasso • a modified SGL-CGD estimation algorithm that treats pathways as independent, may offer greater sensitivity for the detection of causal SNPs and pathways • combines with a weight-tuning algorithm to reduce selection bias • a resampling technique is designed to provide a robust measure of variable importance
  25. 25. Thank you Q & A

×