Partitioning Heritability using GWAS Summary Statistics with LD Score Regression
1. Partitioning heritability by functional
annotation using summary statistics
Hilary Finucane
MIT Department of Mathematics
HSPH Department of Epidemiology
October 21, 2014
2. Acknowledgements
• Brendan Bulik-
Sullivan
• Alkes Price
• Ben Neale
• Alexander Gusev
• Nick Patterson
• Po-Ru Loh
• Gosia Trynka
• Han Xu
• Verneri Anttila
• Yakir Reshef
• Chongzhi Zang
• Stephan Ripke
• Schizophrenia Working
Group of the PGC
• Shaun Purcell
• Mark Daly
• Eli Stahl
• Soumya Raychaudhuri
• Sara Lindstrom
3. Partitioning heritability by functional
annotation is an important goal
• Learn about genetic architecture of disease
– Where does the heritability lie?
• Learn about disease biology
– What are the relevant cell types?
• Learn about the functional annotations
– Which functional annotations show the highest
enrichments?
• Downstream applications
– Fine mapping
– Risk prediction
– GWAS priors
Maurano et al. 2012 Science
Trynka et al. 2013 Nat Genet
Pickrell 2014 AJHG
4. What is partitioned heritability?
• Our model is
Where
• Y is an individual’s phenotype,
• Xj is an individual’s genotype at the j-th SNP
(normalized to mean 0 and variance 1),
• βj is the effect of SNP j, and
• ε is noise and random environmental effects.
6. What is partitioned heritability?
• Our model is
• We define heritability as
and the heritability of a category as
7. Partitioning heritability
using variance components has yielded
many insights
• 31% of schizophrenia SNP-heritability lies in CNS+
gene regions spanning 20% of the genome1.
• 28% of Tourette syndrome SNP-heritability and
29% of OCD SNP-heritability lies in parietal lobe
eQTLs spanning 5% of the genome2.
• 79% of SNP-heritability, averaged across WTCCC
and WTCCC2 traits, lies in DHS regions spanning
16% of the genome3.
1 Lee et al. 2012 Nat Genet
2 Davis et al. 2013 PLoS Genet
3 Gusev et al. in press AJHG
8. A method for partitioning heritability
from summary statistics is needed
• Variance components methods are intractable
at very large sample sizes.
• There is lots of information in large meta-analyses.
• Lots of publicly available summary statistics
allow us to compare many phenotypes and
many annotations to get a big picture.
9. Our method partitions heritability
from summary statistics
• Input:
– Sample size and p-value for every SNP tested in a
large GWAS of a quantitative or case-control trait
– LD information from a reference panel like 1000G
– Genome annotation of interest
– Other genome annotations to include in the
model.
10. Our method partitions heritability
from summary statistics
• Input:
– Sample size and p-value for every SNP tested in a
large GWAS of a quantitative or case-control trait
– LD information from a reference panel like 1000G
– Genome annotation of interest
– Other genome annotations to include in the
model.
• Output:
– Estimated proportion of heritability that falls
within the annotation of interest.
– Enrichment = (% of heritability) / (% of SNPs)
11. Outline
• Description of method
• Validation on simulated data
• Results on real data
12. Outline
• Description of method
• Validation on simulated data
• Results on real data
13. LD is important for summary statistics-based
methods
• Some SNPs have a lot of LD
to other SNPs in the same
category.
• Some SNPs have a lot of LD
to SNPs in other categories.
• Some SNPs do not have a lot
of LD to other SNPs.
14. LD is important for summary statistics-based
methods
• Some SNPs have a lot of LD
to other SNPs in the same
category.
• Some SNPs have a lot of LD
to SNPs in other categories.
• Some SNPs do not have a lot
of LD to other SNPs.
Our solution: LD Score Regression.
See Bulik-Sullivan et al. biorxiv (under revision, Nat
Genet) and ASHG 2014 poster 1787T Bulik-Sullivan
15. LD Score Regression: basic intuition
High LD region Low LD region
Chi-square
• Polygenicity causes more chi-square statistic inflation
in high LD regions than in low LD regions
Mean chi-square for high LD region: high Mean chi-square for low LD region: low
16. Multivariate LD Score Regression: basic
intuition
High chi-square Low chi-square
Enriched category BIG difference between lots of LD vs little LD to the category
Low chi-square Low chi-square
Depleted category SMALL difference between lots of LD vs little LD to the category
17. Multivariate LD Score regression
allows us to partition SNP heritability
• Multivariate LD Score: the sum over all SNPs
in a category of r^2 with that SNP.
18. Multivariate LD Score regression
allows us to partition SNP heritability
• Multivariate LD Score: the sum over all SNPs
in a category of r^2 with that SNP.
• Derivations based on a polygenic model give:
19. Multivariate LD Score regression
allows us to partition SNP heritability
• Multivariate LD Score: the sum over all SNPs
in a category of r^2 with that SNP.
• Derivations based on a polygenic model give:
• Easily extends to overlapping categories.
20. Multivariate LD Score regression
allows us to partition SNP heritability
To estimate partitioned heritability:
• Estimate LD Scores from a reference panel.
• Regress chi-square statistics on LD Scores.
• The slopes give the partitioned heritability.
• For best results, use many categories!
21. Outline
• Description of method
• Validation on simulated data
• Results on real data
22. Multivariate LD Score regression works
in simulations
Null simulations DHS 3x enriched
True h2(DHS) 0.092
REML (2 cat) 0.089 (0.006)
LD Score (27 cat) 0.086 (0.012)
True h2(DHS) 0.276
REML (2 cat) 0.281 (0.006)
LD Score (27 cat) 0.278 (0.013)
• Standard errors are over 100 simulations.
• Simulated quantitative phenotype with h2 = 0.5.
• M = 110,444, N = 2,713
23. Multivariate LD Score regression works
in simulations
Null simulations DHS 3x enriched
True h2(DHS) 0.092
REML (2 cat) 0.089 (0.006)
LD Score (27 cat) 0.086 (0.012)
True h2(DHS) 0.276
REML (2 cat) 0.281 (0.006)
LD Score (27 cat) 0.278 (0.013)
FANTOM5 Enhancer* causal
True h2(DHS) 0.379
REML (2 cat) 0.531 (0.007)
LD Score (27 cat) 0.361 (0.015)
• Standard errors are over 100 simulations.
• Simulated quantitative phenotype with h2 = 0.5.
• M = 110,444, N = 2,713
* Andersson et al. 2014 Nature
24. Multivariate LD Score regression works
in simulations
Null simulations DHS 3x enriched
True h2(DHS) 0.092
REML (2 cat) 0.089 (0.006)
LD Score (27 cat) 0.086 (0.012)
True h2(DHS) 0.276
REML (2 cat) 0.281 (0.006)
LD Score (27 cat) 0.278 (0.013)
FANTOM5 Enhancer* causal
True h2(DHS) 0.379
REML (2 cat) 0.531 (0.007)
LD Score (27 cat) 0.361 (0.015)
FANTOM5 Enhancer* causal,
Excluded from the model
True h2(DHS) 0.379
REML (2 cat) 0.531 (0.007)
LD Score (26 cat) 0.318 (0.014)
• Standard errors are over 100 simulations.
• Simulated quantitative phenotype with h2 = 0.5.
• M = 110,444, N = 2,713
* Andersson et al. 2014 Nature
25. Outline
• Description of method
• Validation on simulated data
• Results on real data
26. Datasets analyzed
Phenotype Citation Sample size
Schizophrenia SCZ working grp of the PGC, 2014 Nature 70,100
Bipolar Disorder Bip working grp of the PGC, 2011 Nat Genet 16,731
Rheumatoid Arthritis* Okada et al., 2014 Nature 38,242
Crohn’s Disease* Jostins et al., 2012 Nature 20,883
Ulcerative Colitis* Jostins et al., 2012 Nature 27,432
Height Wood et al., 2014 Nature Genetics 253,280
BMI Speliotes et al., 2010 Nature Genetics 123,865
Coronary Artery Disease Schunkert et al., 2011 Nature Genetics 86,995
College (yes/no) Rietveld et al., Science 2013 126,559
Type 2 Diabetes Morris et al., 2012 Nature Genetics 69,033
*HLA locus excluded from all analyses for autoimmune traits
27. Annotations used
Mark Source/reference
Coding, 3’ UTR, 5’ UTR, Promoter, Intron UCSC; Gusev et al., in press AJHG
Digital Genomic Footprint, TFBS ENCODE; Gusev et al., in press AJHG
CTCF binding site, Promoter Flanking,
Repressed, Transcribed, TSS, Enhancer,
Weak Enhancer
ENCODE; Hoffman et al., 2012 Nucleic
Acids Research
DHS, fetal DHS, H3K4me1, H3K4me3,
H3K9ac
Trynka et al., 2013 Nature Genetics.*
Conserved Lindblad-Toh et al., 2011 Nature
FANTOM5 Enhancer Andersson et al., 2014 Nature
lincRNAs Cabili et al., 2011 Genes Dev
DHS and DHS promoter Maurano et al., 2012 Science
H3K27ac Roadmap; PGC2 2014 Nature
*Post-processed from ENCODE and Roadmap data by S. Raychaudhuri and X. Liu labs
28. Coding, Intergenic, Enhancer, H3K4me3, and DHS
enrichments in six phenotypes
(Bars indicate 95% confidence intervals)
29. Coding, Intergenic, Enhancer, H3K4me3, DHS, and
Conserved enrichments in six phenotypes
(Bars indicate 95% confidence intervals)
*Lindblad-Toh et al., 2011 Nature
30. Coding, Intergenic, Enhancer, H3K4me3, DHS, and
FANTOM5 Enhancer enrichments in six phenotypes
(Bars indicate 95% confidence intervals)
*Andersson et al., 2014 Nature
31. Cell-type specific H3K27ac enrichments
inform trait biology
• We group 56 cell types into 7 basic categories.
• For each trait (10 traits)
– For each category (7 categories)
• We asses the significance of improvement to
the model from adding that category.
32.
33. Conclusions
• Many annotations are enriched in many
phenotypes.
• Conserved regions, 2.6% of SNPs, are
estimated to explain 30% of heritability on
average.
• FANTOM5 Enhancers are extremely enriched
in auto-immune traits.
• H3K27ac cell-type enrichment matches and
extends our understanding of disease biology.
34. Acknowledgements
• Brendan Bulik-
Sullivan
• Alkes Price
• Ben Neale
• Alexander Gusev
• Nick Patterson
• Po-Ru Loh
• Gosia Trynka
• Han Xu
• Verneri Anttila
• Yakir Reshef
• Chongzhi Zang
• Stephan Ripke
• Schizophrenia Working
Group of the PGC
• Shaun Purcell
• Mark Daly
• Eli Stahl
• Soumya Raychaudhuri
• Sara Lindstrom
Editor's Notes
For a GWAS of a common complex trait, most of the heritability—and so most of the information--lies in the majority of SNPs that do not reach statistical significance. Partitioning heritability is a way to leverage all of the SNPs, instead of just the statistically significant SNPs, to answer questions about genetic architecture, about the biology of disease, and about functional annotations.
Note: this extends to case-control traits under a liability threshold model.
Note: equivalent to other definitions under certain assumptions.
Partitioning heritability is traditionally done with a variance components method such as REML implemented in GCTA, and has yielded many insights in the past. I’d like to highlight this recent result of Gusev et al that non-coding DHS regions comprising 16% of the genome explain an estimated 79% of heritability on average across 11 traits.
We need a method for partitioning heritability from summary statistics not just because many of our largest datasets are meta-analyses for which no one has the genotype data required for a variance components approach, but also because even when we do have all of the genotypes, variance components methods are intractable, especially for more than a very few components. As an added benefit of computational ease, we can look at a lot of phenotypes and a lot of annotations to look at higher level patterns.