Jie Zheng at #ICG12: PhenoSpD: an atlas of phenotypic correlations and a multiple testing correction for the human phenome

PhenoSpD: an integrated toolkit for phenotypic
correlation estimation and multiple testing
correction using GWAS summary statistics
Jie Zheng
The 12th International Conference on Genomics
27th Oct 2017

An invitation from GIGA Science and BGI

Phenome wide association study (PheWAS)
• PheWAS analyzes many phenotypes
compared to a single or multiple
genetic variant(s).
• PheWAS is common place, e.g.
• MR-PheWAS. Millard et al, Sci Rep,
2015
• Haycock et al, JAMA Oncology, 2017
It is likely that longer telomeres increase risk for
several cancers but reduce risk for some non-
neoplastic diseases, including cardiovascular
diseases.

Post GWAS era: a database of harmonized GWAS summary
data in MRC Integrative Epidemiology Unit in Bristol

The network of post GWAS analysis software
Centralized
Database
PhenoSpD MR-Base
LD Hub

LD Hub for LD Score Regression
Univariate analysis: SNP heritability
0 20 40 60 80 100
0246810
LD Score
Chisquare
Bivariate analysis: genetic correlations

Scope of LD Hub
LD Hub
Database
233 publicly
available
GWAS traits
Test Center:
On-the-fly LD
score regression
analysis pipeline
Lookup Center:
Existing LD score
regression results
lookup
GWAShare
Center:
Summary data
sharing & user
contribution

MR-Base for Mendelian randomization
SNPs Trait 1
Confounders
Trait 2
Trait 1 = risk factor (exposure)
Trait 2 = disease (outcome)

Two sample Mendelian randomization

Scope of MR-Base
MR-Base
SNP lookups
12 two-sample
MR
methodologies
MR-Base
R- package
Database
~2000 GWAS
(1100 with full data)

PhenoSpD: why we need it?
• Molecular phenotypes such as
metabolites are highly correlated.
• Multiple testing correction is a
headache problem: Bonferroni
correction is definitely over killed.
• When individual-level phenotype
data is available, phenotypic
correlation matrix can be calculated
easily.
• However, in real world, phenotype
data is normally not available.
• In MR-Base / LD Hub, we only have
GWAS summary statistics.
• We need a magic hand to correct
multiple testing!
Wurtz et al, J Am Coll Cardiol. 2013

PhenoSpD: how it works
1. Harmonize GWAS summary statistics
2. Estimate phenotypic correlation matrix
using metaCCA / LD score regression
3. Apply Spectral decomposition (SpD) to
estimate the equivalent number of
independent variables in the
phenotypic correlation matrix

MetaCCA
• Summary statistics-based multivariate association
testing using canonical correlation analysis –
Cichonska et al Bioinformatics 2016
• As a sub-product, it provides a way to estimate
phenotypic correlation matrix 𝑌𝑌, which is equal
to the Pearson correlation between regression
coefficients (betas) of two GWASs
• The assumption is, both traits are from the same
samples
• PS: 1000 Genomes is not the best option to
estimate LD matrix between SNPs. See Benner et
al AJHG 2017, and LDstore

LD score regression
• Method to estimate SNP heritability and
genetic correlations -- Bulik-Sullivan et al NG
2014, 2015
• It is also provides a way to estimate phenotypic
correlations between two traits, which is the
intercept term of the bi-variate LD score
regression.
• Compare to metaCCA, it adjusted for sample
overlap automatically
• Both genetic and phenotypic correlation
matrixes can be found in LD Hub

SNPSpD and MatSpD
• SNPSpD: A simple correction for multiple testing for SNPs in LD using
spectral decomposition (SpD). Nyholt 2004 AJHG
• MatSpD: MatrixSpD, estimate the equivalent number of independent
variables in a correlation (r) matrix
• The same method can be used to estimate the number of
independent variables in a phenotypic correlation matrix

Simulation
• How accurate is the phenotypic correlation estimation using GWAS results?
• Is there any parameters strongly affecting such estimation?

Model N_ind_A
N_in
d_B N_overlap Overlap_% N_SNPs N_EnvF N_simu y1_y2_A_obs y1_y2_B_obs Mean_y1_y2_est SD_y1_y2_est Deviation_obs_est (%)
sample size 1 300 300 150 50% 1000 100 100 -0.70 -0.70 -0.46 0.56 34.1%
sample size 2 500 500 250 50% 1000 100 100 -0.71 -0.70 -0.47 0.56 33.0%
sample size 3 1000 1000 500 50% 1000 100 100 -0.70 -0.70 -0.47 0.54 33.3%
sample size 4 3000 3000 1500 50% 1000 100 100 -0.70 -0.70 -0.46 0.54 33.6%
sample size 5 5000 5000 2500 50% 1000 100 100 -0.70 -0.70 -0.47 0.54 33.2%
sample size 6 10000 10000 5000 50% 1000 100 100 -0.71 -0.71 -0.47 0.54 33.9%
sample overlap 1 5000 5000 1000 10% 1000 100 100 -0.70 -0.70 -0.13 0.39 82.1%
sample overlap 2 5000 5000 2000 20% 1000 100 100 -0.70 -0.70 -0.23 0.47 67.2%
sample overlap 3 5000 5000 3000 30% 1000 100 100 -0.71 -0.71 -0.33 0.47 54.0%
sample overlap 4 5000 5000 4000 40% 1000 100 100 -0.71 -0.71 -0.40 0.51 43.2%
sample overlap 5 5000 5000 5000 50% 1000 100 100 -0.71 -0.71 -0.47 0.53 33.3%
sample overlap 6 5000 5000 6000 60% 1000 100 100 -0.71 -0.71 -0.53 0.59 25.2%
sample overlap 7 5000 5000 7000 70% 1000 100 100 -0.70 -0.70 -0.58 0.57 17.5%
sample overlap 8 5000 5000 8000 80% 1000 100 100 -0.70 -0.70 -0.62 0.65 11.4%
sample overlap 9 5000 5000 9000 90% 1000 100 100 -0.71 -0.71 -0.67 0.67 5.8%
unbalance sample 1 5000 5000 9000 90% 1000 100 100 -0.71 -0.71 -0.67 0.67 5.8%
number of SNPs 1 5000 5000 2500 50% 10 100 100 -0.70 -0.70 -0.44 0.73 38.1%
number of SNPs 2 5000 5000 2500 50% 100 100 100 -0.70 -0.70 -0.48 0.53 34.1%
number of SNPs 3 5000 5000 2500 50% 500 100 100 -0.70 -0.70 -0.47 0.53 34.3%
number of SNPs 4 5000 5000 2500 50% 1000 100 100 -0.71 -0.70 -0.47 0.56 33.5%
number of SNPs 5 5000 5000 2500 50% 5000 100 100 -0.70 -0.70 -0.47 0.55 33.6%
number of SNPs 6 5000 5000 2500 50% 10000 100 100 -0.71 -0.71 -0.47 0.59 33.7%

Accuracy tests using real data
The estimated phenotypic correlations have
good agreement with observed phenotypic
correlations
The exceptions are traits with limited sample size
(therefore limited sample overlap).
• Shin et al provided the observed phenotypic correlation matrix for 452 metabolites, which can be used as a
test dataset
• So we compared the observed phenotypic correlation with the estimated phenotypic correlation using
PhenoSpD.

Growth importance of PhenoSpD
• PhenoSpD is particularly useful for multiple GWASs from the same
samples, e.g. complex molecular traits such as metabolites and
cytokines
• It can also be applied to all traits in MR-Base / LD Hub, which we can
split traits into groups, e.g. all traits in GIANT consortium are highly
possible to be correlated and majority of them are from the same
sample

Real case application in MR-Base and LD Hub
Consortium / First
author
Category N_traits N_SNPs N_correlations N_independent_traits
Kettunen Blood metabolites 123 9826292 7503 44.9
Shin Metaoblites 451 2482345 101475 324.4
Roederer Immune system
phenotypes
151 1585187 11325 94.2
CARDIOGRAM 2 335391 1 1
TRICL 4 335391 6 3
TAG 4 1449634 6 3.98
SSGAC 7 1449634 21 6
PGC 4 335391 6 3.644
Leptin 2 1449634 1 1
MAGIC 16 1449634 120 11.098
IIBDGC 3 335391 3 2
Hrgene 8 1449634 28 7
HaemGen 6 1449634 15 5
GPC 6 1449634 15 5
GLGC 4 1449634 6 3
GIANT 15 1449634 105 10.1097
GEFOS 3 1449634 3 3
CKDGen 9 335391 36 8
EGG 4 1449634 6 4
GIS 2 2029112 1 1
GUGC 2 2449580 1 1
ENIGMA 7 7237736 21 6
UK Biobank 5 9440243 9 5
Others 24 / / 24
All 862 / 120713 577.3317
Number of independent traits in MR-Base
Consortium /
First author
Category N_traits N_SNPs N_correlations N_independent_traits
All traits All traits 221 / 24310 134.1167
Number of independent traits in LD Hub

Growth importance of PhenoSpD
• There is a great potential to apply PhenoSpD to multiple traits in large
scale biobanks and cohorts such as UK Biobank, China Kadoorie
Biobank, HUNT study (all traits in one sample)

UK Biobank release from Ben Neale’s group
• RAPID GWAS OF THOUSANDS OF
PHENOTYPES FOR 337,000 SAMPLES IN
THE UK BIOBANK
(http://www.nealelab.is/blog/2017/7/
19/rapid-gwas-of-thousands-of-
phenotypes-for-337000-samples-in-
the-uk-biobank)
• GWAS summary statistics of 337,000
European samples are available for
over 2,400 human traits, everyone can
access and download the results.
• ~600 traits are heritable, which are the
most valuable data

PhenoSpD application
• Assess the potential causal relationship between genetic variation, DNA methylation and 139
complex traits.
• PhenoSpD:
139 outcomes  62 independent outcomes
Hypothesis free MR of DNA methylation on 139 human traits

Links for PhenoSpD
• PhenoSpD Paper is on bioRxiv:
https://www.biorxiv.org/content/early/2017/07/25/148627
• R scripts of PhenoSpD can be found on MRC-IEU github:
https://github.com/MRCIEU/PhenoSpD
• LD Hub: http://ldsc.broadinstitute.org/ldhub/
• MR-Base: www.mrbase.org

Acknowledgements
• LD Hub team
• Jie Zheng
• David M Evans
• Benjamin Neale
• MR-Base team
• Gibran Hemani
• Jie Zheng
• George Davey Smith
• Tom Gaunt
• Philip Haycock
• PhenoSpD team
• Jie Zheng
• Tom Richardson
• Louise Millard
• Gibran Hemani
• Chris Raistrick
• Bjarni Vilhjalmsson
• Philip Haycock
• Tom Gaunt

Q & A
Thank you!
Questions welcomed

Jie Zheng at #ICG12: PhenoSpD: an atlas of phenotypic correlations and a multiple testing correction for the human phenome

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Jie Zheng at #ICG12: PhenoSpD: an atlas of phenotypic correlations and a multiple testing correction for the human phenome

Similar to Jie Zheng at #ICG12: PhenoSpD: an atlas of phenotypic correlations and a multiple testing correction for the human phenome (20)

More from GigaScience, BGI Hong Kong

More from GigaScience, BGI Hong Kong (20)

Recently uploaded

Recently uploaded (20)

Jie Zheng at #ICG12: PhenoSpD: an atlas of phenotypic correlations and a multiple testing correction for the human phenome