Bioinformatics for discovery:
Introduction to GWAS and EWAS
BMI 701:Introduction to Biomedical Informatics

12/1/2015
chirag@hms.harvard.edu

@chiragjp

www.chiragjpgroup.org
Chirag J Patel
P = G + EType 2 Diabetes

Cancer

Alzheimer’s

Gene expression
Phenotype Genome
Variants
Environment
Infectious agents

Nutrients

Pollutants

Drugs
Complex traits are a function of genes and
environment...
We are great at G investigation!
over 2000 

Genome-wide Association Studies (GWAS)

https://www.ebi.ac.uk/gwas/
G
>2,000 traits/diseases

>15,000 SNPs

>16,000 SNP-trait associations
https://www.ebi.ac.uk/gwas/
Dissecting G in P:
What is a Genome-wide Association Study?
Hypothesis-free “search engine” for genetic variants 

associated with a complex trait or disease 

in unrelated populations
SNP(A) SNP(a)
diseased
non-
diseased
SNP(A) SNP(a)
diseased
non-
diseased
SNP(A) SNP(a)
diseased
non-
diseased
SNP(A) SNP(a)
diseased
non-
diseased
SNP(A) SNP(a)
diseased
non-
diseased
SNP(A) SNP(a)
diseased
non-
diseased
SNP(A) SNP(a)
diseased
non-
diseased
SNP(A) SNP(a)
diseased
non-
diseased
SNP(A) SNP(a)
diseased
non-
diseased
SNP(Z) SNP(z)
diseased
non-
diseasedgenome-wide
The road to GWAS...
A new paradigm of GWAS for discovery of G in P:
Human Genome Project to GWAS
Sequencing of the genome
2001
HapMap project:
http://hapmap.ncbi.nlm.nih.gov/
Characterize common variation
2001-current day
High-throughput variant
assay
< $99 for ~1M variants
Measurement tools
~2003 (ongoing)
ARTICLES
Genome-wide association study of 14,000
cases of seven common diseases and
3,000 shared controls
The Wellcome Trust Case Control Consortium*
There is increasing evidence that genome-wide association (GWA) studies represent a powerful approach to the
identification of genes involved in common human diseases. We describe a joint GWA study (using the Affymetrix GeneChip
500K Mapping Array Set) undertaken in the British population, which has examined ,2,000 individuals for each of 7 major
diseases and a shared set of ,3,000 controls. Case-control comparisons identified 24 independent association signals at
P , 5 3 1027
: 1 in bipolar disorder, 1 in coronary artery disease, 9 in Crohn’s disease, 3 in rheumatoid arthritis, 7 in type 1
diabetes and 3 in type 2 diabetes. On the basis of prior findings and replication studies thus-far completed, almost all of these
signals reflect genuine susceptibility effects. We observed association at many previously identified loci, and found
compelling evidence that some loci confer risk for more than one of the diseases studied. Across all diseases, we identified a
25 27
Vol 447|7 June 2007|doi:10.1038/nature05911
Nature 2008
Comprehensive, high-throughput analyses
GWAS
Number of raw publications with subject of
“GWAS”
0
1000
2000
3000
1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013
Year
NumberofPublications'GWAS'
pubmed MeSH terms:
human + GWAS
Number of raw publications with subject of
“GWAS”
0
1000
2000
3000
1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013
Year
NumberofPublications'GWAS'
pubmed MeSH terms:

human + GWAS
Risch + Merikangas
linkage vs. association
human genome sequenced
GWAS
age-related macular degeneration
mega-meta-GWAS
WTCCC
GWAS is relevant today (even with NGS) around the corner
Why execute GWAS?
Geneticists have made substantial progress in
identifying the genetic basis of many human
diseases, at least those with conspicuous deter-
minants.ThesesuccessesincludeHuntington's
disease, Alzheimer's disease, and some forms of
breast cancer. However, the detection of ge-
netic factors for complex diseases-such as
schizophrenia, bipolardisorder, anddiabetes-
has been far more complicated. There have
been numerous reports of genes or loci that
might underlie these disorders, butfew ofthese
findings have been replicated. The modest na-
ture ofthe gene effectsforthese disorders likely
explains the contradictory and inconclusive
claims about their identification. Despite the
small effects of such genes, the magnitude of
theirattributable risk (theproportion ofpeople
affectedduetothem) maybelargebecause they
are quite frequent in the population, making
them ofpublic health significance.
Has the genetic study ofcomplex disorders
reached its limits? The persistent lack of
replicability of these reports of linkage be-
tween various loci and complex diseases
might imply that it has. We argue below that
age analysis we have chosen for this argu-
ment is a popular current paradigm in which
pairs of siblings, both with the disease, are
examined for sharing of alleles at multiple
sites in the genome defined by genetic mark-
ers. The more often the affected siblings
share the same allele at a particular site, the
more likely the site is close to the disease
gene. Using the formulas in (1), we calculate
the expected proportion Yofalleles shared by
a pair ofaffected siblings for the best possible
case-that is, a closely linked marker locus
(recombination fraction 0 = 0) that is fully
informative (heterozygosity = 1) (2)-as
1 +W wherew= pq(y-1)2
2+w (py+q)2
If there is no linkage of a marker at a
particular site to the disease, the siblings
would be expected to share alleles 50% ofthe
time; that is, Y would equal 0.5. Values of Y
for various values ofp and y are given in the
third column of the table. For an allele of
moderate frequency (p is 0.1 to 0.5) that con-
linkage analysis for
about 2 or less will ne
because the numbe
(more than -2500)
able.
Although testsof
est effect are of low
above example, direc
a disease locus itself
To illustrate this poi
sion/disequilibrium t
In this test, transmis
at a locus from heter
affected offspring is e
lian inheritance, all a
chance ofbeing tran
eration. In contrast,
associated with dise
mitted more often th
For this approach,
with multiple affect
just on single affect
parents. For the same
can calculate the pr
parents as pq(y + 1
the probability for a
transmit the high ris
Association tests ca
pairs of affected sibl
associatedwithdiseas
over 50% is the same
the probability ofpar
creased at lowvalues
the probability ofpar
creased. The formula
The Future of Genetic Studies of
Complex Human Diseases
Neil Risch and Kathleen Merikangas
onimm, 0In"a0,"a,
Geneticists have made substantial progress in
identifying the genetic basis of many human
diseases, at least those with conspicuous deter-
minants.ThesesuccessesincludeHuntington's
disease, Alzheimer's disease, and some forms of
breast cancer. However, the detection of ge-
netic factors for complex diseases-such as
schizophrenia, bipolardisorder, anddiabetes-
has been far more complicated. There have
been numerous reports of genes or loci that
might underlie these disorders, butfew ofthese
findings have been replicated. The modest na-
ture ofthe gene effectsforthese disorders likely
explains the contradictory and inconclusive
claims about their identification. Despite the
small effects of such genes, the magnitude of
theirattributable risk (theproportion ofpeople
affectedduetothem) maybelargebecause they
are quite frequent in the population, making
them ofpublic health significance.
Has the genetic study ofcomplex disorders
reached its limits? The persistent lack of
replicability of these reports of linkage be-
tween various loci and complex diseases
might imply that it has. We argue below that
age analysis we have chosen for this ar
ment is a popular current paradigm in whi
pairs of siblings, both with the disease,
examined for sharing of alleles at multip
sites in the genome defined by genetic mar
ers. The more often the affected sibli
share the same allele at a particular site, t
more likely the site is close to the dise
gene. Using the formulas in (1), we calcul
the expected proportion Yofalleles shared
a pair ofaffected siblings for the best possi
case-that is, a closely linked marker lo
(recombination fraction 0 = 0) that is fu
informative (heterozygosity = 1) (2)-as
1 +W wherew= pq(y-1)2
2+w (py+q)2
If there is no linkage of a marker at
particular site to the disease, the sibli
would be expected to share alleles 50% oft
time; that is, Y would equal 0.5. Values o
for various values ofp and y are given in t
third column of the table. For an allele
moderate frequency (p is 0.1 to 0.5) that co
The Future of Genetic Studies of
Complex Human Diseases
Neil Risch and Kathleen Merikangas
Science, 1996
A new paradigm is needed for discovery!
How does a GWAS work?
Single nucleotide polymorphisms (SNPs):
How many SNPs are in the human genome?
>3,000,000,000 bases in human genome
SNPs appear ~1000 bases
~3,000,000 SNPs
40-60% have minor allele frequency <5%

GWAS focus on frequency >5%
HapMap Consortium, 2010
Can’t measure everything:
Tag SNPs and Linkage Disequilibrium (LD)
LD = co-occurance of SNPs in a contiguous region
Bush and Moore, 2012
The phenomenon of LD makes GWAS possible:
How and why?: Indirect association
additional studies to map the precise
location of the influential SNP.
Conceptually, the end result of GWAS
under the common disease/common var-
needed to capture the variation
African genome.
It is important to note that t
ogy for measuring genomic
Figure 3. Indirect Association. Genotyped SNPs often lie in a region of high linka
will be statistically associated with disease as a surrogate for the disease SNP throu
doi:10.1371/journal.pcbi.1002822.g003
Bush and Moore, 2012
LD blocks
Can’t measure everything:
Tag SNPs and Linkage Disequilibrium
Tag SNPs are common proxies for other SNPs

500K - 1M per chip
tified significant associations for seven SNPs representing four new
T2DM loci (Table 1). In all cases, the strongest association for the
MAX statistic (see Methods) was obtained with the additive model.
of this gene (Fig. 2a)
solely in the secretory
final stages of insulin
*
*
*
0
2
4
–log10[P]
–log10[P]
*
4954642sr
2373971sr
3373971sr
445409sr
8012261sr
3349941sr
883429sr
2019462sr
0349941sr
90350501sr
036169sr
0415007sr
2225991sr
6136642sr
8136642sr
1869646sr
8798751sr
04928201sr
3926642sr
5926642sr
43666231sr
9926642sr
2954642sr
01350501sr
5769646sr
4577187sr
4769646sr
41350501sr
5784931sr
2173387sr
39250501sr
5050007sr
7492602sr
1255051sr
156868sr
4373387sr
4784931sr
7501107sr
2697402sr
91518711sr
6461001sr
29250501sr
5889103sr
8669646sr
0889103sr
4688392sr
SLC30A8 IDE
0
2
4
7912381sr
3148707sr
0283856sr
52078111sr
5227373sr
0491242sr
2369412sr
2297881sr
662155sr
7790197sr
44068701sr
35075221sr
5826807sr
7851092sr
9409522sr
–log10[P]
–log10[P]
EXT2 ALX4
0
2
4
*** *
0
2
4
a b
c d
LD block
2 alleles are correlated because they are inherited
together
Sladek et al, 2007
image: www.lifa-core.de/
Digitizing SNPs:
e.g., Illumina Infinium Array
image: illumina.com
Assessing Thousands of Factors Simultaneously:
Data-driven search for differences in SNP frequencies
~100,000 - ~1,000,000 association tests
disease cases
healthy controls
GCAGGTACATG...GGTA...
GCAGGTACACG...GGTA...
GCAGGTACATG...GGTA...
GCAGGTACACG...GGTA...
GCAGGTACATG...GGTA...
GCAGGTACACG...GGTA...
disease cases
GCAGGTACATG...GGTA...
GCAGGTACATG...GGTA...
GCAGGTACATG...GGTA...
GCAGGTACATG...GGTA...
healthy controls
Associating One SNP with Disease
Case-Control Study Design
DiseaseSNP (A/a)
?
A a
diseased
non-
diseased
cases
controls
Associating One SNP with Disease
What is an “Odds Ratio”?
DiseaseSNP (A/a)
?
A a
diseased c d
non-
diseased
x y
cases
controls
Chi-squared test
Odds Ratio a vs A:
Odds of disease with allele a
vs.
Odds of disease with allele A
1: equal odds (no difference)

>1: increased odds (increased risk)

<1: decreased odds (decreased risk)
Associating One SNP with Disease
Calculating the Odds Ratio
DiseaseSNP (A/a)
?
A a
diseased c d
non-
diseased
x y
cases
controls
Chi-squared test

Odds Ratio
dx
cy
y/x
d/c
[d/(d+y)]/[y/(d+y)]
Odds Ratio a vs A:
[c/(x+y)]/[x/(c+x)]
Odds with allele a
Odds with allele A
How would you interpret an OR of 2?
Associating One SNP with Disease
Cohort Study Design
DiseaseSNP (A/a)
?
•Direct measure of risk vs. odds ratio

•Need to wait!
•If incidence is low, N needs to be large!
Non-diseasedSNP (A/a)
vs.
Cox survival regression

Relative Risk
Models to associate genotypes with disease
Examples for a case-control study
Aa AA
AA
aa Aa
AaaaAa
Disease Non-diseased
ND=4 NC=4
Models to associate genotypes with disease
Examples for a case-control study
Aa AA
AA
aa Aa
AaaaAa
Disease Non-diseased
ND=4 NC=4
A a
diseased
non-
diseased
6 2
2 6
OR A (vs a)

OR a (vs A)
AA Aa aa
diseased
non-
diseased
Models to associate genotypes with disease
Genotypic Test (“2 or 1 df test”)
Aa AA
AA
aa Aa
AaaaAa
Diseased Non-diseased
ND=4 NC=4
2 OR AA (vs. Aa)

aa (vs. Aa)
2 0
220
Associating One SNP with Quantitative Trait
(e.g., height, weight, cholesterol)
40
60
80
100
1 2 3
factor(SNP)
trait
GG GC CC
height
SNP rs1234 SNP rs123456
25
50
75
100
125
1 2 3
factor(SNP)
trait
height
CC CT TT
Associating One SNP with Quantitative Trait
Linear Regression and Additive Risk Model
y=ɑ+βx+ε
25
50
75
100
125
1 2 3
factor(SNP)
trait
height
CC (0) CT (1) TT (2)
SNP rs123456
height = ɑ+βx
xCC=0 if individual is CC
xCT=1 if individual is CT
xTT=2 if individual is TT
ɑ
β: change in height for 1 risk allele
T= risk allele
β
Prototypical “Manhattan plot” to visualize
associations
Science, 2007
~100,000 - ~1,000,000 association tests
evol
part
ease
tase
well
biol
T
capt
imp
STR
reve
subs
libri
clea
−log10(P)
0
5
10
15
Chromosome
22
X
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
80
60
40
100
rvedteststatistic
a
b
NATURE|Vol 447|7 June 2007
AA Aa aa
diseased
non-
diseased
ibility with schizophrenia, a psychotic disorder with many similar-
ities to BD. In particular association findings have been reported with
assium channel. Ion channelopathies are well-recognized as causes of
episodic central nervous system disease, including seizures, ataxias
−log10
(P)
0
5
10
15
0
5
10
15
0
5
10
15
0
5
10
15
0
5
10
15
0
5
10
15
0
5
10
15
Chromosome
Type 2 diabetes
22
XX
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
22
XX
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
22
XX
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
22
XX
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
22
XX
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
22
XX
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
22
XX
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
Coronary artery disease
Crohn’s disease
Hypertension
Rheumatoid arthritis
Type 1 diabetes
Bipolar disorder
Figure 4 | Genome-wide scan for seven diseases. For each of seven diseases
2log10 of the trend test P value for quality-control-positive SNPs, excluding
Chromosomes are shown in alternating colours for clarity, with
P values ,1 3 1025
highlighted in green. All panels are truncated at
Type I Error:
False Positives!
what is a p-value?
chance we attain the observed result if no difference (H0)
Many tests: some can be significant (low p-value by chance)!
100 tests at a p-value of 0.05...
how many would be significant per chance?
Bonferroni “correction”:

Correct the 0.05 significance level by number of tests
e.g., 1M SNPs: 0.05/1x10-6 = 5x10-8
QQplot:
Distribution of of observed p-values vs. Ho p-
values
Histogram of runif(10000)
runif(10000)
Frequency
0.0 0.2 0.4 0.6 0.8 1.0
0100200300400500
p-values under Ho
Histogram of gwas$P.value
gwas$P.value
Frequency
0.0 0.2 0.4 0.6 0.8 1.0
050000100000150000
p-values of GWAS in Total Cholesterol
Global Lipids Consortium, 2012random uniform distribution
QQplot:
Distribution of of observed p-values vs. Ho p-
values
Histogram of gwas$P.value
gwas$P.value
Frequency
0.0 0.2 0.4 0.6 0.8 1.0
050000100000150000
p-values of GWAS in Total Cholesterol
Which diseases show evidence of association?
Examining the QQplot of test statistics in WTCCC
sent study cannot provideconclusive exclusion of any given gene. This
is the consequence of several factors including: less-than-complete
coverage of common variation genome-wide on the Affymetrix chip;
poor coverage (by design) of rare variants, including many structural
variants (thereby reducing power to detect rare, penetrant, alleles)25
;
difficultieswithdefining thefullgenomicextentofthegene ofinterest;
and, despite the sample size, relatively low power to detect, at levels of
already allow us, for selected diseases, to highlight pathways and
mechanisms of particular interest. Naturally, extensive resequencing
and fine-mapping work, followed by functional studies will be
required before such inferences can be translated into robust state-
ments about the molecular and physiological mechanisms involved.
We turn now to a discussion of the main findings for each disease,
focusing here only on the most significant and interesting results
25
20
20
15
15
10
10
5
5
30
0
0
25
20
20
15
15
10
10
5
5
30
0
0
25
20
20
15
15
10
10
5
5
30
0
0
25
20
20
15
15
10
10
5
5
30
0
0
25
20
20
15
15
10
10
5
5
30
0
0
25
20
20
15
15
10
10
5
5
30
0
0
25
20
20
15
15
10
10
5
5
30
0
0
BD
Observedteststatistic
Expected chi-squared value
CAD CD
HT RA
T2D
T1D
Figure 3 | Quantile-quantile plots for seven genome-wide scans. For each
of the seven disease collections, a quantile-quantile plot of the results of the
trend test is shown in black for all SNPs that pass the standard project filters,
have a minor allele frequency .1% and missing data rate ,1%. SNPs that
360,000 SNPs. SNPs at which the test statistic exceeds 30 are represented by
triangles. Additional quantile-quantile plots, which also exclude all SNPs
located in the regions of association listed in Table 3, are superimposed in
blue (for BD, the exclusion of these SNPs has no visible effect on the plot, and
Observational associations do not equal causation...
Ice Cream $ Drowning
Confounding bias
What is a confounder?
Summer!
?
Confounder is correlated to both the “risk” factor and disease,

leading to invalid inference.

Common source of bias in observational studies (e.g., case-control,
cohort, etc)
SNP Disease
Population Stratification:
A source of possible confounding in GWAS
race/ethnicity
?
Ancestry correlated with allele frequency and disease

GWAS are done on specific populations separately.

(most have been done in populations of European ancestry)
FTO Diabetes
Mediation
SNPs indicative of a mediator factor?
Example: FTO and Type 2 Diabetes
Body Mass
?
Association between FTO and Type 2 Diabetes via BMI?
... or does FTO have a independent role in Type 2 Diabetes...?
FTO Body Mass
PLINK:
(Standard) Whole Genome Analysis Software
PLINK:
(Standard) Whole Genome Analysis Software
http://pngu.mgh.harvard.edu/~purcell/plink/
•cited >9000 times since 2007

•allele frequency

•linkage disequilibrium (LD)

•data manipulation/filtering

•association: allelic, genotypic models

•chi-square

•logistic

•linear
Examples: 

GWASs in Type 2 Diabetes
Type 2 Diabetes Mellitus:
A complex, multifactorial disease
•Insulin production vs. use

•beta-cell function

•insulin sensitivity (BMI)

•Moves glucose from blood into
cells

•Complications arise due to
glucose in blood, hyperglycemia
•diagnosed by blood glucose
levels

CDC,
family history: 25%
body weight, diet, lifestyle, age
ARTICLES
A genome-wide association study
identifies novel risk loci for type 2 diabetes
Robert Sladek1,2,4
, Ghislain Rocheleau1
*, Johan Rung4
*, Christian Dina5
*, Lishuang Shen1
, David Serre1
,
Philippe Boutin5
, Daniel Vincent4
, Alexandre Belisle4
, Samy Hadjadj6
, Beverley Balkau7
, Barbara Heude7
,
Guillaume Charpentier8
, Thomas J. Hudson4,9
, Alexandre Montpetit4
, Alexey V. Pshezhetsky10
, Marc Prentki10,11
,
Barry I. Posner2,12
, David J. Balding13
, David Meyre5
, Constantin Polychronakos1,3
& Philippe Froguel5,14
Type 2 diabetes mellitus results from the interaction of environmental factors with a combination of genetic variants, most of
which were hitherto unknown. A systematic search for these variants was recently made possible by the development of
high-density arrays that permit the genotyping of hundreds of thousands of polymorphisms. We tested 392,935
single-nucleotide polymorphisms in a French case–control cohort. Markers with the most significant difference in genotype
frequencies between cases of type 2 diabetes and controls were fast-tracked for testing in a second cohort. This identified
four loci containing variants that confer type 2 diabetes risk, in addition to confirming the known association with the TCF7L2
gene. These loci include a non-synonymous polymorphism in the zinc transporter SLC30A8, which is expressed exclusively in
insulin-producing b-cells, and two linkage disequilibrium blocks that contain genes potentially involved in b-cell
development or function (IDE–KIF11–HHEX and EXT2–ALX4). These associations explain a substantial portion of disease risk
and constitute proof of principle for the genome-wide approach to the elucidation of complex genetic traits.
The rapidly increasing prevalence of type 2 diabetes mellitus (T2DM) is
thought to be due to environmental factors, such as increased availabil-
ity of food and decreased opportunity and motivation for physical
activity, acting on genetically susceptible individuals. The heritability
of T2DM is one of the best established among common diseases and,
consequently, genetic risk factors for T2DM have been the subject of
intense research1
. Although the genetic causes of many monogenic
forms of diabetes (maturity onset diabetes in the young, neonatal mito-
chondrial and other syndromic types of diabetes mellitus) have been
elucidated, few variants leading to common T2DM have been clearly
identified and individually confer only a small risk (odds ratio < 1.1–
1.25) of developing T2DM1
. Linkage studies have reported many
T2DM-linked chromosomal regions and have identified putative, cau-
sative genetic variants in CAPN10 (ref. 2), ENPP1 (ref. 3), HNF4A (refs
4, 5) and ACDC (also called ADIPOQ)6
. In parallel, candidate-gene
studieshavereportedmanyT2DM-associatedloci,withcodingvariants
in the nuclear receptor PPARG (P12A)7
and the potassium channel
KCNJ11 (E23K)8
being among the very few that havebeen convincingly
replicated. The strongest known (odds ratio < 1.7) T2DM association9
was recently mapped to the transcription factor TCF7L2 and has been
consistently replicated in multiple populations10–20
.
Subjects and study design
The recent availability of high-density genotyping arrays, which com-
bine the power of association studies with the systematic nature of a
genome-wide search, led us to undertake a two-stage, genome-wide
association study to identify additional T2DM susceptibility loci
(Supplementary Fig. 1). In the first stage of this study, we obtained
genotypes for 392,935 single-nucleotide polymorphisms (SNPs) in
1,363 T2DM cases and controls (Supplementary Table 1). In order to
enrich for risk alleles21
, the diabetic subjects studied in stage 1 were
selected to have at least one affected first degree relative and age at
onset under 45 yr (excluding patients with maturity onset diabetes in
the young). Furthermore, in order to decrease phenotypic hetero-
geneity and to enrich for variants determining insulin resistance and
b-cell dysfunction through mechanisms other than severe obesity, we
initially studied diabetic patients with a body mass index (BMI)
,30 kg m22
. Control subjects were selected to have fasting blood
glucose ,5.7 mmol l21
in DESIR, a large prospective cohort for the
study of insulin resistance in French subjects22
.
Genotypes for each study subject were obtained using two plat-
forms: Illumina Infinium Human1 BeadArrays, which assay 109,365
SNPs chosen using a gene-centred design; and Human Hap300
BeadArrays, which assay 317,503 SNPs chosen to tag haplotype
blocks identified by the Phase I HapMap23
. Of the 409,927 markers
that passed quality control (Supplementary Tables 2 and 3), geno-
types were obtained for an average of 99.2% (Human1) and 99.4%
(Hap300) of markers for each subject with a reproducibility of
.99.9% (both platforms). Forty-three subjects were removed from
analysis because of evidence of intercontinental admixture (Sup-
plementary Fig. 3) and an additional four because their genotype-
determined gender disagreed with clinical records. In total, T2DM
association was tested for 100,764 (Human1) and 309,163 (Hap300)
SNPs representing 392,935 unique loci (Fig. 1). Because of unequal
male/female ratios in our cases and controls, we analysed the 12,666
sex-chromosome SNPs separately for each gender.
*These authors contributed equally to this work.
1
Departments of Human Genetics, 2
Medicine and 3
Pediatrics, Faculty of Medicine, McGill University, Montreal H3H 1P3, Canada. 4
McGill University and Genome Quebec Innovation
Centre, Montreal H3A 1A4, Canada. 5
CNRS 8090-Institute of Biology, Pasteur Institute, Lille 59019 Cedex, France. 6
Endocrinology and Diabetology, University Hospital, Poitiers
86021 Cedex, France. 7
INSERM U780-IFR69, Villejuif 94807, France. 8
Endocrinology-Diabetology Unit, Corbeil-Essonnes Hospital, Corbeil-Essonnes 91100, France. 9
Ontario
Institute for Cancer Research, Toronto M5G 1L7, Canada. 10
Montreal Diabetes Research Center, Montreal H2L 4M1, Canada. 11
Molecular Nutrition Unit and the Department of
Nutrition, University of Montreal and the Centre Hospitalier de l’Universite´ de Montre´al, Montreal H3C 3J7, Canada. 12
Polypeptide Hormone Laboratory and Department of Anatomy
and Cell Biology, Montreal H3A 2B2, Canada. 13
Department of Epidemiology & Public Health, Imperial College, St Mary’s Campus, Norfolk Place, London W2 1PG, UK. 14
Section of
Genomic Medicine, Imperial College London W12 0NN, and Hammersmith Hospital, Du Cane Road, London W12 0HS, UK.
881
Nature©2007 Publishing Group
Nature, 2/2007
References and Notes
1. B. G. Richmond, D. S. Strait, Nature 404, 382 (2000).
2. J. Kingdon, Lowly Origins (Princeton Univ. Press,
Princeton, NJ, 2003).
3. C. V. Ward, M. G. Leakey, A. Walker, Evol. Anthropol. 7,
197 (1999).
4. Y. Haile-Selassie, Nature 412, 178 (2001).
5. T. D. White et al., Nature 440, 883 (2006).
6. K. Kovarovic, P. Andrews, J. Hum. Evol., in press (available
at http://dx.doi.org./doi:10.1016/j.jhevol.2007.01.001; doi:
10.1016/j.jhevol.2007.01.001).
7. N. Patterson, D. J. Richter, S. Gnerre, E. S. Lander,
D. Reich, Nature 441, 1103 (2006).
8. K. D. Hunt et al., Primates 37, 363 (1996).
9. J. G. Fleagle et al., Symp. Zool. Soc. London 48, 359
(1981).
10. R. H. Crompton et al., Cour. Forsch-Inst. Senckenb. 243,
115 (2003).
11. J. T. Stern, Yrb. Phys. Anthropol. 19, 59 (1975).
12. S. K. S. Thorpe, R. H. Crompton, Am. J. Phys. Anthropol.
131, 384 (2006).
13. K. D. Hunt, J. Hum. Evol. 26, 183 (1994).
15. E. Larney, S. Larsen, Am. J. Phys. Anthropol. 125, 42 (2004).
16. S. K. S. Thorpe, R. H. Crompton, Am. J. Phys. Anthropol.
127, 58 (2005).
17. S. K. S. Thorpe, R. H. Crompton, M. M. Gunther,
R. F. Ker, R. McN. Alexander, Am. J. Phys. Anthropol.
110, 179 (1999).
18. R. McN. Alexander, Principles of Animal Locomotion
(Princeton Univ. Press, Princeton, NJ, 2003).
19. C. V. Ward, Yrbk. Phys. Anthropol. 45, 185 (2002).
20. R. W. Wrangham, N. L. Conklin-Brittain, K. D. Hunt,
Int. J. Primatol. 19, 949 (1998).
21. H. Pontzer, R. W. Wrangham, J. Hum. Evol. 46, 317 (2004).
22. R. C. Payne et al., J. Anat. 208, 709 (2006).
23. M. Pickford, B. Senut, B. Gommery, in Late Cenozoic
Environments and Hominid Evolution: a Tribute to Bill
Bishop, P. Andrews, P. Banham, Eds. (Geological Society,
London, 1999), pp. 27–38.
24. N. M. Young, L. MacLatchy, J. Hum. Evol. 46, 163 (2004).
25. D. Gommery, B. Senu, M. Pickford, E. Musiime,
Ann. Paléontol. 88, 167 (2002).
26. C. V. Ward, in Handbook of Paleoanthropology Vol. 2:
Primate Evolution and Human Origins, W. Henke,
I. Tattersall, Eds. (Springer, Heidelberg, Germany, 2007),
pp. 1011–1030.
N. Ogihara, M. Nakatsukasa, Eds. (Springer, Heidelberg,
Germany, 2006), pp. 199–208.
28. C. P. E. Zollikofer et al., Nature 434, 755 (2005).
29. M. Pickford, Anthropologie 69, 191 (2005).
30. We thank the Indonesian Institute of Science, Indonesian
Nature Conservation Service, and Leuser Development
Programme for granting permission and giving support
for research in the Leuser Ecosystem. R. McN. Alexander,
T. M. Blackburn, S. Burtles. J. Rees, N. Jeffery,
E. E. Vereecke, A. Walker, A. Wilson, and B. Wood
commented on the manuscript. R. Savage developed the
animation (fig. S1). Studies of captive animals were
hosted by the North of England Zoological Society. This
research was supported by grants from the Leverhulme
Trust, the Royal Society, the L.S.B. Leakey Foundation,
and the Natural Environment Research Council.
Supporting Online Material
www.sciencemag.org/cgi/content/full/316/5829/1328/DC1
Table S1
Movies S1 to S3
5 February 2007; accepted 18 April 2007
10.1126/science.1140799
Genome-Wide Association Analysis
Identifies Loci for Type 2 Diabetes
and Triglyceride Levels
Diabetes Genetics Initiative of Broad Institute of Harvard and MIT, Lund University,
and Novartis Institutes for BioMedical Research*†
New strategies for prevention and treatment of type 2 diabetes (T2D) require improved insight into
disease etiology. We analyzed 386,731 common single-nucleotide polymorphisms (SNPs) in 1464
patients with T2D and 1467 matched controls, each characterized for measures of glucose
metabolism, lipids, obesity, and blood pressure. With collaborators (FUSION and WTCCC/UKT2D),
we identified and confirmed three loci associated with T2D—in a noncoding region near CDKN2A
and CDKN2B, in an intron of IGF2BP2, and an intron of CDKAL1—and replicated associations near
HHEX and in SLC30A8 found by a recent whole-genome association study. We identified and
confirmed association of a SNP in an intron of glucokinase regulatory protein (GCKR) with serum
triglycerides. The discovery of associated variants in unsuspected genes and outside coding regions
illustrates the ability of genome-wide association studies to provide potentially important clues to
the pathogenesis of common diseases.
T
ype 2 diabetes, obesity, and cardiovascular
risk factors are caused by a combination
of genetic susceptibility, environment, be-
havior, and chance. Whole-genome association
studies (WGAS) offer a new approach to gene
discovery unbiased with regard to presumed
functions or locations of causal variants. This
approach is based on Fisher’s theory for additive
effects at common alleles (1); human heterozy-
to purifying selection, and has been made pos-
sible by genomic advances such as the human
genome sequence, SNP and HapMap databases,
and genotyping arrays (3).
We studied 1464 patients with T2D and
1467 controls from Finland and Sweden, each
characterized for 18 clinical traits: anthropomet-
ric measures, glucose tolerance and insulin se-
cretion, lipids and apolipoproteins, and blood
applying stringent quality-control filters, high-
quality genotypes for 386,731 common SNPs
were obtained (4). To extend the set of putative
causal alleles tested for association, we devel-
oped 284,968 additional multimarker (haplo-
type) tests based on these SNP genotypes (5, 6).
The 671,699 allelic tests capture (correlation co-
efficient r2
≥ 0.8) 78% of common SNPs in
HapMap CEU (3).
Each SNP and haplotype test was assessed
for association to T2D and each of 18 traits with
the software package PLINK (http://pngu.mgh.
harvard.edu/purcell/plink/). For T2D, a weighted
meta-analysis was used to combine results for
the population-based and family-based subsam-
ples (4). For quantitative traits, multivariable
linear or logistic regression with or without co-
variates was performed (4). Association results
for each SNP, haplotype test, and phenotype are
available (www.broad.mit.edu/diabetes/).
In genome-wide analysis involving hundreds
of thousands of statistical tests, modest levels of
bias imposed on the null distribution can over-
whelm a small number of true results. We used
three strategies to search for evidence of sys-
tematic bias from unrecognized population struc-
ture, the analytical approach, and genotyping
artifacts (7, 8). First, we examined the distribu-
tion of P-values in the population-based sam-
ple, observing a close match to that expected
for a null distribution (genomic inflation factor
lGC = 1.05 for T2D). Second, we calculated
G. Brice,6
B. Bullman,7
J. Campbell,8
B. Castle,9
R. Cetnarsyj,8
C.
Chapman,10
C. Chu,11
N. Coates,12
T. Cole,10
R. Davidson,4
A. Donaldson,13
H. Dorkins,3
F. Douglas,2
D. Eccles,9
R. Eeles,1
F. Elmslie,6
D. G. Evans,7
S. Goff,6
S. Goodman,5
D. Goudie,2
J. Gray,15
L. Greenhalgh,16
H. Gregory,17
S. V. Hodgson,6
T. Homfray,6
R. S. Houlston,1
L. Izatt,18
L. Jackson,18
L. Jeffers,19
V. Johnson-Roffey,12
F. Kavalier,18
C. Kirk,19
F. Lalloo,7
C. Langman,18
I. Locke,1
M. Longmuir,4
J. Mackay,20
A. Magee,19
S. Mansour,6
Z. Miedzybrodzka,17
J. Miller,11
P. Morrison,19
V. Murday,4
J. Paterson,21
G. Pichert,18
M. Porteous,8
N. Rahman,6
M. Rogers,15
S. Rowe,22
S. Shanley,1
A. Saggar,6
G. Scott,2
L. Side,23
L. Snadden,4
M. Steel,2
M. Thomas,5
S. Thomas,1
1
Clinical Genetics Service, Royal Marsden Hospital, Downs
Road, Sutton, Surrey, SM2 5PT, UK. 2
Department of
Clinical Genetics, Ninewells Hospital, Dundee, DD1 9SY,
UK. 3
Medical and Community Genetics, Kennedy-Galton
Centre, Level 8V, Northwick Park and St. Mark’s NHS Trust,
Watford Rd, Harrow, HA1 3UJ, UK. 4
Institute of Medical
Genetics, Yorkhill NHS Trust, Dalnair Street, Glasgow, G3
8SJ, UK. 5
Clinical Genetics Department, Royal Devon and
Exeter Hospital (Heavitree), Gladstone Road, Exeter, EX1
2ED, UK. 6
Department of Clinical Genetics, St. George’s
Hospital Medical School, Jenner Wing, Cranmer Terrace,
London, SW17 0RE, UK. 7
Department of Medical Genetics,
St. Mary’s Hospital, Hathersage Road, Manchester, M13
0JH, UK. 8
South East of Scotland Clinical Genetics Service,
Western General Hospital, Crewe Road, Edinburgh, EH4
2XU, UK. 9
Department of Medical Genetics, The Princess
Anne Hospital, Coxford Road, Southampton, S016 5YA, UK.
10
Clinical Genetics Unit, Birmingham Women’s Hospital,
Metchley Park Road, Edgbaston, Birmingham, B15 2TG,
UK. 11
Yorkshire Regional Genetic Service, Department of
Clinical Genetics, Cancer Genetics Building, St. James
University Hospital, Beckett Street, Leeds, LS9 7TF, UK.
12
Department of Clinical Genetics, Leicester Royal Infirm-
ary, Leicester, LE1 5WW, UK. 13
Department of Clinical
Genetics, St Michael’s Hospital, Southwell Street, Bristol,
BS2 8EG, UK. 14
Institute of Human Genetics, International
Centre for Life, Central Parkway, Newcastle upon Tyne, NE1
3BZ, UK. 15
Institute of Medical Genetics, University
Hospital of Wales, Heath Park, Cardiff, CF14 4XW, UK.
16
Department of Clinical Genetics, Alder Hey Children’s
Hospital, Eaton Road, Liverpool L12 2AP, UK. 17
Clinical
Genetics Centre, Argyll House, Foresterhill, Aberdeen,
AB25 2ZR, UK. 18
Clinical Genetics, 7th Floor New Guy’s
House, Guy’s
UK. 19
Clinical
Belvoir Park H
20
Clinical and
Health, 30 G
21
Department
Trust, Box 13
22
Department
of Chester Ho
23
Department
Road, Headin
Supporting
www.sciencema
Materials and
Figs. S1 to S8
Tables S1 to S
References
9 March 2007
Published onli
10.1126/scien
Include this in
A Genome-Wide Association Study of
Type 2 Diabetes in Finns Detects
Multiple Susceptibility Variants
Laura J. Scott,1
Karen L. Mohlke,2
Lori L. Bonnycastle,3
Cristen J. Willer,1
Yun Li,1
William L. Duren,1
Michael R. Erdos,3
Heather M. Stringham,1
Peter S. Chines,3
Anne U. Jackson,1
Ludmila Prokunina-Olsson,3
Chia-Jen Ding,1
Amy J. Swift,3
Narisu Narisu,3
Tianle Hu,1
Randall Pruim,4
Rui Xiao,1
Xiao-Yi Li,1
Karen N. Conneely,1
Nancy L. Riebow,3
Andrew G. Sprau,3
Maurine Tong,3
Peggy P. White,1
Kurt N. Hetrick,5
Michael W. Barnhart,5
Craig W. Bark,5
Janet L. Goldstein,5
Lee Watkins,5
Fang Xiang,1
Jouko Saramies,6
Thomas A. Buchanan,7
Richard M. Watanabe,8,9
Timo T. Valle,10
Leena Kinnunen,10,11
Gonçalo R. Abecasis,1
Elizabeth W. Pugh,5
Kimberly F. Doheny,5
Richard N. Bergman,9
Jaakko Tuomilehto,10,11,12
Francis S. Collins,3
* Michael Boehnke1
*
Identifying the genetic variants that increase the risk of type 2 diabetes (T2D) in humans has
been a formidable challenge. Adopting a genome-wide association strategy, we genotyped 1161
Finnish T2D cases and 1174 Finnish normal glucose-tolerant (NGT) controls with >315,000
single-nucleotide polymorphisms (SNPs) and imputed genotypes for an additional >2 million
autosomal SNPs. We carried out association analysis with these SNPs to identify genetic variants
that predispose to T2D, compared our T2D association results with the results of two similar studies,
and genotyped 80 SNPs in an additional 1215 Finnish T2D cases and 1258 Finnish NGT controls.
We identify T2D-associated variants in an intergenic region of chromosome 11p12, contribute
to the identification of T2D-associated variants near the genes IGF2BP2 and CDKAL1 and the
ria (8). We
ciation with
the log-odd
(8). We ob
versus 31.6
P values <
against the
with a large
consistent w
SNPs that
also sugges
trols by birt
successful;
genomic co
Analysi
allowed us
variation in
portion, w
(8, 13) that
equilibrium
Centre d’E
(Utah resid
1
Department
Genetics, Uni
USA. 2
Depar
Science, 6/2007
Study design: Richa Saxena1–6
and Valeriya Lyssenko7
(Team
Leaders), Peter Almgren,7
Paul I. W. de Bakker,1–6
Noël P.
Burtt,1
Jose C. Florez,1–6
Hong Chen,8
Joanne Meyer,8
Joel N.
Hirschhorn,1,6,9–11
Mark J. Daly,1–3,5
Thomas E. Hughes,8
Leif
Groop,7,12
David Altshuler1–6
(Chair)
Clinical characterization and phenotypes: Valeriya Lyssenko7
and Richa Saxena1–6
(Team Leaders), Peter Almgren,7
Kristin
Ardlie,1
Kristina Bengtsson Boström,13
Noël P. Burtt,1
Hong Chen,8
Jose C. Florez,1–6
Bo Isomaa,14,15
Sekar Kathiresan,1,3,5
Guillaume
Lettre,1,6,9–11
Ulf Lindblad,16
Helen N. Lyon,1,6,9–11
Olle Melander,7
Christopher Newton-Cheh,1–3,5
Peter Nilsson,17
Marju Orho-
Melander,7
Lennart Råstam,16
Elizabeth K. Speliotes,1,3,6,9–11
Marja-Riitta Taskinen,12
Tiinamaija Tuomi,12,15
Benjamin F.
Voight,1–3,5
David Altshuler,1–6
Joel N. Hirschhorn,1,6,9–11
Thomas
E. Hughes,8
Leif Groop7,12
(Chair)
DNA sample QC and diabetes replication genotyping:
Candace Guiducci1
and Valeriya Lyssenko7
(Team Leaders),
Anna Berglund,7
Joyce Carlson,18
Lauren Gianniny,1
Rachel
Hackett,1
Liselotte Hall,18
Johan Holmkvist,7
Esa Laurila,7
Marju
Orho-Melander,7
Marketa Sjögren,7
Maria Sterner,18
Aarti
Surti1
Margareta Svensson,7
Malin Svensson,7
Ryan Tewhey,1
Noël P. Burtt1
(Chair)
Whole genome scan genotyping: Brendan Blumenstiel1
(Team Leader), Melissa Parkin,1
Matthew DeFelice,1
Candace
Guiducci,1
Ryan Tewhey,1
Rachel Barry,1
Wendy Brodeur,1
Noël
P. Burtt,1
Jody Camarata,1
Nancy Chia,1
Mary Fava,1
John
Gibbons,1
Bob Handsaker,1
Claire Healy,1
Kieu Nguyen,1
Casey
Gates,1
Carrie Sougnez,1
Diane Gage,1
Marcia Nizzari,1
David
Altshuler,1–6
Stacey B. Gabriel1
(Chair)
GCKR replication genotyping and analysis (Malmö Diet
and Cancer Study): Sekar Kathiresan1,3,5
(Team Leader),
Candace Guiducci,1
Aarti Surti,1
Noël P. Burtt,1
Olle Melander,7
Marju Orho-Melander7
(Chair)
Statistical analysis: Benjamin F. Voight1–3,5
and Paul I. W.
de Bakker1–6
(Team Leaders), Richa Saxena,1–6
Valeriya
Lyssenko,7
Peter Almgren,7
Noël P. Burtt,1
Hong Chen,8
Gung-Wei
Chirn,8
Qicheng Ma,8
Hemang Parikh,7
Delwood Richardson,8
Darrell Ricke,8
Jeffrey J. Roix,8
Leif Groop,7,12
Shaun Purcell,1,2
David Altshuler,1–6
Mark J. Daly1–3,5
(Chair)
1
Broad Institute of Harvard and Massachusetts Institute of
Technology (MIT), Cambridge, MA 02142, USA. 2
Center for
Human Genetic Research, Massachusetts General Hospital,
Boston, MA 02114, USA. 3
Department of Medicine, Mas-
sachusetts General Hospital, Boston, MA 02114, USA.
4
Department of Molecular Biology, Massachusetts General
Hospital, Boston, MA 02114, USA. 5
Department of Medicine,
Harvard Medical School, Boston, MA 02115, USA. 6
Depart-
ment of Genetics, Harvard Medical School, Boston, MA
02115, USA. 7
Department of Clinical Sciences, Diabetes and
Endocrinology Research Unit, University Hospital Malmö,
Lund University, Malmö, Sweden. 8
Diabetes and Metabolism
Disease Area, Novartis Institutes for BioMedical Research, 100
Technology Square, Cambridge, MA 02139, USA. 9
Depart-
ment of Pediatrics, Harvard Medical School, Boston, MA
02115, USA. 10
Division of Endocrinology, Children’s Hospital,
Boston, MA 02115, USA. 11
Division of Genetics, Children’s
Hospital, Boston, MA 02115, USA. 12
Department of Medicine,
Helsinki University Hospital, University of Helsinki, Helsinki,
Finland. 13
Skaraborg Institute, Skövde, Sweden. 14
Malmska
Municipal Health Center and Hospital, Jakobstad, Finland.
15
Folkhälsan Research Center, Helsinki, Finland. 16
Depart-
ment of Clinical Sciences, Community Medicine Research
Unit, University Hospital Malmö, Lund University, Malmö,
Sweden. 17
Department of Clinical Sciences, Medicine Research
Unit, University Hospital Malmö, Lund University, Malmö, Sweden.
18
Clinical Chemistry, University Hospital Malmö, Lund
University, Malmö, Sweden. 19
Department of Psychiatry,
Massachusetts General Hospital, Harvard Medical School,
Boston, MA 02115, USA.
Supporting Online Material
www.sciencemag.org/cgi/content/full/1142358/DC1
Materials and Methods
Figs. S1 and S2
Tables S1 to S6
References
9 March 2007; accepted 20 April 2007
Published online 26 April 2007;
10.1126/science.1142358
Include this information when citing this paper.
Replication of Genome-Wide
Association Signals in UK Samples
Reveals Risk Loci for Type 2 Diabetes
Eleftheria Zeggini,1,2
* Michael N. Weedon,3,4
* Cecilia M. Lindgren,1,2
* Timothy M. Frayling,3,4
*
Katherine S. Elliott,2
Hana Lango,3,4
Nicholas J. Timpson,2,5
John R. B. Perry,3,4
Nigel W. Rayner,1,2
Rachel M. Freathy,3,4
Jeffrey C. Barrett,2
Beverley Shields,4
Andrew P. Morris,2
Sian Ellard,4,6
Christopher J. Groves,1
Lorna W. Harries,4
Jonathan L. Marchini,7
Katharine R. Owen,1
Beatrice Knight,4
Lon R. Cardon,2
Mark Walker,8
Graham A. Hitman,9
Andrew D. Morris,10
Alex S. F. Doney,10
The Wellcome Trust Case Control
Consortium (WTCCC),† Mark I. McCarthy,1,2
‡§ Andrew T. Hattersley3,4
‡
The molecular mechanisms involved in the development of type 2 diabetes are poorly
understood. Starting from genome-wide genotype data for 1924 diabetic cases and 2938
population controls generated by the Wellcome Trust Case Control Consortium, we set out to detect
replicated diabetes association signals through analysis of 3757 additional cases and 5346 controls
and by integration of our findings with equivalent data from other international consortia. We
detected diabetes susceptibility loci in and around the genes CDKAL1, CDKN2A/CDKN2B, and
IGF2BP2 and confirmed the recently described associations at HHEX/IDE and SLC30A8. Our findings
provide insight into the genetic architecture of type 2 diabetes, emphasizing the contribution of
Here, we describe how integration of data
from the WTCCC scan and our own replication
studies with similar information generated by the
Diabetes Genetics Initiative (DGI) (6) and the
Finland–United States Investigation of NIDDM
Genetics (FUSION) (7) has identified several
additional susceptibility variants for T2D.
In the WTCCC study, analysis of 490,032
autosomal SNPs in 16,179 samples yielded
459,448 SNPs that passed initial quality control
(5). We considered only the 393,453 autosomal
SNPs with minor allele frequency (MAF) ex-
ceeding 1% in both cases and controls and no
extreme departure from Hardy-Weinberg equi-
librium (P < 10−4
in cases or controls) (8). This
T2D-specific data set shows no evidence of sub-
stantial confounding from population substruc-
ture and genotyping biases (8).
To distinguish true associations from those
reflecting fluctuations under the null or residual
errors arising from aberrant allele calling, we first
submitted putative signals from the WTCCC study
to additional quality control, including cluster-
plot visualization and validation genotyping on
REPORTS
onFebruary8,2010www.sciencemag.orgDownloadedfrom
ARTICLES
A genome-wide association study
identifies novel risk loci for type 2 diabetes
Robert Sladek1,2,4
, Ghislain Rocheleau1
*, Johan Rung4
*, Christian Dina5
*, Lishuang Shen1
, David Serre1
,
Philippe Boutin5
, Daniel Vincent4
, Alexandre Belisle4
, Samy Hadjadj6
, Beverley Balkau7
, Barbara Heude7
,
Guillaume Charpentier8
, Thomas J. Hudson4,9
, Alexandre Montpetit4
, Alexey V. Pshezhetsky10
, Marc Prentki10,11
,
Barry I. Posner2,12
, David J. Balding13
, David Meyre5
, Constantin Polychronakos1,3
& Philippe Froguel5,14
Type 2 diabetes mellitus results from the interaction of environmental factors with a combination of genetic variants, most of
which were hitherto unknown. A systematic search for these variants was recently made possible by the development of
high-density arrays that permit the genotyping of hundreds of thousands of polymorphisms. We tested 392,935
single-nucleotide polymorphisms in a French case–control cohort. Markers with the most significant difference in genotype
frequencies between cases of type 2 diabetes and controls were fast-tracked for testing in a second cohort. This identified
four loci containing variants that confer type 2 diabetes risk, in addition to confirming the known association with the TCF7L2
gene. These loci include a non-synonymous polymorphism in the zinc transporter SLC30A8, which is expressed exclusively in
insulin-producing b-cells, and two linkage disequilibrium blocks that contain genes potentially involved in b-cell
development or function (IDE–KIF11–HHEX and EXT2–ALX4). These associations explain a substantial portion of disease risk
and constitute proof of principle for the genome-wide approach to the elucidation of complex genetic traits.
The rapidly increasing prevalence of type 2 diabetes mellitus (T2DM) is
thought to be due to environmental factors, such as increased availabil-
ity of food and decreased opportunity and motivation for physical
activity, acting on genetically susceptible individuals. The heritability
of T2DM is one of the best established among common diseases and,
consequently, genetic risk factors for T2DM have been the subject of
intense research1
. Although the genetic causes of many monogenic
forms of diabetes (maturity onset diabetes in the young, neonatal mito-
chondrial and other syndromic types of diabetes mellitus) have been
elucidated, few variants leading to common T2DM have been clearly
identified and individually confer only a small risk (odds ratio < 1.1–
1.25) of developing T2DM1
. Linkage studies have reported many
T2DM-linked chromosomal regions and have identified putative, cau-
sative genetic variants in CAPN10 (ref. 2), ENPP1 (ref. 3), HNF4A (refs
genotypes for 392,935 single-nucleotide polymorphisms (SNPs) in
1,363 T2DM cases and controls (Supplementary Table 1). In order to
enrich for risk alleles21
, the diabetic subjects studied in stage 1 were
selected to have at least one affected first degree relative and age at
onset under 45 yr (excluding patients with maturity onset diabetes in
the young). Furthermore, in order to decrease phenotypic hetero-
geneity and to enrich for variants determining insulin resistance and
b-cell dysfunction through mechanisms other than severe obesity, we
initially studied diabetic patients with a body mass index (BMI)
,30 kg m22
. Control subjects were selected to have fasting blood
glucose ,5.7 mmol l21
in DESIR, a large prospective cohort for the
study of insulin resistance in French subjects22
.
Genotypes for each study subject were obtained using two plat-
Sladek, 2007How many SNPs (p-value?)
European-based; N ~ 1000
cases: high fasting blood glucose/non-obese

controls: non-obese
Human Hap300 chip, showing no T2DM association in stage 1
(P . 0.01) and separated by at least 100 kb. Using the first principal
component as a covariate for ancestry differences between cases and
controls, we tested for association between rs932206 and disease
status. Our result suggests that this apparent association is largely
BMI on the association between marker and disease, as it is asymp-
totically equivalent to the Armitage trend test used to detect asso-
ciation in stages 1 and 2. None of the associations (Supplementary
Table 7) was substantially changed by considering the effects of these
covariates.
5
3
1
5
3
1
5
3
1
5
3
1
5
3
1
15
10
5
5
3
1
5
3
1
5
3
1
5
3
1
5
3
1
5
3
1
5
3
1
5
3
1
5
3
1
5
3
1
5
3
1
5
3
1
5
3
1
5
3
1
5
3
1
5
3
1
5
3
1
1 2 3 4 5
6 7 8 9 10
11 12 13 14 15
16 17 19 20
21 22 X
18
Figure 1 | Graphical summary of stage 1 association results. T2DM
association was determined for SNPs on the Human1 and Hap300 chips. The
x axis represents the chromosome position from pter; the y axis shows
2log10[pMAX], the P-value obtained by the MAX statistic, for each SNP
(Note the different scale on the y axis of the chromosome 10 plot.). SNPs that
passed the cutoff for a fast-tracked second stage are highlighted in red.
882
Nature©2007 Publishing Group Sladek, 2007
Identification of four novel T2DM loci
Our fast-track stage 2 genotyping confirmed the reported association
for rs7903146 (TCF7L2) on chromosome 10, and in addition iden-
tified significant associations for seven SNPs representing four new
T2DM loci (Table 1). In all cases, the strongest association for the
MAX statistic (see Methods) was obtained with the additive model.
The most significant of these corresponds to rs13266634, a non-
synonymous SNP (R325W) in SLC30A8, located in a 33-kb linkage
disequilibrium block on chromosome 8, containing only the 39 end
of this gene (Fig. 2a). SLC30A8 encodes a zinc transporter expressed
solely in the secretory vesicles of b-cells and is thus implicated in the
final stages of insulin biosynthesis, which involve co-crystallization
Table 1 | Confirmed association results
SNP Chromosome Position
(nucleotides)
Risk
allele
Major
allele
MAF
(case)
MAF
(ctrl)
Odds ratio
(het)
Odds ratio
(hom)
PAR ls Stage 2
pMAX
Stage 2 pMAX
(perm)
Stage 1
pMAX
Stage 1 pMAX
(perm)
Nearest
gene
rs7903146 10 114,748,339 T C 0.406 0.293 1.65 6 0.19 2.77 6 0.50 0.28 1.0546 1.5 3 10234
,1.0 3 1027
3.2 3 10217
,3.3 3 10210
TCF7L2
rs13266634 8 118,253,964 C C 0.254 0.301 1.18 6 0.25 1.53 6 0.31 0.24 1.0089 6.1 3 1028
5.0 3 1027
2.1 3 1025
1.8 3 1025
SLC30A8
rs1111875 10 94,452,862 G G 0.358 0.402 1.19 6 0.19 1.44 6 0.24 0.19 1.0069 3.0 3 1026
7.4 3 1026
9.1 3 1026
7.3 3 1026
HHEX
rs7923837 10 94,471,897 G G 0.335 0.377 1.22 6 0.21 1.45 6 0.25 0.20 1.0065 7.5 3 1026
2.2 3 1025
3.4 3 1026
2.5 3 1026
HHEX
rs7480010 11 42,203,294 G A 0.336 0.301 1.14 6 0.13 1.40 6 0.25 0.08 1.0041 1.1 3 1024
2.9 3 1024
1.5 3 1025
1.2 3 1025
LOC387761
rs3740878 11 44,214,378 A A 0.240 0.272 1.26 6 0.29 1.46 6 0.33 0.24 1.0046 1.2 3 1024
2.8 3 1024
1.8 3 1025
1.3 3 1025
EXT2
rs11037909 11 44,212,190 T T 0.240 0.271 1.27 6 0.30 1.47 6 0.33 0.25 1.0045 1.8 3 1024
4.5 3 1024
1.8 3 1025
1.3 3 1025
EXT2
rs1113132 11 44,209,979 C C 0.237 0.267 1.15 6 0.27 1.36 6 0.31 0.19 1.0044 3.3 3 1024
8.1 3 1024
3.7 3 1025
2.9 3 1025
EXT2
Significant T2DM associations were confirmed for eight SNPs in five loci. Allele frequencies, odds ratios (with 95% confidence intervals) and PAR were calculated using only the stage 2 data. Allele
frequencies in the controls were very close to those reported for the CEU set (European subjects genotyped in the HapMap project). Induced sibling recurrent risk ratios (ls) were estimated using
stage 2 genotype counts for the control subjects and assuming a T2DM prevalence of 7% in the French population. hom, homozygous; het, heterozygous; major allele, the allele with the higher
frequency in controls; pMAX, P-value of the MAX statistic from the x2
distribution; pMAX (perm), P-value of the MAX statistic from the permutation-derived empirical distribution (pMAX and
pMAX (perm) are adjusted for variance inflation); risk allele, the allele with higher frequency in cases compared with controls.
0
2
4
–log10[P]
–log10[P]
SLC30A8 IDE HHEXKIF11
0
2
4
a b
NATURE|Vol 445|22 February 2007 ARTICLES
Sladek, 2007
5
3
1
5
3
1
15
10
5
1 1 1
5
3
1
5
3
1
5
3
1
5
3
1
5
3
1
5
3
1
5
3
1
3 4 5
8 9 10
13 14 15
19 20
X
18
DM 2log10[pMAX], the P-value obtained by the MAX statistic, for each SNP
How would you interpret the p-
values?
Odds ratios?
Confirmed 8 SNPs with N ~ 1000
Scaling up discovery by combining populations:

meta-analyses
g the Diabetes Genetics
nvestigation of NIDDM
nd (iv) the Framingham
omponent studies (n ¼
ry Table 1 online.
aring, the four consortia
n 10 and 20 SNPs promi-
their individual, interim,
mentary Table 2 online).
oci with consistent effects
dies. Two of these repre-
6PC2 and GCK. In addi-
nerated evidence for an
NPs around the MTNR1B
rs1387153, P ¼ 2.2 Â
10À11; DFS: rs10830963,
5.8 Â 10À4, for the most
ch analysis). The associa-
d on formal meta-analysis
r exclusion of individuals
¼ 1.1 Â 10À57; rs4607517
NR1B), P ¼ 3.2 Â 10À50;
pplementary Table 3 and
ent efforts to harmonize
(including the additional
data from the WTCCC, DGI and FUSION scans)10 (Supplementary
Note). We found strong evidence that the minor G allele of
rs10830963 was associated with increased risk of T2D (odds ratio ¼
1.09 (1.05–1.12), P ¼ 3.3 Â 10À7; Fig. 2 and Supplementary Table 6
online). The possibility that the fasting glucose association might
DGI
Study ID OR (95% CI) Weight
(%)
1.12 (0.96, 1.30) 4.61
4.89
8.03
9.58
3.53
8.75
2.69
6.04
10.56
23.18
2.85
7.41
7.90
100.00
1.20 (1.03, 1.39)
1.07 (0.95, 1.20)
1.14 (1.03, 1.27)
1.00 (0.84, 1.19)
1.17 (1.04, 1.30)
1.07 (0.88, 1.31)
1.16 (1.02, 1.33)
1.00 (0.90, 1.10)
1.03 (0.96, 1.10)
0.91 (0.75, 1.10)
1.15 (1.02, 1.30)
1.16 (1.03, 1.30)
1.09 (1.05, 1.12)
Meta-analysis P value = 3.3 × 10
–7
FUSION
WTCCC
deCODE
KORA
Rotterdam
CCC
ADDITION/ELY
Norfolk
UKT2DGC
OxGN/58BC
FUSION Stage 2
METSIM
.722 1 1.39
Overall (I
2
= 26.6%, P = 0.176)
Figure 2 Association of rs10830963 with type 2 diabetes (T2D) in 13 case-
control studies.
VOLUME 41 [ NUMBER 1 [ JANUARY 2009 NATURE GENETICS
Meta-analysis of SNP rs10830963:
Combining findings from multiple cohorts
Propenko, 2009
A RT I C L E S
By combining genome-wide association data from 8,130 individuals with type 2 diabetes (T2D) and 38,987 controls of
European descent and following up previously unidentified meta-analysis signals in a further 34,412 cases and 59,925 controls,
we identified 12 new T2D association signals with combined P < 5 × 10−8. These include a second independent signal at the
KCNQ1 locus; the first report, to our knowledge, of an X-chromosomal association (near DUSP9); and a further instance of
overlap between loci implicated in monogenic and multifactorial forms of diabetes (at HNF1A). The identified loci affect both
beta-cell function and insulin action, and, overall, T2D association signals show evidence of enrichment for genes involved in
cell cycle regulation. We also show that a high proportion of T2D susceptibility loci harbor independent association signals
influencing apparently unrelated complex traits.
Type 2 diabetes (T2D) is characterized by insulin resistance and
deficient beta-cell function1. The escalating prevalence of T2D and
the limitations of currently available preventative and therapeutic
options highlight the need for a more complete understanding of
T2D pathogenesis. To date, approximately 25 genome-wide significant
common variant associations with T2D have been described, mostly
through genome-wide association (GWA) analyses2–13. The identities
of the variants and genes mediating the susceptibility effects at most
of these signals have yet to be established, and the known variants
account for less than 10% of the overall estimated genetic contribution
to T2D predisposition. Although some of the unexplained heritability
will reflect variants poorly captured by existing GWA platforms, we
reasoned that an expanded meta-analysis of existing GWA data would
the inverse-variance method (Online Methods, Fig. 1, Supplementary
Tables 1 and 2 and Supplementary Note). We observed only modest
genomic control inflation ( gc = 1.07), suggesting that the observed
results were not due to population stratification. After removing SNPs
within established T2D loci (Supplementary Table 3), the result-
ing quantile-quantile plot was consistent with a modest excess of
disease associations of relatively small effect (Supplementary Note).
Weak evidence for association at HLA variants strongly associated
with autoimmune forms of diabetes (Supplementary Table 3 and
Supplementary Note) suggested some case admixture involving
subjects with type 1 diabetes or latent autoimmune diabetes of adult-
hood; however, failure to detect T2D associations at other non-HLA
type 1 diabetes susceptibility loci (for example, INS, PTPN22 and
Twelve type 2 diabetes susceptibility loci identified
through large-scale association analysis
Voight, 2010
Meta-analyses for T2D:
N>40K and 90K identifies >30 loci among 2,400,000 SNPs
A RT I C L E S
13 autosomal loci exceeded the threshold for genome-wide significance
(P ranging from 2.8 × 10−8 to 1.4 × 10−22) with allele-specific odds
(r2 < 0.05), and conditional analyses (see below) establish these SNPs
as independent (Fig. 2 and Supplementary Table 4). Further analysis
50 Locus established previously
Locus identified by current study
Locus not confirmed by current study
BCL11A
THADA
NOTCH2
ADAMTS9
IRS1
IGF2BP2
WFS1
ZBED3
CDKAL1
HHEX/IDE
KCNQ1 (2 signals*: )
TCF7L2
KCNJ11
CENTD2
MTNR1B
HMGA2 ZFAND6
PRC1
FTO
HNF1B DUSP9
Conditional analysis
Unconditional analysis
TSPAN8/LGR5
HNF1A
CDC123/CAMK1D
CHCHD9
CDKN2A/2B
SLC30A8
TP53INP1
JAZF1
KLF14
PPAR
40
30
–log10(P)–log10(P)
20
10
10
1 2 3 4 5 6 7 8
Chromosome
9 10 11 12 13 14 15 16 17 18 19 20 21 22 X
0
0
Suggestive statistical association (P < 1 10
–5
)
Association in identified or established region (P < 1 10
–4
)
Figure 1 Genome-wide Manhattan plots for the DIAGRAM+ stage 1 meta-analysis. Top panel summarizes the results of the unconditional meta-
analysis. Previously established loci are denoted in red and loci identified by the current study are denoted in green. The ten signals in blue are those
taken forward but not confirmed in stage 2 analyses. The genes used to name signals have been chosen on the basis of proximity to the index SNP and
should not be presumed to indicate causality. The lower panel summarizes the results of equivalent meta-analysis after conditioning on 30 previously
established and newly identified autosomal T2D-associated SNPs (denoted by the dotted lines below these loci in the upper panel). Newly discovered
conditional signals (outside established loci) are denoted with an orange dot if they show suggestive levels of significance (P < 10−5), whereas
secondary signals close to already confirmed T2D loci are shown in purple (P < 10−4).
Meta-analyses for T2D:
N>40K and 90K identifies >30 loci among 2,400,000 SNPs
0
20
40
60
80
100
recombinationrate(cM/Mb)
●●●
●●
●●
●●●
●
●
●
●●●
●
●●●●●
●
●
●
●●●
●●
●● ●
●
●●●
●●
●
●
●
●
●
●
●●
●
●
●●
●● ●
●
●●
●●
●
●
●
●●●
●
●
●
●
●●
●
●
●
●
●●●●●
●
●●
●
●
●
●●●
●
●
●
●
●
●
●●
●
●●●
●●●
●
●
●
●
●
●
●●●●●
●●●●
●
●
●●
●
●
●●
●
●
●●
●
●●
●
●
●●●
●●
●●
●
●●
●
●●
●
●
● ●
●●●●
●
●
●
●
●
●●
●
●● ●●
●● ●
●
●
●
●
● ●
●●
●
●●●●
●
●
●
●●
●
●●
●
●●●
●
●
●
●
●
●●●●
●
● ●● ●
●
●●●●●
●
●
2 −>
PGCP
98
SLC30A8 Region
0
2
4
6
8
10
−log10(P−value)
0
20
40
60
80
100
recombinationrate(cM/Mb)
rs3802177
●●●●
●
● ●
●
●
●
●
● ●
●●
●
●●
●●● ●
●
●
●
●●●
●●
●
●●●●●●
●
●●●
●
●
●
●
●
●
●●
●●●
●●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
● ●●
●
● ●
● ●
●● ●
●
●
●
●
●
●
●
●
●
● ●
●
●●
●
●
●●● ●● ●●
●
●
●
●
●
● ●
●
●
● ●
●
●
●● ●●
●
●
●
●
●
●
●
●
●
● ●
●
●
●● ●
●● ●
●
●●
●●
●
●●
●●
●
● ●
●
● ●
● ●
●●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●●●
● ●● ●●
●
●
●
●●
●
●●
●
●
●
● ●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
● ●●●
● ●
●
●
●
●
●
●
● ●
●
●
●
●
●●
●
●
● ● ●
● ●
●
●
● ●
●
●
●
●●
● ●
●
●
●
●
●
●● ●
●● ●●●
●
●
●
●
●●●●●
●
●
●
●●
●● ●
●
●
●
● ●
● ●
●
●
●
●
●
●● ●
●●
●
●
●
●
●
●
●
●●●
●● ● ●
●
●
●
●●
●
●
●
●
●
●
●●
●●
●●
●
●●●
●●
●● ●
●●
●
●●● ●
● ●
●
●●●●
●
●
●
●
●
●
●
●
●
●
●●●●
●
●●
●
●
●
● ●
●●
●
●
●●
● ●
● ●
● ●
●
●●
●
●
●
●
●●●
●
●
●
●
● ●●
●
●
●
●●
●
●
●
● ●
●
●●●●
●●
●
●
●●
●●●
●
●●●●●
●●
●●●
●
●●●
●
●
●
●
●●●
●●
●
●
●
●●●●●
●
●
●
●
●●
●
●●●
●
●
● ●
●
●
●●
●
●
● ●●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●●●
●
●
● ●
●●●
●
●●●
●
●
●●
●
●
●
●
●●
●
●
●●● ●
●
●
●
●
●
●●
●
●
●
●●
●
●
●●●●
●
●
●
●
●
●
●
●
●●
● ●
●
●● ●
●
●
●
●
● ●●●●
●
●
●
●
●
●
●
● ●
●
●●
● ●● ●
●
●
●
●●
●
●
●●● ●●
●
●
●
●
●●●
●
●
●
●
●●
●
● ●●
●
● ●●
●
●
●
●● ●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●●
●
● ●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
● ●
●
● ●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
● ●●
● ●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
● ●
●
●
●
●
● ●●●● ●●●
●
●
●
●●
●
● ●
●
●
●
●●
●
● ●
●
●
● ●●●
●
●
●● ●
●
●
●
●
●●
●
●
●
●
●● ●●
●●
●
●
●
●
●
● ●
●
●
●
●
●
● ●●
●
●
●
●
●
●
●
●
●●
● ●
●
●
●
●
● ●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●●●● ●
●
●
●
● ●
●
●
●
●● ● ●
● ●●
●
●
●
●
●
●
●
●
●● ●
●
●
●●
●
●
● ●
●
● ●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●● ●
●●
●● ●
●
●●
●
●
● ●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
● ●
●
● ●
●
● ●●
●
●●
●
●●
● ●
●● ●
●
●●
●
●●● ●
●●
●
●
● ●
●
●
●
●
●
●
●
●
● ●
●●
●●
●
●
●●
●
●
●
● ● ●
●
●
●
●
●
●
●●
●
●
● ● ●
●
●
●
●●
●
●
●
●
●
●●
●
●
● ●
●
●
●● ●●
●
●● ●●●
●●
●●●●●●
●
●
●
●●
●●
●
●
●
●
●●●
●
●
●
●●
●
●●
●
●●
●●●●●
●
●
●
●●● ●
●
●●
●
●●
●
●● ●
●●
●
●
●
●
●
●
●
●
●● ●●●
●
●● ●●●●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
● ●●
●
●
●
●
●
●
●●●
●●
●
●
●
●
●
●
●●
●
● ●●●●
●●
●●
●●●
●
●
●
●●●●●
●
●●
●
●
●
●
●●
●
● ●● ●●●●●●●●●
●●●
●
●●●
●
●● ●
●●●
●
●
●
●
●
●
●● ●
●
●
●
●● ●●
●
●●
●
●●●●●● ●
●
● ●
●
●
●
●
●
●●
● ●
●
●●
●
●
●
●
●
●
● ●● ●
●
●
●
●
●
●●
●
●
●
●●●
●
●●●●●
●
●
●●●
●
●●●● ●
●●
●●
● ●
●●● ●
●
●●●●●●●
●
●
●
●
●
●
●●
●
●●
●
●●
●●●●●
●●●
●
●
●
●
●
●
●
●
●
●
●●
●
●●●
●
●
●
●●
●●
●●
●
●●●●
●●●
●
●● ●
●
●
●
●●●
●
●●●
●
●●
●
●●●
●
●●●●●●●●●●
●
●
●
●
●●●●
●
●●
●●●●●●●●●●●●●
●
●●●
●
●●
●● ●
● ●●
●●
●
●●●●●
●
●
●
●●
●●
●
●
●●●●●●
●
●
●
●
●
●
●●●●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●●●
●
●
●●●
●
●
●●●
●
●
●
●
●●●●●
●
●
● ●
●
●
●
●
●
●
●●
●
●●
●
●
●●●●
●
●●
●
●●● ●
●
●
●
●●●
●
●
●
●
●
●●
●●
●●
●
●
●
●
●●●●●
●
●
●●
●●
●
●●●●●
●
●
●●●
●●
●●●
●
●
●
●
●●
●
●
●
● ●●
●
● ●●
●
●
● ●●
●
●
●
●
●
●
●
●●
●●●
●
●●
●●
●●●●
●
●
● ●●●
●
●
●●●
●
●
●
●
●●
●
●
●●●●● ●
●● ●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●●● ●●●●
●
●
●
●● ●
●●●●
●●
● ●
●
●●●●
●● ●
●
●
●
●●
●
● ●●
●
●●
● ●
●
●
●
●●●
● ●●
●●●
●
● ●●●
●
●
●●●●●
●
●
●
●
●●●●●
●
●●●●●
●
●●●
●
●
●●
●
●
●
●
●●●
●●
●●●
●● ●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●●● ●
●
●
●
●
●●●
●
●
●
●●
●
●
● ●
●
●
●
●●
●●
●
●●
●
●
● ●●●
●
●
●
●
●
●
● ● ●
●
● ● ●● ●
●
●
● ●
●●
●
●
●●●● ● ●●●
●●
●
●
●
● ●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●●
●●●
●
●
● ●
● ● ●
●
●
●
●●
●
●
●●
●
●●●
●
●●●
●
●●●●●●● ●
●
●
●
●
●
●●●●●●●● ●●
● ●
●
● ●●●●●● ● ●
●●
●
●●
●●● ●
●
●
● ●
●
●
●●●● ●●
●
●
●●●
●●●
●
●●●●
●
●●●●●●
● ● ●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●●
●
●
●
●
●●●●●●●●●●●●●
●●●●●●● ● ●
●●●●●●●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●● ●●●
●
●●
●
●●●●
●● ●
●
●
●
●
●
●●●●
●
●●
●
●
●
●
●
●●
●●●●● ●
●
●
●●
●
●●●●●●●●●●●●●
●●●●●●●●●●
●
●●
●
●
●
●
●
●
●
●●
●
●●
●
●●
●
●●●●●●
●
●● ●
●●●●●●●
●
●●
●●●●
●
●●●●
●
● ●
●●●●●●
●
●●
●●●●●●●●●●●
●●● ● ●
●
●●●●●●
●
●●
● ●●●●●●
●●●●●
●
●
●
●
● ●●●●●●●●●●●●●●●●●●●
●●
●
●
●
●●
●
●
●
●
●●●●
●●
●
●●●
●●
●●●
●
●●
●●
●
●●
●
●
●●●●●
●
●
●
●●
●●
●●●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●●●
●●● ●
●●●
●
●●●●●●●●
●
●●●●
●
●
●●●
●
●●
●
●●●
● ●●●●
●
●●
●●●
●
●●●●●
●●●●
●●
●●●
●
●
●
●
●
●
●●●●
● ●
●
●●●
●
●
●
●
●
●
●
●
●●●●●●●●●●●
●
●
●●●●●
●
●
●●●●●
●
●●●●
● ●●
●
●●●●●
●
●●●● ●●
●
●●
●
●
●
●●
●●●●●●●●●●●●●
● ●
●●●●●●●
●●●●
●
●●
●●
●●●
●
●
●● ●●●
●
●●●●
●
●
●●●
●●●●●●●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
● ●
●●●
●●●●●
●
●●●●●●●●●●●
●
●●●●●●●
●●●●●●●●
●
●
● ●
●●
●
●
●
●●●
●●
●
●
●●●●●●●●●●●●●●●●
●●●●●
●●●●●
●
●
●
●
●
●
●●
●●
●
●
●
●●●
● ●
●●
●
●
●
●
● ●● ●
●
●
●
●
●
●●
●
●
●
●
●●
●●
●
●
●●●●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●●
● ●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
● ●●
●
●
●
●
●
●
●
● ●
● ●
● ●●
● ●
●
●
●
●●
●
● ●
●● ●●
●
●
●
●
●
●●
● ●
●
●
●● ●
●
●
●
●
●
●● ●
●
●●
●●
● ●
●
●
●●
● ●● ●
●
● ● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●●
●●
●
●●
●
●
●
●
●
●
●
●
●●
● ●●
●
● ●●
●
●
●●
●
●●
●
●● ●
●
●
●
●●
●
●
● ●● ●
●●●
●
●
●
●
●
● ●● ● ●
●
● ●
●
●● ●●●●●●●●●
●
●●●●
●●
●●●
●●
●●
●●●
●
●●
●
●
●
●●●●
●
●
●
●
●
● ●
●●
●
●
●●●
●
●●
●
●●
●
●
●
●●●
●
●
●●●●●●●●
●
●●●●
●●
● ●●
●●
●
●●●●●●●
●●●●
●
●
●●
●●●
● ●●●
●●●
●
●●
●
●
● ●●
●
●●●●
●
●
●
●
●●●
●
●●●●●●●●
●
● ●
●
●●
●
●
●
●●
●
●
●●
●
●● ●●
●
●
●
●●●●
●
●
●
●
●●
●
●●
●●
●
●
●● ●
●●●●
●●
●●
●
●
●
●
●
● ●● ● ●●●
●
●
●
●
●
●
●
●
●
●●●
●
●●
●●●●
●
●●
●
●
●
●
●● ●
●
●
●
●
●
●●
●
●
●
●
●
● ●
●
● ●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
● ●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●●●●
● ● ●
●
●
●
●
●
●
●
● ●● ●
●●
●
● ●●●●
●
●
●● ●
●
●●
●●
●
●
● ●
●
●
●
●
●● ●
●
●
●
●
●
●
● ● ●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
● ●
●
●
●
●
●●
●
●
●
●
● ● ●
●
●
●
●
●
●
●
●●
●
●●
●
●
●●
●●
●
●
●
●
● ●●
●
●
●
●
●●
● ● ●
●
●
●
● ●
●
●●
●
●
●
● ●
●
● ●●● ●●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
● ● ●
●
●
●
● ●
●
●● ●
●
●
●
●●
●
●
● ● ●
●●
●
●
●
●●
●
●
● ●
●
●
●
●
●●
● ●
●
●
●
●●●●
●
●
●
●
●
●
●
●
●
●●
●
●
● ●
●
●
●
●
●
●
●
●
●
●●●
●
●●
●
● ●
●●●
● ●
●
●
●
●
●●
● ●
●●
●●
● ● ●
● ●●
●
●● ●●
●
● ● ●
● ●
● ●●
●
●
● ●
●●
●●
●
●●
●●●●●●●●
●
●
●●●●●●●
●
●●●
●
●
●●●●●
● ●● ●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●● ● ●
●
●
●
●
●●
●
●
●●● ●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●● ●
●
●● ●●
●
●●
● ●
● ●● ●
●
● ●●
●
●
●●
●
●
●
●
●
●
●
●
● ●●
●
●● ●● ●
●●
●
●
●
●
●
●
●
●
●
●
●●●
●
●●
●
●
●
●
●
●●
●
●
●
● ●●
●
●
●
●
●
●
●
●
●
●
●
●●●●
●
● ●●
●
●
●
●
●
● ●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●●
●
●
●● ●
● ●●
●
●
●
●●
● ●
●●●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
● ●
●
●
●
●
●
●
●●●
●
●
●
●●●
● ●
●
● ●●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
● ●
●
●
●●●
●
●
● ●
●
●●
●
●●
●
●●●●●●●●●●
●●●●●
●●
●●●
●●●
●
●
●●●●
●●●●●●●●●●
●
●
●
●
●●
●●●●●
●●●●●●●●●●
●●●●●
●
●
●
●
●
●
●●●●●●●●
●
●
●
●●●●
●●●●
●●●
● ●
●●
●
●
●●
●
●
●
●●●●● ●●
●
●
●
●
●
●
●
●●●●
●
●●●
●
● ●●
●
●
●●
●
●
●
●● ●
●●
●●● ●
● ●
●
●●●
●●
●
●●
●
●
●
●
●
● ●●
●
●
● ● ●
●
●
●
●●
●
●
●
● ● ●●
●
● ● ●
●
●
●●●●
● ●
●
● ●
●
●
● ●● ● ●● ●
●
●
●
●
●
●●
●
●
●
● ●
●● ●●●●
●●
●
●
●● ●
●
●●
● ●
●
●
●
●
●● ●●
●
●●
●
●
●
●
●●
●
●
●
●●
●
●●
●●
●
●
●
rs3802177 stage 1
● r^2: 0.8 − 1.0
● r^2: 0.6 − 0.8
● r^2: 0.4 − 0.6
● r^2: 0.2 − 0.4
● r^2: 0.0 − 0.2
● r^2 missing
<− TRPS1
<− EIF3H
UTP23 −>
<− RAD21
LOC441376 −>
SLC30A8 −>
MED30 −>
<− EXT1
<− SAMD12
<− TNFRSF11
COLEC1
117 118 119 120
Position on chromosome 8 (Mb)
CDKN2A/B Region
0
2
4
6
8
10
−log10(P−value)
0
20
40
60
80
100
recombinationrate(cM/Mb)
rs10965250
●● ●● ●
●
●
●
●
●
●
●
●
●
●
● ●●●
●
●
●●
●
●●
●
●
● ●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●●●
●
●
●
●
●
●
● ●
●●
●
●
● ●
●
●
●
●
●
●●
●
●●
●
●
●
●●
●
●●
●
●
●
●
●
●
●●●
●
●●●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●●●
●
●
●
●
●
●●
●● ●
●
●
●●
●
●
●●
● ●●
●
●
●
● ●●
●
●
●
●
●
●
● ●●
●
●●
●
●
●
●
●●
●
●
●
●
●● ●
●
●● ● ●
●
●
●
●
●
●
●
● ●
●
●●
●●
●● ●
●
●
●
●
●
●
●
●●
●
●●●●
●
●
● ●
●●
●
●
●
●
●
● ● ●
●●
●
●
●
●
●●●●
●
●●
●
●●
●
●
●
●●●
●
●●●
● ●
●
● ●●●
●
●●●
●
●
●
●
●●●●
●●
●
●●
●
●
●
●
●
●
●
●●●
●
●
●
● ●
●
●
● ●
●
●
●
●
●
● ●
●
● ●●
●
●
●
● ●
● ●●●●
●
●●
●
●
●
●
●
● ●●
●
● ●●●●●
●
●●
●
●
●
●
●
● ●●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●●
●●
●●●
●
●●
●●
●
●
●●
●●●
●●
●
●
●●
●
●
●●
● ● ●
●
● ●
●●●●●●●●●
●●
●●
●
●
●
●
●
●
●
●●
●
●
● ●●●●●●●
●●●
●
●
● ●●
●
●
●●●●
●
●
●
●●
●
●
●
●
●●●●●
●
●●
●●●●●●
●
●
●
●●
●
●
●●●
●
● ●
●●●
●
●●●●
●
●
●
●●●●
●●
●●●
●●
●●●●●
●●
●●●
●●●●●
●
●●●●
●
●
●
●●
●
●
●
●
●●●
●
●
●●
●●
●●
●
●
●
●
●●
●
●
●
●●●
●
●
●
●
●
●●●●●●●
●●●
●
●●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●●
●●
●●
●●●●●●●●●●
●
●●●
●●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●●●
●●●●●
● ●●
●●
●
●
●●●
●●
●
●●
●
● ●
●
●
●●●
●
●●●
●
●●●
●
●
●
●
●●●●●●●●●●●●●
●
●●
●●●
●●●
●●●
●
●
●
●●●●
●●
●
●
●
●●
●
●
●
●●
●
●
●●
●
●
●
●●●
●●
●●
●●●●●●●●●●●●●●●
●
●●●
●●●●●
●
●
●
●
●
●
●
●●●●
●●
●
●
●
●
●
●●
●
●●●
●
●
●●
●●●●●
●
●●
●
●
●
●
●●●●●●●
●
●
●
●
●
●●●
●●
●
●●●
●
●●●
●
●●●●●●●●●●●●●●●●
●●●●
●●
●
●●
●●
●●
●
●
●
●
●
●●
●
●●
●
●●●
●
●●●
●
●●●●●
●
●●
●
●●●
●●
●●
●
●
●●●
●●
●●●●
●●
●●
●●
●●
●
●
●
●
●
●
●●●●
●
●●●●●
●
●
●
●●●●
●
●●
●
●
●
●
●●●
●
●●
●
●
●●●●●
●
●
●
●
●
●●
●
●●
● ●●●●●
●
●●
●●●●●
●●
●
●●●
●
●
●●●
●
●
●
●
●
●
●
●●
●
●●●●
●
●
●●
●
●●
●●●●●●●●●●●●●●
●●
●
●●
●●●
●
●
●
●●
●●
●
●●●
●
●●●●
●
●
●
●
●●
●●
●●
●●●●●●
●
●
●●
●
●
●●
●
●
●
●
●●
●
●●
●●●
●
●
●●
●
●●
●
●
●
●●
●
●●●
●
●●
●
●
●●●
●
●●●●●
●
●
●●●
●●●●●
●●
●●●
●
●
●
●
●
●
●●
●
●
●
●
●●●●●●
●
●●●
●●
●
●●●
●
● ● ●
●●
●●
●
●
●
●
● ●
●
●
●
●
●
●
●
●●
●
●
●
●
●● ●●●
●
●
●● ●
●
●
● ●
●
●
●
●
●
●
●
●●
●
●●●
● ●●
●
● ●●●●● ●● ●
●●
● ●● ● ●
●
●●
●●
●●
●
● ●● ●
●
●
●●
● ●
●
●●
●
●●
● ●
●
●
● ● ●●●● ●
●
●
●
●●
●
● ●●●●
●●
●●●
●●
●●
●
●
●
●●
●
●
●●●●
●●●
●
● ●●
●●
●
●
●
●●●
●
●
●●●
●
●
●
●
● ●
●
●
●
●
●●
●
●
●
●●
●
● ●
●
●
●
●
●●
●●
●
●
● ●●●
●
●
●
●
●●
●
●
●
●● ●●
●
●●
●
●
● ● ● ●
●
● ●
●
●●
● ●●●●
●
●
●
●
● ●
●
●
● ●
●
●● ●
●
●
●
● ●●
●●
●●
● ●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●● ●
●
● ●●
●
●
●
●
●
● ●
●●
●
●
● ●
●
●
●
●
● ●
●
●●
●
●
●
● ●
●
●
●●●● ●
●
●
●●
●
●
● ●
●●
●
●●
●
●
●
●
●
●●●
●●●●
●
●
●
●●●
●
●
●
●
●
●●
●
●
●
●
●
● ●●●
●
●
● ●
●●● ●●
●
●
●
●●
●●
●
●●
●●
● ●●●●
● ●
●
●
● ●
● ● ●
●
● ●
●
●
●
●●
●● ●
●●●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●● ●●
●● ●
●
●
●
● ●
●
●●
●
●
● ●
●●●●●
●● ●
●
●● ●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●●● ●●
●
●
● ●●
●
●
●●●
●
●●●●
●
●●
● ●
●
●
●
●●
● ●
●
●
●●●
●●●●●●
●●●●
●● ●●
●●●●
●●●
●●●
●
●
●
●
●● ●●
●
●●●
●● ● ●
●●●
●●●●
●
●●
●
●
●●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
● ●
●
●
●●●
●●
●
●
●
● ●●
●
●
●●
●●
●
●
●●
●
●
●●
●
●●
●
●●●●●
●●
●●
●
●
● ●●●
●●
●
●
●●
●
●●
●●
●●●
●
●
●
●●
●
●
●● ● ●●
●●●●●●●●●●●●●●●●
● ●●
●●●
●●
●●●●
●
●
●
●
● ●●
● ●
●
●● ●●●●●
●
● ● ●
●
●● ●●
●
●●
●
●●
●
●
●●●
●●
●
●
●
●
●●●
●
●● ●● ●
●● ●
●
●
●●
●
●
●●●●
●●● ●
●●
●●●●●
●
●
●●●
●
●●
●
●●
●
●
●●●
●●
●●●
●
●
●
●
●
●
●
●●
●
●
●●●● ●●●
●●
●●
●● ●
●●
●
●●
●
●
●●●●●
● ●●
●
●
●●
●
●
●
●●●●
●
●●
●
●●●
●
●
●
●
●
●
●
●●●
●●
● ●
●
● ●●
● ●●●●●
●
●
●●
●
●●
●
●
●●
●
●
●
●●●●●●
●
●
●●●●
●●
● ●●●●● ●
●
●
●
●●
●●
●
●●
●
●
●
●
●●●●●
●
●
●
●●●●
●
●
●
●●●●●● ●
●●
●●
●●●
●●●
●●●
●
●
●
●● ●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●●
●●●●●
●●● ●
●●●
●
●
●
●
●
●
●●
●
●
●●●●● ●●● ●
●
●
●
●
● ●●●
●
●
●●
●
● ●●
●
●
●
● ●●
●
●
●
●
●
●●●
●
●
●● ● ● ●
●
●● ●
● ●●●
● ●
●
● ●
●
●
●
●
●●
● ● ● ●
●●
●
●
●
●●
●●
●
●●
●
●
●●
● ●
●
●
●
●
●
● ● ●
●
●
●
●
●●
●
●
●
●
●
● ●●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
● ●●
●
●●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●●
●
●●●
●
●
●
●
●●●●
●
● ●
● ●
●
● ●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
● ●● ●
●
●● ● ●
●
●●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●● ●●
●
●
● ●
●
●●
●
●●
●●●
●
●
●
●● ●
●●
●●
●
● ●● ●
●
●
●
●
●●
●
●
●●
●●
● ●
●
●
●
●●
●
●
●
●
●●
●●●
●●
●●● ●●
●●
●●●
●●
●●
●
●
●
●
●●
●● ●● ●
●
●
●
●
●
●
●
●●
●
● ●●
●
●●
● ●
●
●●
●●
● ●●
●
●
●
●
● ●●
●
●
●
●
●
●●
●
●
●●
●●●●●●
●
●
●
●
●
●
●
●
●
●
●●
●
●●●●
●
●
●
●
●
●
●●●
●● ●●●●●●
●●
●●●●●●●●
●
●
●
●
●
● ●
●●
●●
●
●●●●
●●
●●
●
●
●●
●
●
●
●
●
●●●●
●
●
●
● ● ●
●
●●
●●●●●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●●●●●●●
●●
●
●
●
●
●●
●
●
●●
●
●
●●●●
●
●●
●●
● ●
●
●
●
● ● ● ●
●●●
●
●
●
●
●
●
●
● ●
●
● ●●
● ●
●●
●
●
●
●●
●
●
●● ●
●
●●
●
●
●
●
●
●
●●●●●
●●
●● ●
●
●
●●
●
●
●
●
●●●●●●●●
●●●
●
●●●●
●●● ●
●
●●
●
●
●●●● ●●●●
●
● ●
●
●
● ●●●●●
●
●
●
●
●
● ●
●
● ●
●●●
●●●
●
●
●
●●
●●
● ●
●
● ●
●●
●
●●
●
●●
●
●
●
●
●
●
● ●●
●
● ●
●
●●●●
●●
●
●
●
●
●
●●● ●
●
●● ●●
●
● ●●●
●
●
●
●
●●
●
●
●●
● ●
●
●
● ●
●
●
●
●
● ●●●
●
●
●
●
●●
●
●
● ●●●●
●
●
●
●
●
● ●●
●
●
●
● ●
●
● ●
● ●●
●●
●
●
●
● ●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●● ●●
●
●
●
●
●● ●
●
●
●
●●
●
●●
●
●●●●
●●●
●
●
●
●●● ●
●
●
●
●●●
●
●
●
●
●●
●●
●
●●
●
● ●●●
●
●
●
●●
●● ●
●●
●
● ●
●
●●
●
●
●
●
●
● ●●
●●
●●●
●
●
●
●
●●●
●
●● ●●
●●
●● ●●
●
●●● ●●
●●● ●
●●●
●
●
●
●●
●
●
●
●
●
●
●●●
●
●●●●
●
●
●
●●
●●●
●
●
●●●
●●
●●
●●●●●
●
●
●●●●
●
●
●●● ●● ●● ●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●●
●
●●
●
●
●
●
●●
●
●
●●
●
●●
● ●●●●● ●●● ●●●
●
●
●
●
●
● ●●
●
●
●
●
●
●●
●●
● ●●●●
●
●●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
● ●
●
●●
●
● ●
●
●
●
●● ●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●●
●
●
●
●
●● ●
●
●
●●●●●●●
●●
●●●●
●●
●
●
●●
●●
●
●
●
●●
●
●●●
●
●
●
●●
●
● ●●●● ●●●●●
●●●●●
●●
●
●●●●
●
●
●●
●
●●●
●
●
●●● ●● ●
●
●● ●
●
●
●
●●
●●● ●●
●●
●● ●
●
●
●●
●
●
●
●●
●●
●
●
●
● ●
●
●
●
●●●●●
● ●
● ●
●
●
●●
●
●●
●
●
●
●
● ● ●●● ●
●
● ●● ●
●
●
●●
●
● ●
●●
●
●
● ●
●
●
●●
●●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●●●
●
●
●
●
● ●
●●
● ●
● ●
●
●
●
●
●
●● ●
● ●
●
●
●
●
●
●
●
●●●● ●●
●
●
●
●
●
● ●●
●
●
●
● ● ●
●
●●
●
●
●●
●
●
●●
●
●
●
●
●
●
● ●
●
● ●
● ●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●● ●
●
●
● ●
●
●
●●●
●
●●
●●
●
●
●●
●
●
●
●
●
●●●
●
●●
●
●
●
●
●
●
● ●
●
● ●
●
●●
●●
●●
●
●
●
●
● ●
●
●
●
●
● ●
● ●●● ●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
● ●●
●●
●
●
●
●
●
● ●
● ●
● ●
●
●●
●
●
● ●●
●
●
●
●
●
●●
●
●
●
●●
●●
●
●
●
●
●
● ●
●
● ●●
●
●
● ●
●
●
●
●●
● ●● ●
●●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●●
●
●
● ●● ●●
●●
●
●
●
●
●
●●
●
●
●● ●
●
●
●
●
●
●●●
●
●●●
● ●
●●
●
●●●●
●
●
●
●
●●
●
●
● ●
● ●●
● ●● ●● ●
●
● ●
●
●
●
●
●
●
● ● ●
●
● ●
●
●
●
●
●● ●●
●
●
●
●
●
●
●
●
●
●
●
● ●●
● ●
●
● ●●
●
●
●●
●●
● ●
●
●
●
● ●
●
●
● ●●
●
● ●
●
● ● ●
●
● ●
●●
●
●●
●
●
●
●
●
● ●
●
●
●
●
●
●
●●
● ●
● ● ●●●● ●
●
● ●●
●
●
● ●
●
●
●
● ●
●
●●
●
●
●
●
●
● ●● ●
●
●
● ●
●
●
●
● ●
●
●
●
●●
●
●●
●
● ●
●
● ●●
●
●
●
●
●
●
●
●●
● ●
●
●
●●
●
●
●
●
●
●
● ●
●●
●
●
●
●
●
● ●
●
●
●
●● ●
● ● ●
●●
●●●
●
●
●
● ●
●
●
●
●
● ●
●
●●
●
● ● ●
●
●
●
●
●
●
● ●
●
●
●● ●
●
●
●
● ●
●
●
●●
●
●
●
●● ●
●
●
●
●
● ●
●
●
● ●●
●
●
● ●
●
●● ● ●
●
● ●●
●● ●
●
● ●
●
●
●●
●●
●
●
● ●● ●
●
●
●
● ●
●
●
●
●
●
●
●
●
●●
●
● ●
●
●
●
●●
●
●●
●
●
●
●●
●
●●
●
●
●
●
● ●
●
● ●●●
●
●
●
●
●
●● ●
●
●
●●
●
●●
●
●
●●● ●
●
●●●●
●●
●
●
●
●
●
●
● ●●●
●
●
●●● ●●
●
●
●
●
●●
●
●
● ●●
● ● ● ●
●
●
●●
●
●
●
●●
●
● ●
● ● ●●●● ●
●●
●
●●
●
●
●●
●●●
●
●
●
●
●
●
●
●
●
● ●
●●
● ●●
●
●●
●
●
●●
● ● ●
●
●
● ●●●
●●●
●● ●
●
● ●
●
●●
●
●●● ●
●
●
● ●
●
●
●
●
●
●●
●●
●●
●
●● ●● ●
●●
●
●
●●●
●
●
●
●
●
●●
rs10965250 stage 1
● r^2: 0.8 − 1.0
● r^2: 0.6 − 0.8
● r^2: 0.4 − 0.6
● r^2: 0.2 − 0.4
● r^2: 0.0 − 0.2
● r^2 missing
<− MLLT3
KIAA1797 −>
<− PTPLAD2
<− IFNB1
<− IFNW1
<− IFNA21
<− IFNA4
<− IFNA7
<− IFNA13
MTAP −>
<− CDKN2A
<− CDKN2B
DMRTA1 −>
<− ELAVL2
21 22 23 24
Position on chromosome 9 (Mb)
40
60
80
100
recombinationrate(c
CDC123/CAMK1D Region
4
6
8
10
log10(P−value)
40
60
80
100
recombinationrate(c
rs12779790
●●●
●
●
●●
●
rs12779790 stage 1
● r^2: 0.8 − 1.0
● r^2: 0.6 − 0.8
● r^2: 0.4 − 0.6
● r^2: 0.2 − 0.4
● r^2: 0.0 − 0.2
● r^2 missing
HHEX/IDE Region
10
15
log10(P−value)
40
60
80
100
recombinationrate(c
rs5015480
●
●
●
●
●
●●
●
●
●
●●●●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●●●
rs5015480 stage 1
● r^2: 0.8 − 1.0
● r^2: 0.6 − 0.8
● r^2: 0.4 − 0.6
● r^2: 0.2 − 0.4
● r^2: 0.0 − 0.2
● r^2 missing
.609
Not in a gene...In a gene...
~90% of GWAS hits are non-coding!
pporting!Figures!
!
!
~90% of GWAS hits are non-coding!
Stamatoyannopoulos, Science 2012
Systematic Localization of Common
Disease-Associated Variation in
Regulatory DNA
Matthew T. Maurano,1
* Richard Humbert,1
* Eric Rynes,1
* Robert E. Thurman,1
Eric Haugen,1
Hao Wang,1
Alex P. Reynolds,1
Richard Sandstrom,1
Hongzhu Qu,1,2
Jennifer Brody,3
Anthony Shafer,1
Fidencio Neri,1
Kristen Lee,1
Tanya Kutyavin,1
Sandra Stehling-Sun,1
Audra K. Johnson,1
Theresa K. Canfield,1
Erika Giste,1
Morgan Diegel,1
Daniel Bates,1
R. Scott Hansen,4
Shane Neph,1
Peter J. Sabo,1
Shelly Heimfeld,5
Antony Raubitschek,6
Steven Ziegler,6
Chris Cotsapas,7,8
Nona Sotoodehnia,3,9
Ian Glass,10
Shamil R. Sunyaev,11
Rajinder Kaul,4
John A. Stamatoyannopoulos1,12
†
Genome-wide association studies have identified many noncoding variants associated with common
diseases and traits. We show that these variants are concentrated in regulatory DNA marked by
deoxyribonuclease I (DNase I) hypersensitive sites (DHSs). Eighty-eight percent of such DHSs are active
during fetal development and are enriched in variants associated with gestational exposure–related
phenotypes. We identified distant gene targets for hundreds of variant-containing DHSs that may explain
phenotype associations. Disease-associated variants systematically perturb transcription factor recognition
sequences, frequently alter allelic chromatin states, and form regulatory networks. We also demonstrated
tissue-selective enrichment of more weakly disease-associated variants within DHSs and the de novo
identification of pathogenic cell types for Crohn’s disease, multiple sclerosis, and an electrocardiogram
trait, without prior knowledge of physiological mechanisms. Our results suggest pervasive involvement of
regulatory DNA variation in common human disease and provide pathogenic insights into diverse disorders.
D
isease- and trait-associated genetic variants
are rapidly being identified with genome-
wide association studies (GWAS) and re-
lated strategies (1). To date, hundreds of GWAS
have been conducted, spanning diverse diseases
and quantitative phenotypes (2) (fig. S1A). How-
ever, the majority (~93%) of disease- and trait-
associated variants emerging from these studies
lie within noncoding sequence (fig. S1B), com-
plicating their functional evaluation. Several lines
of evidence suggest the involvement of a propor-
tion of such variants in transcriptional regulatory
mechanisms, including modulation of promoter
and enhancer elements (3–6) and enrichment with-
in expression quantitative trait loci (eQTL) (3, 7, 8).
Human regulatory DNA encompasses a vari-
ety of cis-regulatory elements within which the co-
operative binding of transcription factors creates
focal alterations in chromatin structure. Deoxy-
ribonuclease I (DNase I) hypersensitive sites (DHSs)
are sensitive and precise markers of this actuated
regulatory DNA, and DNase I mapping has been
instrumental in the discovery and census of hu-
man cis-regulatory elements (9). We performed
DNase I mapping genome-wide (10) in 349 cell
and tissue samples, including 85 cell types studied
under the ENCODE Project (10) and 264 sam-
ples studied under the Roadmap Epigenomics
Program (11). These encompass several classes
nome. In total, we identified 3,899,693 distinct
DHS positions along the genome (collectively
spanning 42.2%), each of which was detected in
one or more cell or tissue types (median = 5).
Disease- and trait-associated variants are
concentrated in regulatory DNA. We examined
the distribution of 5654 noncoding genome-wide
significant associations [5134 unique single-
nucleotide polymorphisms (SNPs); fig. S1 and
table S2] for 207 diseases and 447 quantitative
traits (2) with the deep genome-scale maps of
regulatory DNA marked by DHSs. This revealed
a collective 40% enrichment of GWAS SNPs in
DHSs (fig. S1C, P < 10−55
, binomial, compared to
the distribution of HapMap SNPs). Fully 76.6%
of all noncoding GWAS SNPs either lie within a
DHS (57.1%, 2931 SNPs) or are in complete
linkage disequilibrium (LD) with SNPs in a near-
by DHS (19.5%, 999 SNPs) (Fig. 1A) (12). To con-
firm this enrichment, we sampled variants from
the 1000 Genomes Project (13) with the same ge-
nomic feature localization (intronic versus inter-
genic), distance from the nearest transcriptional
start site, and allele frequency in individuals of
European ancestry. We confirmed significant en-
richment both for SNPs within DHSs (P < 10−59
,
simulation) and also including variants in com-
plete LD (r 2
= 1) with SNPs in DHSs (P < 10−37
,
simulation) (fig. S2).
In total, 47.5% of GWAS SNPs fall within
gene bodies (fig. S1B); however, only 10.9% of
intronic GWAS SNPs within DHSs are in strong
LD (r2
≥ 0.8) with a coding SNP, indicating that
the vast majority of noncoding genic variants
are not simply tagging coding sequence. Analo-
gously, only 16.3% of GWAS variants within
coding sequences are in strong LD with variants in
DHSs. SNPs on widely used genotyping arrays
(e.g., Affymetrix) were modestly enriched with-
in DHSs (fig. S2), possibly due to selection of
SNPs with robust experimental performance in
genotyping assays. However, we found no evi-
dence for sequence composition bias (table S3).
To further examine the enrichment of GWAS
SNPs in regulatory DNA, we systematically clas-
sified all noncoding GWAS SNPs by the quality
1
Department of Genome Sciences, University of Washington,
Seattle, WA 98195, USA. 2
Laboratory of Disease Genomics
RESEARCH ARTICLE
onSeptember12,2012www.sciencemag.orgDownloadedfrom
There have been few, if any, similar bursts of discovery in the
history of medical research.
David Hunter and Peter Kraft, NEJM, 2007
Common claims discussed in regards to GWAS:
Despite issues, yielded many discoveries vs. cost
to a doubling of the number of associated variants discov-
ered. The proportion of genetic variation explained by
significantly associated SNPs is usually low (typically less
than 10%) for many complex traits, but for diseases such
as CD and multiple sclerosis (MS [MIM 126200]), and for
quantitative traits such as height and lipid traits, between
Figure 1. GWAS Discoveries over Time
Data obtained from the Published GWAS Catalog (see Web
Resources). Only the top SNPs representing loci with association
p values < 5 3 10À8
are included, and so that multiple counting
is avoided, SNPs identified for the same traits with LD r2
> 0.8 esti-
mated from the entire HapMap samples are excluded.
~500,000 SNP chips x ~$500/chip

= $250M
Five years of GWAS Discovery (Visscher, 2012)
$250M / ~2000 loci

= $125K/locus
Candidate genes: >$250M!
100 NIH R01s

Fighter jet

Hadron Collider: $9B
P = G + EType 2 Diabetes

Cancer

Alzheimer’s

Gene expression
Phenotype Genome
Variants
Environment
Infectious agents

Nutrients

Pollutants

Drugs
Complex traits are a function of genes and
environment...
Nothing comparable to elucidate E influence!
We lack high-throughput methods
and data to discover new E in P…
E: ???
A similar paradigm for discovery should exist

for E!
Why?
σ2
P = σ2
G + σ2
E
σ2
G
σ2
P
H2 =
Heritability (H2) is the range of phenotypic variability
attributed to genetic variability in a population
Indicator of the proportion of phenotypic
differences attributed to G.
Height is an example of a heritable trait:

Francis Galton shows how its done (1887)
“mid-height of 205 parents
described 60% of variability of 928
offspring”
Eye color
Hair curliness
Type-1 diabetes
Height
Schizophrenia
Epilepsy
Graves' disease
Celiac disease
Polycystic ovary syndrome
Attention deficit hyperactivity disorder
Bipolar disorder
Obesity
Alzheimer's disease
Anorexia nervosa
Psoriasis
Bone mineral density
Menarche, age at
Nicotine dependence
Sexual orientation
Alcoholism
Lupus
Rheumatoid arthritis
Crohn's disease
Migraine
Thyroid cancer
Autism
Blood pressure, diastolic
Body mass index
Depression
Coronary artery disease
Insomnia
Menopause, age at
Heart disease
Prostate cancer
QT interval
Breast cancer
Ovarian cancer
Hangover
Stroke
Asthma
Blood pressure, systolic
Hypertension
Osteoarthritis
Parkinson's disease
Longevity
Type-2 diabetes
Gallstone disease
Testicular cancer
Cervical cancer
Sciatica
Bladder cancer
Colon cancer
Lung cancer
Leukemia
Stomach cancer
0 25 50 75 100
Heritability: Var(G)/Var(Phenotype) SNPedia.com
G estimates for complex traits are low and variable:
massive opportunity for high-throughput E discovery
Type 2 Diabetes (25%)
Heart Disease (25-30%)
Autism (50%???)
Eye color
Hair curliness
Type-1 diabetes
Height
Schizophrenia
Epilepsy
Graves' disease
Celiac disease
Polycystic ovary syndrome
Attention deficit hyperactivity disorder
Bipolar disorder
Obesity
Alzheimer's disease
Anorexia nervosa
Psoriasis
Bone mineral density
Menarche, age at
Nicotine dependence
Sexual orientation
Alcoholism
Lupus
Rheumatoid arthritis
Crohn's disease
Migraine
Thyroid cancer
Autism
Blood pressure, diastolic
Body mass index
Depression
Coronary artery disease
Insomnia
Menopause, age at
Heart disease
Prostate cancer
QT interval
Breast cancer
Ovarian cancer
Hangover
Stroke
Asthma
Blood pressure, systolic
Hypertension
Osteoarthritis
Parkinson's disease
Longevity
Type-2 diabetes
Gallstone disease
Testicular cancer
Cervical cancer
Sciatica
Bladder cancer
Colon cancer
Lung cancer
Leukemia
Stomach cancer
0 25 50 75 100
Heritability: Var(G)/Var(Phenotype) SNPedia.com
G estimates for complex traits are low and variable:
massive opportunity for high-throughput E discovery
σ2
E : Exposome!
©2015NatureAmerica,Inc.Allrightsreserved.
Despite a century of research on complex traits in humans, the
relative importance and specific nature of the influences of
genes and environment on human traits remain controversial.
We report a meta-analysis of twin correlations and reported
variance components for 17,804 traits from 2,748 publications
including 14,558,903 partly dependent twin pairs, virtually
all published twin studies of complex traits. Estimates of
heritability cluster strongly within functional domains,
and across all traits the reported heritability is 49%. For a
majority (69%) of traits, the observed twin correlations are
consistent with a simple and parsimonious model where twin
resemblance is solely due to additive genetic variation. The
data are inconsistent with substantial influences from shared
environment or non-additive genetic variation. This study
provides the most comprehensive analysis of the causes of
individual differences in human traits thus far and will guide
future gene-mapping efforts. All the results can be visualized
using the MaTCH webtool.
Specifically, the partitioning of observed variability into underlying
genetic and environmental sources and the relative importance of
additive and non-additive genetic variation are continually debated1–5.
Recent results from large-scale genome-wide association studies
(GWAS) show that many genetic variants contribute to the variation
in complex traits and that effect sizes are typically small6,7. However,
the sum of the variance explained by the detected variants is much
smaller than the reported heritability of the trait4,6–10. This ‘missing
heritability’ has led some investigators to conclude that non-additive
variation must be important4,11. Although the presence of gene-gene
interaction has been demonstrated empirically5,12–17, little is known
about its relative contribution to observed variation18.
In this study, our aim is twofold. First, we analyze empirical esti-
mates of the relative contributions of genes and environment for
virtually all human traits investigated in the past 50 years. Second, we
assess empirical evidence for the presence and relative importance of
non-additive genetic influences on all human traits studied. We rely
on classical twin studies, as the twin design has been used widely
to disentangle the relative contributions of genes and environment,
across a variety of human traits. The classical twin design is based
on contrasting the trait resemblance of monozygotic and dizygotic
twin pairs. Monozygotic twins are genetically identical, and dizygotic
twins are genetically full siblings. We show that, for a majority of traits
(69%), the observed statistics are consistent with a simple and parsi-
monious model where the observed variation is solely due to additive
genetic variation. The data are inconsistent with a substantial influence
from shared environment or non-additive genetic variation. We also
show that estimates of heritability cluster strongly within functional
domains, and across all traits the reported heritability is 49%. Our
results are based on a meta-analysis of twin correlations and reported
variance components for 17,804 traits from 2,748 publications includ-
ing 14,558,903 partly dependent twin pairs, virtually all twin studies of
complex traits published between 1958 and 2012. This study provides
the most comprehensive analysis of the causes of individual differences
in human traits thus far and will guide future gene-mapping efforts. All
Meta-analysis of the heritability of human traits based on
fifty years of twin studies
Tinca J C Polderman1,10, Beben Benyamin2,10, Christiaan A de Leeuw1,3, Patrick F Sullivan4–6,
Arjen van Bochoven7, Peter M Visscher2,8,11 & Danielle Posthuma1,9,11
1Department of Complex Trait Genetics, VU University, Center for Neurogenomics
and Cognitive Research, Amsterdam, the Netherlands. 2Queensland Brain
Institute, University of Queensland, Brisbane, Queensland, Australia. 3Institute
for Computing and Information Sciences, Radboud University Nijmegen,
Nijmegen, the Netherlands. 4Center for Psychiatric Genomics, Department
of Genetics, University of North Carolina, Chapel Hill, North Carolina, USA.
5Department of Psychiatry, University of North Carolina, Chapel Hill, North
Carolina, USA. 6Department of Medical Epidemiology and Biostatistics,
Karolinska Institutet, Stockholm, Sweden. 7Faculty of Sciences, VU University,
Insight into the nature of observed variation in human traits is impor-
tant in medicine, psychology, social sciences and evolutionary biology.
It has gained new relevance with both the ability to map genes for
human traits and the availability of large, collaborative data sets to do
so on an extensive and comprehensive scale. Individual differences in
human traits have been studied for more than a century, yet the causes
of variation in human traits remain uncertain and controversial.
Nature Genetics, 2015
17,804 traits of the phenome
2,748 publications

14,558,903 twin pairs
Average H2 (genome): 0.49
Exposome may play an equal role.
Explaining the other 50%:
A new data-driven paradigm for robust discovery of
via EWAS and the exposome
what to measure? how to measure?
PERSPECTIVES
Xenobiotics
Inflammation
Preexisting disease
Lipid peroxidation
Oxidative stress
Gut flora
Internal
chemical
environment
Externalenvironment
ExposomeRADIATION
DIET
POLLUTION
INFECTIONS
DRUGS
LIFE-STYLE
STRESS
Reactive electrophiles
Metals
Endocrine disrupters
Immune modulators
Receptor-binding proteins
itical entity for disease eti-
ogy (7). Recent discussion
as focused on whether and
ow to implement this vision
8). Although fully charac-
rizing human exposomes
daunting, strategies can be
eveloped for getting “snap-
hots” of critical portions of
person’s exposome during
ifferent stages of life. At
ne extreme is a “bottom-up”
rategy in which all chemi-
als in each external source
f a subject’s exposome are
easured at each time point.
lthoughthisapproachwould
ave the advantage of relat-
g important exposures to
e air, water, or diet, it would
quire enormous effort and
ould miss essential compo-
ents of the internal chemi-
al environment due to such
actors as gender, obesity,
flammation, and stress. By
ontrast, a “top-down” strat-
gy would measure all chem-
als (or products of their
ownstream processing or
ffects, so-called read-outs
r signatures) in a subject’s
ood. This would require
nly a single blood specimen
each time point and would relate directly ruptors and can be measured through serum
some (telomere) length in
peripheral blood mono-
nuclear cells responded
to chronic psychological
stress, possibly mediated
by the production of reac-
tive oxygen species (15).
Characterizing the
exposome represents a tech-
nological challenge like that of
thehumangenomeproject,which
began when DNA sequencing
was in its infancy (16). Analyti-
cal systems are needed to pro-
cess small amounts of blood from
thousands of subjects. Assays
should be multiplexed for mea-
suring many chemicals in each
class of interest. Tandem mass
spectrometry, gene and protein
chips, and microfluidic systems
offer the means to do this. Plat-
forms for high-throughput assays
shouldleadtoeconomiesofscale,
again like those experienced by
the human genome project. And
because exposome technologies
would provide feedback for thera-
peuticinterventionsandpersonal-
ized medicine, they should moti-
vate the development of commer-
cial devices for screening impor-
tant environmental exposures in
blood samples.
With successful characterization of both
Characterizing the exposome. The exposome represents
the combined exposures from all sources that reach the
internal chemical environment. Toxicologically important
classes of exposome chemicals are shown. Signatures and
biomarkers can detect these agents in blood or serum.
onOctober21,2010www.sciencemag.orgrom
“A more comprehensive view of
environmental exposure is
needed ... to discover major
causes of diseases...”
how to analyze in relation to health?
Wild, 2005

Rappaport and Smith, 2010, 2011

Buck-Louis and Sundaram 2012

Miller and Jones, 2014

Patel CJ and Ioannidis JPAI, 2014
We still cannot “query” the environment like the genome...
Connecting Environmental Exposure with Disease:
Missing the “System” of Exposures?
E+ E-
diseased
non-
diseased
?
Exposed to many things, but do not assess the multiplicity.
Fragmented literature of associations.
Challenge to discover E associated with disease.
e modelling
oblem is akin to – but less well
sed and more poorly understood than –
e testing. For example, consider the use
r regression to adjust the risk levels of
atments to the same background level
There can be many covariates, and
t of covariates can be in or out of the
With ten covariates, there are over 1000
models. Consider a maze as a metaphor
elling (Figure 3). The red line traces the
path out of the maze. The path through
ze looks simple, once it is known.
ways in the literature for dealing with model
selection, so we propose a new, composite
2. Publication bias
is general recognition that a paper
much better chance of acceptance if
hing new is found. This means that, for
ation, the claim in the paper has to
sed on a p-value less than 0.05. From
g’s point of view5
, this is quality by
tion. The journals are placing heavy
ce on a statistical test rather than
nation of the methods and steps that
o a conclusion. As to having a p-value
han 0.05, some might be tempted to
the system10
through multiple testing,
ple modelling or unfair treatment of
or some combination of the three that
to a small p-value. Researchers can be
creative in devising a plausible story to
statistical finding.
2 The data cleaning team creates a
modelling data set and a holdout set and
P < 0.05
Figure 3. The path through a complex process can appear quite simple once the path is defined. Which terms are
included in a multiple linear regression model? Each turn in a maze is analogous to including or not a specific
term in the evolving linear model. By keeping an eye on the p-value on the term selected to be at issue, one
can work towards a suitably small p-value. © ktsdesign – Fotolia
A maze of associations is one way to a fragmented
literature and Vibration of Effects
Young, 2011
univariate
sex
sex & age
sex & race
sex & race & age
JCE, 2015
Example of fragmentation:
Is everything we eat associated with cancer?
Schoenfeld and Ioannidis, AJCN (2012)
50 random ingredients from
Boston Cooking School
Cookbook
Any associated with cancer?
FIGURE 1. Effect estimates reported in the literature by malignancy type (top) or ingredient (bottom). Only ingredients with $10 studie
outliers are not shown (effect estimates .10).
Of 50, 40 studied in a cancer risk
Weak statistical evidence:

non-replicated

inconsistent effects

non-standardized
Connecting Environmental Exposure with Disease:
Missing the “System” of Exposures?
E+ E-
diseased
non-
diseased
?
Exposed to many things, but do not assess the multiplicity.
Fragmented literature of associations.
Challenge to discover E associated with disease.
evol
part
ease
tase
well
biol
T
capt
imp
STR
reve
subs
libri
−log10(P)
0
5
10
15
Chromosome
22
X
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
80
60
100
teststatistic
a
b
NATURE|Vol 447|7 June 2007
Environment-Wide Association Studies (EWAS):
A GWAS-like study for the environment
What specific environmental “loci” are associated to disease?
Environmental Category
Vitam
ins
β-carotene
M
etals
lead
O
rganophosphate
Pesticides
H
ydrocarbons
2-hydroxyfluorene [factor]
case
control
... but there is no “microarray” for environmental exposure...
Gold standard for breadth of human exposure information:
National Health and Nutrition Examination Survey1
since the 1960s

now biannual: 1999 onwards

10,000 participants per survey

The sample for the survey is selected to represent
the U.S. population of all ages. To produce reli-
able statistics, NHANES over-samples persons 60
and older, African Americans, and Hispanics.
Since the United States has experienced dramatic
growth in the number of older people during this
century, the aging population has major impli-
cations for health care needs, public policy, and
research priorities. NCHS is working with public
health agencies to increase the knowledge of the
health status of older Americans. NHANES has a
primary role in this endeavor.
All participants visit the physician. Dietary inter-
views and body measurements are included for
everyone. All but the very young have a blood
sample taken and will have a dental screening.
Depending upon the age of the participant, the
rest of the examination includes tests and proce-
dures to assess the various aspects of health listed
above. In general, the older the individual, the
more extensive the examination.
Survey Operations
Health interviews are conducted in respondents’
homes. Health measurements are performed in
specially-designed and equipped mobile centers,
which travel to locations throughout the country.
The study team consists of a physician, medical
and health technicians, as well as dietary and health
interviewers. Many of the study staff are
bilingual (English/Spanish).
An advanced computer system using high-
end servers, desktop PCs, and wide-area
networking collect and process all of the
NHANES data, nearly eliminating the need
for paper forms and manual coding operations.
This system allows interviewers to use note-
book computers with electronic pens. The staff
at the mobile center can automatically transmit
data into data bases through such devices as
digital scales and stadiometers. Touch-sensi-
tive computer screens let respondents enter
their own responses to certain sensitive ques-
tions in complete privacy. Survey information
is available to NCHS staff within 24 hours of
collection, which enhances the capability of
collecting quality data and increases the speed
with which results are released to the public.
In each location, local health and government
officials are notified of the upcoming survey.
Households in the study area receive a letter
from the NCHS Director to introduce the
survey. Local media may feature stories about
the survey.
NHANES is designed to facilitate and en-
courage participation. Transportation is provided
to and from the mobile center if necessary.
Participants receive compensation and a report
of medical findings is given to each participant.
All information collected in the survey is kept
strictly confidential. Privacy is protected by
public laws.
Uses of the Data
Information from NHANES is made available
through an extensive series of publications and
articles in scientific and technical journals. For
data users and researchers throughout the world,
survey data are available on the internet and on
easy-to-use CD-ROMs.
Research organizations, universities, health
care providers, and educators benefit from
survey information. Primary data users are
federal agencies that collaborated in the de-
sign and development of the survey. The
National Institutes of Health, the Food and
Drug Administration, and CDC are among the
agencies that rely upon NHANES to provide
data essential for the implementation and
evaluation of program activities. The U.S.
Department of Agriculture and NCHS coop-
erate in planning and reporting dietary and
nutrition information from the survey.
NHANES’ partnership with the U.S. Environ-
mental Protection Agency allows continued
study of the many important environmental
influences on our health.
• Physical fitness and physical functioning
• Reproductive history and sexual behavior
• Respiratory disease (asthma, chronic bron-
chitis, emphysema)
• Sexually transmitted diseases
• Vision
1 http://www.cdc.gov/nchs/nhanes.htm
>250 exposures (serum + urine)

GWAS chip

>85 quantitative clinical traits
(e.g., serum glucose, lipids, BMI)

Death index linkage (cause of
death)
Gold standard for breadth of human exposure information:
National Health and Nutrition Examination Survey
Nutrients and Vitamins

vitamin D, carotenes
Infectious Agents

hepatitis, HIV, Staph. aureus
Plastics and consumables

phthalates, bisphenol A
Physical Activity

stepsPesticides and pollutants

atrazine; cadmium; hydrocarbons
Drugs

statins; aspirin
EWAS Approach for Discovery
bisphenol A
PCB199
β-carotene
cotinine
...
}{
for each:
Environmental factors:
log transformed & z-standardized
reference groups “negative”
p-value(βfactor)
bisphenol A 0.8
PCB199 0.1
β-carotene 0.01
cotinine 0.03
... ...
Significance tests (p-values):
zfactor
disease
βfactor
Regression:
adjusted for other risk factors
age, sex, race, socioeconomic status, ...
Training Survey or Cohort:
Classify diseased/non-diseased participants:
E.g.: Diabetics and Non-diabetics
EWAS Approach for Discovery
Validation Survey or Cohort: p-value < 0.05 in test survey?
False Discovery Rate Estimation:
The expected rate of false positives
# false positives ≤ α
# findings ≤ α=
# false positives (α)?
“Shuffle” (permute) disease and non-diseased
participants
cases controls
Re-run EWAS
Repeat many times
FDR (p-value)
bisphenol A 1
PCB199 0.4
β-carotene 0.1
cotinine 0.2
... ...
50 false positives ≤ 0.05
100 findings ≤ 0.05
= 0.5
Exposome factors are associated with Type 2 Diabetes?
PLoS ONE, 2010
Novel Findings:

heptachlor
epoxide

γ-tocopherol
Known
Associations:

β-carotene

vitamin D

Interesting Patterns:

pesticides, PCBs
−log10(pvalue)
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
● ●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●● ●
●●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
● ●
acrylamide
allergentest
bacterialinfection
cotinine
diakyl
dioxins
furansdibenzofuran
heavymetals
hydrocarbons
latex
nutrientscarotenoid
nutrientsminerals
nutrientsvitaminA
nutrientsvitaminB
nutrientsvitaminC
nutrientsvitaminD
nutrientsvitaminE
pcbs
perchlorate
pesticidesatrazine
pesticideschlorophenol
pesticidesorganochlorine
pesticidesorganophosphate
pesticidespyrethyroid
phenols
phthalates
phytoestrogens
polybrominatedethers
polyflourochemicals
viralinfection
volatilecompounds
012
1999-2000
2001-2002
2003-2004
2005-2006
cohort markers
FDR(α<0.02) ~ 10%“replicated” factors
Fasting Blood Glucose ≥ 126 mg/dL?
BMI, SES, ethnicity, age, sex
OR: Δ 1SD of exposure
N=500-2000 per cohort
Heptachlor Epoxide
OR=3.2, 1.8
PCB170
OR=4.5,2.3
γ-tocopherol (vitamin E)
OR=1.8,1.6
β-carotene
OR=0.6,0.6
What model is used to test for
association?
Compare vs. GWAS?
EWAS in Type 2 Diabetes:
Searching >250 exposures for associations with 

FBG > 125 mg/dL
Exposome factors associated with serum lipids?

Triglycerides, LDL-Cholesterol, HDL-Cholesterol
EWAS on Serum Lipid Levels:
Triglycerides, LDL-Cholesterol, HDL-Cholesterol
Risk factors for coronary heart disease (CHD)

Targets for intervention (ie, statins)

Influenced by smoking, physical activity, diet,
genetics1
1. Teslovich et al. Nature (2010) 

2 .Grundy et al. ATVB (2004)

3. Gotto et al. JACC (2004)
LDL-C Δ1%: 1% increased risk for CHD2

HDL-C Δ1%: 2% decreased risk for CHD3

Triglycerides: higher risk for CHD
EWAS in HDL-C:
17 Validated Factors 1999-2000
2001-2002
2003-2004
2005-2006
cohort markers
FDR < 5%
carotenes
cotinine
heavy metals
organochlorine pesticides
IJE 2012.
hydrocarbons
log10(HDL-C)
adjusted for BMI, SES, ethnicity, age, age2, sex
N=1000-3000
E
Vitamins
DCBA
minerals
EWAS in Triglycerides and LDL-C
22 factors
organochlorine pesticides

polychlorinated biphenyls

carotenoids

vitamin E

vitamin A
8 factors
carotenoids

vitamin E

vitamin A
IJE 2012.
Effect Sizes For Validated Factors:
HDL-C
% change = Δ 1 SD in Exposure

17 validated factors
survey! N! P-value! FDR! Effect (mg/dL)!
pollutants nutrient factors
IJE 2012.
How do effect sizes compare between GWAS and EWAS?
Previous studies have suggested sex-specific heritability of lipid
traits15
. A key challenge in addressing this issue is evaluating enough
three types of human tissue samples from liver (960 samples),
omental fat (741 samples) and subcutaneous fat (609 samples). We
Table 1 | Meta-analysis of plasma lipid concentrations in >100,000 individuals of European descent.
Locus Chr Lead SNP Lead trait Other traits Alleles/MAF Effect size P eQTL CAD Ethnic
LDLRAP1 1 rs12027135 TC LDL T/A/0.45 21.22 4 3 10211
Y 111?
PABPC4 1 rs4660293 HDL A/G/0.23 20.48 4 3 10210
Y 1111
PCSK9 1 rs2479409 LDL TC A/G/0.30 12.01 2 3 10228
1111
ANGPTL3 1 rs2131925 TG TC, LDL T/G/0.32 24.94 9 3 10243
Y 1111
EVI5 1 rs7515577 TC A/C/0.21 21.18 3 3 1028
111?
SORT1 1 rs629301 LDL TC T/G/0.22 25.65 1 3 102170
Y Y 1111
ZNF648 1 rs1689800 HDL A/G/0.35 20.47 3 3 10210
1112
MOSC1 1 rs2642442 TC LDL T/C/0.32 21.39 6 3 10213
111?
GALNT2 1 rs4846914 HDL TG A/G/0.40 20.61 4 3 10221
1111
IRF2BP2 1 rs514230 TC LDL T/A/0.48 21.36 5 3 10214
111?
APOB 2 rs1367117 LDL TC G/A/0.30 14.05 4 3 102114
1111
rs1042034 TG HDL T/C/0.22 25.99 1 3 10245
1211
GCKR 2 rs1260326 TG TC C/T/0.41 18.76 6 3 102133
Y 1111
ABCG5/8 2 rs4299376 LDL TC T/G/0.30 12.75 2 3 10247
1111
RAB3GAP1 2 rs7570971 TC C/A/0.34 11.25 2 3 1028
12??
COBLL1 2 rs10195252 TG T/C/0.40 22.01 2 3 10210
Y 1111
rs12328675 HDL T/C/0.13 10.68 3 3 10210
11?1
IRS1 2 rs2972146 HDL TG T/G/0.37 10.46 3 3 1029
Y Y 1111
RAF1 3 rs2290159 TC G/C/0.22 21.42 4 3 1029
111?
MSL2L1 3 rs645040 TG T/G/0.22 22.22 3 3 1028
1121
KLHL8 4 rs442177 TG T/G/0.41 22.25 9 3 10212
1111
SLC39A8 4 rs13107325 HDL C/T/0.07 20.84 7 3 10211
Y 12?2
ARL15 5 rs6450176 HDL G/A/0.26 20.49 5 3 1028
2??1
MAP3K1 5 rs9686661 TG C/T/0.20 12.57 1 3 10210
1111
HMGCR 5 rs12916 TC LDL T/C/0.39 12.84 9 3 10247
111?
TIMD4 5 rs6882076 TC LDL, TG C/T/0.35 21.98 7 3 10228
111?
MYLIP 6 rs3757354 LDL TC C/T/0.22 21.43 1 3 10211
1221
HFE 6 rs1800562 LDL TC G/A/0.06 22.22 6 3 10210
11?1
HLA 6 rs3177928 TC LDL G/A/0.16 12.31 4 3 10219
Y 111?
rs2247056 TG C/T/0.25 22.99 2 3 10215
1112
C6orf106 6 rs2814944 HDL G/A/0.16 20.49 4 3 1029
Y 1112
rs2814982 TC C/T/0.11 21.86 5 3 10211
Y 221?
FRK 6 rs9488822 TC LDL A/T/0.35 21.18 2 3 10210
Y 111?
CITED2 6 rs605066 HDL T/C/0.42 20.39 3 3 1028
1121
LPA 6 rs1564348 LDL TC T/C/0.17 20.56 2 3 10217
Y 11?1
rs1084651 HDL G/A/0.16 11.95 3 3 1028
11?1
DNAH11 7 rs12670798 TC LDL T/C/0.23 11.43 9 3 10210
111?
NPC1L1 7 rs2072183 TC LDL G/C/0.25 12.01 3 3 10211
121?
TYW1B 7 rs13238203 TG C/T/0.04 27.91 1 3 1029
1???
MLXIPL 7 rs17145738 TG HDL C/T/0.12 29.32 6 3 10258
Y 1111
KLF14 7 rs4731702 HDL C/T/0.48 10.59 1 3 10215
Y 1111
PPP1R3B 8 rs9987289 HDL TC, LDL G/A/0.09 21.21 6 3 10225
Y 1111
PINX1 8 rs11776767 TG G/C/0.37 12.01 1 3 1028
2111
NAT2 8 rs1495741 TG TC A/G/0.22 12.85 5 3 10214
Y 2111
LPL 8 rs12678919 TG HDL A/G/0.12 213.64 2 3 102115
Y 1111
CYP7A1 8 rs2081687 TC LDL C/T/0.35 11.23 2 3 10212
111?
TRPS1 8 rs2293889 HDL G/T/0.41 20.44 6 3 10211
1111
rs2737229 TC A/C/0.30 21.11 2 3 1028
112?
TRIB1 8 rs2954029 TG TC, LDL, HDL A/T/0.47 25.64 3 3 10255
Y 1111
PLEC1 8 rs11136341 LDL TC A/G/0.40 11.40 4 3 10213
1111
TTC39B 9 rs581080 HDL TC C/G/0.18 20.65 3 3 10212
1211
ARTICLES NATURE|Vol 466|5 August 2010
survey! N! P-value! FDR! Effect (mg/dL)!
Teslovich, 2010
GWAS EWAS
Table 1 | Meta-analysis of plasma lipid concentrations in >100,000 individuals of European descent.
Locus Chr Lead SNP Lead trait Other traits Alleles/MAF Effect size P eQTL CAD Ethnic
LDLRAP1 1 rs12027135 TC LDL T/A/0.45 21.22 4 3 10211
Y 111?
PABPC4 1 rs4660293 HDL A/G/0.23 20.48 4 3 10210
Y 1111
PCSK9 1 rs2479409 LDL TC A/G/0.30 12.01 2 3 10228
1111
ANGPTL3 1 rs2131925 TG TC, LDL T/G/0.32 24.94 9 3 10243
Y 1111
EVI5 1 rs7515577 TC A/C/0.21 21.18 3 3 1028
111?
SORT1 1 rs629301 LDL TC T/G/0.22 25.65 1 3 102170
Y Y 1111
ZNF648 1 rs1689800 HDL A/G/0.35 20.47 3 3 10210
1112
MOSC1 1 rs2642442 TC LDL T/C/0.32 21.39 6 3 10213
111?
GALNT2 1 rs4846914 HDL TG A/G/0.40 20.61 4 3 10221
1111
IRF2BP2 1 rs514230 TC LDL T/A/0.48 21.36 5 3 10214
111?
APOB 2 rs1367117 LDL TC G/A/0.30 14.05 4 3 102114
1111
rs1042034 TG HDL T/C/0.22 25.99 1 3 10245
1211
GCKR 2 rs1260326 TG TC C/T/0.41 18.76 6 3 102133
Y 1111
ABCG5/8 2 rs4299376 LDL TC T/G/0.30 12.75 2 3 10247
1111
RAB3GAP1 2 rs7570971 TC C/A/0.34 11.25 2 3 1028
12??
COBLL1 2 rs10195252 TG T/C/0.40 22.01 2 3 10210
Y 1111
rs12328675 HDL T/C/0.13 10.68 3 3 10210
11?1
IRS1 2 rs2972146 HDL TG T/G/0.37 10.46 3 3 1029
Y Y 1111
RAF1 3 rs2290159 TC G/C/0.22 21.42 4 3 1029
111?
MSL2L1 3 rs645040 TG T/G/0.22 22.22 3 3 1028
1121
KLHL8 4 rs442177 TG T/G/0.41 22.25 9 3 10212
1111
SLC39A8 4 rs13107325 HDL C/T/0.07 20.84 7 3 10211
Y 12?2
ARL15 5 rs6450176 HDL G/A/0.26 20.49 5 3 1028
2??1
MAP3K1 5 rs9686661 TG C/T/0.20 12.57 1 3 10210
1111
HMGCR 5 rs12916 TC LDL T/C/0.39 12.84 9 3 10247
111?
TIMD4 5 rs6882076 TC LDL, TG C/T/0.35 21.98 7 3 10228
111?
MYLIP 6 rs3757354 LDL TC C/T/0.22 21.43 1 3 10211
1221
HFE 6 rs1800562 LDL TC G/A/0.06 22.22 6 3 10210
11?1
HLA 6 rs3177928 TC LDL G/A/0.16 12.31 4 3 10219
Y 111?
rs2247056 TG C/T/0.25 22.99 2 3 10215
1112
C6orf106 6 rs2814944 HDL G/A/0.16 20.49 4 3 1029
Y 1112
rs2814982 TC C/T/0.11 21.86 5 3 10211
Y 221?
FRK 6 rs9488822 TC LDL A/T/0.35 21.18 2 3 10210
Y 111?
CITED2 6 rs605066 HDL T/C/0.42 20.39 3 3 1028
1121
LPA 6 rs1564348 LDL TC T/C/0.17 20.56 2 3 10217
Y 11?1
rs1084651 HDL G/A/0.16 11.95 3 3 1028
11?1
DNAH11 7 rs12670798 TC LDL T/C/0.23 11.43 9 3 10210
111?
NPC1L1 7 rs2072183 TC LDL G/C/0.25 12.01 3 3 10211
121?
TYW1B 7 rs13238203 TG C/T/0.04 27.91 1 3 1029
1???
MLXIPL 7 rs17145738 TG HDL C/T/0.12 29.32 6 3 10258
Y 1111
KLF14 7 rs4731702 HDL C/T/0.48 10.59 1 3 10215
Y 1111
PPP1R3B 8 rs9987289 HDL TC, LDL G/A/0.09 21.21 6 3 10225
Y 1111
PINX1 8 rs11776767 TG G/C/0.37 12.01 1 3 1028
2111
NAT2 8 rs1495741 TG TC A/G/0.22 12.85 5 3 10214
Y 2111
LPL 8 rs12678919 TG HDL A/G/0.12 213.64 2 3 102115
Y 1111
CYP7A1 8 rs2081687 TC LDL C/T/0.35 11.23 2 3 10212
111?
TRPS1 8 rs2293889 HDL G/T/0.41 20.44 6 3 10211
1111
rs2737229 TC A/C/0.30 21.11 2 3 1028
112?
TRIB1 8 rs2954029 TG TC, LDL, HDL A/T/0.47 25.64 3 3 10255
Y 1111
PLEC1 8 rs11136341 LDL TC A/G/0.40 11.40 4 3 10213
1111
TTC39B 9 rs581080 HDL TC C/G/0.18 20.65 3 3 10212
1211
survey! N! P-value! FDR! Effect (mg/dL)!
Teslovich, 2010
tions in >100,000 individuals of European descent.
trait Other traits Alleles/MAF Effect size P eQTL CAD Ethnic
C LDL T/A/0.45 21.22 4 3 10211
Y 111?
DL A/G/0.23 20.48 4 3 10210
Y 1111
DL TC A/G/0.30 12.01 2 3 10228
1111
G TC, LDL T/G/0.32 24.94 9 3 10243
Y 1111
C A/C/0.21 21.18 3 3 1028
111?
DL TC T/G/0.22 25.65 1 3 102170
Y Y 1111
DL A/G/0.35 20.47 3 3 10210
1112
C LDL T/C/0.32 21.39 6 3 10213
111?
DL TG A/G/0.40 20.61 4 3 10221
1111
C LDL T/A/0.48 21.36 5 3 10214
111?
DL TC G/A/0.30 14.05 4 3 102114
1111
G HDL T/C/0.22 25.99 1 3 10245
1211
G TC C/T/0.41 18.76 6 3 102133
Y 1111
DL TC T/G/0.30 12.75 2 3 10247
1111
C C/A/0.34 11.25 2 3 1028
12??
G T/C/0.40 22.01 2 3 10210
Y 1111
DL T/C/0.13 10.68 3 3 10210
11?1
DL TG T/G/0.37 10.46 3 3 1029
Y Y 1111
C G/C/0.22 21.42 4 3 1029
111?
G T/G/0.22 22.22 3 3 1028
1121
G T/G/0.41 22.25 9 3 10212
1111
DL C/T/0.07 20.84 7 3 10211
Y 12?2
DL G/A/0.26 20.49 5 3 1028
2??1
G C/T/0.20 12.57 1 3 10210
1111
NATURE|Vol 466|5 August 2010
GWAS EWAS
How do effect sizes compare between GWAS and EWAS?
EWAS uncovers persistent pollutants
in people with Type 2 Diabetes, Higher Lipids:
How are these factors linked with these diseases?
•organochlorine pesticides

•polychlorinated biphenyls

•dibenzofurans

•dioxins
•found all over the world

•persist in food chain

Porta et al, Environ Int 2008
•arteriosclerosis, 

•T2D/insulin resistance

Porta et al, Lancet, 2006

Lee et al, Diabetes Care, 2006

Lee et al, Diabetologia, 2007

Everett et al, Environ Res, 2010

Lind et al, EHP, 2011

(Korea, Japan, Europe)
capacitors

adhesives
Studying the Elusive Environment in Large Scale
Itispossiblethatmorethan50%ofcomplexdiseaserisk
isattributedtodifferencesinanindividual’senvironment.1
Airpollution,smoking,anddietaredocumentedenviron-
mental factors affecting health, yet these factors are but
a fraction of the “exposome,” the totality of the exposure
loadoccurringthroughoutaperson’slifetime.1
Investigat-
ing one or a handful of exposures at a time has led to a
highly fragmented literature of epidemiologic associa-
tions. Much of that literature is not reproducible, and se-
lectivereportingmaybeamajorreasonforthelackofre-
producibility. A new model is required to discover
environmental exposures associated with disease while
mitigating possibilities of selective reporting.
Toremedythelackofreproducibilityandconcernsof
validity, multiple personal exposures can be assessed si-
multaneously in terms of their association with a condi-
tion or disease of interest; the strongest associations can
then be tentatively validated in independent data sets
(eg, as done in references 2 and 3).2,3
The main advan-
tages of this process include the ability to search the list
ofexposuresandadjustformultiplicitysystematicallyand
reportalltheprobedassociationsinsteadofonlythemost
significant results. The term “environment-wide associa-
tion studies” (EWAS) has been used to describe this ap-
proach (an analogy to genome-wide association stud-
ies).Forexample,Wangetal4
screenedmorethan2000
chemicalsinserumtodiscoverendogenousexposuresas-
sociated with risk for cardiovascular disease.
Therearenotablehurdlesinanalyzing“big”environ-
the EWAS vantage point, intervening on β-carotene
(Figure, D) seems a futile exercise given its complex rela-
tionship with other nutrients and pollutants.
Giventhiscomplexity,howcanstudiesofenvironmen-
talriskmoveforward?First,EWASanalysesshouldbeap-
pliedtomultipledatasets,andconsistencycanbeformally
examinedforallassessedcorrelations.Second,thetempo-
ral relationship between exposure and changes in health
parametersmayofferhelpfulhintsaboutwhichofthesig-
nalsaremorethansimplecorrelations.Third,standardized
adjustedanalyses,inwhichadjustmentsareperformedsys-
tematicallyandinthesamewayacrossmultipledatasets,
may also help. This is in stark contrast with the current
model,wherebymostepidemiologicstudiesusesingledata
setswithoutreplicationaswellasnon–time-dependentas-
sessments,andreportedadjustmentsaremarkedlydiffer-
entacrossreportsanddatasets,eventhoseperformedby
thesameteam(differentapproachesincreasevaliditybut
mustbereconciledandassimilated).
However, eventually for most environmental cor-
relates,theremaybeunsurpassabledifficultyestablish-
ing potential causal inferences based on observational
data alone. Factors that seem protective may some-
times be tested in randomized trials. The complexity of
the multiple correlations also highlights the challenge
thatinterveningtomodify1putativeriskfactoralsomay
inadvertently affect multiple other correlated factors.
Even when a seemingly simple intervention is tested in
randomizedtrials(affectingasingleriskfactoramongthe
VIEWPOINT
Chirag J. Patel, PhD
Center for Biomedical
Informatics, Harvard
Medical School,
Boston, Massachusetts.
John P. A. Ioannidis,
MD, DSc
Stanford Prevention
Research Center,
Department of Health
Research and Policy,
Department of
Medicine, Stanford
University School of
Medicine, Stanford,
California, Department
of Statistics, Stanford
University School of
Humanities and
Sciences, Stanford,
California, and
Meta-Research
Innovation Center at
Stanford (METRICS),
Stanford, California.
Opinion
JAMA, 2014

JECH, 2014
•longitudinal/linkable data & biorepositories
How can we study the elusive environment in larger scale for biomedical
discovery?
Studying the Elusive Environment in Large Scale
Itispossiblethatmorethan50%ofcomplexdiseaserisk
isattributedtodifferencesinanindividual’senvironment.1
Airpollution,smoking,anddietaredocumentedenviron-
mental factors affecting health, yet these factors are but
a fraction of the “exposome,” the totality of the exposure
loadoccurringthroughoutaperson’slifetime.1
Investigat-
ing one or a handful of exposures at a time has led to a
highly fragmented literature of epidemiologic associa-
tions. Much of that literature is not reproducible, and se-
lectivereportingmaybeamajorreasonforthelackofre-
producibility. A new model is required to discover
environmental exposures associated with disease while
mitigating possibilities of selective reporting.
Toremedythelackofreproducibilityandconcernsof
validity, multiple personal exposures can be assessed si-
multaneously in terms of their association with a condi-
tion or disease of interest; the strongest associations can
then be tentatively validated in independent data sets
(eg, as done in references 2 and 3).2,3
The main advan-
tages of this process include the ability to search the list
ofexposuresandadjustformultiplicitysystematicallyand
reportalltheprobedassociationsinsteadofonlythemost
the EWAS vantage point, intervening on β-carotene
(Figure, D) seems a futile exercise given its complex rela
tionship with other nutrients and pollutants.
Giventhiscomplexity,howcanstudiesofenvironmen
talriskmoveforward?First,EWASanalysesshouldbeap
pliedtomultipledatasets,andconsistencycanbeformally
examinedforallassessedcorrelations.Second,thetempo
ral relationship between exposure and changes in health
parametersmayofferhelpfulhintsaboutwhichofthesig
nalsaremorethansimplecorrelations.Third,standardized
adjustedanalyses,inwhichadjustmentsareperformedsys
tematicallyandinthesamewayacrossmultipledatasets
may also help. This is in stark contrast with the curren
model,wherebymostepidemiologicstudiesusesingledata
setswithoutreplicationaswellasnon–time-dependentas
sessments,andreportedadjustmentsaremarkedlydiffer
entacrossreportsanddatasets,eventhoseperformedby
thesameteam(differentapproachesincreasevaliditybu
mustbereconciledandassimilated).
However, eventually for most environmental cor
relates,theremaybeunsurpassabledifficultyestablish
ing potential causal inferences based on observationa
VIEWPOINT
Chirag J. Patel, PhD
Center for Biomedical
Informatics, Harvard
Medical School,
Boston, Massachusetts.
John P. A. Ioannidis,
MD, DSc
Stanford Prevention
Research Center,
Department of Health
Research and Policy,
Department of
Medicine, Stanford
University School of
Medicine, Stanford,
California, Department
of Statistics, Stanford
University School of
Humanities and
Sciences, Stanford,
California, and
Meta-Research
Innovation Center at
Stanford (METRICS),
Stanford, California.
Opinion
High-throughputascertainmentofendogenousindicatorsofen-
vironmentalexposurethatmayreflecttheexposomeincreasinglyat-
tractattention,andtheirperformanceneedstobecarefullyevaluated.
These include chemical detection of indicators of exposure through
7
US federally funded gene expression experiment data be
itedinpublicrepositoriessuchastheGeneExpressionOmni
repositoryhasbeeninstrumentalindevelopmentoftechno
measurement of gene expression, data standardization, an
Figure. Correlation Interdependency Globes for 4 Environmental Exposures (Cotinine, Mercury, Cadmium, Trans-β-Carotene) in National Hea
Nutrition Examination Survey (NHANES) Participants, 2003-2004
A Serum cotinine B Serum total mercury C Serum cadmium D Serum trans-β-carotene
37 Total correlations 42 Total correlations 68 Total correlations 68 Total correlations
Negative correlation Positive cor
Infectious
agents
Pollutants
Nutrients
and vitamins
Demographic
attributes
Eachcorrelationinterdependencyglobeincludes317environmentalexposures
representedbythenodesaroundtheperipheryoftheglobe.Pairwisecorrelations
aredepictedbyedges(lines)betweenthenodeofinterest(arrowhead)andother
nodes.Correlationswithabsolutevaluesexceeding0.2areshown(strong
Thesizeofeachnodeisproportionaltothenumberofedgesforanode,a
thicknessofeachedgeindicatesthemagnitudeofthecorrelation.
Opinion Viewpoint
•data mining and informatics to tackle complexity
what causes what?
confounding
•evaluate new ‘omics technologies high-throughput,
non-targeted metabolomics
There is no “microarray” for E...
http://grants.nih.gov/grants/guide/rfa-files/RFA-ES-15-010.html
NIH National Institute of Environmental Health: $34M in FY 2015:

new technologies for ascertaining the exposome in children
E
LaboratoryE
LaboratoryE
LaboratoryE
Laboratory
E Data Center
•Data repository

•Analytic ecosystem

•Data standards
Exposome Laboratory Network
Studying the Elusive Environment in Large Scale
Itispossiblethatmorethan50%ofcomplexdiseaserisk
isattributedtodifferencesinanindividual’senvironment.1
Airpollution,smoking,anddietaredocumentedenviron-
mental factors affecting health, yet these factors are but
a fraction of the “exposome,” the totality of the exposure
loadoccurringthroughoutaperson’slifetime.1
Investigat-
ing one or a handful of exposures at a time has led to a
highly fragmented literature of epidemiologic associa-
tions. Much of that literature is not reproducible, and se-
lectivereportingmaybeamajorreasonforthelackofre-
producibility. A new model is required to discover
environmental exposures associated with disease while
mitigating possibilities of selective reporting.
Toremedythelackofreproducibilityandconcernsof
validity, multiple personal exposures can be assessed si-
multaneously in terms of their association with a condi-
tion or disease of interest; the strongest associations can
then be tentatively validated in independent data sets
(eg, as done in references 2 and 3).2,3
The main advan-
tages of this process include the ability to search the list
ofexposuresandadjustformultiplicitysystematicallyand
reportalltheprobedassociationsinsteadofonlythemost
significant results. The term “environment-wide associa-
tion studies” (EWAS) has been used to describe this ap-
proach (an analogy to genome-wide association stud-
ies).Forexample,Wangetal4
screenedmorethan2000
chemicalsinserumtodiscoverendogenousexposuresas-
sociated with risk for cardiovascular disease.
Therearenotablehurdlesinanalyzing“big”environ-
the EWAS vantage point, intervening on β-carotene
(Figure, D) seems a futile exercise given its complex rela-
tionship with other nutrients and pollutants.
Giventhiscomplexity,howcanstudiesofenvironmen-
talriskmoveforward?First,EWASanalysesshouldbeap-
pliedtomultipledatasets,andconsistencycanbeformally
examinedforallassessedcorrelations.Second,thetempo-
ral relationship between exposure and changes in health
parametersmayofferhelpfulhintsaboutwhichofthesig-
nalsaremorethansimplecorrelations.Third,standardized
adjustedanalyses,inwhichadjustmentsareperformedsys-
tematicallyandinthesamewayacrossmultipledatasets,
may also help. This is in stark contrast with the current
model,wherebymostepidemiologicstudiesusesingledata
setswithoutreplicationaswellasnon–time-dependentas-
sessments,andreportedadjustmentsaremarkedlydiffer-
entacrossreportsanddatasets,eventhoseperformedby
thesameteam(differentapproachesincreasevaliditybut
mustbereconciledandassimilated).
However, eventually for most environmental cor-
relates,theremaybeunsurpassabledifficultyestablish-
ing potential causal inferences based on observational
data alone. Factors that seem protective may some-
times be tested in randomized trials. The complexity of
the multiple correlations also highlights the challenge
thatinterveningtomodify1putativeriskfactoralsomay
inadvertently affect multiple other correlated factors.
Even when a seemingly simple intervention is tested in
randomizedtrials(affectingasingleriskfactoramongthe
VIEWPOINT
Chirag J. Patel, PhD
Center for Biomedical
Informatics, Harvard
Medical School,
Boston, Massachusetts.
John P. A. Ioannidis,
MD, DSc
Stanford Prevention
Research Center,
Department of Health
Research and Policy,
Department of
Medicine, Stanford
University School of
Medicine, Stanford,
California, Department
of Statistics, Stanford
University School of
Humanities and
Sciences, Stanford,
California, and
Meta-Research
Innovation Center at
Stanford (METRICS),
Stanford, California.
Opinion
JAMA, 2014

JECH, 2014
•longitudinal/linkable data & biorepositories
Possibilities of discovery with the exposome:
How do we proceed?
Studying the Elusive Environment in Large Scale
Itispossiblethatmorethan50%ofcomplexdiseaserisk
isattributedtodifferencesinanindividual’senvironment.1
Airpollution,smoking,anddietaredocumentedenviron-
mental factors affecting health, yet these factors are but
a fraction of the “exposome,” the totality of the exposure
loadoccurringthroughoutaperson’slifetime.1
Investigat-
ing one or a handful of exposures at a time has led to a
highly fragmented literature of epidemiologic associa-
tions. Much of that literature is not reproducible, and se-
lectivereportingmaybeamajorreasonforthelackofre-
producibility. A new model is required to discover
environmental exposures associated with disease while
mitigating possibilities of selective reporting.
Toremedythelackofreproducibilityandconcernsof
validity, multiple personal exposures can be assessed si-
multaneously in terms of their association with a condi-
tion or disease of interest; the strongest associations can
then be tentatively validated in independent data sets
(eg, as done in references 2 and 3).2,3
The main advan-
tages of this process include the ability to search the list
ofexposuresandadjustformultiplicitysystematicallyand
reportalltheprobedassociationsinsteadofonlythemost
the EWAS vantage point, intervening on β-carotene
(Figure, D) seems a futile exercise given its complex rela
tionship with other nutrients and pollutants.
Giventhiscomplexity,howcanstudiesofenvironmen
talriskmoveforward?First,EWASanalysesshouldbeap
pliedtomultipledatasets,andconsistencycanbeformally
examinedforallassessedcorrelations.Second,thetempo
ral relationship between exposure and changes in health
parametersmayofferhelpfulhintsaboutwhichofthesig
nalsaremorethansimplecorrelations.Third,standardized
adjustedanalyses,inwhichadjustmentsareperformedsys
tematicallyandinthesamewayacrossmultipledatasets
may also help. This is in stark contrast with the curren
model,wherebymostepidemiologicstudiesusesingledata
setswithoutreplicationaswellasnon–time-dependentas
sessments,andreportedadjustmentsaremarkedlydiffer
entacrossreportsanddatasets,eventhoseperformedby
thesameteam(differentapproachesincreasevaliditybu
mustbereconciledandassimilated).
However, eventually for most environmental cor
relates,theremaybeunsurpassabledifficultyestablish
ing potential causal inferences based on observationa
VIEWPOINT
Chirag J. Patel, PhD
Center for Biomedical
Informatics, Harvard
Medical School,
Boston, Massachusetts.
John P. A. Ioannidis,
MD, DSc
Stanford Prevention
Research Center,
Department of Health
Research and Policy,
Department of
Medicine, Stanford
University School of
Medicine, Stanford,
California, Department
of Statistics, Stanford
University School of
Humanities and
Sciences, Stanford,
California, and
Meta-Research
Innovation Center at
Stanford (METRICS),
Stanford, California.
Opinion
High-throughputascertainmentofendogenousindicatorsofen-
vironmentalexposurethatmayreflecttheexposomeincreasinglyat-
tractattention,andtheirperformanceneedstobecarefullyevaluated.
These include chemical detection of indicators of exposure through
7
US federally funded gene expression experiment data be
itedinpublicrepositoriessuchastheGeneExpressionOmni
repositoryhasbeeninstrumentalindevelopmentoftechno
measurement of gene expression, data standardization, an
Figure. Correlation Interdependency Globes for 4 Environmental Exposures (Cotinine, Mercury, Cadmium, Trans-β-Carotene) in National Hea
Nutrition Examination Survey (NHANES) Participants, 2003-2004
A Serum cotinine B Serum total mercury C Serum cadmium D Serum trans-β-carotene
37 Total correlations 42 Total correlations 68 Total correlations 68 Total correlations
Negative correlation Positive cor
Infectious
agents
Pollutants
Nutrients
and vitamins
Demographic
attributes
Eachcorrelationinterdependencyglobeincludes317environmentalexposures
representedbythenodesaroundtheperipheryoftheglobe.Pairwisecorrelations
aredepictedbyedges(lines)betweenthenodeofinterest(arrowhead)andother
nodes.Correlationswithabsolutevaluesexceeding0.2areshown(strong
Thesizeofeachnodeisproportionaltothenumberofedgesforanode,a
thicknessofeachedgeindicatesthemagnitudeofthecorrelation.
Opinion Viewpoint
•data mining and informatics to tackle complexity
what causes what?
confounding
•evaluate new ‘omics technologies metabolomics
758,000 individuals
>400 studies
>>1B datapoints (genotypes and phenotypes)

controlled-access (by application)
Accelerating discoveries with publicly-accessible, population-scale data:

a dbGaP for environmental exposures?
with Paul Avillach, Michael McDuffie, Jeremy Easton-Marks, 

Cartik Saravanamuthu and the BD2K PIC-SURE team
40K participants

>1000 indicators of exposure

Data and API available now

http://nhanes.hms.harvard.edu
BD2K Patient-Centered Information Commons
NHANES exposome browser
Studying the Elusive Environment in Large Scale
Itispossiblethatmorethan50%ofcomplexdiseaserisk
isattributedtodifferencesinanindividual’senvironment.1
Airpollution,smoking,anddietaredocumentedenviron-
mental factors affecting health, yet these factors are but
a fraction of the “exposome,” the totality of the exposure
loadoccurringthroughoutaperson’slifetime.1
Investigat-
ing one or a handful of exposures at a time has led to a
highly fragmented literature of epidemiologic associa-
tions. Much of that literature is not reproducible, and se-
lectivereportingmaybeamajorreasonforthelackofre-
producibility. A new model is required to discover
environmental exposures associated with disease while
mitigating possibilities of selective reporting.
Toremedythelackofreproducibilityandconcernsof
validity, multiple personal exposures can be assessed si-
multaneously in terms of their association with a condi-
tion or disease of interest; the strongest associations can
then be tentatively validated in independent data sets
(eg, as done in references 2 and 3).2,3
The main advan-
tages of this process include the ability to search the list
ofexposuresandadjustformultiplicitysystematicallyand
reportalltheprobedassociationsinsteadofonlythemost
significant results. The term “environment-wide associa-
tion studies” (EWAS) has been used to describe this ap-
proach (an analogy to genome-wide association stud-
ies).Forexample,Wangetal4
screenedmorethan2000
chemicalsinserumtodiscoverendogenousexposuresas-
sociated with risk for cardiovascular disease.
Therearenotablehurdlesinanalyzing“big”environ-
the EWAS vantage point, intervening on β-carotene
(Figure, D) seems a futile exercise given its complex rela-
tionship with other nutrients and pollutants.
Giventhiscomplexity,howcanstudiesofenvironmen-
talriskmoveforward?First,EWASanalysesshouldbeap-
pliedtomultipledatasets,andconsistencycanbeformally
examinedforallassessedcorrelations.Second,thetempo-
ral relationship between exposure and changes in health
parametersmayofferhelpfulhintsaboutwhichofthesig-
nalsaremorethansimplecorrelations.Third,standardized
adjustedanalyses,inwhichadjustmentsareperformedsys-
tematicallyandinthesamewayacrossmultipledatasets,
may also help. This is in stark contrast with the current
model,wherebymostepidemiologicstudiesusesingledata
setswithoutreplicationaswellasnon–time-dependentas-
sessments,andreportedadjustmentsaremarkedlydiffer-
entacrossreportsanddatasets,eventhoseperformedby
thesameteam(differentapproachesincreasevaliditybut
mustbereconciledandassimilated).
However, eventually for most environmental cor-
relates,theremaybeunsurpassabledifficultyestablish-
ing potential causal inferences based on observational
data alone. Factors that seem protective may some-
times be tested in randomized trials. The complexity of
the multiple correlations also highlights the challenge
thatinterveningtomodify1putativeriskfactoralsomay
inadvertently affect multiple other correlated factors.
Even when a seemingly simple intervention is tested in
randomizedtrials(affectingasingleriskfactoramongthe
VIEWPOINT
Chirag J. Patel, PhD
Center for Biomedical
Informatics, Harvard
Medical School,
Boston, Massachusetts.
John P. A. Ioannidis,
MD, DSc
Stanford Prevention
Research Center,
Department of Health
Research and Policy,
Department of
Medicine, Stanford
University School of
Medicine, Stanford,
California, Department
of Statistics, Stanford
University School of
Humanities and
Sciences, Stanford,
California, and
Meta-Research
Innovation Center at
Stanford (METRICS),
Stanford, California.
Opinion
JAMA, 2014

JECH, 2014
•longitudinal/linkable data & biorepositories
Possibilities of discovery with the exposome:
How do we proceed?
Studying the Elusive Environment in Large Scale
Itispossiblethatmorethan50%ofcomplexdiseaserisk
isattributedtodifferencesinanindividual’senvironment.1
Airpollution,smoking,anddietaredocumentedenviron-
mental factors affecting health, yet these factors are but
a fraction of the “exposome,” the totality of the exposure
loadoccurringthroughoutaperson’slifetime.1
Investigat-
ing one or a handful of exposures at a time has led to a
highly fragmented literature of epidemiologic associa-
tions. Much of that literature is not reproducible, and se-
lectivereportingmaybeamajorreasonforthelackofre-
producibility. A new model is required to discover
environmental exposures associated with disease while
mitigating possibilities of selective reporting.
Toremedythelackofreproducibilityandconcernsof
validity, multiple personal exposures can be assessed si-
multaneously in terms of their association with a condi-
tion or disease of interest; the strongest associations can
then be tentatively validated in independent data sets
(eg, as done in references 2 and 3).2,3
The main advan-
tages of this process include the ability to search the list
ofexposuresandadjustformultiplicitysystematicallyand
reportalltheprobedassociationsinsteadofonlythemost
the EWAS vantage point, intervening on β-carotene
(Figure, D) seems a futile exercise given its complex rela
tionship with other nutrients and pollutants.
Giventhiscomplexity,howcanstudiesofenvironmen
talriskmoveforward?First,EWASanalysesshouldbeap
pliedtomultipledatasets,andconsistencycanbeformally
examinedforallassessedcorrelations.Second,thetempo
ral relationship between exposure and changes in health
parametersmayofferhelpfulhintsaboutwhichofthesig
nalsaremorethansimplecorrelations.Third,standardized
adjustedanalyses,inwhichadjustmentsareperformedsys
tematicallyandinthesamewayacrossmultipledatasets
may also help. This is in stark contrast with the curren
model,wherebymostepidemiologicstudiesusesingledata
setswithoutreplicationaswellasnon–time-dependentas
sessments,andreportedadjustmentsaremarkedlydiffer
entacrossreportsanddatasets,eventhoseperformedby
thesameteam(differentapproachesincreasevaliditybu
mustbereconciledandassimilated).
However, eventually for most environmental cor
relates,theremaybeunsurpassabledifficultyestablish
ing potential causal inferences based on observationa
VIEWPOINT
Chirag J. Patel, PhD
Center for Biomedical
Informatics, Harvard
Medical School,
Boston, Massachusetts.
John P. A. Ioannidis,
MD, DSc
Stanford Prevention
Research Center,
Department of Health
Research and Policy,
Department of
Medicine, Stanford
University School of
Medicine, Stanford,
California, Department
of Statistics, Stanford
University School of
Humanities and
Sciences, Stanford,
California, and
Meta-Research
Innovation Center at
Stanford (METRICS),
Stanford, California.
Opinion
High-throughputascertainmentofendogenousindicatorsofen-
vironmentalexposurethatmayreflecttheexposomeincreasinglyat-
tractattention,andtheirperformanceneedstobecarefullyevaluated.
These include chemical detection of indicators of exposure through
7
US federally funded gene expression experiment data be
itedinpublicrepositoriessuchastheGeneExpressionOmni
repositoryhasbeeninstrumentalindevelopmentoftechno
measurement of gene expression, data standardization, an
Figure. Correlation Interdependency Globes for 4 Environmental Exposures (Cotinine, Mercury, Cadmium, Trans-β-Carotene) in National Hea
Nutrition Examination Survey (NHANES) Participants, 2003-2004
A Serum cotinine B Serum total mercury C Serum cadmium D Serum trans-β-carotene
37 Total correlations 42 Total correlations 68 Total correlations 68 Total correlations
Negative correlation Positive cor
Infectious
agents
Pollutants
Nutrients
and vitamins
Demographic
attributes
Eachcorrelationinterdependencyglobeincludes317environmentalexposures
representedbythenodesaroundtheperipheryoftheglobe.Pairwisecorrelations
aredepictedbyedges(lines)betweenthenodeofinterest(arrowhead)andother
nodes.Correlationswithabsolutevaluesexceeding0.2areshown(strong
Thesizeofeachnodeisproportionaltothenumberofedgesforanode,a
thicknessofeachedgeindicatesthemagnitudeofthecorrelation.
Opinion Viewpoint
•data mining and informatics to tackle complexity
what causes what?
confounding
•evaluate new ‘omics technologies metabolomics
Complexity of exposome-phenome associations:
Many more potential biases vs. GWAS
Reverse causality:
Could the disease “lead” to
exposure?
γ-tocopherol
?
tocopherol (vitamin e) supplements for

CHD individuals?
low HDL
Confounding bias:
Ice cream and drowning deaths

Mercury and HDL-C
fish consumption
mercury
confounders
high HDL
??
Independence of association:

Web of exposure of the exposome?
β-carotene hydrocarbons
γ-tocopherol
ρ
Longitudinal Study:
“Gold Standard” for Validation
•exposure changing through time

•reverse causality bias

•compute disease risk
time
Disease
?
Exposure
DiseaseRisk
[low]
[high]
EWAS to search for

exposures and behaviors associated with all-cause mortality.
NHANES: 1999-2004
National Death Index linked mortality

246 behaviors and exposures (serum/urine/self-report)
NHANES: 1999-2001
N=330 to 6008 (26 to 655 deaths) 

~5.5 years of followup
Cox proportional hazards

baseline exposure and time to death
False discovery rate < 5%
NHANES: 2003-2004
N=177 to 3258 (20-202 deaths)

~2.8 years of followup
p < 0.05
IJE, 2013
Adjusted Hazard Ratio
-log10(pvalue)
0.4 0.6 0.8 1.0 1.2 1.4 1.6 2.0 2.4 2.8
02468
1
2
3
4
5
67
1 Physical Activity
2 Does anyone smoke in home?
3 Cadmium
4 Cadmium, urine
5 Past smoker
6 Current smoker
7 trans-lycopene
(11) 1
2
3 4
5 6
78
9
10 1112
13 14
1516
1 age (10 year increment)
2 SES_1
3 male
4 SES_0
5 black
6 SES_2
7 SES_3
8 education_hs
9 other_eth
10 mexican
11 occupation_blue_semi
12 education_less_hs
13 occupation_never
14 occupation_blue_high
15 occupation_white_semi
16 other_hispanic
(69)
All-cause mortality:

253 exposure/behavior associations in survival
age, sex, income, education, race/ethnicity, occupation [in red]
FDR < 5%
sociodemographics
replicated factor
IJE, 2013
Adjusted Hazard Ratio
-log10(pvalue)
0.4 0.6 0.8 1.0 1.2 1.4 1.6 2.0 2.4 2.8
02468
1
2
3
4
5
67
1 Physical Activity
2 Does anyone smoke in home?
3 Cadmium
4 Cadmium, urine
5 Past smoker
6 Current smoker
7 trans-lycopene
(11) 1
2
3 4
5 6
78
9
10 1112
13 14
1516
1 age (10 year increment)
2 SES_1
3 male
4 SES_0
5 black
6 SES_2
7 SES_3
8 education_hs
9 other_eth
10 mexican
11 occupation_blue_semi
12 education_less_hs
13 occupation_never
14 occupation_blue_high
15 occupation_white_semi
16 other_hispanic
(69)
EWAS (re)-identifies factors associated with all-cause mortality:

Volcano plot of 200 associations
age (10 years)
income (quintile 2)
income (quintile 1)
male
black income (quintile 3)
any one smoke in home?
age, sex, income, education, race/ethnicity, occupation [in red]
serum and urine cadmium
[1 SD]
past smoker?
current smoker?serum lycopene
[1SD]
physical activity
[low, moderate, high activity]*
*derived from METs per activity and categorized by Health.gov guidelines
R2 ~ 2%
Correlation Structure of the Exposome?
Analogy: “Linkage Disequilibrium”
Identification of four novel T2DM loci
Our fast-track stage 2 genotyping confirmed the reported association
for rs7903146 (TCF7L2) on chromosome 10, and in addition iden-
tified significant associations for seven SNPs representing four new
T2DM loci (Table 1). In all cases, the strongest association for the
MAX statistic (see Methods) was obtained with the additive model.
The most significant of these corresponds to rs13266634, a non-
synonymous SNP (R325W) in SLC30A8, located in a 33-kb linkage
disequilibrium block on chromosome 8, containing only the 39 end
of this gene (Fig. 2a). SLC30A8 encodes a zinc transporter expressed
solely in the secretory vesicles of b-cells and is thus implicated in the
final stages of insulin biosynthesis, which involve co-crystallization
Table 1 | Confirmed association results
SNP Chromosome Position
(nucleotides)
Risk
allele
Major
allele
MAF
(case)
MAF
(ctrl)
Odds ratio
(het)
Odds ratio
(hom)
PAR ls Stage 2
pMAX
Stage 2 pMAX
(perm)
Stage 1
pMAX
Stage 1 pMAX
(perm)
Nearest
gene
rs7903146 10 114,748,339 T C 0.406 0.293 1.65 6 0.19 2.77 6 0.50 0.28 1.0546 1.5 3 10234
,1.0 3 1027
3.2 3 10217
,3.3 3 10210
TCF7L2
rs13266634 8 118,253,964 C C 0.254 0.301 1.18 6 0.25 1.53 6 0.31 0.24 1.0089 6.1 3 1028
5.0 3 1027
2.1 3 1025
1.8 3 1025
SLC30A8
rs1111875 10 94,452,862 G G 0.358 0.402 1.19 6 0.19 1.44 6 0.24 0.19 1.0069 3.0 3 1026
7.4 3 1026
9.1 3 1026
7.3 3 1026
HHEX
rs7923837 10 94,471,897 G G 0.335 0.377 1.22 6 0.21 1.45 6 0.25 0.20 1.0065 7.5 3 1026
2.2 3 1025
3.4 3 1026
2.5 3 1026
HHEX
rs7480010 11 42,203,294 G A 0.336 0.301 1.14 6 0.13 1.40 6 0.25 0.08 1.0041 1.1 3 1024
2.9 3 1024
1.5 3 1025
1.2 3 1025
LOC387761
rs3740878 11 44,214,378 A A 0.240 0.272 1.26 6 0.29 1.46 6 0.33 0.24 1.0046 1.2 3 1024
2.8 3 1024
1.8 3 1025
1.3 3 1025
EXT2
rs11037909 11 44,212,190 T T 0.240 0.271 1.27 6 0.30 1.47 6 0.33 0.25 1.0045 1.8 3 1024
4.5 3 1024
1.8 3 1025
1.3 3 1025
EXT2
rs1113132 11 44,209,979 C C 0.237 0.267 1.15 6 0.27 1.36 6 0.31 0.19 1.0044 3.3 3 1024
8.1 3 1024
3.7 3 1025
2.9 3 1025
EXT2
Significant T2DM associations were confirmed for eight SNPs in five loci. Allele frequencies, odds ratios (with 95% confidence intervals) and PAR were calculated using only the stage 2 data. Allele
frequencies in the controls were very close to those reported for the CEU set (European subjects genotyped in the HapMap project). Induced sibling recurrent risk ratios (ls) were estimated using
stage 2 genotype counts for the control subjects and assuming a T2DM prevalence of 7% in the French population. hom, homozygous; het, heterozygous; major allele, the allele with the higher
frequency in controls; pMAX, P-value of the MAX statistic from the x2
distribution; pMAX (perm), P-value of the MAX statistic from the permutation-derived empirical distribution (pMAX and
pMAX (perm) are adjusted for variance inflation); risk allele, the allele with higher frequency in cases compared with controls.
*
*
*
0
2
4
–log10[P]
–log10[P]*
4954642sr
2373971sr
3373971sr
445409sr
8012261sr
3349941sr
883429sr
2019462sr
0349941sr
90350501sr
036169sr
0415007sr
2225991sr
6136642sr
8136642sr
1869646sr
8798751sr
04928201sr
3926642sr
5926642sr
43666231sr
9926642sr
2954642sr
01350501sr
5769646sr
4577187sr
4769646sr
41350501sr
5784931sr
2173387sr
39250501sr
5050007sr
7492602sr
1255051sr
156868sr
4373387sr
4784931sr
7501107sr
2697402sr
91518711sr
6461001sr
29250501sr
5889103sr
8669646sr
0889103sr
4688392sr
SLC30A8 IDE HHEXKIF11
**
**
**
0
2
4
* *
5470942sr
7602242sr
28178111sr
1570942sr
2394424sr
8838141sr
76029511sr
37178111sr
2945391sr
2608842sr
64690501sr
1537942sr
2950249sr
0339351sr
1708842sr
195749sr
4037942sr
1137942sr
7383297sr
5781111sr
9275722sr
9537197sr
6342097sr
0383856sr
0990707sr
4184197sr
19028801sr
9125722sr
88028801sr
1974064sr
5374283sr
53465221sr
6283856sr
5058573sr
3679991sr
1118097sr
3491242sr
46078111sr
06078111sr
7912381sr
3148707sr
0283856sr
52078111sr
5227373sr
0491242sr
2369412sr
2297881sr
662155sr
7790197sr
44068701sr
35075221sr
5826807sr
7851092sr
9409522sr
–log10[P]
–log10[P]
EXT2 ALX4
0
2
4
*** * **
0
2
4
LOC387761
a b
c d
NATURE|Vol 445|22 February 2007 ARTICLES
Sladek et al., Nature Genetics (2007)
Correlation between
occurrence of genetic loci

In GWAS, allows one to trace
to the “causal” locus.
Independence of association:

How to untangle “web” of
exposure?
β-carotene hydrocarbons
γ-tocopherol
ρ
Interdependencies of the exposome:
Correlation globes paint a complex view of exposure
Red: positive ρ

Blue: negative ρ

thickness: |ρ|
permuted data to produce

“null ρ”

sought replication in > 1
cohort
Pac Symp Biocomput 2015

JECH 2015
for each pair of E:

Spearman ρ

(575 factors: 81,937 correlations)
Red: positive ρ

Blue: negative ρ

thickness: |ρ|
Interdependencies of the exposome:
Correlation globes paint a complex view of exposure
permuted data to produce

“null ρ”

sought replication in > 1
cohort
Pac Symp Biocomput 2015

JECH 2015
Effective number of
variables:

500 (10% decrease)
for each pair of E:

Spearman ρ

(575 factors: 81,937 correlations)
Estimating the LD of the exposome:
Diabetes vs. death have distinct globes (PoPs vs. smoking?)...
Diabetes All-cause mortality
Pac Symp Biocomput. 2015
Browse these and 82 other phenotype-exposome globes!
http://www.chiragjpgroup.org/exposome_correlation
What nodes have the most connections?

(“hubs”)
sex, age, and income
ρ
What factor(s) is(are) correlated with many other exposures?
Pulse rate
Eosinophils number
Lymphocyte number
Monocyte
Segmented neutrophils number
Blood 2,5-Dimethylfuran
Cadmium LeadCotinine
C-reactive protein
Floor, GFAAS
Protoporphyrin
Glycohemoglobin
Glucose, plasma
g-tocopherol
Hepatitis A Antibody
Homocysteine
Herpes I
Herpes II
Red cell distribution width
Alkaline phosphotase
Globulin
Glucose, serum
Gamma glutamyl transferase
Triglycerides
Blood Benzene
Blood 1,4-Dichlorobenzene
Blood Ethylbenzene
Blood Styrene
Blood Toluene
Blood m-/p-Xylene
White blood cell count
Mono-benzyl phthalate
3-fluorene
2-fluorene
3-phenanthrene
2-phenanthrene
1-pyrene
Cadmium, urine
Albumin, urine
Lead, urine
10
20
30
-0.3 -0.2 -0.1 0.0
Effect Size per 1SD of income/poverty ratio
-log10(pvalue)
overall income/poverty ratio effects (per 1SD)
validated results
Lower income associated with 43 of 330 (>13%) exposures
and biomarkers in the US population
Higher income: lower levels of biomarkers
AJE, 2015
(Another 23 associated with higher levels=20%)
EWAS:
Possible to accelerate the pace of discovery of exposures
• generalizable, comprehensive,
transparent, and systematic study of
environment

• Created hypotheses for T2D, CVD, death,
and others 

• What is LD of the environment?

• Needles among needles

• Confounding, reverse causality...
−log10(pvalue)
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●● ●●
●
●
●●
●
●
●●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●● ●
●
●
●●
●
● ●
●
●
●
●
●●●
●●
●
●
●
●
●●
●
●
●
●
●
●
● ●●
●
●
●●
●
●●
●●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●●
●
●
●
●
●
●
●
●
●
●
●
●
● ●●
●
●
●
●
●
●
●●
●
●●
●
●●
●●
●
●
●
●
●
acrylamide
allergentest
bacterialinfection
cotinine
diakyl
dioxins
furansdibenzofuran
heavymetals
hydrocarbons
latex
nutrientscarotenoid
nutrientsminerals
nutrientsvitaminA
nutrientsvitaminB
nutrientsvitaminC
nutrientsvitaminD
nutrientsvitaminE
pcbs
perchlorate
pesticidesatrazine
pesticidescarbamate
pesticideschlorophenol
pesticidesorganochlorine
pesticidesorganophosphate
pesticidespyrethyroid
phenols
phthalates
phytoestrogens
polybrominatedethers
polyflourochemicals
viralinfection
volatilecompounds
01234
HDL-C: 1-10 mg/dL

T2D: ~2-3 OR

mortality: ~1.5-2 HR
Can exposure enable re-classification of phenotypes?
Committee on A Framework for Developing a
New Taxonomy of Disease
Board on Life Sciences
Division on Earth and Life Studies
NRC, National Academy of Sciences 2011
The use of multiple molecular parameters to
characterize disease [P] may lead to a more
accurate and find-grained classification of
disease [P]…
“multiple molecular parameters” must include E!
An icon for “precision medicine”?:

Linnaeus: classification of phenotypes (P) for
treatment and prevention (18th century)
signs (signa), symptoms
essensia (essence of symptoms; e.g., inflammation)
causa (what caused the disease; e.g. pathogen)
Related diseases: common cause and treatment.
Class 5: MENTALES (mental
disturbances)
Order 1: IDEALIS (faulty judgment)

Order 2: IMAGINI (imagination disorder)

Order 3: PATHETICI (irregular desires)

L-5-3: CITTA (eat the inedible)

L-5-3: TARANTISMUS (dancing via tarantula
bite)
Cogn Behav Neurol 2012
Classification of phenotypes (P) and disease today for via
International Classification of Disease
We are many phenotypes simultaneously:

Can we better categorize these P?
Body Measures

Body Mass Index

Height
Blood pressure & fitness

Systolic BP

Diastolic BP

Pulse rate

VO2 Max
Metabolic

Glucose

LDL-Cholesterol

Triglycerides
Inflammation

C-reactive protein

white blood cell count
Kidney function

Creatinine

Sodium

Uric Acid
Liver function

Aspartate aminotransferase

Gamma glutamyltransferase
Aging

Telomere length
EWAS-derived phenotype-exposure association map:
A 2-D view of phenotype-exposure associations for re-
classification
PCB170
Glucose
BMI
Height
Cholesterol
β-carotene
folate
http://bit.ly.com/pemap
Creation of a phenotype-exposure association map:
A 2-D view of 83 phenotype by 252 exposure associations
> 0
< 0
Association Size:
Clusters of exposures associated with clusters of phenotypes?
252 biomarkers of exposure × 83 clinical trait phenotypes 

NHANES 1999-2000, 2001-2002, 2005-2006

~21K regressions: replicated significant (FDR < 5%) in 2003-2004

adjusted by age, age2, sex, race, income, chronic disease

Hugues Aschard, JP Ioannidis
83phenotypes
252 exposures
Alpha-carotene
Alcohol
VitaminEasalpha-tocopherol
Beta-carotene
Caffeine
Calcium
Carbohydrate
Cholesterol
Copper
Beta-cryptoxanthin
Folicacid
Folate,DFE
Foodfolate
Dietaryfiber
Iron
Energy
Lycopene
Lutein+zeaxanthin
MFA16:1
MFA18:1
MFA20:1
Magnesium
Totalmonounsaturatedfattyacids
Moisture
Niacin
PFA18:2
PFA18:3
PFA20:4
PFA22:5
PFA22:6
Totalpolyunsaturatedfattyacids
Phosphorus
Potassium
Protein
Retinol
SFA4:0
SFA6:0
SFA8:0
SFA10:0
SFA12:0
SFA14:0
SFA16:0
SFA18:0
Selenium
Totalsaturatedfattyacids
Totalsugars
Totalfat
Theobromine
VitaminA,RAE
Thiamin
VitaminB12
Riboflavin
VitaminB6
VitaminC
VitaminK
Zinc
NoSalt
OrdinarySalt
a-Carotene
VitaminB12,serum
trans-b-carotene
cis-b-carotene
b-cryptoxanthin
Folate,serum
g-tocopherol
Iron,FrozenSerum
CombinedLutein/zeaxanthin
trans-lycopene
Folate,RBC
Retinylpalmitate
Retinylstearate
Retinol
VitaminD
a-Tocopherol
Daidzein
o-Desmethylangolensin
Equol
Enterodiol
Enterolactone
Genistein
EstimatedVO2max
PhysicalActivity
Doesanyonesmokeinhome?
Total#ofcigarettessmokedinhome
Cotinine
CurrentCigaretteSmoker?
Agelastsmokedcigarettesregularly
#cigarettessmokedperdaywhenquit
#cigarettessmokedperdaynow
#dayssmokedcigsduringpast30days
Avg#cigarettes/dayduringpast30days
Smokedatleast100cigarettesinlife
Doyounowsmokecigarettes...
numberofdayssincequit
Usedsnuffatleast20timesinlife
drink5inaday
drinkperday
days5drinksinyear
daysdrinkinyear
3-fluorene
2-fluorene
3-phenanthrene
1-phenanthrene
2-phenanthrene
1-pyrene
3-benzo[c]phenanthrene
3-benz[a]anthracene
Mono-n-butylphthalate
Mono-phthalate
Mono-cyclohexylphthalate
Mono-ethylphthalate
Mono-phthalate
Mono--hexylphthalate
Mono-isobutylphthalate
Mono-n-methylphthalate
Mono-phthalate
Mono-benzylphthalate
Cadmium
Lead
Mercury,total
Barium,urine
Cadmium,urine
Cobalt,urine
Cesium,urine
Mercury,urine
Iodine,urine
Molybdenum,urine
Lead,urine
Platinum,urine
Antimony,urine
Thallium,urine
Tungsten,urine
Uranium,urine
BloodBenzene
BloodEthylbenzene
Bloodo-Xylene
BloodStyrene
BloodTrichloroethene
BloodToluene
Bloodm-/p-Xylene
1,2,3,7,8-pncdd
1,2,3,7,8,9-hxcdd
1,2,3,4,6,7,8-hpcdd
1,2,3,4,6,7,8,9-ocdd
2,3,7,8-tcdd
Beta-hexachlorocyclohexane
Gamma-hexachlorocyclohexane
Hexachlorobenzene
HeptachlorEpoxide
Mirex
Oxychlordane
p,p-DDE
Trans-nonachlor
2,5-dichlorophenolresult
2,4,6-trichlorophenolresult
Pentachlorophenol
Dimethylphosphate
Diethylphosphate
Dimethylthiophosphate
PCB66
PCB74
PCB99
PCB105
PCB118
PCB138&158
PCB146
PCB153
PCB156
PCB157
PCB167
PCB170
PCB172
PCB177
PCB178
PCB180
PCB183
PCB187
3,3,4,4,5,5-hxcb
3,3,4,4,5-pncb
3,4,4,5-tcb
Perfluoroheptanoicacid
Perfluorohexanesulfonicacid
Perfluorononanoicacid
Perfluorooctanoicacid
Perfluorooctanesulfonicacid
Perfluorooctanesulfonamide
2,3,7,8-tcdf
1,2,3,7,8-pncdf
2,3,4,7,8-pncdf
1,2,3,4,7,8-hxcdf
1,2,3,6,7,8-hxcdf
1,2,3,7,8,9-hxcdf
2,3,4,6,7,8-hxcdf
1,2,3,4,6,7,8-hpcdf
Measles
Toxoplasma
HepatitisAAntibody
HepatitisBcoreantibody
HepatitisBSurfaceAntibody
HerpesII
Albumin, urine
Uric acid
Phosphorus
Osmolality
Sodium
Potassium
Creatinine
Chloride
Total calcium
Bicarbonate
Blood urea nitrogen
Total protein
Total bilirubin
Lactate dehydrogenase LDH
Gamma glutamyl transferase
Globulin
Alanine aminotransferase ALT
Aspartate aminotransferase AST
Alkaline phosphotase
Albumin
Methylmalonic acid
PSA. total
Prostate specific antigen ratio
TIBC, Frozen Serum
Red cell distribution width
Red blood cell count
Platelet count SI
Segmented neutrophils percent
Mean platelet volume
Mean cell volume
Mean cell hemoglobin
MCHC
Hemoglobin
Hematocrit
Ferritin
Protoporphyrin
Transferrin saturation
White blood cell count
Monocyte percent
Lymphocyte percent
Eosinophils percent
C-reactive protein
Segmented neutrophils number
Monocyte number
Lymphocyte number
Eosinophils number
Basophils number
mean systolic
mean diastolic
60 sec. pulse:
60 sec HR
Total Cholesterol
Triglycerides
Glucose, serum
Insulin
Homocysteine
Glucose, plasma
Glycohemoglobin
C-peptide: SI
LDL-cholesterol
Direct HDL-Cholesterol
Bone alkaline phosphotase
Trunk Fat
Lumber Pelvis BMD
Lumber Spine BMD
Head BMD
Trunk Lean excl BMC
Total Lean excl BMC
Total Fat
Total BMD
Weight
Waist Circumference
Triceps Skinfold
Thigh Circumference
Subscapular Skinfold
Recumbent Length
Upper Leg Length
Standing Height
Head Circumference
Maximal Calf Circumference
Body Mass Index
-0.4 -0.2 0 0.2 0.4
Value
050100150
Color Key
and Histogram
Count
http://bit.ly.com/pemap
phenotypes
exposures
+- EWAS-derived phenotype-exposure association map:
A 2-D view of connections between P and E
Alpha-carotene
Alcohol
VitaminEasalpha-tocopherol
Beta-carotene
Caffeine
Calcium
Carbohydrate
Cholesterol
Copper
Beta-cryptoxanthin
Folicacid
Folate,DFE
Foodfolate
Dietaryfiber
Iron
Energy
Lycopene
Lutein+zeaxanthin
MFA16:1
MFA18:1
MFA20:1
Magnesium
Totalmonounsaturatedfattyacids
Moisture
Niacin
PFA18:2
PFA18:3
PFA20:4
PFA22:5
PFA22:6
Totalpolyunsaturatedfattyacids
Phosphorus
Potassium
Protein
Retinol
SFA4:0
SFA6:0
SFA8:0
SFA10:0
SFA12:0
SFA14:0
SFA16:0
SFA18:0
Selenium
Totalsaturatedfattyacids
Totalsugars
Totalfat
Theobromine
VitaminA,RAE
Thiamin
VitaminB12
Riboflavin
VitaminB6
VitaminC
VitaminK
Zinc
NoSalt
OrdinarySalt
a-Carotene
VitaminB12,serum
trans-b-carotene
cis-b-carotene
b-cryptoxanthin
Folate,serum
g-tocopherol
Iron,FrozenSerum
CombinedLutein/zeaxanthin
trans-lycopene
Folate,RBC
Retinylpalmitate
Retinylstearate
Retinol
VitaminD
a-Tocopherol
Daidzein
o-Desmethylangolensin
Equol
Enterodiol
Enterolactone
Genistein
EstimatedVO2max
PhysicalActivity
Doesanyonesmokeinhome?
Total#ofcigarettessmokedinhome
Cotinine
CurrentCigaretteSmoker?
Agelastsmokedcigarettesregularly
#cigarettessmokedperdaywhenquit
#cigarettessmokedperdaynow
#dayssmokedcigsduringpast30days
Avg#cigarettes/dayduringpast30days
Smokedatleast100cigarettesinlife
Doyounowsmokecigarettes...
numberofdayssincequit
Usedsnuffatleast20timesinlife
drink5inaday
drinkperday
days5drinksinyear
daysdrinkinyear
3-fluorene
2-fluorene
3-phenanthrene
1-phenanthrene
2-phenanthrene
1-pyrene
3-benzo[c]phenanthrene
3-benz[a]anthracene
Mono-n-butylphthalate
Mono-phthalate
Mono-cyclohexylphthalate
Mono-ethylphthalate
Mono-phthalate
Mono--hexylphthalate
Mono-isobutylphthalate
Mono-n-methylphthalate
Mono-phthalate
Mono-benzylphthalate
Cadmium
Lead
Mercury,total
Barium,urine
Cadmium,urine
Cobalt,urine
Cesium,urine
Mercury,urine
Iodine,urine
Molybdenum,urine
Lead,urine
Platinum,urine
Antimony,urine
Thallium,urine
Tungsten,urine
Uranium,urine
BloodBenzene
BloodEthylbenzene
Bloodo-Xylene
BloodStyrene
BloodTrichloroethene
BloodToluene
Bloodm-/p-Xylene
1,2,3,7,8-pncdd
1,2,3,7,8,9-hxcdd
1,2,3,4,6,7,8-hpcdd
1,2,3,4,6,7,8,9-ocdd
2,3,7,8-tcdd
Beta-hexachlorocyclohexane
Gamma-hexachlorocyclohexane
Hexachlorobenzene
HeptachlorEpoxide
Mirex
Oxychlordane
p,p-DDE
Trans-nonachlor
2,5-dichlorophenolresult
2,4,6-trichlorophenolresult
Pentachlorophenol
Dimethylphosphate
Diethylphosphate
Dimethylthiophosphate
PCB66
PCB74
PCB99
PCB105
PCB118
PCB138&158
PCB146
PCB153
PCB156
PCB157
PCB167
PCB170
PCB172
PCB177
PCB178
PCB180
PCB183
PCB187
3,3,4,4,5,5-hxcb
3,3,4,4,5-pncb
3,4,4,5-tcb
Perfluoroheptanoicacid
Perfluorohexanesulfonicacid
Perfluorononanoicacid
Perfluorooctanoicacid
Perfluorooctanesulfonicacid
Perfluorooctanesulfonamide
2,3,7,8-tcdf
1,2,3,7,8-pncdf
2,3,4,7,8-pncdf
1,2,3,4,7,8-hxcdf
1,2,3,6,7,8-hxcdf
1,2,3,7,8,9-hxcdf
2,3,4,6,7,8-hxcdf
1,2,3,4,6,7,8-hpcdf
Measles
Toxoplasma
HepatitisAAntibody
HepatitisBcoreantibody
HepatitisBSurfaceAntibody
HerpesII
Albumin, urine
Uric acid
Phosphorus
Osmolality
Sodium
Potassium
Creatinine
Chloride
Total calcium
Bicarbonate
Blood urea nitrogen
Total protein
Total bilirubin
Lactate dehydrogenase LDH
Gamma glutamyl transferase
Globulin
Alanine aminotransferase ALT
Aspartate aminotransferase AST
Alkaline phosphotase
Albumin
Methylmalonic acid
PSA. total
Prostate specific antigen ratio
TIBC, Frozen Serum
Red cell distribution width
Red blood cell count
Platelet count SI
Segmented neutrophils percent
Mean platelet volume
Mean cell volume
Mean cell hemoglobin
MCHC
Hemoglobin
Hematocrit
Ferritin
Protoporphyrin
Transferrin saturation
White blood cell count
Monocyte percent
Lymphocyte percent
Eosinophils percent
C-reactive protein
Segmented neutrophils number
Monocyte number
Lymphocyte number
Eosinophils number
Basophils number
mean systolic
mean diastolic
60 sec. pulse:
60 sec HR
Total Cholesterol
Triglycerides
Glucose, serum
Insulin
Homocysteine
Glucose, plasma
Glycohemoglobin
C-peptide: SI
LDL-cholesterol
Direct HDL-Cholesterol
Bone alkaline phosphotase
Trunk Fat
Lumber Pelvis BMD
Lumber Spine BMD
Head BMD
Trunk Lean excl BMC
Total Lean excl BMC
Total Fat
Total BMD
Weight
Waist Circumference
Triceps Skinfold
Thigh Circumference
Subscapular Skinfold
Recumbent Length
Upper Leg Length
Standing Height
Head Circumference
Maximal Calf Circumference
Body Mass Index
-0.4 -0.2 0 0.2 0.4
Value
050100150
Color Key
and Histogram
Count
http://bit.ly.com/pemap
phenotypes
exposures
+-
nutrients
BMI,weight,
BMD
metabolic
renalfunction
pcbs
metabolic
bloodparameters
hydrocarbons
EWAS-derived phenotype-exposure association map:
A 2-D view of connections between P and E
Toward a phenotype-exposure association map:
(Re)-categorizing phenotypes with E
7 6 5 4 3 2 1 0
Distance
liver:Albumin
kidney:Bicarbonate
immunological:Basophils percent
immunological:Lymphocyte percent
immunological:Eosinophils percent
kidney:Phosphorus
liver:Total protein
liver:Aspartate aminotransferase AST
liver:Alanine aminotransferase ALT
body measures:Head Circumference
body measures:Recumbent Length
liver:Lactate dehydrogenase LDH
cancer:Prostate specific antigen ratio
cancer:PSA, free
blood:Transferrin saturation
liver:Total bilirubin
heart:Direct HDL-Cholesterol
immunological:Monocyte percent
bone:Head BMD
body measures:Standing Height
body measures:Upper Leg Length
bone:Total BMD
bone:Lumber Spine BMD
bone:Lumber Pelvis BMD
heart:Triglycerides
heart:LDL-cholesterol
heart:Total Cholesterol
blood:MCHC
blood:TIBC, Frozen Serum
blood:Hematocrit
blood:Hemoglobin
kidney:Potassium
blood:Mean cell hemoglobin
blood:Mean cell volume
kidney:Uric acid
kidney:Blood urea nitrogen
kidney:Total calcium
kidney:Creatinine
blood:Ferritin
blood:Red blood cell count
body measures:Weight
blood:Segmented neutrophils percent
body measures:Total Lean excl BMC
body measures:Trunk Lean excl BMC
body measures:Body Mass Index
body measures:Waist Circumference
body measures:Triceps Skinfold
body measures:Maximal Calf Circumference
body measures:Thigh Circumference
liver:Gamma glutamyl transferase
blood pressure:60 sec. pulse:
metabolic:Insulin
body measures:Total Fat
body measures:Trunk Fat
body measures:Subscapular Skinfold
blood pressure:mean systolic
immunological:C-reactive protein
liver:Globulin
immunological:Monocyte number
immunological:Segmented neutrophils number
immunological:Lymphocyte number
immunological:White blood cell count
immunological:Basophils number
immunological:Eosinophils number
blood:Mean platelet volume
heart:Homocysteine
nutrition:Methylmalonic acid
kidney:Osmolality
kidney:Chloride
kidney:Sodium
kidney:Albumin, urine
blood pressure:60 sec HR
cancer:PSA. total
blood:Platelet count SI
blood:Protoporphyrin
blood:Red cell distribution width
bone:Bone alkaline phosphotase
liver:Alkaline phosphotase
blood pressure:mean diastolic
metabolic:C-peptide: SI
metabolic:Glycohemoglobin
metabolic:Glucose, plasma
metabolic:Glucose, serum
inflammation
adiposity
kidney function
metabolic traits
7 6 5 4 3 2 1 0
Distance
liver:Albumin
kidney:Bicarbonate
immunological:Basophils percent
immunological:Lymphocyte percent
immunological:Eosinophils percent
kidney:Phosphorus
liver:Total protein
liver:Aspartate aminotransferase AST
liver:Alanine aminotransferase ALT
body measures:Head Circumference
body measures:Recumbent Length
liver:Lactate dehydrogenase LDH
cancer:Prostate specific antigen ratio
cancer:PSA, free
blood:Transferrin saturation
liver:Total bilirubin
heart:Direct HDL-Cholesterol
immunological:Monocyte percent
bone:Head BMD
body measures:Standing Height
body measures:Upper Leg Length
bone:Total BMD
bone:Lumber Spine BMD
bone:Lumber Pelvis BMD
heart:Triglycerides
heart:LDL-cholesterol
heart:Total Cholesterol
blood:MCHC
blood:TIBC, Frozen Serum
blood:Hematocrit
blood:Hemoglobin
kidney:Potassium
blood:Mean cell hemoglobin
blood:Mean cell volume
kidney:Uric acid
kidney:Blood urea nitrogen
kidney:Total calcium
kidney:Creatinine
blood:Ferritin
blood:Red blood cell count
body measures:Weight
blood:Segmented neutrophils percent
body measures:Total Lean excl BMC
body measures:Trunk Lean excl BMC
body measures:Body Mass Index
body measures:Waist Circumference
body measures:Triceps Skinfold
body measures:Maximal Calf Circumference
body measures:Thigh Circumference
liver:Gamma glutamyl transferase
blood pressure:60 sec. pulse:
metabolic:Insulin
body measures:Total Fat
body measures:Trunk Fat
body measures:Subscapular Skinfold
blood pressure:mean systolic
immunological:C-reactive protein
liver:Globulin
immunological:Monocyte number
immunological:Segmented neutrophils number
immunological:Lymphocyte number
immunological:White blood cell count
immunological:Basophils number
immunological:Eosinophils number
blood:Mean platelet volume
heart:Homocysteine
nutrition:Methylmalonic acid
kidney:Osmolality
kidney:Chloride
kidney:Sodium
kidney:Albumin, urine
blood pressure:60 sec HR
cancer:PSA. total
blood:Platelet count SI
blood:Protoporphyrin
blood:Red cell distribution width
bone:Bone alkaline phosphotase
liver:Alkaline phosphotase
blood pressure:mean diastolic
metabolic:C-peptide: SI
metabolic:Glycohemoglobin
metabolic:Glucose, plasma
metabolic:Glucose, serum
“bad” cholesterol
“good” cholesterol
Toward a phenotype-exposure association map:
(Re)-categorizing phenotypes with E
7 6 5 4 3 2 1 0
Distance
liver:Albumin
kidney:Bicarbonate
immunological:Basophils percent
immunological:Lymphocyte percent
immunological:Eosinophils percent
kidney:Phosphorus
liver:Total protein
liver:Aspartate aminotransferase AST
liver:Alanine aminotransferase ALT
body measures:Head Circumference
body measures:Recumbent Length
liver:Lactate dehydrogenase LDH
cancer:Prostate specific antigen ratio
cancer:PSA, free
blood:Transferrin saturation
liver:Total bilirubin
heart:Direct HDL-Cholesterol
immunological:Monocyte percent
bone:Head BMD
body measures:Standing Height
body measures:Upper Leg Length
bone:Total BMD
bone:Lumber Spine BMD
bone:Lumber Pelvis BMD
heart:Triglycerides
heart:LDL-cholesterol
heart:Total Cholesterol
blood:MCHC
blood:TIBC, Frozen Serum
blood:Hematocrit
blood:Hemoglobin
kidney:Potassium
blood:Mean cell hemoglobin
blood:Mean cell volume
kidney:Uric acid
kidney:Blood urea nitrogen
kidney:Total calcium
kidney:Creatinine
blood:Ferritin
blood:Red blood cell count
body measures:Weight
blood:Segmented neutrophils percent
body measures:Total Lean excl BMC
body measures:Trunk Lean excl BMC
body measures:Body Mass Index
body measures:Waist Circumference
body measures:Triceps Skinfold
body measures:Maximal Calf Circumference
body measures:Thigh Circumference
liver:Gamma glutamyl transferase
blood pressure:60 sec. pulse:
metabolic:Insulin
body measures:Total Fat
body measures:Trunk Fat
body measures:Subscapular Skinfold
blood pressure:mean systolic
immunological:C-reactive protein
liver:Globulin
immunological:Monocyte number
immunological:Segmented neutrophils number
immunological:Lymphocyte number
immunological:White blood cell count
immunological:Basophils number
immunological:Eosinophils number
blood:Mean platelet volume
heart:Homocysteine
nutrition:Methylmalonic acid
kidney:Osmolality
kidney:Chloride
kidney:Sodium
kidney:Albumin, urine
blood pressure:60 sec HR
cancer:PSA. total
blood:Platelet count SI
blood:Protoporphyrin
blood:Red cell distribution width
bone:Bone alkaline phosphotase
liver:Alkaline phosphotase
blood pressure:mean diastolic
metabolic:C-peptide: SI
metabolic:Glycohemoglobin
metabolic:Glucose, plasma
metabolic:Glucose, serum
height + BMD
Toward a phenotype-exposure association map:
(Re)-categorizing phenotypes with E
σ2
EH2 vs.
Triglycerides
Total Cholesterol
LDL-cholesterol
Trunk Fat
Albumin, urine
Insulin
Total Fat
Head Circumference
Blood urea nitrogen
Albumin
Homocysteine
C-peptide: SI
C-reactive protein
Body Mass Index
Ferritin
Thigh Circumference
Maximal Calf Circumference
Direct HDL-Cholesterol
Total calcium
Total bilirubin
Red cell distribution width
Gamma glutamyl transferase
Mean cell volume
Mean cell hemoglobin
White blood cell count
Uric acid
Protoporphyrin
Hemoglobin
Total protein
Alkaline phosphotase
Waist Circumference
Hematocrit
Weight
Standing Height
1/Creatinine
Creatinine
Trunk Lean excl BMC
Methylmalonic acid
Triceps Skinfold
Lymphocyte number
Subscapular Skinfold
Total Lean excl BMC
Segmented neutrophils number
Lactate dehydrogenase LDH
Bone alkaline phosphotase
TIBC, Frozen Serum
Aspartate aminotransferase AST
Phosphorus
Lumber Pelvis BMD
Glycohemoglobin
Globulin
Chloride
Bicarbonate
Alanine aminotransferase ALT
60 sec. pulse:
Upper Leg Length
Total BMD
Potassium
Glucose, serum
Glucose, plasma
Red blood cell count
Lumber Spine BMD
Platelet count SI
MCHC
Osmolality
Monocyte number
mean systolic
Lymphocyte percent
Segmented neutrophils percent
Recumbent Length
Eosinophils number
Monocyte percent
Head BMD
mean diastolic
Prostate specific antigen ratio
60 sec HR
Basophils number
Sodium
PSA, free
Mean platelet volume
Eosinophils percent
PSA. total
Basophils percent
0 10 20 30 40
R^2 * 100
1 to 66 exposures identified for 81
phenotypes

Additive effect of E factors:

Describe < 20% of variability in P
(On average: 8%)
σ2
E?
Emerging technologies to ascertain exposome will enable
biomedical discovery
High-throughput E standards:

mitigate fragmented literature of associations
Confounding, reverse causality: 

how to handle at large dimension?
e.g., EWASs in complex disease through the life course
Enable more precise definitions of P
...but what about interaction between these factors?
Do a combination of genetic and environmental factors impart different
risk for disease than either alone?
P = G x E
Complex traits are a function of genes and environment...
Gene-Environment Interactions: Combination of G and E
different than of variant or factor alone
Find additional disease risk (variance)

Posit biological mechanisms
G+
G-
E+ E-NAT2 variant
smoke?
cancer
non-cancer
Bladder Cancer
Environmental Toxicology. 2012

Bioinformatics. 2012

Curr Op Env Health (in press)
Analytically complex

• How do you select which G and E to test???

• Need a lot of samples (power!)

Few studies exist that measure G & E together
Why not investigate genes and environment simultaneously:
Analytic complexity and large numbers of interactions!
G genetic variants and E exposures = G × E possible pairs
= 100 possible interactions
10 genetic variants
1
2
3
4
5
6
7
8
9
10
rs13266634 (SLC30A8)
rs1807292 (PPARγ)
rs7903146 (TCF7L2)
............................................. 10 exposures1 2 3 4 5 6 7 8 9 10
sm
oke?
vitam
in
E
radiation
...............................
pesticide
vitam
in
C
Bioinformatics. 2012

Curr Op Env Health (in press)
Why not investigate genes and environment simultaneously:
Analytic complexity and large numbers of interactions!
G genetic variants and E exposures = G × E possible pairs
= 100 possible interactions
10 genetic variants
1
2
3
4
5
6
7
8
9
10
rs13266634 (SLC30A8)
rs1807292 (PPARγ)
rs7903146 (TCF7L2)
............................................. 10 exposures1 2 3 4 5 6 7 8 9 10
sm
oke?
vitam
in
E
radiation
...............................
pesticide
vitam
in
C
Bioinformatics. 2012

Curr Op Env Health (in press)
Combining EWAS and GWAS:
Select pairs by their main effects
−log10(pvalue)
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
● ●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●● ●
●●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
● ●
acrylamide
allergentest
bacterialinfection
cotinine
diakyl
dioxins
furansdibenzofuran
heavymetals
hydrocarbons
latex
nutrientscarotenoid
nutrientsminerals
nutrientsvitaminA
nutrientsvitaminB
nutrientsvitaminC
nutrientsvitaminD
nutrientsvitaminE
pcbs
perchlorate
pesticidesatrazine
pesticideschlorophenol
pesticidesorganochlorine
pesticidesorganophosphate
pesticidespyrethyroid
phenols
phthalates
phytoestrogens
polybrominatedethers
polyflourochemicals
viralinfection
volatilecompounds
012
γ-tocopherol

β-carotene

heptachlor

PCB170
ex: PLOS ONE (2010)
A RT I C L E S
50 Locus established previously
Locus identified by current study
Locus not confirmed by current study
BCL11A
THADA
NOTCH2
ADAMTS9
IRS1
IGF2BP2
WFS1
ZBED3
CDKAL1
HHEX/IDE
KCNQ1 (2 signals*: )
TCF7L2
KCNJ11
CENTD2
MTNR1B
HMGA2 ZFAND6
PRC1
FTO
HNF1B DUSP9
Conditional analysis
Unconditional analysis
TSPAN8/LGR5
HNF1A
CDC123/CAMK1D
CHCHD9
CDKN2A/2B
SLC30A8
TP53INP1
JAZF1
KLF14
PPAR
40
30
–log10(P)(P)
20
10
10
0
Suggestive statistical association (P < 1 10
–5
)
Association in identified or established region (P < 1 10
–4
)
rs7903146 (TCF7L2)

rs13266634 (SLC30A8)
rs1801282 (PPARG)
+
ex: Voight et al., Nature Genetics (2010)

WTCCC, Nature (2007)

Sladek et al., Nature (2007)
Human Genetics. 2013
Prototype G-EWAS Methodology
GxE in association to T2D
1.) Nyholt. AJHG 2004

2.) Bůžková et al. Annals of Human Genetics. 2010
4.4 17.8
Bonferroni Correction
Number of Effective Tests1 ≅80

α=0.05/80 = 0.0006
False Discovery Rate
Parametric Bootstrap
of Null Model2
γ-tocopherol
cis-β-carotene
PCB170
heptachlor
rs10923931(NOTCH2)
rs7903146(TCF7L2)
rs13266634(SLC30A8)
rs7901695(TCF7L2)
total: 90
rs2383208(Unknown)
rs1260326(GCKR)
rs780094(GCKR)
rs2237895(KCNQ1)
rs10811661(Unknown)
rs4712523(CDKAL1)
rs4607103(Unknown)
rs1111875(Unknown)
rs7578597(THADA)
rs4402960(IGF2BP2)
rs1801282(PPARG)
rs12779790(Unknown)
rs8050136(FTO)
rs864745(JAZF1)
trans-β-carotene
18 GWAS loci
5 EWAS factors
Logistic Regression

Fasting Blood Glucose ≥ 126 mg/dL

(age, BMI, sex, race)
logit(diabetes)
z(γ-tocopherol)
rs13266634
(0)
(1)
(2)
(#) risk alleles
Per-risk allele OR for rs13266634 (SLC30A8) Stratified by E
Increase or decrease up to 30-40% vs. marginal effect!
Adjusted for race, sex, BMI, age
trans-β-carotene (low(-1SD))
trans-β-carotene (mean)
trans-β-carotene (high(+1SD))
γ-tocopherol (low(-1SD))
γ-tocopherol (mean)
γ-tocopherol (high(+1SD))
rs13266634(SLC30A8)
rs13266634(SLC30A8)
0 0.5 1 1.5 2 2.5
Per risk allele OR
OR (95% CI)
1.8 [1.3,2.6]
1.1 [0.79,1.5]
0.65 [0.4,1.1]
p-value:5e-05
N(cases):1702(164)
0.82 [0.52,1.3]
1.1 [0.87,1.5]
1.6 [1.3,2]
p-value:0.0094
N(cases):2925(274)
marginal OR=1.1
trans-β-carotene (low(-1SD))
trans-β-carotene (mean)
trans-β-carotene (high(+1SD))
γ-tocopherol (low(-1SD))
γ-tocopherol (mean)
γ-tocopherol (high(+1SD))
rs13266634(SLC30A8)
rs13266634(SLC30A8)
0 0.5 1 1.5 2 2.5
Per risk allele OR
OR (95% CI)
1.8 [1.3,2.6]
1.1 [0.79,1.5]
0.65 [0.4,1.1]
p-value:5e-05
N(cases):1702(164)
0.82 [0.52,1.3]
1.1 [0.87,1.5]
1.6 [1.3,2]
p-value:0.0094
N(cases):2925(274)
FDR=2%
FDR=18%
Human Genetics. 2013
It is possible to detect GxE by combining EWAS and GWAS
Detected interaction effect changes
between EWAS and GWAS factors

What is the biological mechanism of
interaction? 

Need to replicate these results in
diverse populations.

Re-capture GWAS “investment” by
considering prevalent E-factors?!
Possible to utilize the XWAS approach for general
purpose discovery…
PheWAS: dissecting the shared genetic
architecture (pleiotropy) of disease!
PheWAS:

Phenome-wide association study

Denny et al, Nature Biotech 2013
c
Coronary atherosclerosis
Ischemic heart disease
Chronic ischemic heart disease
Angina pectoris
Occlusion & stenosis of precerebral arteries
Hemorrhoids
Intermediate coronary syndrome
Myocardial infraction
Polyneuropathy in diabetes
Type 2 diabetic nephropathy
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
Infectious
N
eoplastic
Psychiatric
N
eurologic
Cardiovascular
Pulm
onaryD
igestive
G
enitourinary
D
erm
atologic
M
usculoskeletal
Injuries
Sym
ptom
s
and
signs
H
em
atopoietic
Endocrine
and
m
etabolic
rs4977574 (CDKN2BAS)
–log10(P)
Seborrheic keratosis
f oral mucosa
ologic
sculoskeletal
Injuries
m
ptom
s
and
signs
d
Type 1 diabetic ketoacidosis
Type 1 diabetes
Rheum. arthritis
nephropathy
diabetic neuropathy
11
10
9
8
7
)
rs660895 (HLA-DRB1)
10−12), acute myocardial infarction (OR =
d abdominal aortic aneurysm (OR = 1.29,
with prior publications3, but also with other
ular” phenotypes such as unstable angina,
Our study replicated the association between rheumatoid arthritis
and rs660895 near HLA-DRB1 (Fig. 3d; OR = 1.56, P = 6.7 × 10−8).
This SNP was also strongly associated with type 1 diabetes (OR =
1.44, P = 7.1 × 10−8) and potentially associated with inflammatory
−5
e to brain
Solar dermatitis
Seborrheic keratosis
Osteopenia
m
onaryD
igestive
G
enitourinary
D
erm
atologic
M
usculoskeletal
Injuries
Sym
ptom
s
and
signs
Angi
Occlu
Hemorrhoids
Polyneuro
Type 2 diabet5
4
3
2
1
0
Infectious
N
eoplastic
Psychiatric
N
eurologic
Cardiovascular
Pulm
onaryD
igestive
G
enitourinary
D
erm
atologic
M
usculoskeletal
Injuries
Sym
ptom
s
and
signs
H
em
atopoietic
Endocrine
and
m
etabolic
Infectious
N
eoplastic
Psychiatric
N
eurologic
Cardiovascular
Pulm
onaryD
igestive
G
enitourinary
D
erm
atologic
M
usculoskeletal
Injuries
Sym
ptom
s
and
signs
H
em
atopoietic
Endocrine
and
m
etabolic
2
1
0
d
Type 1 diabetic ketoacidosis
Type 1 diabetes
Type 2 diabetes
Arteritides
Giant cell arteritis
Conjunctivitis, infectious
Visual field defects
Viral pneumonia
Nasal polyps
Rheum. arthritis
Shock
Type 1 diabetes nephropathy
Polyneuropathy in diabetes
Type 1 diabetic neuropathy
11
10
9
8
7
6
5
4
3
2
1
0
Infectious
N
eoplastic
Psychiatric
N
eurologic
Cardiovascular
Pulm
onaryD
igestive
G
enitourinary
D
erm
atologic
M
usculoskeletal
Injuries
Sym
ptom
s
and
signs
H
em
atopoietic
Endocrine
and
m
etabolic
–log10(P)
rs660895 (HLA-DRB1)
or four SNPs. Each panel represents 1,358 phenotypes
h a particular SNP, using logistic regression assuming an
djusted for age, sex, study site and the first three principal
are grouped along the x axis by categorization within
chy. The upper red lines indicate P = 4.6 × 10−6 (FDR = 0.1
r blue lines indicate P = 0.05; dashed lines are a
orrection (P = 0.05/1,358). Diamonds encircling phenotype
NHGRI Catalog associations. (a) PheWAS associations for
eviously associated with hair and eye color, freckling and
palsy. (b) PheWAS associations for rs2853676 in TERT,
h glioma. (c) PheWAS associations for rs4977574 near
previously associated with myocardial infarction, and in
(d) PheWAS associations for rs660895 near HLA-DRB1,
h rheumatoid arthritis. Results and plots for all SNPs
tudy are available at http://phewascatalog.org/.
MI GWAS SNP RA GWAS SNP
MWAS:
Medication-wide association study

Ryan, PB., CPT 2013
www.nature.com/psp
3
1.0E-001
atc1_concept_name, atc3_concept_name, rxnorm_concept_name
Color by
atc1_concept_name
ALIMENTARY TRACT AND METABOLISM
ANTIINFECTIVES FOR SYSTEMIC USE
ANTIPARASITIC PRODUCTS, INSECTICIDES
AND REPELLENTS
BLOOD AND BLOOD FORMING ORGANS
CARDIOVASCULAR SYSTEM
DERMATOLOGICALS
GENITO-URINARY SYSTEM AND SEX HORMONES
MUSCULO-SKELETAL SYSTEM
NERVOUS SYSTEM
NULL
RESPIRATORY SYSTEM
SENSORY ORGANS
SYSTEMIC HORMONAL PREPARATIONS,
EXCLUDING SEX HORMONES AND INSULINS
Shape by
GROUND_TRUTH
Horizontal line:
Horizontal line:
Bonferroni adjustment: P
P < 0.05
0
1
SulfasalazineANTIDIARRHEALS,INTES...
ANTIEMETICSANDANTI...
DRUGSFORACIDRELA...
DRUGSUSEDIN
DIABETES
LAXATIVES
ANTIBACTERIALSFOR
SYSTEMICUSE
ANTIMYCOTICSFOR
SYSTEMICUSE
ANTIVIRALSFORSYSTE...
ANTHELMINTICS
ANTIPROTOZOALS
ANTIANEMIC
PREPARATIONS
ANTITHROMBOTICAGE...
AGENTSACTINGONTH...
ANTIFUNGALSFORDER...
EMOLLIENTSAND
PROTECTIVES
SEXHORMONESAND
MODULATORSOFTHE
GENITALSYSTEM
UROLOGICALS
ANTIINFLAMMATORYAND
ANTIRHEUMATIC
PRODUCTS
MUSCLERELAXANTS
TOPICALPRODUCTSFOR
JOINTANDMUSCULAR
PAIN
ANALGESICS
ANESTHETICS
ANTIEPILEPTICS
ANTI-PARKINSONDRUGS
PHYCHOANALEPTICS
PSYCHOLEPTICS
NULL
ANTIHISTAMINESFOR
SYSTEMICUSE
COUGHANDCOLDPRE...
DRUGSFOR
OBSTRUCTIVEAIRWAY...
NASALPREPARATIONS
OPHTHALMOLOGICALS
OTOLOGICALS
PITUITARYANDHYPOTH...
THYROIDTHERAPY
CALCIUMCHANNEL
BLOCKERS
DRUGSFORFUNCTIONAL
GASTROINTESTINALDIS...
ALIMEN
TARY
TRACT
AND
METAB
OLISM
ANTINF
ECTIVE
SFOR
SYSTE
MIC
USE
ANTIPA
RASITIC
PRODU
BLOOD
AND
BLOO...
CARDIO
VASCUL
ARSY...
DERMA
TOLOGI
CALS
GENITO
URINAR
Y
SYSTE
MAND
SEX
HORMO
NES
MUSCU
LO-
SKELET
AL
SYSTE
M
NERVO
US
SYSTE
M
NULL
RESPIR
ATORY
SYSTE
M
SENSO
RY
ORGAN
S
SYSTE
MIC
HORMO
NALP...
CTS,I...
Tetrahydrocannabinol
Sucralfate
Dicyclomine
Hyoscyamine
Acarbose
Sitagliptin
Lactulose
Clindamycin
Methenamine
PenicillinV
Ketoconazole
Nevirapine
Mebendazole
Tinidazole
Darbepoetinalfa
EpoetinAlfa
Dipyridamole
Moexipril
Amlodipine
Nifedipine
Terbinafine
Urea
Estradiol
Estrogens,conjugated(USP)
Estropipate
Darifenacin
Flavoxate
Oxybutynin
Etodolac
Fenoprofen
Indomethacine
Ketorolac
Nabumetone
Oxaprozin
Sulindac
Metaxalone
Methocarbamol
Flurbiprofen
Ketoprofen
Piroxicam
Tolmetin
Almotriptan
Diflunisal
Eletriptan
Frovatriptan
Naratriptan
Rizatriptan
Salicylsalicylicacid
Sumatriptan
Zolmitriptan
Prilocaine
Primidone
Bromocriptine
Desipramine
Imipramine
Nortriptyline
Chlorazepate
Droperidol
Prochlorperazine
Ramelteon
Temazepam
Amylases
Endopeptidases
Lipase
Sodiumphosphate,monobasic
Loratadine
Benzonatate
Salmeterol
Zafirlukast
Fluticasone
Acetazolamide
Bromfenac
Gatifloxacin
Ketotifen
Scopolamine
Miconazole
Cosyntropin
Methimazole
1.0E-002
p_full
1.0E-003
1.0E-004
1.0E-005
1.0E-006
1.0E-007
1.0E-008
1.0E-009
1.0E-010
1.0E-011
1.0E-012
MWAS
MarketScan CCAE
OMOP acute myocardial infarction 1
a
In conclusion:
on GWAS and EWAS
GWAS has been unparalleled in biological discovery...

... coupled with EWAS, will lead to precise and personal
medicine.
−log10(pvalue)
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
● ●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●● ●
●●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
● ●
acrylamide
allergentest
bacterialinfection
cotinine
diakyl
dioxins
furansdibenzofuran
heavymetals
hydrocarbons
latex
nutrientscarotenoid
nutrientsminerals
nutrientsvitaminA
nutrientsvitaminB
nutrientsvitaminC
nutrientsvitaminD
nutrientsvitaminE
pcbs
perchlorate
pesticidesatrazine
pesticideschlorophenol
pesticidesorganochlorine
pesticidesorganophosphate
pesticidespyrethyroid
phenols
phthalates
phytoestrogens
polybrominatedethers
polyflourochemicals
viralinfection
volatilecompounds
012
to a doubling of the number of associated variants discov-
ered. The proportion of genetic variation explained by
significantly associated SNPs is usually low (typically less
than 10%) for many complex traits, but for diseases such
as CD and multiple sclerosis (MS [MIM 126200]), and for
quantitative traits such as height and lipid traits, between
10% and 20% of genetic variance has been accounted for
(Table 1). In comparison to the pre-GWAS era, the propor-
tion of genetic variation accounted for by newly discov-
ered variants that are segregating in the population is large.
It is clear that for most complex traits that have been
investigated by GWAS, multiple identified loci have
genome-wide statistical significance, and thus it is likely
that there are (many) other loci that have not been identi-
fied because of a lack of statistical significance (false nega-
tives). Recently, researchers have developed and applied
methods to quantify the proportion of phenotypic varia-
Figure 1. GWAS Discoveries over Time
Data obtained from the Published GWAS Catalog (see Web
Resources). Only the top SNPs representing loci with association
p values < 5 3 10À8
are included, and so that multiple counting
is avoided, SNPs identified for the same traits with LD r2
> 0.8 esti-
mated from the entire HapMap samples are excluded.
Figure 2. Increase in Number of Loci Identified as a Function of
Experimental Sample Size
(A) Selected quantitative traits.
(B) Selected diseases.
The coordinates are on the log scale. The complex traits were
selected with the criteria that there were at least three GWAS
papers published on each in journals with a 2010–2011 journal
Harvard HMS
Isaac Kohane

Susanne Churchill

Stan Shaw

Nathan Palmer

Jenn Grandfield

Sunny Alvear

Michal Preminger

Harvard Chan
Hugues Aschard

Francesca Dominici

Stanford
John Ioannidis

Atul Butte (UCSF)

U Queensland
Jian Yang

Peter Visscher

Cochrane
Belinda Burford
Chirag Lakhani
Adam Brown
Arjun Manrai
Erik Corona
Nam Pho
Chirag J Patel

chirag@hms.harvard.edu

@chiragjp

www.chiragjpgroup.org
CDC/NCHS
Ajay Yesupriya

Imperial
Ioanna Tzoulaki

Paul Elliott

Lund (Sweden)
Jan Sundquist

Kristina Sundquist
NIH Common Fund

Big Data to Knowledge
Thanks...

Intro to Biomedical Informatics 701

  • 1.
    Bioinformatics for discovery: Introductionto GWAS and EWAS BMI 701:Introduction to Biomedical Informatics 12/1/2015 chirag@hms.harvard.edu @chiragjp www.chiragjpgroup.org Chirag J Patel
  • 2.
    P = G+ EType 2 Diabetes Cancer Alzheimer’s Gene expression Phenotype Genome Variants Environment Infectious agents Nutrients Pollutants Drugs Complex traits are a function of genes and environment...
  • 3.
    We are greatat G investigation! over 2000 Genome-wide Association Studies (GWAS) https://www.ebi.ac.uk/gwas/ G
  • 4.
    >2,000 traits/diseases >15,000 SNPs >16,000SNP-trait associations https://www.ebi.ac.uk/gwas/
  • 5.
    Dissecting G inP: What is a Genome-wide Association Study? Hypothesis-free “search engine” for genetic variants associated with a complex trait or disease in unrelated populations SNP(A) SNP(a) diseased non- diseased SNP(A) SNP(a) diseased non- diseased SNP(A) SNP(a) diseased non- diseased SNP(A) SNP(a) diseased non- diseased SNP(A) SNP(a) diseased non- diseased SNP(A) SNP(a) diseased non- diseased SNP(A) SNP(a) diseased non- diseased SNP(A) SNP(a) diseased non- diseased SNP(A) SNP(a) diseased non- diseased SNP(Z) SNP(z) diseased non- diseasedgenome-wide
  • 6.
    The road toGWAS...
  • 7.
    A new paradigmof GWAS for discovery of G in P: Human Genome Project to GWAS Sequencing of the genome 2001 HapMap project: http://hapmap.ncbi.nlm.nih.gov/ Characterize common variation 2001-current day High-throughput variant assay < $99 for ~1M variants Measurement tools ~2003 (ongoing) ARTICLES Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls The Wellcome Trust Case Control Consortium* There is increasing evidence that genome-wide association (GWA) studies represent a powerful approach to the identification of genes involved in common human diseases. We describe a joint GWA study (using the Affymetrix GeneChip 500K Mapping Array Set) undertaken in the British population, which has examined ,2,000 individuals for each of 7 major diseases and a shared set of ,3,000 controls. Case-control comparisons identified 24 independent association signals at P , 5 3 1027 : 1 in bipolar disorder, 1 in coronary artery disease, 9 in Crohn’s disease, 3 in rheumatoid arthritis, 7 in type 1 diabetes and 3 in type 2 diabetes. On the basis of prior findings and replication studies thus-far completed, almost all of these signals reflect genuine susceptibility effects. We observed association at many previously identified loci, and found compelling evidence that some loci confer risk for more than one of the diseases studied. Across all diseases, we identified a 25 27 Vol 447|7 June 2007|doi:10.1038/nature05911 Nature 2008 Comprehensive, high-throughput analyses GWAS
  • 8.
    Number of rawpublications with subject of “GWAS” 0 1000 2000 3000 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 Year NumberofPublications'GWAS' pubmed MeSH terms: human + GWAS
  • 9.
    Number of rawpublications with subject of “GWAS” 0 1000 2000 3000 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 Year NumberofPublications'GWAS' pubmed MeSH terms: human + GWAS Risch + Merikangas linkage vs. association human genome sequenced GWAS age-related macular degeneration mega-meta-GWAS WTCCC GWAS is relevant today (even with NGS) around the corner
  • 10.
  • 11.
    Geneticists have madesubstantial progress in identifying the genetic basis of many human diseases, at least those with conspicuous deter- minants.ThesesuccessesincludeHuntington's disease, Alzheimer's disease, and some forms of breast cancer. However, the detection of ge- netic factors for complex diseases-such as schizophrenia, bipolardisorder, anddiabetes- has been far more complicated. There have been numerous reports of genes or loci that might underlie these disorders, butfew ofthese findings have been replicated. The modest na- ture ofthe gene effectsforthese disorders likely explains the contradictory and inconclusive claims about their identification. Despite the small effects of such genes, the magnitude of theirattributable risk (theproportion ofpeople affectedduetothem) maybelargebecause they are quite frequent in the population, making them ofpublic health significance. Has the genetic study ofcomplex disorders reached its limits? The persistent lack of replicability of these reports of linkage be- tween various loci and complex diseases might imply that it has. We argue below that age analysis we have chosen for this argu- ment is a popular current paradigm in which pairs of siblings, both with the disease, are examined for sharing of alleles at multiple sites in the genome defined by genetic mark- ers. The more often the affected siblings share the same allele at a particular site, the more likely the site is close to the disease gene. Using the formulas in (1), we calculate the expected proportion Yofalleles shared by a pair ofaffected siblings for the best possible case-that is, a closely linked marker locus (recombination fraction 0 = 0) that is fully informative (heterozygosity = 1) (2)-as 1 +W wherew= pq(y-1)2 2+w (py+q)2 If there is no linkage of a marker at a particular site to the disease, the siblings would be expected to share alleles 50% ofthe time; that is, Y would equal 0.5. Values of Y for various values ofp and y are given in the third column of the table. For an allele of moderate frequency (p is 0.1 to 0.5) that con- linkage analysis for about 2 or less will ne because the numbe (more than -2500) able. Although testsof est effect are of low above example, direc a disease locus itself To illustrate this poi sion/disequilibrium t In this test, transmis at a locus from heter affected offspring is e lian inheritance, all a chance ofbeing tran eration. In contrast, associated with dise mitted more often th For this approach, with multiple affect just on single affect parents. For the same can calculate the pr parents as pq(y + 1 the probability for a transmit the high ris Association tests ca pairs of affected sibl associatedwithdiseas over 50% is the same the probability ofpar creased at lowvalues the probability ofpar creased. The formula The Future of Genetic Studies of Complex Human Diseases Neil Risch and Kathleen Merikangas onimm, 0In"a0,"a, Geneticists have made substantial progress in identifying the genetic basis of many human diseases, at least those with conspicuous deter- minants.ThesesuccessesincludeHuntington's disease, Alzheimer's disease, and some forms of breast cancer. However, the detection of ge- netic factors for complex diseases-such as schizophrenia, bipolardisorder, anddiabetes- has been far more complicated. There have been numerous reports of genes or loci that might underlie these disorders, butfew ofthese findings have been replicated. The modest na- ture ofthe gene effectsforthese disorders likely explains the contradictory and inconclusive claims about their identification. Despite the small effects of such genes, the magnitude of theirattributable risk (theproportion ofpeople affectedduetothem) maybelargebecause they are quite frequent in the population, making them ofpublic health significance. Has the genetic study ofcomplex disorders reached its limits? The persistent lack of replicability of these reports of linkage be- tween various loci and complex diseases might imply that it has. We argue below that age analysis we have chosen for this ar ment is a popular current paradigm in whi pairs of siblings, both with the disease, examined for sharing of alleles at multip sites in the genome defined by genetic mar ers. The more often the affected sibli share the same allele at a particular site, t more likely the site is close to the dise gene. Using the formulas in (1), we calcul the expected proportion Yofalleles shared a pair ofaffected siblings for the best possi case-that is, a closely linked marker lo (recombination fraction 0 = 0) that is fu informative (heterozygosity = 1) (2)-as 1 +W wherew= pq(y-1)2 2+w (py+q)2 If there is no linkage of a marker at particular site to the disease, the sibli would be expected to share alleles 50% oft time; that is, Y would equal 0.5. Values o for various values ofp and y are given in t third column of the table. For an allele moderate frequency (p is 0.1 to 0.5) that co The Future of Genetic Studies of Complex Human Diseases Neil Risch and Kathleen Merikangas Science, 1996 A new paradigm is needed for discovery!
  • 12.
    How does aGWAS work?
  • 13.
    Single nucleotide polymorphisms(SNPs): How many SNPs are in the human genome? >3,000,000,000 bases in human genome SNPs appear ~1000 bases ~3,000,000 SNPs 40-60% have minor allele frequency <5% GWAS focus on frequency >5% HapMap Consortium, 2010
  • 14.
    Can’t measure everything: TagSNPs and Linkage Disequilibrium (LD) LD = co-occurance of SNPs in a contiguous region Bush and Moore, 2012
  • 15.
    The phenomenon ofLD makes GWAS possible: How and why?: Indirect association additional studies to map the precise location of the influential SNP. Conceptually, the end result of GWAS under the common disease/common var- needed to capture the variation African genome. It is important to note that t ogy for measuring genomic Figure 3. Indirect Association. Genotyped SNPs often lie in a region of high linka will be statistically associated with disease as a surrogate for the disease SNP throu doi:10.1371/journal.pcbi.1002822.g003 Bush and Moore, 2012 LD blocks
  • 16.
    Can’t measure everything: TagSNPs and Linkage Disequilibrium Tag SNPs are common proxies for other SNPs 500K - 1M per chip tified significant associations for seven SNPs representing four new T2DM loci (Table 1). In all cases, the strongest association for the MAX statistic (see Methods) was obtained with the additive model. of this gene (Fig. 2a) solely in the secretory final stages of insulin * * * 0 2 4 –log10[P] –log10[P] * 4954642sr 2373971sr 3373971sr 445409sr 8012261sr 3349941sr 883429sr 2019462sr 0349941sr 90350501sr 036169sr 0415007sr 2225991sr 6136642sr 8136642sr 1869646sr 8798751sr 04928201sr 3926642sr 5926642sr 43666231sr 9926642sr 2954642sr 01350501sr 5769646sr 4577187sr 4769646sr 41350501sr 5784931sr 2173387sr 39250501sr 5050007sr 7492602sr 1255051sr 156868sr 4373387sr 4784931sr 7501107sr 2697402sr 91518711sr 6461001sr 29250501sr 5889103sr 8669646sr 0889103sr 4688392sr SLC30A8 IDE 0 2 4 7912381sr 3148707sr 0283856sr 52078111sr 5227373sr 0491242sr 2369412sr 2297881sr 662155sr 7790197sr 44068701sr 35075221sr 5826807sr 7851092sr 9409522sr –log10[P] –log10[P] EXT2 ALX4 0 2 4 *** * 0 2 4 a b c d LD block 2 alleles are correlated because they are inherited together Sladek et al, 2007
  • 17.
    image: www.lifa-core.de/ Digitizing SNPs: e.g.,Illumina Infinium Array image: illumina.com
  • 18.
    Assessing Thousands ofFactors Simultaneously: Data-driven search for differences in SNP frequencies ~100,000 - ~1,000,000 association tests disease cases healthy controls GCAGGTACATG...GGTA... GCAGGTACACG...GGTA... GCAGGTACATG...GGTA... GCAGGTACACG...GGTA... GCAGGTACATG...GGTA... GCAGGTACACG...GGTA... disease cases GCAGGTACATG...GGTA... GCAGGTACATG...GGTA... GCAGGTACATG...GGTA... GCAGGTACATG...GGTA... healthy controls
  • 19.
    Associating One SNPwith Disease Case-Control Study Design DiseaseSNP (A/a) ? A a diseased non- diseased cases controls
  • 20.
    Associating One SNPwith Disease What is an “Odds Ratio”? DiseaseSNP (A/a) ? A a diseased c d non- diseased x y cases controls Chi-squared test Odds Ratio a vs A: Odds of disease with allele a vs. Odds of disease with allele A 1: equal odds (no difference) >1: increased odds (increased risk) <1: decreased odds (decreased risk)
  • 21.
    Associating One SNPwith Disease Calculating the Odds Ratio DiseaseSNP (A/a) ? A a diseased c d non- diseased x y cases controls Chi-squared test Odds Ratio dx cy y/x d/c [d/(d+y)]/[y/(d+y)] Odds Ratio a vs A: [c/(x+y)]/[x/(c+x)] Odds with allele a Odds with allele A How would you interpret an OR of 2?
  • 22.
    Associating One SNPwith Disease Cohort Study Design DiseaseSNP (A/a) ? •Direct measure of risk vs. odds ratio •Need to wait! •If incidence is low, N needs to be large! Non-diseasedSNP (A/a) vs. Cox survival regression Relative Risk
  • 23.
    Models to associategenotypes with disease Examples for a case-control study Aa AA AA aa Aa AaaaAa Disease Non-diseased ND=4 NC=4
  • 24.
    Models to associategenotypes with disease Examples for a case-control study Aa AA AA aa Aa AaaaAa Disease Non-diseased ND=4 NC=4 A a diseased non- diseased 6 2 2 6 OR A (vs a) OR a (vs A)
  • 25.
    AA Aa aa diseased non- diseased Modelsto associate genotypes with disease Genotypic Test (“2 or 1 df test”) Aa AA AA aa Aa AaaaAa Diseased Non-diseased ND=4 NC=4 2 OR AA (vs. Aa) aa (vs. Aa) 2 0 220
  • 26.
    Associating One SNPwith Quantitative Trait (e.g., height, weight, cholesterol) 40 60 80 100 1 2 3 factor(SNP) trait GG GC CC height SNP rs1234 SNP rs123456 25 50 75 100 125 1 2 3 factor(SNP) trait height CC CT TT
  • 27.
    Associating One SNPwith Quantitative Trait Linear Regression and Additive Risk Model y=ɑ+βx+ε 25 50 75 100 125 1 2 3 factor(SNP) trait height CC (0) CT (1) TT (2) SNP rs123456 height = ɑ+βx xCC=0 if individual is CC xCT=1 if individual is CT xTT=2 if individual is TT ɑ β: change in height for 1 risk allele T= risk allele β
  • 28.
    Prototypical “Manhattan plot”to visualize associations Science, 2007 ~100,000 - ~1,000,000 association tests evol part ease tase well biol T capt imp STR reve subs libri clea −log10(P) 0 5 10 15 Chromosome 22 X 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 80 60 40 100 rvedteststatistic a b NATURE|Vol 447|7 June 2007 AA Aa aa diseased non- diseased
  • 29.
    ibility with schizophrenia,a psychotic disorder with many similar- ities to BD. In particular association findings have been reported with assium channel. Ion channelopathies are well-recognized as causes of episodic central nervous system disease, including seizures, ataxias −log10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15 Chromosome Type 2 diabetes 22 XX 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 22 XX 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 22 XX 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 22 XX 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 22 XX 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 22 XX 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 22 XX 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 Coronary artery disease Crohn’s disease Hypertension Rheumatoid arthritis Type 1 diabetes Bipolar disorder Figure 4 | Genome-wide scan for seven diseases. For each of seven diseases 2log10 of the trend test P value for quality-control-positive SNPs, excluding Chromosomes are shown in alternating colours for clarity, with P values ,1 3 1025 highlighted in green. All panels are truncated at
  • 30.
    Type I Error: FalsePositives! what is a p-value? chance we attain the observed result if no difference (H0) Many tests: some can be significant (low p-value by chance)! 100 tests at a p-value of 0.05... how many would be significant per chance? Bonferroni “correction”: Correct the 0.05 significance level by number of tests e.g., 1M SNPs: 0.05/1x10-6 = 5x10-8
  • 31.
    QQplot: Distribution of ofobserved p-values vs. Ho p- values Histogram of runif(10000) runif(10000) Frequency 0.0 0.2 0.4 0.6 0.8 1.0 0100200300400500 p-values under Ho Histogram of gwas$P.value gwas$P.value Frequency 0.0 0.2 0.4 0.6 0.8 1.0 050000100000150000 p-values of GWAS in Total Cholesterol Global Lipids Consortium, 2012random uniform distribution
  • 32.
    QQplot: Distribution of ofobserved p-values vs. Ho p- values Histogram of gwas$P.value gwas$P.value Frequency 0.0 0.2 0.4 0.6 0.8 1.0 050000100000150000 p-values of GWAS in Total Cholesterol
  • 33.
    Which diseases showevidence of association? Examining the QQplot of test statistics in WTCCC sent study cannot provideconclusive exclusion of any given gene. This is the consequence of several factors including: less-than-complete coverage of common variation genome-wide on the Affymetrix chip; poor coverage (by design) of rare variants, including many structural variants (thereby reducing power to detect rare, penetrant, alleles)25 ; difficultieswithdefining thefullgenomicextentofthegene ofinterest; and, despite the sample size, relatively low power to detect, at levels of already allow us, for selected diseases, to highlight pathways and mechanisms of particular interest. Naturally, extensive resequencing and fine-mapping work, followed by functional studies will be required before such inferences can be translated into robust state- ments about the molecular and physiological mechanisms involved. We turn now to a discussion of the main findings for each disease, focusing here only on the most significant and interesting results 25 20 20 15 15 10 10 5 5 30 0 0 25 20 20 15 15 10 10 5 5 30 0 0 25 20 20 15 15 10 10 5 5 30 0 0 25 20 20 15 15 10 10 5 5 30 0 0 25 20 20 15 15 10 10 5 5 30 0 0 25 20 20 15 15 10 10 5 5 30 0 0 25 20 20 15 15 10 10 5 5 30 0 0 BD Observedteststatistic Expected chi-squared value CAD CD HT RA T2D T1D Figure 3 | Quantile-quantile plots for seven genome-wide scans. For each of the seven disease collections, a quantile-quantile plot of the results of the trend test is shown in black for all SNPs that pass the standard project filters, have a minor allele frequency .1% and missing data rate ,1%. SNPs that 360,000 SNPs. SNPs at which the test statistic exceeds 30 are represented by triangles. Additional quantile-quantile plots, which also exclude all SNPs located in the regions of association listed in Table 3, are superimposed in blue (for BD, the exclusion of these SNPs has no visible effect on the plot, and
  • 34.
    Observational associations donot equal causation...
  • 35.
    Ice Cream $Drowning Confounding bias What is a confounder? Summer! ? Confounder is correlated to both the “risk” factor and disease, leading to invalid inference. Common source of bias in observational studies (e.g., case-control, cohort, etc)
  • 36.
    SNP Disease Population Stratification: Asource of possible confounding in GWAS race/ethnicity ? Ancestry correlated with allele frequency and disease GWAS are done on specific populations separately. (most have been done in populations of European ancestry)
  • 37.
    FTO Diabetes Mediation SNPs indicativeof a mediator factor? Example: FTO and Type 2 Diabetes Body Mass ? Association between FTO and Type 2 Diabetes via BMI? ... or does FTO have a independent role in Type 2 Diabetes...? FTO Body Mass
  • 38.
  • 39.
    PLINK: (Standard) Whole GenomeAnalysis Software http://pngu.mgh.harvard.edu/~purcell/plink/ •cited >9000 times since 2007 •allele frequency •linkage disequilibrium (LD) •data manipulation/filtering •association: allelic, genotypic models •chi-square •logistic •linear
  • 40.
    Examples: GWASs inType 2 Diabetes
  • 41.
    Type 2 DiabetesMellitus: A complex, multifactorial disease •Insulin production vs. use •beta-cell function •insulin sensitivity (BMI) •Moves glucose from blood into cells •Complications arise due to glucose in blood, hyperglycemia •diagnosed by blood glucose levels CDC, family history: 25% body weight, diet, lifestyle, age
  • 42.
    ARTICLES A genome-wide associationstudy identifies novel risk loci for type 2 diabetes Robert Sladek1,2,4 , Ghislain Rocheleau1 *, Johan Rung4 *, Christian Dina5 *, Lishuang Shen1 , David Serre1 , Philippe Boutin5 , Daniel Vincent4 , Alexandre Belisle4 , Samy Hadjadj6 , Beverley Balkau7 , Barbara Heude7 , Guillaume Charpentier8 , Thomas J. Hudson4,9 , Alexandre Montpetit4 , Alexey V. Pshezhetsky10 , Marc Prentki10,11 , Barry I. Posner2,12 , David J. Balding13 , David Meyre5 , Constantin Polychronakos1,3 & Philippe Froguel5,14 Type 2 diabetes mellitus results from the interaction of environmental factors with a combination of genetic variants, most of which were hitherto unknown. A systematic search for these variants was recently made possible by the development of high-density arrays that permit the genotyping of hundreds of thousands of polymorphisms. We tested 392,935 single-nucleotide polymorphisms in a French case–control cohort. Markers with the most significant difference in genotype frequencies between cases of type 2 diabetes and controls were fast-tracked for testing in a second cohort. This identified four loci containing variants that confer type 2 diabetes risk, in addition to confirming the known association with the TCF7L2 gene. These loci include a non-synonymous polymorphism in the zinc transporter SLC30A8, which is expressed exclusively in insulin-producing b-cells, and two linkage disequilibrium blocks that contain genes potentially involved in b-cell development or function (IDE–KIF11–HHEX and EXT2–ALX4). These associations explain a substantial portion of disease risk and constitute proof of principle for the genome-wide approach to the elucidation of complex genetic traits. The rapidly increasing prevalence of type 2 diabetes mellitus (T2DM) is thought to be due to environmental factors, such as increased availabil- ity of food and decreased opportunity and motivation for physical activity, acting on genetically susceptible individuals. The heritability of T2DM is one of the best established among common diseases and, consequently, genetic risk factors for T2DM have been the subject of intense research1 . Although the genetic causes of many monogenic forms of diabetes (maturity onset diabetes in the young, neonatal mito- chondrial and other syndromic types of diabetes mellitus) have been elucidated, few variants leading to common T2DM have been clearly identified and individually confer only a small risk (odds ratio < 1.1– 1.25) of developing T2DM1 . Linkage studies have reported many T2DM-linked chromosomal regions and have identified putative, cau- sative genetic variants in CAPN10 (ref. 2), ENPP1 (ref. 3), HNF4A (refs 4, 5) and ACDC (also called ADIPOQ)6 . In parallel, candidate-gene studieshavereportedmanyT2DM-associatedloci,withcodingvariants in the nuclear receptor PPARG (P12A)7 and the potassium channel KCNJ11 (E23K)8 being among the very few that havebeen convincingly replicated. The strongest known (odds ratio < 1.7) T2DM association9 was recently mapped to the transcription factor TCF7L2 and has been consistently replicated in multiple populations10–20 . Subjects and study design The recent availability of high-density genotyping arrays, which com- bine the power of association studies with the systematic nature of a genome-wide search, led us to undertake a two-stage, genome-wide association study to identify additional T2DM susceptibility loci (Supplementary Fig. 1). In the first stage of this study, we obtained genotypes for 392,935 single-nucleotide polymorphisms (SNPs) in 1,363 T2DM cases and controls (Supplementary Table 1). In order to enrich for risk alleles21 , the diabetic subjects studied in stage 1 were selected to have at least one affected first degree relative and age at onset under 45 yr (excluding patients with maturity onset diabetes in the young). Furthermore, in order to decrease phenotypic hetero- geneity and to enrich for variants determining insulin resistance and b-cell dysfunction through mechanisms other than severe obesity, we initially studied diabetic patients with a body mass index (BMI) ,30 kg m22 . Control subjects were selected to have fasting blood glucose ,5.7 mmol l21 in DESIR, a large prospective cohort for the study of insulin resistance in French subjects22 . Genotypes for each study subject were obtained using two plat- forms: Illumina Infinium Human1 BeadArrays, which assay 109,365 SNPs chosen using a gene-centred design; and Human Hap300 BeadArrays, which assay 317,503 SNPs chosen to tag haplotype blocks identified by the Phase I HapMap23 . Of the 409,927 markers that passed quality control (Supplementary Tables 2 and 3), geno- types were obtained for an average of 99.2% (Human1) and 99.4% (Hap300) of markers for each subject with a reproducibility of .99.9% (both platforms). Forty-three subjects were removed from analysis because of evidence of intercontinental admixture (Sup- plementary Fig. 3) and an additional four because their genotype- determined gender disagreed with clinical records. In total, T2DM association was tested for 100,764 (Human1) and 309,163 (Hap300) SNPs representing 392,935 unique loci (Fig. 1). Because of unequal male/female ratios in our cases and controls, we analysed the 12,666 sex-chromosome SNPs separately for each gender. *These authors contributed equally to this work. 1 Departments of Human Genetics, 2 Medicine and 3 Pediatrics, Faculty of Medicine, McGill University, Montreal H3H 1P3, Canada. 4 McGill University and Genome Quebec Innovation Centre, Montreal H3A 1A4, Canada. 5 CNRS 8090-Institute of Biology, Pasteur Institute, Lille 59019 Cedex, France. 6 Endocrinology and Diabetology, University Hospital, Poitiers 86021 Cedex, France. 7 INSERM U780-IFR69, Villejuif 94807, France. 8 Endocrinology-Diabetology Unit, Corbeil-Essonnes Hospital, Corbeil-Essonnes 91100, France. 9 Ontario Institute for Cancer Research, Toronto M5G 1L7, Canada. 10 Montreal Diabetes Research Center, Montreal H2L 4M1, Canada. 11 Molecular Nutrition Unit and the Department of Nutrition, University of Montreal and the Centre Hospitalier de l’Universite´ de Montre´al, Montreal H3C 3J7, Canada. 12 Polypeptide Hormone Laboratory and Department of Anatomy and Cell Biology, Montreal H3A 2B2, Canada. 13 Department of Epidemiology & Public Health, Imperial College, St Mary’s Campus, Norfolk Place, London W2 1PG, UK. 14 Section of Genomic Medicine, Imperial College London W12 0NN, and Hammersmith Hospital, Du Cane Road, London W12 0HS, UK. 881 Nature©2007 Publishing Group Nature, 2/2007 References and Notes 1. B. G. Richmond, D. S. Strait, Nature 404, 382 (2000). 2. J. Kingdon, Lowly Origins (Princeton Univ. Press, Princeton, NJ, 2003). 3. C. V. Ward, M. G. Leakey, A. Walker, Evol. Anthropol. 7, 197 (1999). 4. Y. Haile-Selassie, Nature 412, 178 (2001). 5. T. D. White et al., Nature 440, 883 (2006). 6. K. Kovarovic, P. Andrews, J. Hum. Evol., in press (available at http://dx.doi.org./doi:10.1016/j.jhevol.2007.01.001; doi: 10.1016/j.jhevol.2007.01.001). 7. N. Patterson, D. J. Richter, S. Gnerre, E. S. Lander, D. Reich, Nature 441, 1103 (2006). 8. K. D. Hunt et al., Primates 37, 363 (1996). 9. J. G. Fleagle et al., Symp. Zool. Soc. London 48, 359 (1981). 10. R. H. Crompton et al., Cour. Forsch-Inst. Senckenb. 243, 115 (2003). 11. J. T. Stern, Yrb. Phys. Anthropol. 19, 59 (1975). 12. S. K. S. Thorpe, R. H. Crompton, Am. J. Phys. Anthropol. 131, 384 (2006). 13. K. D. Hunt, J. Hum. Evol. 26, 183 (1994). 15. E. Larney, S. Larsen, Am. J. Phys. Anthropol. 125, 42 (2004). 16. S. K. S. Thorpe, R. H. Crompton, Am. J. Phys. Anthropol. 127, 58 (2005). 17. S. K. S. Thorpe, R. H. Crompton, M. M. Gunther, R. F. Ker, R. McN. Alexander, Am. J. Phys. Anthropol. 110, 179 (1999). 18. R. McN. Alexander, Principles of Animal Locomotion (Princeton Univ. Press, Princeton, NJ, 2003). 19. C. V. Ward, Yrbk. Phys. Anthropol. 45, 185 (2002). 20. R. W. Wrangham, N. L. Conklin-Brittain, K. D. Hunt, Int. J. Primatol. 19, 949 (1998). 21. H. Pontzer, R. W. Wrangham, J. Hum. Evol. 46, 317 (2004). 22. R. C. Payne et al., J. Anat. 208, 709 (2006). 23. M. Pickford, B. Senut, B. Gommery, in Late Cenozoic Environments and Hominid Evolution: a Tribute to Bill Bishop, P. Andrews, P. Banham, Eds. (Geological Society, London, 1999), pp. 27–38. 24. N. M. Young, L. MacLatchy, J. Hum. Evol. 46, 163 (2004). 25. D. Gommery, B. Senu, M. Pickford, E. Musiime, Ann. Paléontol. 88, 167 (2002). 26. C. V. Ward, in Handbook of Paleoanthropology Vol. 2: Primate Evolution and Human Origins, W. Henke, I. Tattersall, Eds. (Springer, Heidelberg, Germany, 2007), pp. 1011–1030. N. Ogihara, M. Nakatsukasa, Eds. (Springer, Heidelberg, Germany, 2006), pp. 199–208. 28. C. P. E. Zollikofer et al., Nature 434, 755 (2005). 29. M. Pickford, Anthropologie 69, 191 (2005). 30. We thank the Indonesian Institute of Science, Indonesian Nature Conservation Service, and Leuser Development Programme for granting permission and giving support for research in the Leuser Ecosystem. R. McN. Alexander, T. M. Blackburn, S. Burtles. J. Rees, N. Jeffery, E. E. Vereecke, A. Walker, A. Wilson, and B. Wood commented on the manuscript. R. Savage developed the animation (fig. S1). Studies of captive animals were hosted by the North of England Zoological Society. This research was supported by grants from the Leverhulme Trust, the Royal Society, the L.S.B. Leakey Foundation, and the Natural Environment Research Council. Supporting Online Material www.sciencemag.org/cgi/content/full/316/5829/1328/DC1 Table S1 Movies S1 to S3 5 February 2007; accepted 18 April 2007 10.1126/science.1140799 Genome-Wide Association Analysis Identifies Loci for Type 2 Diabetes and Triglyceride Levels Diabetes Genetics Initiative of Broad Institute of Harvard and MIT, Lund University, and Novartis Institutes for BioMedical Research*† New strategies for prevention and treatment of type 2 diabetes (T2D) require improved insight into disease etiology. We analyzed 386,731 common single-nucleotide polymorphisms (SNPs) in 1464 patients with T2D and 1467 matched controls, each characterized for measures of glucose metabolism, lipids, obesity, and blood pressure. With collaborators (FUSION and WTCCC/UKT2D), we identified and confirmed three loci associated with T2D—in a noncoding region near CDKN2A and CDKN2B, in an intron of IGF2BP2, and an intron of CDKAL1—and replicated associations near HHEX and in SLC30A8 found by a recent whole-genome association study. We identified and confirmed association of a SNP in an intron of glucokinase regulatory protein (GCKR) with serum triglycerides. The discovery of associated variants in unsuspected genes and outside coding regions illustrates the ability of genome-wide association studies to provide potentially important clues to the pathogenesis of common diseases. T ype 2 diabetes, obesity, and cardiovascular risk factors are caused by a combination of genetic susceptibility, environment, be- havior, and chance. Whole-genome association studies (WGAS) offer a new approach to gene discovery unbiased with regard to presumed functions or locations of causal variants. This approach is based on Fisher’s theory for additive effects at common alleles (1); human heterozy- to purifying selection, and has been made pos- sible by genomic advances such as the human genome sequence, SNP and HapMap databases, and genotyping arrays (3). We studied 1464 patients with T2D and 1467 controls from Finland and Sweden, each characterized for 18 clinical traits: anthropomet- ric measures, glucose tolerance and insulin se- cretion, lipids and apolipoproteins, and blood applying stringent quality-control filters, high- quality genotypes for 386,731 common SNPs were obtained (4). To extend the set of putative causal alleles tested for association, we devel- oped 284,968 additional multimarker (haplo- type) tests based on these SNP genotypes (5, 6). The 671,699 allelic tests capture (correlation co- efficient r2 ≥ 0.8) 78% of common SNPs in HapMap CEU (3). Each SNP and haplotype test was assessed for association to T2D and each of 18 traits with the software package PLINK (http://pngu.mgh. harvard.edu/purcell/plink/). For T2D, a weighted meta-analysis was used to combine results for the population-based and family-based subsam- ples (4). For quantitative traits, multivariable linear or logistic regression with or without co- variates was performed (4). Association results for each SNP, haplotype test, and phenotype are available (www.broad.mit.edu/diabetes/). In genome-wide analysis involving hundreds of thousands of statistical tests, modest levels of bias imposed on the null distribution can over- whelm a small number of true results. We used three strategies to search for evidence of sys- tematic bias from unrecognized population struc- ture, the analytical approach, and genotyping artifacts (7, 8). First, we examined the distribu- tion of P-values in the population-based sam- ple, observing a close match to that expected for a null distribution (genomic inflation factor lGC = 1.05 for T2D). Second, we calculated G. Brice,6 B. Bullman,7 J. Campbell,8 B. Castle,9 R. Cetnarsyj,8 C. Chapman,10 C. Chu,11 N. Coates,12 T. Cole,10 R. Davidson,4 A. Donaldson,13 H. Dorkins,3 F. Douglas,2 D. Eccles,9 R. Eeles,1 F. Elmslie,6 D. G. Evans,7 S. Goff,6 S. Goodman,5 D. Goudie,2 J. Gray,15 L. Greenhalgh,16 H. Gregory,17 S. V. Hodgson,6 T. Homfray,6 R. S. Houlston,1 L. Izatt,18 L. Jackson,18 L. Jeffers,19 V. Johnson-Roffey,12 F. Kavalier,18 C. Kirk,19 F. Lalloo,7 C. Langman,18 I. Locke,1 M. Longmuir,4 J. Mackay,20 A. Magee,19 S. Mansour,6 Z. Miedzybrodzka,17 J. Miller,11 P. Morrison,19 V. Murday,4 J. Paterson,21 G. Pichert,18 M. Porteous,8 N. Rahman,6 M. Rogers,15 S. Rowe,22 S. Shanley,1 A. Saggar,6 G. Scott,2 L. Side,23 L. Snadden,4 M. Steel,2 M. Thomas,5 S. Thomas,1 1 Clinical Genetics Service, Royal Marsden Hospital, Downs Road, Sutton, Surrey, SM2 5PT, UK. 2 Department of Clinical Genetics, Ninewells Hospital, Dundee, DD1 9SY, UK. 3 Medical and Community Genetics, Kennedy-Galton Centre, Level 8V, Northwick Park and St. Mark’s NHS Trust, Watford Rd, Harrow, HA1 3UJ, UK. 4 Institute of Medical Genetics, Yorkhill NHS Trust, Dalnair Street, Glasgow, G3 8SJ, UK. 5 Clinical Genetics Department, Royal Devon and Exeter Hospital (Heavitree), Gladstone Road, Exeter, EX1 2ED, UK. 6 Department of Clinical Genetics, St. George’s Hospital Medical School, Jenner Wing, Cranmer Terrace, London, SW17 0RE, UK. 7 Department of Medical Genetics, St. Mary’s Hospital, Hathersage Road, Manchester, M13 0JH, UK. 8 South East of Scotland Clinical Genetics Service, Western General Hospital, Crewe Road, Edinburgh, EH4 2XU, UK. 9 Department of Medical Genetics, The Princess Anne Hospital, Coxford Road, Southampton, S016 5YA, UK. 10 Clinical Genetics Unit, Birmingham Women’s Hospital, Metchley Park Road, Edgbaston, Birmingham, B15 2TG, UK. 11 Yorkshire Regional Genetic Service, Department of Clinical Genetics, Cancer Genetics Building, St. James University Hospital, Beckett Street, Leeds, LS9 7TF, UK. 12 Department of Clinical Genetics, Leicester Royal Infirm- ary, Leicester, LE1 5WW, UK. 13 Department of Clinical Genetics, St Michael’s Hospital, Southwell Street, Bristol, BS2 8EG, UK. 14 Institute of Human Genetics, International Centre for Life, Central Parkway, Newcastle upon Tyne, NE1 3BZ, UK. 15 Institute of Medical Genetics, University Hospital of Wales, Heath Park, Cardiff, CF14 4XW, UK. 16 Department of Clinical Genetics, Alder Hey Children’s Hospital, Eaton Road, Liverpool L12 2AP, UK. 17 Clinical Genetics Centre, Argyll House, Foresterhill, Aberdeen, AB25 2ZR, UK. 18 Clinical Genetics, 7th Floor New Guy’s House, Guy’s UK. 19 Clinical Belvoir Park H 20 Clinical and Health, 30 G 21 Department Trust, Box 13 22 Department of Chester Ho 23 Department Road, Headin Supporting www.sciencema Materials and Figs. S1 to S8 Tables S1 to S References 9 March 2007 Published onli 10.1126/scien Include this in A Genome-Wide Association Study of Type 2 Diabetes in Finns Detects Multiple Susceptibility Variants Laura J. Scott,1 Karen L. Mohlke,2 Lori L. Bonnycastle,3 Cristen J. Willer,1 Yun Li,1 William L. Duren,1 Michael R. Erdos,3 Heather M. Stringham,1 Peter S. Chines,3 Anne U. Jackson,1 Ludmila Prokunina-Olsson,3 Chia-Jen Ding,1 Amy J. Swift,3 Narisu Narisu,3 Tianle Hu,1 Randall Pruim,4 Rui Xiao,1 Xiao-Yi Li,1 Karen N. Conneely,1 Nancy L. Riebow,3 Andrew G. Sprau,3 Maurine Tong,3 Peggy P. White,1 Kurt N. Hetrick,5 Michael W. Barnhart,5 Craig W. Bark,5 Janet L. Goldstein,5 Lee Watkins,5 Fang Xiang,1 Jouko Saramies,6 Thomas A. Buchanan,7 Richard M. Watanabe,8,9 Timo T. Valle,10 Leena Kinnunen,10,11 Gonçalo R. Abecasis,1 Elizabeth W. Pugh,5 Kimberly F. Doheny,5 Richard N. Bergman,9 Jaakko Tuomilehto,10,11,12 Francis S. Collins,3 * Michael Boehnke1 * Identifying the genetic variants that increase the risk of type 2 diabetes (T2D) in humans has been a formidable challenge. Adopting a genome-wide association strategy, we genotyped 1161 Finnish T2D cases and 1174 Finnish normal glucose-tolerant (NGT) controls with >315,000 single-nucleotide polymorphisms (SNPs) and imputed genotypes for an additional >2 million autosomal SNPs. We carried out association analysis with these SNPs to identify genetic variants that predispose to T2D, compared our T2D association results with the results of two similar studies, and genotyped 80 SNPs in an additional 1215 Finnish T2D cases and 1258 Finnish NGT controls. We identify T2D-associated variants in an intergenic region of chromosome 11p12, contribute to the identification of T2D-associated variants near the genes IGF2BP2 and CDKAL1 and the ria (8). We ciation with the log-odd (8). We ob versus 31.6 P values < against the with a large consistent w SNPs that also sugges trols by birt successful; genomic co Analysi allowed us variation in portion, w (8, 13) that equilibrium Centre d’E (Utah resid 1 Department Genetics, Uni USA. 2 Depar Science, 6/2007 Study design: Richa Saxena1–6 and Valeriya Lyssenko7 (Team Leaders), Peter Almgren,7 Paul I. W. de Bakker,1–6 Noël P. Burtt,1 Jose C. Florez,1–6 Hong Chen,8 Joanne Meyer,8 Joel N. Hirschhorn,1,6,9–11 Mark J. Daly,1–3,5 Thomas E. Hughes,8 Leif Groop,7,12 David Altshuler1–6 (Chair) Clinical characterization and phenotypes: Valeriya Lyssenko7 and Richa Saxena1–6 (Team Leaders), Peter Almgren,7 Kristin Ardlie,1 Kristina Bengtsson Boström,13 Noël P. Burtt,1 Hong Chen,8 Jose C. Florez,1–6 Bo Isomaa,14,15 Sekar Kathiresan,1,3,5 Guillaume Lettre,1,6,9–11 Ulf Lindblad,16 Helen N. Lyon,1,6,9–11 Olle Melander,7 Christopher Newton-Cheh,1–3,5 Peter Nilsson,17 Marju Orho- Melander,7 Lennart Råstam,16 Elizabeth K. Speliotes,1,3,6,9–11 Marja-Riitta Taskinen,12 Tiinamaija Tuomi,12,15 Benjamin F. Voight,1–3,5 David Altshuler,1–6 Joel N. Hirschhorn,1,6,9–11 Thomas E. Hughes,8 Leif Groop7,12 (Chair) DNA sample QC and diabetes replication genotyping: Candace Guiducci1 and Valeriya Lyssenko7 (Team Leaders), Anna Berglund,7 Joyce Carlson,18 Lauren Gianniny,1 Rachel Hackett,1 Liselotte Hall,18 Johan Holmkvist,7 Esa Laurila,7 Marju Orho-Melander,7 Marketa Sjögren,7 Maria Sterner,18 Aarti Surti1 Margareta Svensson,7 Malin Svensson,7 Ryan Tewhey,1 Noël P. Burtt1 (Chair) Whole genome scan genotyping: Brendan Blumenstiel1 (Team Leader), Melissa Parkin,1 Matthew DeFelice,1 Candace Guiducci,1 Ryan Tewhey,1 Rachel Barry,1 Wendy Brodeur,1 Noël P. Burtt,1 Jody Camarata,1 Nancy Chia,1 Mary Fava,1 John Gibbons,1 Bob Handsaker,1 Claire Healy,1 Kieu Nguyen,1 Casey Gates,1 Carrie Sougnez,1 Diane Gage,1 Marcia Nizzari,1 David Altshuler,1–6 Stacey B. Gabriel1 (Chair) GCKR replication genotyping and analysis (Malmö Diet and Cancer Study): Sekar Kathiresan1,3,5 (Team Leader), Candace Guiducci,1 Aarti Surti,1 Noël P. Burtt,1 Olle Melander,7 Marju Orho-Melander7 (Chair) Statistical analysis: Benjamin F. Voight1–3,5 and Paul I. W. de Bakker1–6 (Team Leaders), Richa Saxena,1–6 Valeriya Lyssenko,7 Peter Almgren,7 Noël P. Burtt,1 Hong Chen,8 Gung-Wei Chirn,8 Qicheng Ma,8 Hemang Parikh,7 Delwood Richardson,8 Darrell Ricke,8 Jeffrey J. Roix,8 Leif Groop,7,12 Shaun Purcell,1,2 David Altshuler,1–6 Mark J. Daly1–3,5 (Chair) 1 Broad Institute of Harvard and Massachusetts Institute of Technology (MIT), Cambridge, MA 02142, USA. 2 Center for Human Genetic Research, Massachusetts General Hospital, Boston, MA 02114, USA. 3 Department of Medicine, Mas- sachusetts General Hospital, Boston, MA 02114, USA. 4 Department of Molecular Biology, Massachusetts General Hospital, Boston, MA 02114, USA. 5 Department of Medicine, Harvard Medical School, Boston, MA 02115, USA. 6 Depart- ment of Genetics, Harvard Medical School, Boston, MA 02115, USA. 7 Department of Clinical Sciences, Diabetes and Endocrinology Research Unit, University Hospital Malmö, Lund University, Malmö, Sweden. 8 Diabetes and Metabolism Disease Area, Novartis Institutes for BioMedical Research, 100 Technology Square, Cambridge, MA 02139, USA. 9 Depart- ment of Pediatrics, Harvard Medical School, Boston, MA 02115, USA. 10 Division of Endocrinology, Children’s Hospital, Boston, MA 02115, USA. 11 Division of Genetics, Children’s Hospital, Boston, MA 02115, USA. 12 Department of Medicine, Helsinki University Hospital, University of Helsinki, Helsinki, Finland. 13 Skaraborg Institute, Skövde, Sweden. 14 Malmska Municipal Health Center and Hospital, Jakobstad, Finland. 15 Folkhälsan Research Center, Helsinki, Finland. 16 Depart- ment of Clinical Sciences, Community Medicine Research Unit, University Hospital Malmö, Lund University, Malmö, Sweden. 17 Department of Clinical Sciences, Medicine Research Unit, University Hospital Malmö, Lund University, Malmö, Sweden. 18 Clinical Chemistry, University Hospital Malmö, Lund University, Malmö, Sweden. 19 Department of Psychiatry, Massachusetts General Hospital, Harvard Medical School, Boston, MA 02115, USA. Supporting Online Material www.sciencemag.org/cgi/content/full/1142358/DC1 Materials and Methods Figs. S1 and S2 Tables S1 to S6 References 9 March 2007; accepted 20 April 2007 Published online 26 April 2007; 10.1126/science.1142358 Include this information when citing this paper. Replication of Genome-Wide Association Signals in UK Samples Reveals Risk Loci for Type 2 Diabetes Eleftheria Zeggini,1,2 * Michael N. Weedon,3,4 * Cecilia M. Lindgren,1,2 * Timothy M. Frayling,3,4 * Katherine S. Elliott,2 Hana Lango,3,4 Nicholas J. Timpson,2,5 John R. B. Perry,3,4 Nigel W. Rayner,1,2 Rachel M. Freathy,3,4 Jeffrey C. Barrett,2 Beverley Shields,4 Andrew P. Morris,2 Sian Ellard,4,6 Christopher J. Groves,1 Lorna W. Harries,4 Jonathan L. Marchini,7 Katharine R. Owen,1 Beatrice Knight,4 Lon R. Cardon,2 Mark Walker,8 Graham A. Hitman,9 Andrew D. Morris,10 Alex S. F. Doney,10 The Wellcome Trust Case Control Consortium (WTCCC),† Mark I. McCarthy,1,2 ‡§ Andrew T. Hattersley3,4 ‡ The molecular mechanisms involved in the development of type 2 diabetes are poorly understood. Starting from genome-wide genotype data for 1924 diabetic cases and 2938 population controls generated by the Wellcome Trust Case Control Consortium, we set out to detect replicated diabetes association signals through analysis of 3757 additional cases and 5346 controls and by integration of our findings with equivalent data from other international consortia. We detected diabetes susceptibility loci in and around the genes CDKAL1, CDKN2A/CDKN2B, and IGF2BP2 and confirmed the recently described associations at HHEX/IDE and SLC30A8. Our findings provide insight into the genetic architecture of type 2 diabetes, emphasizing the contribution of Here, we describe how integration of data from the WTCCC scan and our own replication studies with similar information generated by the Diabetes Genetics Initiative (DGI) (6) and the Finland–United States Investigation of NIDDM Genetics (FUSION) (7) has identified several additional susceptibility variants for T2D. In the WTCCC study, analysis of 490,032 autosomal SNPs in 16,179 samples yielded 459,448 SNPs that passed initial quality control (5). We considered only the 393,453 autosomal SNPs with minor allele frequency (MAF) ex- ceeding 1% in both cases and controls and no extreme departure from Hardy-Weinberg equi- librium (P < 10−4 in cases or controls) (8). This T2D-specific data set shows no evidence of sub- stantial confounding from population substruc- ture and genotyping biases (8). To distinguish true associations from those reflecting fluctuations under the null or residual errors arising from aberrant allele calling, we first submitted putative signals from the WTCCC study to additional quality control, including cluster- plot visualization and validation genotyping on REPORTS onFebruary8,2010www.sciencemag.orgDownloadedfrom
  • 43.
    ARTICLES A genome-wide associationstudy identifies novel risk loci for type 2 diabetes Robert Sladek1,2,4 , Ghislain Rocheleau1 *, Johan Rung4 *, Christian Dina5 *, Lishuang Shen1 , David Serre1 , Philippe Boutin5 , Daniel Vincent4 , Alexandre Belisle4 , Samy Hadjadj6 , Beverley Balkau7 , Barbara Heude7 , Guillaume Charpentier8 , Thomas J. Hudson4,9 , Alexandre Montpetit4 , Alexey V. Pshezhetsky10 , Marc Prentki10,11 , Barry I. Posner2,12 , David J. Balding13 , David Meyre5 , Constantin Polychronakos1,3 & Philippe Froguel5,14 Type 2 diabetes mellitus results from the interaction of environmental factors with a combination of genetic variants, most of which were hitherto unknown. A systematic search for these variants was recently made possible by the development of high-density arrays that permit the genotyping of hundreds of thousands of polymorphisms. We tested 392,935 single-nucleotide polymorphisms in a French case–control cohort. Markers with the most significant difference in genotype frequencies between cases of type 2 diabetes and controls were fast-tracked for testing in a second cohort. This identified four loci containing variants that confer type 2 diabetes risk, in addition to confirming the known association with the TCF7L2 gene. These loci include a non-synonymous polymorphism in the zinc transporter SLC30A8, which is expressed exclusively in insulin-producing b-cells, and two linkage disequilibrium blocks that contain genes potentially involved in b-cell development or function (IDE–KIF11–HHEX and EXT2–ALX4). These associations explain a substantial portion of disease risk and constitute proof of principle for the genome-wide approach to the elucidation of complex genetic traits. The rapidly increasing prevalence of type 2 diabetes mellitus (T2DM) is thought to be due to environmental factors, such as increased availabil- ity of food and decreased opportunity and motivation for physical activity, acting on genetically susceptible individuals. The heritability of T2DM is one of the best established among common diseases and, consequently, genetic risk factors for T2DM have been the subject of intense research1 . Although the genetic causes of many monogenic forms of diabetes (maturity onset diabetes in the young, neonatal mito- chondrial and other syndromic types of diabetes mellitus) have been elucidated, few variants leading to common T2DM have been clearly identified and individually confer only a small risk (odds ratio < 1.1– 1.25) of developing T2DM1 . Linkage studies have reported many T2DM-linked chromosomal regions and have identified putative, cau- sative genetic variants in CAPN10 (ref. 2), ENPP1 (ref. 3), HNF4A (refs genotypes for 392,935 single-nucleotide polymorphisms (SNPs) in 1,363 T2DM cases and controls (Supplementary Table 1). In order to enrich for risk alleles21 , the diabetic subjects studied in stage 1 were selected to have at least one affected first degree relative and age at onset under 45 yr (excluding patients with maturity onset diabetes in the young). Furthermore, in order to decrease phenotypic hetero- geneity and to enrich for variants determining insulin resistance and b-cell dysfunction through mechanisms other than severe obesity, we initially studied diabetic patients with a body mass index (BMI) ,30 kg m22 . Control subjects were selected to have fasting blood glucose ,5.7 mmol l21 in DESIR, a large prospective cohort for the study of insulin resistance in French subjects22 . Genotypes for each study subject were obtained using two plat- Sladek, 2007How many SNPs (p-value?) European-based; N ~ 1000 cases: high fasting blood glucose/non-obese controls: non-obese
  • 44.
    Human Hap300 chip,showing no T2DM association in stage 1 (P . 0.01) and separated by at least 100 kb. Using the first principal component as a covariate for ancestry differences between cases and controls, we tested for association between rs932206 and disease status. Our result suggests that this apparent association is largely BMI on the association between marker and disease, as it is asymp- totically equivalent to the Armitage trend test used to detect asso- ciation in stages 1 and 2. None of the associations (Supplementary Table 7) was substantially changed by considering the effects of these covariates. 5 3 1 5 3 1 5 3 1 5 3 1 5 3 1 15 10 5 5 3 1 5 3 1 5 3 1 5 3 1 5 3 1 5 3 1 5 3 1 5 3 1 5 3 1 5 3 1 5 3 1 5 3 1 5 3 1 5 3 1 5 3 1 5 3 1 5 3 1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 19 20 21 22 X 18 Figure 1 | Graphical summary of stage 1 association results. T2DM association was determined for SNPs on the Human1 and Hap300 chips. The x axis represents the chromosome position from pter; the y axis shows 2log10[pMAX], the P-value obtained by the MAX statistic, for each SNP (Note the different scale on the y axis of the chromosome 10 plot.). SNPs that passed the cutoff for a fast-tracked second stage are highlighted in red. 882 Nature©2007 Publishing Group Sladek, 2007
  • 45.
    Identification of fournovel T2DM loci Our fast-track stage 2 genotyping confirmed the reported association for rs7903146 (TCF7L2) on chromosome 10, and in addition iden- tified significant associations for seven SNPs representing four new T2DM loci (Table 1). In all cases, the strongest association for the MAX statistic (see Methods) was obtained with the additive model. The most significant of these corresponds to rs13266634, a non- synonymous SNP (R325W) in SLC30A8, located in a 33-kb linkage disequilibrium block on chromosome 8, containing only the 39 end of this gene (Fig. 2a). SLC30A8 encodes a zinc transporter expressed solely in the secretory vesicles of b-cells and is thus implicated in the final stages of insulin biosynthesis, which involve co-crystallization Table 1 | Confirmed association results SNP Chromosome Position (nucleotides) Risk allele Major allele MAF (case) MAF (ctrl) Odds ratio (het) Odds ratio (hom) PAR ls Stage 2 pMAX Stage 2 pMAX (perm) Stage 1 pMAX Stage 1 pMAX (perm) Nearest gene rs7903146 10 114,748,339 T C 0.406 0.293 1.65 6 0.19 2.77 6 0.50 0.28 1.0546 1.5 3 10234 ,1.0 3 1027 3.2 3 10217 ,3.3 3 10210 TCF7L2 rs13266634 8 118,253,964 C C 0.254 0.301 1.18 6 0.25 1.53 6 0.31 0.24 1.0089 6.1 3 1028 5.0 3 1027 2.1 3 1025 1.8 3 1025 SLC30A8 rs1111875 10 94,452,862 G G 0.358 0.402 1.19 6 0.19 1.44 6 0.24 0.19 1.0069 3.0 3 1026 7.4 3 1026 9.1 3 1026 7.3 3 1026 HHEX rs7923837 10 94,471,897 G G 0.335 0.377 1.22 6 0.21 1.45 6 0.25 0.20 1.0065 7.5 3 1026 2.2 3 1025 3.4 3 1026 2.5 3 1026 HHEX rs7480010 11 42,203,294 G A 0.336 0.301 1.14 6 0.13 1.40 6 0.25 0.08 1.0041 1.1 3 1024 2.9 3 1024 1.5 3 1025 1.2 3 1025 LOC387761 rs3740878 11 44,214,378 A A 0.240 0.272 1.26 6 0.29 1.46 6 0.33 0.24 1.0046 1.2 3 1024 2.8 3 1024 1.8 3 1025 1.3 3 1025 EXT2 rs11037909 11 44,212,190 T T 0.240 0.271 1.27 6 0.30 1.47 6 0.33 0.25 1.0045 1.8 3 1024 4.5 3 1024 1.8 3 1025 1.3 3 1025 EXT2 rs1113132 11 44,209,979 C C 0.237 0.267 1.15 6 0.27 1.36 6 0.31 0.19 1.0044 3.3 3 1024 8.1 3 1024 3.7 3 1025 2.9 3 1025 EXT2 Significant T2DM associations were confirmed for eight SNPs in five loci. Allele frequencies, odds ratios (with 95% confidence intervals) and PAR were calculated using only the stage 2 data. Allele frequencies in the controls were very close to those reported for the CEU set (European subjects genotyped in the HapMap project). Induced sibling recurrent risk ratios (ls) were estimated using stage 2 genotype counts for the control subjects and assuming a T2DM prevalence of 7% in the French population. hom, homozygous; het, heterozygous; major allele, the allele with the higher frequency in controls; pMAX, P-value of the MAX statistic from the x2 distribution; pMAX (perm), P-value of the MAX statistic from the permutation-derived empirical distribution (pMAX and pMAX (perm) are adjusted for variance inflation); risk allele, the allele with higher frequency in cases compared with controls. 0 2 4 –log10[P] –log10[P] SLC30A8 IDE HHEXKIF11 0 2 4 a b NATURE|Vol 445|22 February 2007 ARTICLES Sladek, 2007 5 3 1 5 3 1 15 10 5 1 1 1 5 3 1 5 3 1 5 3 1 5 3 1 5 3 1 5 3 1 5 3 1 3 4 5 8 9 10 13 14 15 19 20 X 18 DM 2log10[pMAX], the P-value obtained by the MAX statistic, for each SNP How would you interpret the p- values? Odds ratios? Confirmed 8 SNPs with N ~ 1000
  • 46.
    Scaling up discoveryby combining populations: meta-analyses
  • 47.
    g the DiabetesGenetics nvestigation of NIDDM nd (iv) the Framingham omponent studies (n ¼ ry Table 1 online. aring, the four consortia n 10 and 20 SNPs promi- their individual, interim, mentary Table 2 online). oci with consistent effects dies. Two of these repre- 6PC2 and GCK. In addi- nerated evidence for an NPs around the MTNR1B rs1387153, P ¼ 2.2 Â 10À11; DFS: rs10830963, 5.8 Â 10À4, for the most ch analysis). The associa- d on formal meta-analysis r exclusion of individuals ¼ 1.1 Â 10À57; rs4607517 NR1B), P ¼ 3.2 Â 10À50; pplementary Table 3 and ent efforts to harmonize (including the additional data from the WTCCC, DGI and FUSION scans)10 (Supplementary Note). We found strong evidence that the minor G allele of rs10830963 was associated with increased risk of T2D (odds ratio ¼ 1.09 (1.05–1.12), P ¼ 3.3 Â 10À7; Fig. 2 and Supplementary Table 6 online). The possibility that the fasting glucose association might DGI Study ID OR (95% CI) Weight (%) 1.12 (0.96, 1.30) 4.61 4.89 8.03 9.58 3.53 8.75 2.69 6.04 10.56 23.18 2.85 7.41 7.90 100.00 1.20 (1.03, 1.39) 1.07 (0.95, 1.20) 1.14 (1.03, 1.27) 1.00 (0.84, 1.19) 1.17 (1.04, 1.30) 1.07 (0.88, 1.31) 1.16 (1.02, 1.33) 1.00 (0.90, 1.10) 1.03 (0.96, 1.10) 0.91 (0.75, 1.10) 1.15 (1.02, 1.30) 1.16 (1.03, 1.30) 1.09 (1.05, 1.12) Meta-analysis P value = 3.3 × 10 –7 FUSION WTCCC deCODE KORA Rotterdam CCC ADDITION/ELY Norfolk UKT2DGC OxGN/58BC FUSION Stage 2 METSIM .722 1 1.39 Overall (I 2 = 26.6%, P = 0.176) Figure 2 Association of rs10830963 with type 2 diabetes (T2D) in 13 case- control studies. VOLUME 41 [ NUMBER 1 [ JANUARY 2009 NATURE GENETICS Meta-analysis of SNP rs10830963: Combining findings from multiple cohorts Propenko, 2009
  • 48.
    A RT IC L E S By combining genome-wide association data from 8,130 individuals with type 2 diabetes (T2D) and 38,987 controls of European descent and following up previously unidentified meta-analysis signals in a further 34,412 cases and 59,925 controls, we identified 12 new T2D association signals with combined P < 5 × 10−8. These include a second independent signal at the KCNQ1 locus; the first report, to our knowledge, of an X-chromosomal association (near DUSP9); and a further instance of overlap between loci implicated in monogenic and multifactorial forms of diabetes (at HNF1A). The identified loci affect both beta-cell function and insulin action, and, overall, T2D association signals show evidence of enrichment for genes involved in cell cycle regulation. We also show that a high proportion of T2D susceptibility loci harbor independent association signals influencing apparently unrelated complex traits. Type 2 diabetes (T2D) is characterized by insulin resistance and deficient beta-cell function1. The escalating prevalence of T2D and the limitations of currently available preventative and therapeutic options highlight the need for a more complete understanding of T2D pathogenesis. To date, approximately 25 genome-wide significant common variant associations with T2D have been described, mostly through genome-wide association (GWA) analyses2–13. The identities of the variants and genes mediating the susceptibility effects at most of these signals have yet to be established, and the known variants account for less than 10% of the overall estimated genetic contribution to T2D predisposition. Although some of the unexplained heritability will reflect variants poorly captured by existing GWA platforms, we reasoned that an expanded meta-analysis of existing GWA data would the inverse-variance method (Online Methods, Fig. 1, Supplementary Tables 1 and 2 and Supplementary Note). We observed only modest genomic control inflation ( gc = 1.07), suggesting that the observed results were not due to population stratification. After removing SNPs within established T2D loci (Supplementary Table 3), the result- ing quantile-quantile plot was consistent with a modest excess of disease associations of relatively small effect (Supplementary Note). Weak evidence for association at HLA variants strongly associated with autoimmune forms of diabetes (Supplementary Table 3 and Supplementary Note) suggested some case admixture involving subjects with type 1 diabetes or latent autoimmune diabetes of adult- hood; however, failure to detect T2D associations at other non-HLA type 1 diabetes susceptibility loci (for example, INS, PTPN22 and Twelve type 2 diabetes susceptibility loci identified through large-scale association analysis Voight, 2010 Meta-analyses for T2D: N>40K and 90K identifies >30 loci among 2,400,000 SNPs
  • 49.
    A RT IC L E S 13 autosomal loci exceeded the threshold for genome-wide significance (P ranging from 2.8 × 10−8 to 1.4 × 10−22) with allele-specific odds (r2 < 0.05), and conditional analyses (see below) establish these SNPs as independent (Fig. 2 and Supplementary Table 4). Further analysis 50 Locus established previously Locus identified by current study Locus not confirmed by current study BCL11A THADA NOTCH2 ADAMTS9 IRS1 IGF2BP2 WFS1 ZBED3 CDKAL1 HHEX/IDE KCNQ1 (2 signals*: ) TCF7L2 KCNJ11 CENTD2 MTNR1B HMGA2 ZFAND6 PRC1 FTO HNF1B DUSP9 Conditional analysis Unconditional analysis TSPAN8/LGR5 HNF1A CDC123/CAMK1D CHCHD9 CDKN2A/2B SLC30A8 TP53INP1 JAZF1 KLF14 PPAR 40 30 –log10(P)–log10(P) 20 10 10 1 2 3 4 5 6 7 8 Chromosome 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X 0 0 Suggestive statistical association (P < 1 10 –5 ) Association in identified or established region (P < 1 10 –4 ) Figure 1 Genome-wide Manhattan plots for the DIAGRAM+ stage 1 meta-analysis. Top panel summarizes the results of the unconditional meta- analysis. Previously established loci are denoted in red and loci identified by the current study are denoted in green. The ten signals in blue are those taken forward but not confirmed in stage 2 analyses. The genes used to name signals have been chosen on the basis of proximity to the index SNP and should not be presumed to indicate causality. The lower panel summarizes the results of equivalent meta-analysis after conditioning on 30 previously established and newly identified autosomal T2D-associated SNPs (denoted by the dotted lines below these loci in the upper panel). Newly discovered conditional signals (outside established loci) are denoted with an orange dot if they show suggestive levels of significance (P < 10−5), whereas secondary signals close to already confirmed T2D loci are shown in purple (P < 10−4). Meta-analyses for T2D: N>40K and 90K identifies >30 loci among 2,400,000 SNPs
  • 50.
    0 20 40 60 80 100 recombinationrate(cM/Mb) ●●● ●● ●● ●●● ● ● ● ●●● ● ●●●●● ● ● ● ●●● ●● ●● ● ● ●●● ●● ● ● ● ● ● ● ●● ● ● ●● ●● ● ● ●● ●● ● ● ● ●●● ● ● ● ● ●● ● ● ● ● ●●●●● ● ●● ● ● ● ●●● ● ● ● ● ● ● ●● ● ●●● ●●● ● ● ● ● ● ● ●●●●● ●●●● ● ● ●● ● ● ●● ● ● ●● ● ●● ● ● ●●● ●● ●● ● ●● ● ●● ● ● ●● ●●●● ● ● ● ● ● ●● ● ●● ●● ●● ● ● ● ● ● ● ● ●● ● ●●●● ● ● ● ●● ● ●● ● ●●● ● ● ● ● ● ●●●● ● ● ●● ● ● ●●●●● ● ● 2 −> PGCP 98 SLC30A8 Region 0 2 4 6 8 10 −log10(P−value) 0 20 40 60 80 100 recombinationrate(cM/Mb) rs3802177 ●●●● ● ● ● ● ● ● ● ● ● ●● ● ●● ●●● ● ● ● ● ●●● ●● ● ●●●●●● ● ●●● ● ● ● ● ● ● ●● ●●● ●● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●●● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ●● ●● ● ●● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ●● ●● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ●● ●●● ● ● ● ● ●●●●● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ●●● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ●● ●● ● ●●● ●● ●● ● ●● ● ●●● ● ● ● ● ●●●● ● ● ● ● ● ● ● ● ● ● ●●●● ● ●● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ●●● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ●●●● ●● ● ● ●● ●●● ● ●●●●● ●● ●●● ● ●●● ● ● ● ● ●●● ●● ● ● ● ●●●●● ● ● ● ● ●● ● ●●● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ●●● ● ● ● ● ●●● ● ●●● ● ● ●● ● ● ● ● ●● ● ● ●●● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ●●●● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ●●●● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ●● ● ● ●●● ●● ● ● ● ● ●●● ● ● ● ● ●● ● ● ●● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●●● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●●●● ●●● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●●● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ●● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ●●●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ●● ● ●● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ●● ● ● ●● ● ● ●● ● ●●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ●● ● ●● ●●● ●● ●●●●●● ● ● ● ●● ●● ● ● ● ● ●●● ● ● ● ●● ● ●● ● ●● ●●●●● ● ● ● ●●● ● ● ●● ● ●● ● ●● ● ●● ● ● ● ● ● ● ● ● ●● ●●● ● ●● ●●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●●● ●● ● ● ● ● ● ● ●● ● ● ●●●● ●● ●● ●●● ● ● ● ●●●●● ● ●● ● ● ● ● ●● ● ● ●● ●●●●●●●●● ●●● ● ●●● ● ●● ● ●●● ● ● ● ● ● ● ●● ● ● ● ● ●● ●● ● ●● ● ●●●●●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ●●● ● ●●●●● ● ● ●●● ● ●●●● ● ●● ●● ● ● ●●● ● ● ●●●●●●● ● ● ● ● ● ● ●● ● ●● ● ●● ●●●●● ●●● ● ● ● ● ● ● ● ● ● ● ●● ● ●●● ● ● ● ●● ●● ●● ● ●●●● ●●● ● ●● ● ● ● ● ●●● ● ●●● ● ●● ● ●●● ● ●●●●●●●●●● ● ● ● ● ●●●● ● ●● ●●●●●●●●●●●●● ● ●●● ● ●● ●● ● ● ●● ●● ● ●●●●● ● ● ● ●● ●● ● ● ●●●●●● ● ● ● ● ● ● ●●●● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ●●● ● ● ●●● ● ● ●●● ● ● ● ● ●●●●● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ●●●● ● ●● ● ●●● ● ● ● ● ●●● ● ● ● ● ● ●● ●● ●● ● ● ● ● ●●●●● ● ● ●● ●● ● ●●●●● ● ● ●●● ●● ●●● ● ● ● ● ●● ● ● ● ● ●● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ●● ●●● ● ●● ●● ●●●● ● ● ● ●●● ● ● ●●● ● ● ● ● ●● ● ● ●●●●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●●●● ●●●● ● ● ● ●● ● ●●●● ●● ● ● ● ●●●● ●● ● ● ● ● ●● ● ● ●● ● ●● ● ● ● ● ● ●●● ● ●● ●●● ● ● ●●● ● ● ●●●●● ● ● ● ● ●●●●● ● ●●●●● ● ●●● ● ● ●● ● ● ● ● ●●● ●● ●●● ●● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●●●● ● ● ● ● ● ●●● ● ● ● ●● ● ● ● ● ● ● ● ●● ●● ● ●● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ●●●● ● ●●● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ●● ●●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ●●● ● ●●● ● ●●●●●●● ● ● ● ● ● ● ●●●●●●●● ●● ● ● ● ● ●●●●●● ● ● ●● ● ●● ●●● ● ● ● ● ● ● ● ●●●● ●● ● ● ●●● ●●● ● ●●●● ● ●●●●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●●●●●●●●●●●●● ●●●●●●● ● ● ●●●●●●● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ●● ●●● ● ●● ● ●●●● ●● ● ● ● ● ● ● ●●●● ● ●● ● ● ● ● ● ●● ●●●●● ● ● ● ●● ● ●●●●●●●●●●●●● ●●●●●●●●●● ● ●● ● ● ● ● ● ● ● ●● ● ●● ● ●● ● ●●●●●● ● ●● ● ●●●●●●● ● ●● ●●●● ● ●●●● ● ● ● ●●●●●● ● ●● ●●●●●●●●●●● ●●● ● ● ● ●●●●●● ● ●● ● ●●●●●● ●●●●● ● ● ● ● ● ●●●●●●●●●●●●●●●●●●● ●● ● ● ● ●● ● ● ● ● ●●●● ●● ● ●●● ●● ●●● ● ●● ●● ● ●● ● ● ●●●●● ● ● ● ●● ●● ●●● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ●●● ●●● ● ●●● ● ●●●●●●●● ● ●●●● ● ● ●●● ● ●● ● ●●● ● ●●●● ● ●● ●●● ● ●●●●● ●●●● ●● ●●● ● ● ● ● ● ● ●●●● ● ● ● ●●● ● ● ● ● ● ● ● ● ●●●●●●●●●●● ● ● ●●●●● ● ● ●●●●● ● ●●●● ● ●● ● ●●●●● ● ●●●● ●● ● ●● ● ● ● ●● ●●●●●●●●●●●●● ● ● ●●●●●●● ●●●● ● ●● ●● ●●● ● ● ●● ●●● ● ●●●● ● ● ●●● ●●●●●●● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●●● ●●●●● ● ●●●●●●●●●●● ● ●●●●●●● ●●●●●●●● ● ● ● ● ●● ● ● ● ●●● ●● ● ● ●●●●●●●●●●●●●●●● ●●●●● ●●●●● ● ● ● ● ● ● ●● ●● ● ● ● ●●● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ●● ●● ● ● ●●●●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ●● ●● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ●● ●● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ●● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ●● ● ● ●● ● ●● ● ●● ● ● ● ● ●● ● ● ● ●● ● ●●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ●●●●●●●●● ● ●●●● ●● ●●● ●● ●● ●●● ● ●● ● ● ● ●●●● ● ● ● ● ● ● ● ●● ● ● ●●● ● ●● ● ●● ● ● ● ●●● ● ● ●●●●●●●● ● ●●●● ●● ● ●● ●● ● ●●●●●●● ●●●● ● ● ●● ●●● ● ●●● ●●● ● ●● ● ● ● ●● ● ●●●● ● ● ● ● ●●● ● ●●●●●●●● ● ● ● ● ●● ● ● ● ●● ● ● ●● ● ●● ●● ● ● ● ●●●● ● ● ● ● ●● ● ●● ●● ● ● ●● ● ●●●● ●● ●● ● ● ● ● ● ● ●● ● ●●● ● ● ● ● ● ● ● ● ● ●●● ● ●● ●●●● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ●●●● ● ● ●● ● ● ●● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ●● ●● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●●● ●● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●●●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ●● ● ● ● ●●● ● ● ● ● ● ● ●● ● ● ●● ●● ● ● ● ● ●● ● ●● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ●● ● ●● ●●●●●●●● ● ● ●●●●●●● ● ●●● ● ● ●●●●● ● ●● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ●●● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ●● ● ●● ● ● ● ●● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ●● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ●●● ● ●● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●●●● ● ● ●● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ●● ● ● ●● ● ● ●● ● ● ● ●● ● ● ●●● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ●●● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ●● ● ●● ● ●●●●●●●●●● ●●●●● ●● ●●● ●●● ● ● ●●●● ●●●●●●●●●● ● ● ● ● ●● ●●●●● ●●●●●●●●●● ●●●●● ● ● ● ● ● ● ●●●●●●●● ● ● ● ●●●● ●●●● ●●● ● ● ●● ● ● ●● ● ● ● ●●●●● ●● ● ● ● ● ● ● ● ●●●● ● ●●● ● ● ●● ● ● ●● ● ● ● ●● ● ●● ●●● ● ● ● ● ●●● ●● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ●●●● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ●●●● ●● ● ● ●● ● ● ●● ● ● ● ● ● ● ●● ●● ● ●● ● ● ● ● ●● ● ● ● ●● ● ●● ●● ● ● ● rs3802177 stage 1 ● r^2: 0.8 − 1.0 ● r^2: 0.6 − 0.8 ● r^2: 0.4 − 0.6 ● r^2: 0.2 − 0.4 ● r^2: 0.0 − 0.2 ● r^2 missing <− TRPS1 <− EIF3H UTP23 −> <− RAD21 LOC441376 −> SLC30A8 −> MED30 −> <− EXT1 <− SAMD12 <− TNFRSF11 COLEC1 117 118 119 120 Position on chromosome 8 (Mb) CDKN2A/B Region 0 2 4 6 8 10 −log10(P−value) 0 20 40 60 80 100 recombinationrate(cM/Mb) rs10965250 ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ●●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ●● ● ●● ● ● ● ● ● ● ●●● ● ●●● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ●●● ● ● ● ● ● ●● ●● ● ● ● ●● ● ● ●● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ●● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ●● ● ● ● ● ● ● ● ● ●● ● ●●●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●●●● ● ●● ● ●● ● ● ● ●●● ● ●●● ● ● ● ● ●●● ● ●●● ● ● ● ● ●●●● ●● ● ●● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●●●● ● ●● ● ● ● ● ● ● ●● ● ● ●●●●● ● ●● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●●● ●● ●●● ● ●● ●● ● ● ●● ●●● ●● ● ● ●● ● ● ●● ● ● ● ● ● ● ●●●●●●●●● ●● ●● ● ● ● ● ● ● ● ●● ● ● ● ●●●●●●● ●●● ● ● ● ●● ● ● ●●●● ● ● ● ●● ● ● ● ● ●●●●● ● ●● ●●●●●● ● ● ● ●● ● ● ●●● ● ● ● ●●● ● ●●●● ● ● ● ●●●● ●● ●●● ●● ●●●●● ●● ●●● ●●●●● ● ●●●● ● ● ● ●● ● ● ● ● ●●● ● ● ●● ●● ●● ● ● ● ● ●● ● ● ● ●●● ● ● ● ● ● ●●●●●●● ●●● ● ●● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ●● ●● ●● ●●●●●●●●●● ● ●●● ●● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ●●● ●●●●● ● ●● ●● ● ● ●●● ●● ● ●● ● ● ● ● ● ●●● ● ●●● ● ●●● ● ● ● ● ●●●●●●●●●●●●● ● ●● ●●● ●●● ●●● ● ● ● ●●●● ●● ● ● ● ●● ● ● ● ●● ● ● ●● ● ● ● ●●● ●● ●● ●●●●●●●●●●●●●●● ● ●●● ●●●●● ● ● ● ● ● ● ● ●●●● ●● ● ● ● ● ● ●● ● ●●● ● ● ●● ●●●●● ● ●● ● ● ● ● ●●●●●●● ● ● ● ● ● ●●● ●● ● ●●● ● ●●● ● ●●●●●●●●●●●●●●●● ●●●● ●● ● ●● ●● ●● ● ● ● ● ● ●● ● ●● ● ●●● ● ●●● ● ●●●●● ● ●● ● ●●● ●● ●● ● ● ●●● ●● ●●●● ●● ●● ●● ●● ● ● ● ● ● ● ●●●● ● ●●●●● ● ● ● ●●●● ● ●● ● ● ● ● ●●● ● ●● ● ● ●●●●● ● ● ● ● ● ●● ● ●● ● ●●●●● ● ●● ●●●●● ●● ● ●●● ● ● ●●● ● ● ● ● ● ● ● ●● ● ●●●● ● ● ●● ● ●● ●●●●●●●●●●●●●● ●● ● ●● ●●● ● ● ● ●● ●● ● ●●● ● ●●●● ● ● ● ● ●● ●● ●● ●●●●●● ● ● ●● ● ● ●● ● ● ● ● ●● ● ●● ●●● ● ● ●● ● ●● ● ● ● ●● ● ●●● ● ●● ● ● ●●● ● ●●●●● ● ● ●●● ●●●●● ●● ●●● ● ● ● ● ● ● ●● ● ● ● ● ●●●●●● ● ●●● ●● ● ●●● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ●●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●●● ● ●● ● ● ●●●●● ●● ● ●● ● ●● ● ● ● ●● ●● ●● ● ● ●● ● ● ● ●● ● ● ● ●● ● ●● ● ● ● ● ● ● ●●●● ● ● ● ● ●● ● ● ●●●● ●● ●●● ●● ●● ● ● ● ●● ● ● ●●●● ●●● ● ● ●● ●● ● ● ● ●●● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ●● ●● ● ● ● ●●● ● ● ● ● ●● ● ● ● ●● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ●●●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ●● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●●●● ● ● ● ●● ● ● ● ● ●● ● ●● ● ● ● ● ● ●●● ●●●● ● ● ● ●●● ● ● ● ● ● ●● ● ● ● ● ● ● ●●● ● ● ● ● ●●● ●● ● ● ● ●● ●● ● ●● ●● ● ●●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ●●● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ●● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ●●●●● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ●●● ●● ● ● ● ●● ● ● ●●● ● ●●●● ● ●● ● ● ● ● ● ●● ● ● ● ● ●●● ●●●●●● ●●●● ●● ●● ●●●● ●●● ●●● ● ● ● ● ●● ●● ● ●●● ●● ● ● ●●● ●●●● ● ●● ● ● ●● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●●● ●● ● ● ● ● ●● ● ● ●● ●● ● ● ●● ● ● ●● ● ●● ● ●●●●● ●● ●● ● ● ● ●●● ●● ● ● ●● ● ●● ●● ●●● ● ● ● ●● ● ● ●● ● ●● ●●●●●●●●●●●●●●●● ● ●● ●●● ●● ●●●● ● ● ● ● ● ●● ● ● ● ●● ●●●●● ● ● ● ● ● ●● ●● ● ●● ● ●● ● ● ●●● ●● ● ● ● ● ●●● ● ●● ●● ● ●● ● ● ● ●● ● ● ●●●● ●●● ● ●● ●●●●● ● ● ●●● ● ●● ● ●● ● ● ●●● ●● ●●● ● ● ● ● ● ● ● ●● ● ● ●●●● ●●● ●● ●● ●● ● ●● ● ●● ● ● ●●●●● ● ●● ● ● ●● ● ● ● ●●●● ● ●● ● ●●● ● ● ● ● ● ● ● ●●● ●● ● ● ● ● ●● ● ●●●●● ● ● ●● ● ●● ● ● ●● ● ● ● ●●●●●● ● ● ●●●● ●● ● ●●●●● ● ● ● ● ●● ●● ● ●● ● ● ● ● ●●●●● ● ● ● ●●●● ● ● ● ●●●●●● ● ●● ●● ●●● ●●● ●●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ●● ●●●●● ●●● ● ●●● ● ● ● ● ● ● ●● ● ● ●●●●● ●●● ● ● ● ● ● ● ●●● ● ● ●● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ●●● ● ● ●● ● ● ● ● ●● ● ● ●●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ●● ●● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ●● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ●● ● ●●● ● ● ● ● ●●●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ●● ● ●● ●●● ● ● ● ●● ● ●● ●● ● ● ●● ● ● ● ● ● ●● ● ● ●● ●● ● ● ● ● ● ●● ● ● ● ● ●● ●●● ●● ●●● ●● ●● ●●● ●● ●● ● ● ● ● ●● ●● ●● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ●● ● ● ● ●● ●● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ●● ●●●●●● ● ● ● ● ● ● ● ● ● ● ●● ● ●●●● ● ● ● ● ● ● ●●● ●● ●●●●●● ●● ●●●●●●●● ● ● ● ● ● ● ● ●● ●● ● ●●●● ●● ●● ● ● ●● ● ● ● ● ● ●●●● ● ● ● ● ● ● ● ●● ●●●●● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ●●●●●●● ●● ● ● ● ● ●● ● ● ●● ● ● ●●●● ● ●● ●● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ●● ● ● ●● ● ● ●● ● ● ● ● ● ● ●●●●● ●● ●● ● ● ● ●● ● ● ● ● ●●●●●●●● ●●● ● ●●●● ●●● ● ● ●● ● ● ●●●● ●●●● ● ● ● ● ● ● ●●●●● ● ● ● ● ● ● ● ● ● ● ●●● ●●● ● ● ● ●● ●● ● ● ● ● ● ●● ● ●● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ●●●● ●● ● ● ● ● ● ●●● ● ● ●● ●● ● ● ●●● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ●● ● ● ● ●●●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ●● ● ● ● ● ●● ● ● ● ● ●● ● ●● ● ●●●● ●●● ● ● ● ●●● ● ● ● ● ●●● ● ● ● ● ●● ●● ● ●● ● ● ●●● ● ● ● ●● ●● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ●● ●● ●●● ● ● ● ● ●●● ● ●● ●● ●● ●● ●● ● ●●● ●● ●●● ● ●●● ● ● ● ●● ● ● ● ● ● ● ●●● ● ●●●● ● ● ● ●● ●●● ● ● ●●● ●● ●● ●●●●● ● ● ●●●● ● ● ●●● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●●● ● ●● ● ● ● ● ●● ● ● ●● ● ●● ● ●●●●● ●●● ●●● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ●● ● ●●●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ●●●●●●● ●● ●●●● ●● ● ● ●● ●● ● ● ● ●● ● ●●● ● ● ● ●● ● ● ●●●● ●●●●● ●●●●● ●● ● ●●●● ● ● ●● ● ●●● ● ● ●●● ●● ● ● ●● ● ● ● ● ●● ●●● ●● ●● ●● ● ● ● ●● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ●●●●● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ●●● ● ● ● ●● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ●● ●● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●●●● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●●● ● ●● ●● ● ● ●● ● ● ● ● ● ●●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ●● ●● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ●●● ● ●●● ● ● ●● ● ●●●● ● ● ● ● ●● ● ● ● ● ● ●● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●●●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ●● ●● ● ● ● ● ● ● ●● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ●● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ●● ● ● ● ●● ● ●● ● ● ●●● ● ● ●●●● ●● ● ● ● ● ● ● ● ●●● ● ● ●●● ●● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ●●●● ● ●● ● ●● ● ● ●● ●●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ●● ● ● ●● ● ● ● ● ● ● ●●● ●●● ●● ● ● ● ● ● ●● ● ●●● ● ● ● ● ● ● ● ● ● ● ●● ●● ●● ● ●● ●● ● ●● ● ● ●●● ● ● ● ● ● ●● rs10965250 stage 1 ● r^2: 0.8 − 1.0 ● r^2: 0.6 − 0.8 ● r^2: 0.4 − 0.6 ● r^2: 0.2 − 0.4 ● r^2: 0.0 − 0.2 ● r^2 missing <− MLLT3 KIAA1797 −> <− PTPLAD2 <− IFNB1 <− IFNW1 <− IFNA21 <− IFNA4 <− IFNA7 <− IFNA13 MTAP −> <− CDKN2A <− CDKN2B DMRTA1 −> <− ELAVL2 21 22 23 24 Position on chromosome 9 (Mb) 40 60 80 100 recombinationrate(c CDC123/CAMK1D Region 4 6 8 10 log10(P−value) 40 60 80 100 recombinationrate(c rs12779790 ●●● ● ● ●● ● rs12779790 stage 1 ● r^2: 0.8 − 1.0 ● r^2: 0.6 − 0.8 ● r^2: 0.4 − 0.6 ● r^2: 0.2 − 0.4 ● r^2: 0.0 − 0.2 ● r^2 missing HHEX/IDE Region 10 15 log10(P−value) 40 60 80 100 recombinationrate(c rs5015480 ● ● ● ● ● ●● ● ● ● ●●●● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ●●● rs5015480 stage 1 ● r^2: 0.8 − 1.0 ● r^2: 0.6 − 0.8 ● r^2: 0.4 − 0.6 ● r^2: 0.2 − 0.4 ● r^2: 0.0 − 0.2 ● r^2 missing .609 Not in a gene...In a gene... ~90% of GWAS hits are non-coding!
  • 51.
    pporting!Figures! ! ! ~90% of GWAShits are non-coding! Stamatoyannopoulos, Science 2012 Systematic Localization of Common Disease-Associated Variation in Regulatory DNA Matthew T. Maurano,1 * Richard Humbert,1 * Eric Rynes,1 * Robert E. Thurman,1 Eric Haugen,1 Hao Wang,1 Alex P. Reynolds,1 Richard Sandstrom,1 Hongzhu Qu,1,2 Jennifer Brody,3 Anthony Shafer,1 Fidencio Neri,1 Kristen Lee,1 Tanya Kutyavin,1 Sandra Stehling-Sun,1 Audra K. Johnson,1 Theresa K. Canfield,1 Erika Giste,1 Morgan Diegel,1 Daniel Bates,1 R. Scott Hansen,4 Shane Neph,1 Peter J. Sabo,1 Shelly Heimfeld,5 Antony Raubitschek,6 Steven Ziegler,6 Chris Cotsapas,7,8 Nona Sotoodehnia,3,9 Ian Glass,10 Shamil R. Sunyaev,11 Rajinder Kaul,4 John A. Stamatoyannopoulos1,12 † Genome-wide association studies have identified many noncoding variants associated with common diseases and traits. We show that these variants are concentrated in regulatory DNA marked by deoxyribonuclease I (DNase I) hypersensitive sites (DHSs). Eighty-eight percent of such DHSs are active during fetal development and are enriched in variants associated with gestational exposure–related phenotypes. We identified distant gene targets for hundreds of variant-containing DHSs that may explain phenotype associations. Disease-associated variants systematically perturb transcription factor recognition sequences, frequently alter allelic chromatin states, and form regulatory networks. We also demonstrated tissue-selective enrichment of more weakly disease-associated variants within DHSs and the de novo identification of pathogenic cell types for Crohn’s disease, multiple sclerosis, and an electrocardiogram trait, without prior knowledge of physiological mechanisms. Our results suggest pervasive involvement of regulatory DNA variation in common human disease and provide pathogenic insights into diverse disorders. D isease- and trait-associated genetic variants are rapidly being identified with genome- wide association studies (GWAS) and re- lated strategies (1). To date, hundreds of GWAS have been conducted, spanning diverse diseases and quantitative phenotypes (2) (fig. S1A). How- ever, the majority (~93%) of disease- and trait- associated variants emerging from these studies lie within noncoding sequence (fig. S1B), com- plicating their functional evaluation. Several lines of evidence suggest the involvement of a propor- tion of such variants in transcriptional regulatory mechanisms, including modulation of promoter and enhancer elements (3–6) and enrichment with- in expression quantitative trait loci (eQTL) (3, 7, 8). Human regulatory DNA encompasses a vari- ety of cis-regulatory elements within which the co- operative binding of transcription factors creates focal alterations in chromatin structure. Deoxy- ribonuclease I (DNase I) hypersensitive sites (DHSs) are sensitive and precise markers of this actuated regulatory DNA, and DNase I mapping has been instrumental in the discovery and census of hu- man cis-regulatory elements (9). We performed DNase I mapping genome-wide (10) in 349 cell and tissue samples, including 85 cell types studied under the ENCODE Project (10) and 264 sam- ples studied under the Roadmap Epigenomics Program (11). These encompass several classes nome. In total, we identified 3,899,693 distinct DHS positions along the genome (collectively spanning 42.2%), each of which was detected in one or more cell or tissue types (median = 5). Disease- and trait-associated variants are concentrated in regulatory DNA. We examined the distribution of 5654 noncoding genome-wide significant associations [5134 unique single- nucleotide polymorphisms (SNPs); fig. S1 and table S2] for 207 diseases and 447 quantitative traits (2) with the deep genome-scale maps of regulatory DNA marked by DHSs. This revealed a collective 40% enrichment of GWAS SNPs in DHSs (fig. S1C, P < 10−55 , binomial, compared to the distribution of HapMap SNPs). Fully 76.6% of all noncoding GWAS SNPs either lie within a DHS (57.1%, 2931 SNPs) or are in complete linkage disequilibrium (LD) with SNPs in a near- by DHS (19.5%, 999 SNPs) (Fig. 1A) (12). To con- firm this enrichment, we sampled variants from the 1000 Genomes Project (13) with the same ge- nomic feature localization (intronic versus inter- genic), distance from the nearest transcriptional start site, and allele frequency in individuals of European ancestry. We confirmed significant en- richment both for SNPs within DHSs (P < 10−59 , simulation) and also including variants in com- plete LD (r 2 = 1) with SNPs in DHSs (P < 10−37 , simulation) (fig. S2). In total, 47.5% of GWAS SNPs fall within gene bodies (fig. S1B); however, only 10.9% of intronic GWAS SNPs within DHSs are in strong LD (r2 ≥ 0.8) with a coding SNP, indicating that the vast majority of noncoding genic variants are not simply tagging coding sequence. Analo- gously, only 16.3% of GWAS variants within coding sequences are in strong LD with variants in DHSs. SNPs on widely used genotyping arrays (e.g., Affymetrix) were modestly enriched with- in DHSs (fig. S2), possibly due to selection of SNPs with robust experimental performance in genotyping assays. However, we found no evi- dence for sequence composition bias (table S3). To further examine the enrichment of GWAS SNPs in regulatory DNA, we systematically clas- sified all noncoding GWAS SNPs by the quality 1 Department of Genome Sciences, University of Washington, Seattle, WA 98195, USA. 2 Laboratory of Disease Genomics RESEARCH ARTICLE onSeptember12,2012www.sciencemag.orgDownloadedfrom
  • 52.
    There have beenfew, if any, similar bursts of discovery in the history of medical research. David Hunter and Peter Kraft, NEJM, 2007
  • 53.
    Common claims discussedin regards to GWAS: Despite issues, yielded many discoveries vs. cost to a doubling of the number of associated variants discov- ered. The proportion of genetic variation explained by significantly associated SNPs is usually low (typically less than 10%) for many complex traits, but for diseases such as CD and multiple sclerosis (MS [MIM 126200]), and for quantitative traits such as height and lipid traits, between Figure 1. GWAS Discoveries over Time Data obtained from the Published GWAS Catalog (see Web Resources). Only the top SNPs representing loci with association p values < 5 3 10À8 are included, and so that multiple counting is avoided, SNPs identified for the same traits with LD r2 > 0.8 esti- mated from the entire HapMap samples are excluded. ~500,000 SNP chips x ~$500/chip = $250M Five years of GWAS Discovery (Visscher, 2012) $250M / ~2000 loci = $125K/locus Candidate genes: >$250M! 100 NIH R01s Fighter jet Hadron Collider: $9B
  • 54.
    P = G+ EType 2 Diabetes Cancer Alzheimer’s Gene expression Phenotype Genome Variants Environment Infectious agents Nutrients Pollutants Drugs Complex traits are a function of genes and environment...
  • 55.
    Nothing comparable toelucidate E influence! We lack high-throughput methods and data to discover new E in P… E: ???
  • 56.
    A similar paradigmfor discovery should exist for E! Why?
  • 57.
  • 58.
    σ2 G σ2 P H2 = Heritability (H2)is the range of phenotypic variability attributed to genetic variability in a population Indicator of the proportion of phenotypic differences attributed to G.
  • 59.
    Height is anexample of a heritable trait: Francis Galton shows how its done (1887) “mid-height of 205 parents described 60% of variability of 928 offspring”
  • 60.
    Eye color Hair curliness Type-1diabetes Height Schizophrenia Epilepsy Graves' disease Celiac disease Polycystic ovary syndrome Attention deficit hyperactivity disorder Bipolar disorder Obesity Alzheimer's disease Anorexia nervosa Psoriasis Bone mineral density Menarche, age at Nicotine dependence Sexual orientation Alcoholism Lupus Rheumatoid arthritis Crohn's disease Migraine Thyroid cancer Autism Blood pressure, diastolic Body mass index Depression Coronary artery disease Insomnia Menopause, age at Heart disease Prostate cancer QT interval Breast cancer Ovarian cancer Hangover Stroke Asthma Blood pressure, systolic Hypertension Osteoarthritis Parkinson's disease Longevity Type-2 diabetes Gallstone disease Testicular cancer Cervical cancer Sciatica Bladder cancer Colon cancer Lung cancer Leukemia Stomach cancer 0 25 50 75 100 Heritability: Var(G)/Var(Phenotype) SNPedia.com G estimates for complex traits are low and variable: massive opportunity for high-throughput E discovery Type 2 Diabetes (25%) Heart Disease (25-30%) Autism (50%???)
  • 61.
    Eye color Hair curliness Type-1diabetes Height Schizophrenia Epilepsy Graves' disease Celiac disease Polycystic ovary syndrome Attention deficit hyperactivity disorder Bipolar disorder Obesity Alzheimer's disease Anorexia nervosa Psoriasis Bone mineral density Menarche, age at Nicotine dependence Sexual orientation Alcoholism Lupus Rheumatoid arthritis Crohn's disease Migraine Thyroid cancer Autism Blood pressure, diastolic Body mass index Depression Coronary artery disease Insomnia Menopause, age at Heart disease Prostate cancer QT interval Breast cancer Ovarian cancer Hangover Stroke Asthma Blood pressure, systolic Hypertension Osteoarthritis Parkinson's disease Longevity Type-2 diabetes Gallstone disease Testicular cancer Cervical cancer Sciatica Bladder cancer Colon cancer Lung cancer Leukemia Stomach cancer 0 25 50 75 100 Heritability: Var(G)/Var(Phenotype) SNPedia.com G estimates for complex traits are low and variable: massive opportunity for high-throughput E discovery σ2 E : Exposome!
  • 62.
    ©2015NatureAmerica,Inc.Allrightsreserved. Despite a centuryof research on complex traits in humans, the relative importance and specific nature of the influences of genes and environment on human traits remain controversial. We report a meta-analysis of twin correlations and reported variance components for 17,804 traits from 2,748 publications including 14,558,903 partly dependent twin pairs, virtually all published twin studies of complex traits. Estimates of heritability cluster strongly within functional domains, and across all traits the reported heritability is 49%. For a majority (69%) of traits, the observed twin correlations are consistent with a simple and parsimonious model where twin resemblance is solely due to additive genetic variation. The data are inconsistent with substantial influences from shared environment or non-additive genetic variation. This study provides the most comprehensive analysis of the causes of individual differences in human traits thus far and will guide future gene-mapping efforts. All the results can be visualized using the MaTCH webtool. Specifically, the partitioning of observed variability into underlying genetic and environmental sources and the relative importance of additive and non-additive genetic variation are continually debated1–5. Recent results from large-scale genome-wide association studies (GWAS) show that many genetic variants contribute to the variation in complex traits and that effect sizes are typically small6,7. However, the sum of the variance explained by the detected variants is much smaller than the reported heritability of the trait4,6–10. This ‘missing heritability’ has led some investigators to conclude that non-additive variation must be important4,11. Although the presence of gene-gene interaction has been demonstrated empirically5,12–17, little is known about its relative contribution to observed variation18. In this study, our aim is twofold. First, we analyze empirical esti- mates of the relative contributions of genes and environment for virtually all human traits investigated in the past 50 years. Second, we assess empirical evidence for the presence and relative importance of non-additive genetic influences on all human traits studied. We rely on classical twin studies, as the twin design has been used widely to disentangle the relative contributions of genes and environment, across a variety of human traits. The classical twin design is based on contrasting the trait resemblance of monozygotic and dizygotic twin pairs. Monozygotic twins are genetically identical, and dizygotic twins are genetically full siblings. We show that, for a majority of traits (69%), the observed statistics are consistent with a simple and parsi- monious model where the observed variation is solely due to additive genetic variation. The data are inconsistent with a substantial influence from shared environment or non-additive genetic variation. We also show that estimates of heritability cluster strongly within functional domains, and across all traits the reported heritability is 49%. Our results are based on a meta-analysis of twin correlations and reported variance components for 17,804 traits from 2,748 publications includ- ing 14,558,903 partly dependent twin pairs, virtually all twin studies of complex traits published between 1958 and 2012. This study provides the most comprehensive analysis of the causes of individual differences in human traits thus far and will guide future gene-mapping efforts. All Meta-analysis of the heritability of human traits based on fifty years of twin studies Tinca J C Polderman1,10, Beben Benyamin2,10, Christiaan A de Leeuw1,3, Patrick F Sullivan4–6, Arjen van Bochoven7, Peter M Visscher2,8,11 & Danielle Posthuma1,9,11 1Department of Complex Trait Genetics, VU University, Center for Neurogenomics and Cognitive Research, Amsterdam, the Netherlands. 2Queensland Brain Institute, University of Queensland, Brisbane, Queensland, Australia. 3Institute for Computing and Information Sciences, Radboud University Nijmegen, Nijmegen, the Netherlands. 4Center for Psychiatric Genomics, Department of Genetics, University of North Carolina, Chapel Hill, North Carolina, USA. 5Department of Psychiatry, University of North Carolina, Chapel Hill, North Carolina, USA. 6Department of Medical Epidemiology and Biostatistics, Karolinska Institutet, Stockholm, Sweden. 7Faculty of Sciences, VU University, Insight into the nature of observed variation in human traits is impor- tant in medicine, psychology, social sciences and evolutionary biology. It has gained new relevance with both the ability to map genes for human traits and the availability of large, collaborative data sets to do so on an extensive and comprehensive scale. Individual differences in human traits have been studied for more than a century, yet the causes of variation in human traits remain uncertain and controversial. Nature Genetics, 2015 17,804 traits of the phenome 2,748 publications 14,558,903 twin pairs Average H2 (genome): 0.49 Exposome may play an equal role.
  • 63.
    Explaining the other50%: A new data-driven paradigm for robust discovery of via EWAS and the exposome what to measure? how to measure? PERSPECTIVES Xenobiotics Inflammation Preexisting disease Lipid peroxidation Oxidative stress Gut flora Internal chemical environment Externalenvironment ExposomeRADIATION DIET POLLUTION INFECTIONS DRUGS LIFE-STYLE STRESS Reactive electrophiles Metals Endocrine disrupters Immune modulators Receptor-binding proteins itical entity for disease eti- ogy (7). Recent discussion as focused on whether and ow to implement this vision 8). Although fully charac- rizing human exposomes daunting, strategies can be eveloped for getting “snap- hots” of critical portions of person’s exposome during ifferent stages of life. At ne extreme is a “bottom-up” rategy in which all chemi- als in each external source f a subject’s exposome are easured at each time point. lthoughthisapproachwould ave the advantage of relat- g important exposures to e air, water, or diet, it would quire enormous effort and ould miss essential compo- ents of the internal chemi- al environment due to such actors as gender, obesity, flammation, and stress. By ontrast, a “top-down” strat- gy would measure all chem- als (or products of their ownstream processing or ffects, so-called read-outs r signatures) in a subject’s ood. This would require nly a single blood specimen each time point and would relate directly ruptors and can be measured through serum some (telomere) length in peripheral blood mono- nuclear cells responded to chronic psychological stress, possibly mediated by the production of reac- tive oxygen species (15). Characterizing the exposome represents a tech- nological challenge like that of thehumangenomeproject,which began when DNA sequencing was in its infancy (16). Analyti- cal systems are needed to pro- cess small amounts of blood from thousands of subjects. Assays should be multiplexed for mea- suring many chemicals in each class of interest. Tandem mass spectrometry, gene and protein chips, and microfluidic systems offer the means to do this. Plat- forms for high-throughput assays shouldleadtoeconomiesofscale, again like those experienced by the human genome project. And because exposome technologies would provide feedback for thera- peuticinterventionsandpersonal- ized medicine, they should moti- vate the development of commer- cial devices for screening impor- tant environmental exposures in blood samples. With successful characterization of both Characterizing the exposome. The exposome represents the combined exposures from all sources that reach the internal chemical environment. Toxicologically important classes of exposome chemicals are shown. Signatures and biomarkers can detect these agents in blood or serum. onOctober21,2010www.sciencemag.orgrom “A more comprehensive view of environmental exposure is needed ... to discover major causes of diseases...” how to analyze in relation to health? Wild, 2005 Rappaport and Smith, 2010, 2011 Buck-Louis and Sundaram 2012 Miller and Jones, 2014 Patel CJ and Ioannidis JPAI, 2014
  • 64.
    We still cannot“query” the environment like the genome...
  • 65.
    Connecting Environmental Exposurewith Disease: Missing the “System” of Exposures? E+ E- diseased non- diseased ? Exposed to many things, but do not assess the multiplicity. Fragmented literature of associations. Challenge to discover E associated with disease.
  • 66.
    e modelling oblem isakin to – but less well sed and more poorly understood than – e testing. For example, consider the use r regression to adjust the risk levels of atments to the same background level There can be many covariates, and t of covariates can be in or out of the With ten covariates, there are over 1000 models. Consider a maze as a metaphor elling (Figure 3). The red line traces the path out of the maze. The path through ze looks simple, once it is known. ways in the literature for dealing with model selection, so we propose a new, composite 2. Publication bias is general recognition that a paper much better chance of acceptance if hing new is found. This means that, for ation, the claim in the paper has to sed on a p-value less than 0.05. From g’s point of view5 , this is quality by tion. The journals are placing heavy ce on a statistical test rather than nation of the methods and steps that o a conclusion. As to having a p-value han 0.05, some might be tempted to the system10 through multiple testing, ple modelling or unfair treatment of or some combination of the three that to a small p-value. Researchers can be creative in devising a plausible story to statistical finding. 2 The data cleaning team creates a modelling data set and a holdout set and P < 0.05 Figure 3. The path through a complex process can appear quite simple once the path is defined. Which terms are included in a multiple linear regression model? Each turn in a maze is analogous to including or not a specific term in the evolving linear model. By keeping an eye on the p-value on the term selected to be at issue, one can work towards a suitably small p-value. © ktsdesign – Fotolia A maze of associations is one way to a fragmented literature and Vibration of Effects Young, 2011 univariate sex sex & age sex & race sex & race & age JCE, 2015
  • 67.
    Example of fragmentation: Iseverything we eat associated with cancer? Schoenfeld and Ioannidis, AJCN (2012) 50 random ingredients from Boston Cooking School Cookbook Any associated with cancer? FIGURE 1. Effect estimates reported in the literature by malignancy type (top) or ingredient (bottom). Only ingredients with $10 studie outliers are not shown (effect estimates .10). Of 50, 40 studied in a cancer risk Weak statistical evidence: non-replicated inconsistent effects non-standardized
  • 68.
    Connecting Environmental Exposurewith Disease: Missing the “System” of Exposures? E+ E- diseased non- diseased ? Exposed to many things, but do not assess the multiplicity. Fragmented literature of associations. Challenge to discover E associated with disease.
  • 69.
    evol part ease tase well biol T capt imp STR reve subs libri −log10(P) 0 5 10 15 Chromosome 22 X 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 80 60 100 teststatistic a b NATURE|Vol 447|7 June2007 Environment-Wide Association Studies (EWAS): A GWAS-like study for the environment What specific environmental “loci” are associated to disease? Environmental Category Vitam ins β-carotene M etals lead O rganophosphate Pesticides H ydrocarbons 2-hydroxyfluorene [factor] case control
  • 70.
    ... but thereis no “microarray” for environmental exposure...
  • 71.
    Gold standard forbreadth of human exposure information: National Health and Nutrition Examination Survey1 since the 1960s now biannual: 1999 onwards 10,000 participants per survey The sample for the survey is selected to represent the U.S. population of all ages. To produce reli- able statistics, NHANES over-samples persons 60 and older, African Americans, and Hispanics. Since the United States has experienced dramatic growth in the number of older people during this century, the aging population has major impli- cations for health care needs, public policy, and research priorities. NCHS is working with public health agencies to increase the knowledge of the health status of older Americans. NHANES has a primary role in this endeavor. All participants visit the physician. Dietary inter- views and body measurements are included for everyone. All but the very young have a blood sample taken and will have a dental screening. Depending upon the age of the participant, the rest of the examination includes tests and proce- dures to assess the various aspects of health listed above. In general, the older the individual, the more extensive the examination. Survey Operations Health interviews are conducted in respondents’ homes. Health measurements are performed in specially-designed and equipped mobile centers, which travel to locations throughout the country. The study team consists of a physician, medical and health technicians, as well as dietary and health interviewers. Many of the study staff are bilingual (English/Spanish). An advanced computer system using high- end servers, desktop PCs, and wide-area networking collect and process all of the NHANES data, nearly eliminating the need for paper forms and manual coding operations. This system allows interviewers to use note- book computers with electronic pens. The staff at the mobile center can automatically transmit data into data bases through such devices as digital scales and stadiometers. Touch-sensi- tive computer screens let respondents enter their own responses to certain sensitive ques- tions in complete privacy. Survey information is available to NCHS staff within 24 hours of collection, which enhances the capability of collecting quality data and increases the speed with which results are released to the public. In each location, local health and government officials are notified of the upcoming survey. Households in the study area receive a letter from the NCHS Director to introduce the survey. Local media may feature stories about the survey. NHANES is designed to facilitate and en- courage participation. Transportation is provided to and from the mobile center if necessary. Participants receive compensation and a report of medical findings is given to each participant. All information collected in the survey is kept strictly confidential. Privacy is protected by public laws. Uses of the Data Information from NHANES is made available through an extensive series of publications and articles in scientific and technical journals. For data users and researchers throughout the world, survey data are available on the internet and on easy-to-use CD-ROMs. Research organizations, universities, health care providers, and educators benefit from survey information. Primary data users are federal agencies that collaborated in the de- sign and development of the survey. The National Institutes of Health, the Food and Drug Administration, and CDC are among the agencies that rely upon NHANES to provide data essential for the implementation and evaluation of program activities. The U.S. Department of Agriculture and NCHS coop- erate in planning and reporting dietary and nutrition information from the survey. NHANES’ partnership with the U.S. Environ- mental Protection Agency allows continued study of the many important environmental influences on our health. • Physical fitness and physical functioning • Reproductive history and sexual behavior • Respiratory disease (asthma, chronic bron- chitis, emphysema) • Sexually transmitted diseases • Vision 1 http://www.cdc.gov/nchs/nhanes.htm >250 exposures (serum + urine) GWAS chip >85 quantitative clinical traits (e.g., serum glucose, lipids, BMI) Death index linkage (cause of death)
  • 72.
    Gold standard forbreadth of human exposure information: National Health and Nutrition Examination Survey Nutrients and Vitamins vitamin D, carotenes Infectious Agents hepatitis, HIV, Staph. aureus Plastics and consumables phthalates, bisphenol A Physical Activity stepsPesticides and pollutants atrazine; cadmium; hydrocarbons Drugs statins; aspirin
  • 73.
    EWAS Approach forDiscovery bisphenol A PCB199 β-carotene cotinine ... }{ for each: Environmental factors: log transformed & z-standardized reference groups “negative” p-value(βfactor) bisphenol A 0.8 PCB199 0.1 β-carotene 0.01 cotinine 0.03 ... ... Significance tests (p-values): zfactor disease βfactor Regression: adjusted for other risk factors age, sex, race, socioeconomic status, ... Training Survey or Cohort: Classify diseased/non-diseased participants: E.g.: Diabetics and Non-diabetics
  • 74.
    EWAS Approach forDiscovery Validation Survey or Cohort: p-value < 0.05 in test survey? False Discovery Rate Estimation: The expected rate of false positives # false positives ≤ α # findings ≤ α= # false positives (α)? “Shuffle” (permute) disease and non-diseased participants cases controls Re-run EWAS Repeat many times FDR (p-value) bisphenol A 1 PCB199 0.4 β-carotene 0.1 cotinine 0.2 ... ... 50 false positives ≤ 0.05 100 findings ≤ 0.05 = 0.5
  • 75.
    Exposome factors areassociated with Type 2 Diabetes?
  • 76.
    PLoS ONE, 2010 NovelFindings: heptachlor epoxide γ-tocopherol Known Associations: β-carotene vitamin D Interesting Patterns: pesticides, PCBs −log10(pvalue) ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ●● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● acrylamide allergentest bacterialinfection cotinine diakyl dioxins furansdibenzofuran heavymetals hydrocarbons latex nutrientscarotenoid nutrientsminerals nutrientsvitaminA nutrientsvitaminB nutrientsvitaminC nutrientsvitaminD nutrientsvitaminE pcbs perchlorate pesticidesatrazine pesticideschlorophenol pesticidesorganochlorine pesticidesorganophosphate pesticidespyrethyroid phenols phthalates phytoestrogens polybrominatedethers polyflourochemicals viralinfection volatilecompounds 012 1999-2000 2001-2002 2003-2004 2005-2006 cohort markers FDR(α<0.02) ~ 10%“replicated” factors Fasting Blood Glucose ≥ 126 mg/dL? BMI, SES, ethnicity, age, sex OR: Δ 1SD of exposure N=500-2000 per cohort Heptachlor Epoxide OR=3.2, 1.8 PCB170 OR=4.5,2.3 γ-tocopherol (vitamin E) OR=1.8,1.6 β-carotene OR=0.6,0.6 What model is used to test for association? Compare vs. GWAS? EWAS in Type 2 Diabetes: Searching >250 exposures for associations with FBG > 125 mg/dL
  • 77.
    Exposome factors associatedwith serum lipids? Triglycerides, LDL-Cholesterol, HDL-Cholesterol
  • 78.
    EWAS on SerumLipid Levels: Triglycerides, LDL-Cholesterol, HDL-Cholesterol Risk factors for coronary heart disease (CHD) Targets for intervention (ie, statins) Influenced by smoking, physical activity, diet, genetics1 1. Teslovich et al. Nature (2010) 2 .Grundy et al. ATVB (2004) 3. Gotto et al. JACC (2004) LDL-C Δ1%: 1% increased risk for CHD2 HDL-C Δ1%: 2% decreased risk for CHD3 Triglycerides: higher risk for CHD
  • 79.
    EWAS in HDL-C: 17Validated Factors 1999-2000 2001-2002 2003-2004 2005-2006 cohort markers FDR < 5% carotenes cotinine heavy metals organochlorine pesticides IJE 2012. hydrocarbons log10(HDL-C) adjusted for BMI, SES, ethnicity, age, age2, sex N=1000-3000 E Vitamins DCBA minerals
  • 80.
    EWAS in Triglyceridesand LDL-C 22 factors organochlorine pesticides polychlorinated biphenyls carotenoids vitamin E vitamin A 8 factors carotenoids vitamin E vitamin A IJE 2012.
  • 81.
    Effect Sizes ForValidated Factors: HDL-C % change = Δ 1 SD in Exposure 17 validated factors survey! N! P-value! FDR! Effect (mg/dL)! pollutants nutrient factors IJE 2012.
  • 82.
    How do effectsizes compare between GWAS and EWAS? Previous studies have suggested sex-specific heritability of lipid traits15 . A key challenge in addressing this issue is evaluating enough three types of human tissue samples from liver (960 samples), omental fat (741 samples) and subcutaneous fat (609 samples). We Table 1 | Meta-analysis of plasma lipid concentrations in >100,000 individuals of European descent. Locus Chr Lead SNP Lead trait Other traits Alleles/MAF Effect size P eQTL CAD Ethnic LDLRAP1 1 rs12027135 TC LDL T/A/0.45 21.22 4 3 10211 Y 111? PABPC4 1 rs4660293 HDL A/G/0.23 20.48 4 3 10210 Y 1111 PCSK9 1 rs2479409 LDL TC A/G/0.30 12.01 2 3 10228 1111 ANGPTL3 1 rs2131925 TG TC, LDL T/G/0.32 24.94 9 3 10243 Y 1111 EVI5 1 rs7515577 TC A/C/0.21 21.18 3 3 1028 111? SORT1 1 rs629301 LDL TC T/G/0.22 25.65 1 3 102170 Y Y 1111 ZNF648 1 rs1689800 HDL A/G/0.35 20.47 3 3 10210 1112 MOSC1 1 rs2642442 TC LDL T/C/0.32 21.39 6 3 10213 111? GALNT2 1 rs4846914 HDL TG A/G/0.40 20.61 4 3 10221 1111 IRF2BP2 1 rs514230 TC LDL T/A/0.48 21.36 5 3 10214 111? APOB 2 rs1367117 LDL TC G/A/0.30 14.05 4 3 102114 1111 rs1042034 TG HDL T/C/0.22 25.99 1 3 10245 1211 GCKR 2 rs1260326 TG TC C/T/0.41 18.76 6 3 102133 Y 1111 ABCG5/8 2 rs4299376 LDL TC T/G/0.30 12.75 2 3 10247 1111 RAB3GAP1 2 rs7570971 TC C/A/0.34 11.25 2 3 1028 12?? COBLL1 2 rs10195252 TG T/C/0.40 22.01 2 3 10210 Y 1111 rs12328675 HDL T/C/0.13 10.68 3 3 10210 11?1 IRS1 2 rs2972146 HDL TG T/G/0.37 10.46 3 3 1029 Y Y 1111 RAF1 3 rs2290159 TC G/C/0.22 21.42 4 3 1029 111? MSL2L1 3 rs645040 TG T/G/0.22 22.22 3 3 1028 1121 KLHL8 4 rs442177 TG T/G/0.41 22.25 9 3 10212 1111 SLC39A8 4 rs13107325 HDL C/T/0.07 20.84 7 3 10211 Y 12?2 ARL15 5 rs6450176 HDL G/A/0.26 20.49 5 3 1028 2??1 MAP3K1 5 rs9686661 TG C/T/0.20 12.57 1 3 10210 1111 HMGCR 5 rs12916 TC LDL T/C/0.39 12.84 9 3 10247 111? TIMD4 5 rs6882076 TC LDL, TG C/T/0.35 21.98 7 3 10228 111? MYLIP 6 rs3757354 LDL TC C/T/0.22 21.43 1 3 10211 1221 HFE 6 rs1800562 LDL TC G/A/0.06 22.22 6 3 10210 11?1 HLA 6 rs3177928 TC LDL G/A/0.16 12.31 4 3 10219 Y 111? rs2247056 TG C/T/0.25 22.99 2 3 10215 1112 C6orf106 6 rs2814944 HDL G/A/0.16 20.49 4 3 1029 Y 1112 rs2814982 TC C/T/0.11 21.86 5 3 10211 Y 221? FRK 6 rs9488822 TC LDL A/T/0.35 21.18 2 3 10210 Y 111? CITED2 6 rs605066 HDL T/C/0.42 20.39 3 3 1028 1121 LPA 6 rs1564348 LDL TC T/C/0.17 20.56 2 3 10217 Y 11?1 rs1084651 HDL G/A/0.16 11.95 3 3 1028 11?1 DNAH11 7 rs12670798 TC LDL T/C/0.23 11.43 9 3 10210 111? NPC1L1 7 rs2072183 TC LDL G/C/0.25 12.01 3 3 10211 121? TYW1B 7 rs13238203 TG C/T/0.04 27.91 1 3 1029 1??? MLXIPL 7 rs17145738 TG HDL C/T/0.12 29.32 6 3 10258 Y 1111 KLF14 7 rs4731702 HDL C/T/0.48 10.59 1 3 10215 Y 1111 PPP1R3B 8 rs9987289 HDL TC, LDL G/A/0.09 21.21 6 3 10225 Y 1111 PINX1 8 rs11776767 TG G/C/0.37 12.01 1 3 1028 2111 NAT2 8 rs1495741 TG TC A/G/0.22 12.85 5 3 10214 Y 2111 LPL 8 rs12678919 TG HDL A/G/0.12 213.64 2 3 102115 Y 1111 CYP7A1 8 rs2081687 TC LDL C/T/0.35 11.23 2 3 10212 111? TRPS1 8 rs2293889 HDL G/T/0.41 20.44 6 3 10211 1111 rs2737229 TC A/C/0.30 21.11 2 3 1028 112? TRIB1 8 rs2954029 TG TC, LDL, HDL A/T/0.47 25.64 3 3 10255 Y 1111 PLEC1 8 rs11136341 LDL TC A/G/0.40 11.40 4 3 10213 1111 TTC39B 9 rs581080 HDL TC C/G/0.18 20.65 3 3 10212 1211 ARTICLES NATURE|Vol 466|5 August 2010 survey! N! P-value! FDR! Effect (mg/dL)! Teslovich, 2010 GWAS EWAS
  • 83.
    Table 1 |Meta-analysis of plasma lipid concentrations in >100,000 individuals of European descent. Locus Chr Lead SNP Lead trait Other traits Alleles/MAF Effect size P eQTL CAD Ethnic LDLRAP1 1 rs12027135 TC LDL T/A/0.45 21.22 4 3 10211 Y 111? PABPC4 1 rs4660293 HDL A/G/0.23 20.48 4 3 10210 Y 1111 PCSK9 1 rs2479409 LDL TC A/G/0.30 12.01 2 3 10228 1111 ANGPTL3 1 rs2131925 TG TC, LDL T/G/0.32 24.94 9 3 10243 Y 1111 EVI5 1 rs7515577 TC A/C/0.21 21.18 3 3 1028 111? SORT1 1 rs629301 LDL TC T/G/0.22 25.65 1 3 102170 Y Y 1111 ZNF648 1 rs1689800 HDL A/G/0.35 20.47 3 3 10210 1112 MOSC1 1 rs2642442 TC LDL T/C/0.32 21.39 6 3 10213 111? GALNT2 1 rs4846914 HDL TG A/G/0.40 20.61 4 3 10221 1111 IRF2BP2 1 rs514230 TC LDL T/A/0.48 21.36 5 3 10214 111? APOB 2 rs1367117 LDL TC G/A/0.30 14.05 4 3 102114 1111 rs1042034 TG HDL T/C/0.22 25.99 1 3 10245 1211 GCKR 2 rs1260326 TG TC C/T/0.41 18.76 6 3 102133 Y 1111 ABCG5/8 2 rs4299376 LDL TC T/G/0.30 12.75 2 3 10247 1111 RAB3GAP1 2 rs7570971 TC C/A/0.34 11.25 2 3 1028 12?? COBLL1 2 rs10195252 TG T/C/0.40 22.01 2 3 10210 Y 1111 rs12328675 HDL T/C/0.13 10.68 3 3 10210 11?1 IRS1 2 rs2972146 HDL TG T/G/0.37 10.46 3 3 1029 Y Y 1111 RAF1 3 rs2290159 TC G/C/0.22 21.42 4 3 1029 111? MSL2L1 3 rs645040 TG T/G/0.22 22.22 3 3 1028 1121 KLHL8 4 rs442177 TG T/G/0.41 22.25 9 3 10212 1111 SLC39A8 4 rs13107325 HDL C/T/0.07 20.84 7 3 10211 Y 12?2 ARL15 5 rs6450176 HDL G/A/0.26 20.49 5 3 1028 2??1 MAP3K1 5 rs9686661 TG C/T/0.20 12.57 1 3 10210 1111 HMGCR 5 rs12916 TC LDL T/C/0.39 12.84 9 3 10247 111? TIMD4 5 rs6882076 TC LDL, TG C/T/0.35 21.98 7 3 10228 111? MYLIP 6 rs3757354 LDL TC C/T/0.22 21.43 1 3 10211 1221 HFE 6 rs1800562 LDL TC G/A/0.06 22.22 6 3 10210 11?1 HLA 6 rs3177928 TC LDL G/A/0.16 12.31 4 3 10219 Y 111? rs2247056 TG C/T/0.25 22.99 2 3 10215 1112 C6orf106 6 rs2814944 HDL G/A/0.16 20.49 4 3 1029 Y 1112 rs2814982 TC C/T/0.11 21.86 5 3 10211 Y 221? FRK 6 rs9488822 TC LDL A/T/0.35 21.18 2 3 10210 Y 111? CITED2 6 rs605066 HDL T/C/0.42 20.39 3 3 1028 1121 LPA 6 rs1564348 LDL TC T/C/0.17 20.56 2 3 10217 Y 11?1 rs1084651 HDL G/A/0.16 11.95 3 3 1028 11?1 DNAH11 7 rs12670798 TC LDL T/C/0.23 11.43 9 3 10210 111? NPC1L1 7 rs2072183 TC LDL G/C/0.25 12.01 3 3 10211 121? TYW1B 7 rs13238203 TG C/T/0.04 27.91 1 3 1029 1??? MLXIPL 7 rs17145738 TG HDL C/T/0.12 29.32 6 3 10258 Y 1111 KLF14 7 rs4731702 HDL C/T/0.48 10.59 1 3 10215 Y 1111 PPP1R3B 8 rs9987289 HDL TC, LDL G/A/0.09 21.21 6 3 10225 Y 1111 PINX1 8 rs11776767 TG G/C/0.37 12.01 1 3 1028 2111 NAT2 8 rs1495741 TG TC A/G/0.22 12.85 5 3 10214 Y 2111 LPL 8 rs12678919 TG HDL A/G/0.12 213.64 2 3 102115 Y 1111 CYP7A1 8 rs2081687 TC LDL C/T/0.35 11.23 2 3 10212 111? TRPS1 8 rs2293889 HDL G/T/0.41 20.44 6 3 10211 1111 rs2737229 TC A/C/0.30 21.11 2 3 1028 112? TRIB1 8 rs2954029 TG TC, LDL, HDL A/T/0.47 25.64 3 3 10255 Y 1111 PLEC1 8 rs11136341 LDL TC A/G/0.40 11.40 4 3 10213 1111 TTC39B 9 rs581080 HDL TC C/G/0.18 20.65 3 3 10212 1211 survey! N! P-value! FDR! Effect (mg/dL)! Teslovich, 2010 tions in >100,000 individuals of European descent. trait Other traits Alleles/MAF Effect size P eQTL CAD Ethnic C LDL T/A/0.45 21.22 4 3 10211 Y 111? DL A/G/0.23 20.48 4 3 10210 Y 1111 DL TC A/G/0.30 12.01 2 3 10228 1111 G TC, LDL T/G/0.32 24.94 9 3 10243 Y 1111 C A/C/0.21 21.18 3 3 1028 111? DL TC T/G/0.22 25.65 1 3 102170 Y Y 1111 DL A/G/0.35 20.47 3 3 10210 1112 C LDL T/C/0.32 21.39 6 3 10213 111? DL TG A/G/0.40 20.61 4 3 10221 1111 C LDL T/A/0.48 21.36 5 3 10214 111? DL TC G/A/0.30 14.05 4 3 102114 1111 G HDL T/C/0.22 25.99 1 3 10245 1211 G TC C/T/0.41 18.76 6 3 102133 Y 1111 DL TC T/G/0.30 12.75 2 3 10247 1111 C C/A/0.34 11.25 2 3 1028 12?? G T/C/0.40 22.01 2 3 10210 Y 1111 DL T/C/0.13 10.68 3 3 10210 11?1 DL TG T/G/0.37 10.46 3 3 1029 Y Y 1111 C G/C/0.22 21.42 4 3 1029 111? G T/G/0.22 22.22 3 3 1028 1121 G T/G/0.41 22.25 9 3 10212 1111 DL C/T/0.07 20.84 7 3 10211 Y 12?2 DL G/A/0.26 20.49 5 3 1028 2??1 G C/T/0.20 12.57 1 3 10210 1111 NATURE|Vol 466|5 August 2010 GWAS EWAS How do effect sizes compare between GWAS and EWAS?
  • 84.
    EWAS uncovers persistentpollutants in people with Type 2 Diabetes, Higher Lipids: How are these factors linked with these diseases? •organochlorine pesticides •polychlorinated biphenyls •dibenzofurans •dioxins •found all over the world •persist in food chain Porta et al, Environ Int 2008 •arteriosclerosis, •T2D/insulin resistance Porta et al, Lancet, 2006 Lee et al, Diabetes Care, 2006 Lee et al, Diabetologia, 2007 Everett et al, Environ Res, 2010 Lind et al, EHP, 2011 (Korea, Japan, Europe) capacitors adhesives
  • 85.
    Studying the ElusiveEnvironment in Large Scale Itispossiblethatmorethan50%ofcomplexdiseaserisk isattributedtodifferencesinanindividual’senvironment.1 Airpollution,smoking,anddietaredocumentedenviron- mental factors affecting health, yet these factors are but a fraction of the “exposome,” the totality of the exposure loadoccurringthroughoutaperson’slifetime.1 Investigat- ing one or a handful of exposures at a time has led to a highly fragmented literature of epidemiologic associa- tions. Much of that literature is not reproducible, and se- lectivereportingmaybeamajorreasonforthelackofre- producibility. A new model is required to discover environmental exposures associated with disease while mitigating possibilities of selective reporting. Toremedythelackofreproducibilityandconcernsof validity, multiple personal exposures can be assessed si- multaneously in terms of their association with a condi- tion or disease of interest; the strongest associations can then be tentatively validated in independent data sets (eg, as done in references 2 and 3).2,3 The main advan- tages of this process include the ability to search the list ofexposuresandadjustformultiplicitysystematicallyand reportalltheprobedassociationsinsteadofonlythemost significant results. The term “environment-wide associa- tion studies” (EWAS) has been used to describe this ap- proach (an analogy to genome-wide association stud- ies).Forexample,Wangetal4 screenedmorethan2000 chemicalsinserumtodiscoverendogenousexposuresas- sociated with risk for cardiovascular disease. Therearenotablehurdlesinanalyzing“big”environ- the EWAS vantage point, intervening on β-carotene (Figure, D) seems a futile exercise given its complex rela- tionship with other nutrients and pollutants. Giventhiscomplexity,howcanstudiesofenvironmen- talriskmoveforward?First,EWASanalysesshouldbeap- pliedtomultipledatasets,andconsistencycanbeformally examinedforallassessedcorrelations.Second,thetempo- ral relationship between exposure and changes in health parametersmayofferhelpfulhintsaboutwhichofthesig- nalsaremorethansimplecorrelations.Third,standardized adjustedanalyses,inwhichadjustmentsareperformedsys- tematicallyandinthesamewayacrossmultipledatasets, may also help. This is in stark contrast with the current model,wherebymostepidemiologicstudiesusesingledata setswithoutreplicationaswellasnon–time-dependentas- sessments,andreportedadjustmentsaremarkedlydiffer- entacrossreportsanddatasets,eventhoseperformedby thesameteam(differentapproachesincreasevaliditybut mustbereconciledandassimilated). However, eventually for most environmental cor- relates,theremaybeunsurpassabledifficultyestablish- ing potential causal inferences based on observational data alone. Factors that seem protective may some- times be tested in randomized trials. The complexity of the multiple correlations also highlights the challenge thatinterveningtomodify1putativeriskfactoralsomay inadvertently affect multiple other correlated factors. Even when a seemingly simple intervention is tested in randomizedtrials(affectingasingleriskfactoramongthe VIEWPOINT Chirag J. Patel, PhD Center for Biomedical Informatics, Harvard Medical School, Boston, Massachusetts. John P. A. Ioannidis, MD, DSc Stanford Prevention Research Center, Department of Health Research and Policy, Department of Medicine, Stanford University School of Medicine, Stanford, California, Department of Statistics, Stanford University School of Humanities and Sciences, Stanford, California, and Meta-Research Innovation Center at Stanford (METRICS), Stanford, California. Opinion JAMA, 2014 JECH, 2014 •longitudinal/linkable data & biorepositories How can we study the elusive environment in larger scale for biomedical discovery? Studying the Elusive Environment in Large Scale Itispossiblethatmorethan50%ofcomplexdiseaserisk isattributedtodifferencesinanindividual’senvironment.1 Airpollution,smoking,anddietaredocumentedenviron- mental factors affecting health, yet these factors are but a fraction of the “exposome,” the totality of the exposure loadoccurringthroughoutaperson’slifetime.1 Investigat- ing one or a handful of exposures at a time has led to a highly fragmented literature of epidemiologic associa- tions. Much of that literature is not reproducible, and se- lectivereportingmaybeamajorreasonforthelackofre- producibility. A new model is required to discover environmental exposures associated with disease while mitigating possibilities of selective reporting. Toremedythelackofreproducibilityandconcernsof validity, multiple personal exposures can be assessed si- multaneously in terms of their association with a condi- tion or disease of interest; the strongest associations can then be tentatively validated in independent data sets (eg, as done in references 2 and 3).2,3 The main advan- tages of this process include the ability to search the list ofexposuresandadjustformultiplicitysystematicallyand reportalltheprobedassociationsinsteadofonlythemost the EWAS vantage point, intervening on β-carotene (Figure, D) seems a futile exercise given its complex rela tionship with other nutrients and pollutants. Giventhiscomplexity,howcanstudiesofenvironmen talriskmoveforward?First,EWASanalysesshouldbeap pliedtomultipledatasets,andconsistencycanbeformally examinedforallassessedcorrelations.Second,thetempo ral relationship between exposure and changes in health parametersmayofferhelpfulhintsaboutwhichofthesig nalsaremorethansimplecorrelations.Third,standardized adjustedanalyses,inwhichadjustmentsareperformedsys tematicallyandinthesamewayacrossmultipledatasets may also help. This is in stark contrast with the curren model,wherebymostepidemiologicstudiesusesingledata setswithoutreplicationaswellasnon–time-dependentas sessments,andreportedadjustmentsaremarkedlydiffer entacrossreportsanddatasets,eventhoseperformedby thesameteam(differentapproachesincreasevaliditybu mustbereconciledandassimilated). However, eventually for most environmental cor relates,theremaybeunsurpassabledifficultyestablish ing potential causal inferences based on observationa VIEWPOINT Chirag J. Patel, PhD Center for Biomedical Informatics, Harvard Medical School, Boston, Massachusetts. John P. A. Ioannidis, MD, DSc Stanford Prevention Research Center, Department of Health Research and Policy, Department of Medicine, Stanford University School of Medicine, Stanford, California, Department of Statistics, Stanford University School of Humanities and Sciences, Stanford, California, and Meta-Research Innovation Center at Stanford (METRICS), Stanford, California. Opinion High-throughputascertainmentofendogenousindicatorsofen- vironmentalexposurethatmayreflecttheexposomeincreasinglyat- tractattention,andtheirperformanceneedstobecarefullyevaluated. These include chemical detection of indicators of exposure through 7 US federally funded gene expression experiment data be itedinpublicrepositoriessuchastheGeneExpressionOmni repositoryhasbeeninstrumentalindevelopmentoftechno measurement of gene expression, data standardization, an Figure. Correlation Interdependency Globes for 4 Environmental Exposures (Cotinine, Mercury, Cadmium, Trans-β-Carotene) in National Hea Nutrition Examination Survey (NHANES) Participants, 2003-2004 A Serum cotinine B Serum total mercury C Serum cadmium D Serum trans-β-carotene 37 Total correlations 42 Total correlations 68 Total correlations 68 Total correlations Negative correlation Positive cor Infectious agents Pollutants Nutrients and vitamins Demographic attributes Eachcorrelationinterdependencyglobeincludes317environmentalexposures representedbythenodesaroundtheperipheryoftheglobe.Pairwisecorrelations aredepictedbyedges(lines)betweenthenodeofinterest(arrowhead)andother nodes.Correlationswithabsolutevaluesexceeding0.2areshown(strong Thesizeofeachnodeisproportionaltothenumberofedgesforanode,a thicknessofeachedgeindicatesthemagnitudeofthecorrelation. Opinion Viewpoint •data mining and informatics to tackle complexity what causes what? confounding •evaluate new ‘omics technologies high-throughput, non-targeted metabolomics
  • 86.
    There is no“microarray” for E...
  • 87.
    http://grants.nih.gov/grants/guide/rfa-files/RFA-ES-15-010.html NIH National Instituteof Environmental Health: $34M in FY 2015: new technologies for ascertaining the exposome in children E LaboratoryE LaboratoryE LaboratoryE Laboratory E Data Center •Data repository •Analytic ecosystem •Data standards Exposome Laboratory Network
  • 88.
    Studying the ElusiveEnvironment in Large Scale Itispossiblethatmorethan50%ofcomplexdiseaserisk isattributedtodifferencesinanindividual’senvironment.1 Airpollution,smoking,anddietaredocumentedenviron- mental factors affecting health, yet these factors are but a fraction of the “exposome,” the totality of the exposure loadoccurringthroughoutaperson’slifetime.1 Investigat- ing one or a handful of exposures at a time has led to a highly fragmented literature of epidemiologic associa- tions. Much of that literature is not reproducible, and se- lectivereportingmaybeamajorreasonforthelackofre- producibility. A new model is required to discover environmental exposures associated with disease while mitigating possibilities of selective reporting. Toremedythelackofreproducibilityandconcernsof validity, multiple personal exposures can be assessed si- multaneously in terms of their association with a condi- tion or disease of interest; the strongest associations can then be tentatively validated in independent data sets (eg, as done in references 2 and 3).2,3 The main advan- tages of this process include the ability to search the list ofexposuresandadjustformultiplicitysystematicallyand reportalltheprobedassociationsinsteadofonlythemost significant results. The term “environment-wide associa- tion studies” (EWAS) has been used to describe this ap- proach (an analogy to genome-wide association stud- ies).Forexample,Wangetal4 screenedmorethan2000 chemicalsinserumtodiscoverendogenousexposuresas- sociated with risk for cardiovascular disease. Therearenotablehurdlesinanalyzing“big”environ- the EWAS vantage point, intervening on β-carotene (Figure, D) seems a futile exercise given its complex rela- tionship with other nutrients and pollutants. Giventhiscomplexity,howcanstudiesofenvironmen- talriskmoveforward?First,EWASanalysesshouldbeap- pliedtomultipledatasets,andconsistencycanbeformally examinedforallassessedcorrelations.Second,thetempo- ral relationship between exposure and changes in health parametersmayofferhelpfulhintsaboutwhichofthesig- nalsaremorethansimplecorrelations.Third,standardized adjustedanalyses,inwhichadjustmentsareperformedsys- tematicallyandinthesamewayacrossmultipledatasets, may also help. This is in stark contrast with the current model,wherebymostepidemiologicstudiesusesingledata setswithoutreplicationaswellasnon–time-dependentas- sessments,andreportedadjustmentsaremarkedlydiffer- entacrossreportsanddatasets,eventhoseperformedby thesameteam(differentapproachesincreasevaliditybut mustbereconciledandassimilated). However, eventually for most environmental cor- relates,theremaybeunsurpassabledifficultyestablish- ing potential causal inferences based on observational data alone. Factors that seem protective may some- times be tested in randomized trials. The complexity of the multiple correlations also highlights the challenge thatinterveningtomodify1putativeriskfactoralsomay inadvertently affect multiple other correlated factors. Even when a seemingly simple intervention is tested in randomizedtrials(affectingasingleriskfactoramongthe VIEWPOINT Chirag J. Patel, PhD Center for Biomedical Informatics, Harvard Medical School, Boston, Massachusetts. John P. A. Ioannidis, MD, DSc Stanford Prevention Research Center, Department of Health Research and Policy, Department of Medicine, Stanford University School of Medicine, Stanford, California, Department of Statistics, Stanford University School of Humanities and Sciences, Stanford, California, and Meta-Research Innovation Center at Stanford (METRICS), Stanford, California. Opinion JAMA, 2014 JECH, 2014 •longitudinal/linkable data & biorepositories Possibilities of discovery with the exposome: How do we proceed? Studying the Elusive Environment in Large Scale Itispossiblethatmorethan50%ofcomplexdiseaserisk isattributedtodifferencesinanindividual’senvironment.1 Airpollution,smoking,anddietaredocumentedenviron- mental factors affecting health, yet these factors are but a fraction of the “exposome,” the totality of the exposure loadoccurringthroughoutaperson’slifetime.1 Investigat- ing one or a handful of exposures at a time has led to a highly fragmented literature of epidemiologic associa- tions. Much of that literature is not reproducible, and se- lectivereportingmaybeamajorreasonforthelackofre- producibility. A new model is required to discover environmental exposures associated with disease while mitigating possibilities of selective reporting. Toremedythelackofreproducibilityandconcernsof validity, multiple personal exposures can be assessed si- multaneously in terms of their association with a condi- tion or disease of interest; the strongest associations can then be tentatively validated in independent data sets (eg, as done in references 2 and 3).2,3 The main advan- tages of this process include the ability to search the list ofexposuresandadjustformultiplicitysystematicallyand reportalltheprobedassociationsinsteadofonlythemost the EWAS vantage point, intervening on β-carotene (Figure, D) seems a futile exercise given its complex rela tionship with other nutrients and pollutants. Giventhiscomplexity,howcanstudiesofenvironmen talriskmoveforward?First,EWASanalysesshouldbeap pliedtomultipledatasets,andconsistencycanbeformally examinedforallassessedcorrelations.Second,thetempo ral relationship between exposure and changes in health parametersmayofferhelpfulhintsaboutwhichofthesig nalsaremorethansimplecorrelations.Third,standardized adjustedanalyses,inwhichadjustmentsareperformedsys tematicallyandinthesamewayacrossmultipledatasets may also help. This is in stark contrast with the curren model,wherebymostepidemiologicstudiesusesingledata setswithoutreplicationaswellasnon–time-dependentas sessments,andreportedadjustmentsaremarkedlydiffer entacrossreportsanddatasets,eventhoseperformedby thesameteam(differentapproachesincreasevaliditybu mustbereconciledandassimilated). However, eventually for most environmental cor relates,theremaybeunsurpassabledifficultyestablish ing potential causal inferences based on observationa VIEWPOINT Chirag J. Patel, PhD Center for Biomedical Informatics, Harvard Medical School, Boston, Massachusetts. John P. A. Ioannidis, MD, DSc Stanford Prevention Research Center, Department of Health Research and Policy, Department of Medicine, Stanford University School of Medicine, Stanford, California, Department of Statistics, Stanford University School of Humanities and Sciences, Stanford, California, and Meta-Research Innovation Center at Stanford (METRICS), Stanford, California. Opinion High-throughputascertainmentofendogenousindicatorsofen- vironmentalexposurethatmayreflecttheexposomeincreasinglyat- tractattention,andtheirperformanceneedstobecarefullyevaluated. These include chemical detection of indicators of exposure through 7 US federally funded gene expression experiment data be itedinpublicrepositoriessuchastheGeneExpressionOmni repositoryhasbeeninstrumentalindevelopmentoftechno measurement of gene expression, data standardization, an Figure. Correlation Interdependency Globes for 4 Environmental Exposures (Cotinine, Mercury, Cadmium, Trans-β-Carotene) in National Hea Nutrition Examination Survey (NHANES) Participants, 2003-2004 A Serum cotinine B Serum total mercury C Serum cadmium D Serum trans-β-carotene 37 Total correlations 42 Total correlations 68 Total correlations 68 Total correlations Negative correlation Positive cor Infectious agents Pollutants Nutrients and vitamins Demographic attributes Eachcorrelationinterdependencyglobeincludes317environmentalexposures representedbythenodesaroundtheperipheryoftheglobe.Pairwisecorrelations aredepictedbyedges(lines)betweenthenodeofinterest(arrowhead)andother nodes.Correlationswithabsolutevaluesexceeding0.2areshown(strong Thesizeofeachnodeisproportionaltothenumberofedgesforanode,a thicknessofeachedgeindicatesthemagnitudeofthecorrelation. Opinion Viewpoint •data mining and informatics to tackle complexity what causes what? confounding •evaluate new ‘omics technologies metabolomics
  • 89.
    758,000 individuals >400 studies >>1Bdatapoints (genotypes and phenotypes) controlled-access (by application) Accelerating discoveries with publicly-accessible, population-scale data: a dbGaP for environmental exposures?
  • 90.
    with Paul Avillach,Michael McDuffie, Jeremy Easton-Marks, Cartik Saravanamuthu and the BD2K PIC-SURE team 40K participants >1000 indicators of exposure Data and API available now http://nhanes.hms.harvard.edu BD2K Patient-Centered Information Commons NHANES exposome browser
  • 91.
    Studying the ElusiveEnvironment in Large Scale Itispossiblethatmorethan50%ofcomplexdiseaserisk isattributedtodifferencesinanindividual’senvironment.1 Airpollution,smoking,anddietaredocumentedenviron- mental factors affecting health, yet these factors are but a fraction of the “exposome,” the totality of the exposure loadoccurringthroughoutaperson’slifetime.1 Investigat- ing one or a handful of exposures at a time has led to a highly fragmented literature of epidemiologic associa- tions. Much of that literature is not reproducible, and se- lectivereportingmaybeamajorreasonforthelackofre- producibility. A new model is required to discover environmental exposures associated with disease while mitigating possibilities of selective reporting. Toremedythelackofreproducibilityandconcernsof validity, multiple personal exposures can be assessed si- multaneously in terms of their association with a condi- tion or disease of interest; the strongest associations can then be tentatively validated in independent data sets (eg, as done in references 2 and 3).2,3 The main advan- tages of this process include the ability to search the list ofexposuresandadjustformultiplicitysystematicallyand reportalltheprobedassociationsinsteadofonlythemost significant results. The term “environment-wide associa- tion studies” (EWAS) has been used to describe this ap- proach (an analogy to genome-wide association stud- ies).Forexample,Wangetal4 screenedmorethan2000 chemicalsinserumtodiscoverendogenousexposuresas- sociated with risk for cardiovascular disease. Therearenotablehurdlesinanalyzing“big”environ- the EWAS vantage point, intervening on β-carotene (Figure, D) seems a futile exercise given its complex rela- tionship with other nutrients and pollutants. Giventhiscomplexity,howcanstudiesofenvironmen- talriskmoveforward?First,EWASanalysesshouldbeap- pliedtomultipledatasets,andconsistencycanbeformally examinedforallassessedcorrelations.Second,thetempo- ral relationship between exposure and changes in health parametersmayofferhelpfulhintsaboutwhichofthesig- nalsaremorethansimplecorrelations.Third,standardized adjustedanalyses,inwhichadjustmentsareperformedsys- tematicallyandinthesamewayacrossmultipledatasets, may also help. This is in stark contrast with the current model,wherebymostepidemiologicstudiesusesingledata setswithoutreplicationaswellasnon–time-dependentas- sessments,andreportedadjustmentsaremarkedlydiffer- entacrossreportsanddatasets,eventhoseperformedby thesameteam(differentapproachesincreasevaliditybut mustbereconciledandassimilated). However, eventually for most environmental cor- relates,theremaybeunsurpassabledifficultyestablish- ing potential causal inferences based on observational data alone. Factors that seem protective may some- times be tested in randomized trials. The complexity of the multiple correlations also highlights the challenge thatinterveningtomodify1putativeriskfactoralsomay inadvertently affect multiple other correlated factors. Even when a seemingly simple intervention is tested in randomizedtrials(affectingasingleriskfactoramongthe VIEWPOINT Chirag J. Patel, PhD Center for Biomedical Informatics, Harvard Medical School, Boston, Massachusetts. John P. A. Ioannidis, MD, DSc Stanford Prevention Research Center, Department of Health Research and Policy, Department of Medicine, Stanford University School of Medicine, Stanford, California, Department of Statistics, Stanford University School of Humanities and Sciences, Stanford, California, and Meta-Research Innovation Center at Stanford (METRICS), Stanford, California. Opinion JAMA, 2014 JECH, 2014 •longitudinal/linkable data & biorepositories Possibilities of discovery with the exposome: How do we proceed? Studying the Elusive Environment in Large Scale Itispossiblethatmorethan50%ofcomplexdiseaserisk isattributedtodifferencesinanindividual’senvironment.1 Airpollution,smoking,anddietaredocumentedenviron- mental factors affecting health, yet these factors are but a fraction of the “exposome,” the totality of the exposure loadoccurringthroughoutaperson’slifetime.1 Investigat- ing one or a handful of exposures at a time has led to a highly fragmented literature of epidemiologic associa- tions. Much of that literature is not reproducible, and se- lectivereportingmaybeamajorreasonforthelackofre- producibility. A new model is required to discover environmental exposures associated with disease while mitigating possibilities of selective reporting. Toremedythelackofreproducibilityandconcernsof validity, multiple personal exposures can be assessed si- multaneously in terms of their association with a condi- tion or disease of interest; the strongest associations can then be tentatively validated in independent data sets (eg, as done in references 2 and 3).2,3 The main advan- tages of this process include the ability to search the list ofexposuresandadjustformultiplicitysystematicallyand reportalltheprobedassociationsinsteadofonlythemost the EWAS vantage point, intervening on β-carotene (Figure, D) seems a futile exercise given its complex rela tionship with other nutrients and pollutants. Giventhiscomplexity,howcanstudiesofenvironmen talriskmoveforward?First,EWASanalysesshouldbeap pliedtomultipledatasets,andconsistencycanbeformally examinedforallassessedcorrelations.Second,thetempo ral relationship between exposure and changes in health parametersmayofferhelpfulhintsaboutwhichofthesig nalsaremorethansimplecorrelations.Third,standardized adjustedanalyses,inwhichadjustmentsareperformedsys tematicallyandinthesamewayacrossmultipledatasets may also help. This is in stark contrast with the curren model,wherebymostepidemiologicstudiesusesingledata setswithoutreplicationaswellasnon–time-dependentas sessments,andreportedadjustmentsaremarkedlydiffer entacrossreportsanddatasets,eventhoseperformedby thesameteam(differentapproachesincreasevaliditybu mustbereconciledandassimilated). However, eventually for most environmental cor relates,theremaybeunsurpassabledifficultyestablish ing potential causal inferences based on observationa VIEWPOINT Chirag J. Patel, PhD Center for Biomedical Informatics, Harvard Medical School, Boston, Massachusetts. John P. A. Ioannidis, MD, DSc Stanford Prevention Research Center, Department of Health Research and Policy, Department of Medicine, Stanford University School of Medicine, Stanford, California, Department of Statistics, Stanford University School of Humanities and Sciences, Stanford, California, and Meta-Research Innovation Center at Stanford (METRICS), Stanford, California. Opinion High-throughputascertainmentofendogenousindicatorsofen- vironmentalexposurethatmayreflecttheexposomeincreasinglyat- tractattention,andtheirperformanceneedstobecarefullyevaluated. These include chemical detection of indicators of exposure through 7 US federally funded gene expression experiment data be itedinpublicrepositoriessuchastheGeneExpressionOmni repositoryhasbeeninstrumentalindevelopmentoftechno measurement of gene expression, data standardization, an Figure. Correlation Interdependency Globes for 4 Environmental Exposures (Cotinine, Mercury, Cadmium, Trans-β-Carotene) in National Hea Nutrition Examination Survey (NHANES) Participants, 2003-2004 A Serum cotinine B Serum total mercury C Serum cadmium D Serum trans-β-carotene 37 Total correlations 42 Total correlations 68 Total correlations 68 Total correlations Negative correlation Positive cor Infectious agents Pollutants Nutrients and vitamins Demographic attributes Eachcorrelationinterdependencyglobeincludes317environmentalexposures representedbythenodesaroundtheperipheryoftheglobe.Pairwisecorrelations aredepictedbyedges(lines)betweenthenodeofinterest(arrowhead)andother nodes.Correlationswithabsolutevaluesexceeding0.2areshown(strong Thesizeofeachnodeisproportionaltothenumberofedgesforanode,a thicknessofeachedgeindicatesthemagnitudeofthecorrelation. Opinion Viewpoint •data mining and informatics to tackle complexity what causes what? confounding •evaluate new ‘omics technologies metabolomics
  • 92.
    Complexity of exposome-phenomeassociations: Many more potential biases vs. GWAS Reverse causality: Could the disease “lead” to exposure? γ-tocopherol ? tocopherol (vitamin e) supplements for CHD individuals? low HDL Confounding bias: Ice cream and drowning deaths Mercury and HDL-C fish consumption mercury confounders high HDL ?? Independence of association: Web of exposure of the exposome? β-carotene hydrocarbons γ-tocopherol ρ
  • 93.
    Longitudinal Study: “Gold Standard”for Validation •exposure changing through time •reverse causality bias •compute disease risk time Disease ? Exposure DiseaseRisk [low] [high]
  • 94.
    EWAS to searchfor exposures and behaviors associated with all-cause mortality. NHANES: 1999-2004 National Death Index linked mortality 246 behaviors and exposures (serum/urine/self-report) NHANES: 1999-2001 N=330 to 6008 (26 to 655 deaths) ~5.5 years of followup Cox proportional hazards baseline exposure and time to death False discovery rate < 5% NHANES: 2003-2004 N=177 to 3258 (20-202 deaths) ~2.8 years of followup p < 0.05 IJE, 2013
  • 95.
    Adjusted Hazard Ratio -log10(pvalue) 0.40.6 0.8 1.0 1.2 1.4 1.6 2.0 2.4 2.8 02468 1 2 3 4 5 67 1 Physical Activity 2 Does anyone smoke in home? 3 Cadmium 4 Cadmium, urine 5 Past smoker 6 Current smoker 7 trans-lycopene (11) 1 2 3 4 5 6 78 9 10 1112 13 14 1516 1 age (10 year increment) 2 SES_1 3 male 4 SES_0 5 black 6 SES_2 7 SES_3 8 education_hs 9 other_eth 10 mexican 11 occupation_blue_semi 12 education_less_hs 13 occupation_never 14 occupation_blue_high 15 occupation_white_semi 16 other_hispanic (69) All-cause mortality: 253 exposure/behavior associations in survival age, sex, income, education, race/ethnicity, occupation [in red] FDR < 5% sociodemographics replicated factor IJE, 2013
  • 96.
    Adjusted Hazard Ratio -log10(pvalue) 0.40.6 0.8 1.0 1.2 1.4 1.6 2.0 2.4 2.8 02468 1 2 3 4 5 67 1 Physical Activity 2 Does anyone smoke in home? 3 Cadmium 4 Cadmium, urine 5 Past smoker 6 Current smoker 7 trans-lycopene (11) 1 2 3 4 5 6 78 9 10 1112 13 14 1516 1 age (10 year increment) 2 SES_1 3 male 4 SES_0 5 black 6 SES_2 7 SES_3 8 education_hs 9 other_eth 10 mexican 11 occupation_blue_semi 12 education_less_hs 13 occupation_never 14 occupation_blue_high 15 occupation_white_semi 16 other_hispanic (69) EWAS (re)-identifies factors associated with all-cause mortality: Volcano plot of 200 associations age (10 years) income (quintile 2) income (quintile 1) male black income (quintile 3) any one smoke in home? age, sex, income, education, race/ethnicity, occupation [in red] serum and urine cadmium [1 SD] past smoker? current smoker?serum lycopene [1SD] physical activity [low, moderate, high activity]* *derived from METs per activity and categorized by Health.gov guidelines R2 ~ 2%
  • 97.
    Correlation Structure ofthe Exposome? Analogy: “Linkage Disequilibrium” Identification of four novel T2DM loci Our fast-track stage 2 genotyping confirmed the reported association for rs7903146 (TCF7L2) on chromosome 10, and in addition iden- tified significant associations for seven SNPs representing four new T2DM loci (Table 1). In all cases, the strongest association for the MAX statistic (see Methods) was obtained with the additive model. The most significant of these corresponds to rs13266634, a non- synonymous SNP (R325W) in SLC30A8, located in a 33-kb linkage disequilibrium block on chromosome 8, containing only the 39 end of this gene (Fig. 2a). SLC30A8 encodes a zinc transporter expressed solely in the secretory vesicles of b-cells and is thus implicated in the final stages of insulin biosynthesis, which involve co-crystallization Table 1 | Confirmed association results SNP Chromosome Position (nucleotides) Risk allele Major allele MAF (case) MAF (ctrl) Odds ratio (het) Odds ratio (hom) PAR ls Stage 2 pMAX Stage 2 pMAX (perm) Stage 1 pMAX Stage 1 pMAX (perm) Nearest gene rs7903146 10 114,748,339 T C 0.406 0.293 1.65 6 0.19 2.77 6 0.50 0.28 1.0546 1.5 3 10234 ,1.0 3 1027 3.2 3 10217 ,3.3 3 10210 TCF7L2 rs13266634 8 118,253,964 C C 0.254 0.301 1.18 6 0.25 1.53 6 0.31 0.24 1.0089 6.1 3 1028 5.0 3 1027 2.1 3 1025 1.8 3 1025 SLC30A8 rs1111875 10 94,452,862 G G 0.358 0.402 1.19 6 0.19 1.44 6 0.24 0.19 1.0069 3.0 3 1026 7.4 3 1026 9.1 3 1026 7.3 3 1026 HHEX rs7923837 10 94,471,897 G G 0.335 0.377 1.22 6 0.21 1.45 6 0.25 0.20 1.0065 7.5 3 1026 2.2 3 1025 3.4 3 1026 2.5 3 1026 HHEX rs7480010 11 42,203,294 G A 0.336 0.301 1.14 6 0.13 1.40 6 0.25 0.08 1.0041 1.1 3 1024 2.9 3 1024 1.5 3 1025 1.2 3 1025 LOC387761 rs3740878 11 44,214,378 A A 0.240 0.272 1.26 6 0.29 1.46 6 0.33 0.24 1.0046 1.2 3 1024 2.8 3 1024 1.8 3 1025 1.3 3 1025 EXT2 rs11037909 11 44,212,190 T T 0.240 0.271 1.27 6 0.30 1.47 6 0.33 0.25 1.0045 1.8 3 1024 4.5 3 1024 1.8 3 1025 1.3 3 1025 EXT2 rs1113132 11 44,209,979 C C 0.237 0.267 1.15 6 0.27 1.36 6 0.31 0.19 1.0044 3.3 3 1024 8.1 3 1024 3.7 3 1025 2.9 3 1025 EXT2 Significant T2DM associations were confirmed for eight SNPs in five loci. Allele frequencies, odds ratios (with 95% confidence intervals) and PAR were calculated using only the stage 2 data. Allele frequencies in the controls were very close to those reported for the CEU set (European subjects genotyped in the HapMap project). Induced sibling recurrent risk ratios (ls) were estimated using stage 2 genotype counts for the control subjects and assuming a T2DM prevalence of 7% in the French population. hom, homozygous; het, heterozygous; major allele, the allele with the higher frequency in controls; pMAX, P-value of the MAX statistic from the x2 distribution; pMAX (perm), P-value of the MAX statistic from the permutation-derived empirical distribution (pMAX and pMAX (perm) are adjusted for variance inflation); risk allele, the allele with higher frequency in cases compared with controls. * * * 0 2 4 –log10[P] –log10[P]* 4954642sr 2373971sr 3373971sr 445409sr 8012261sr 3349941sr 883429sr 2019462sr 0349941sr 90350501sr 036169sr 0415007sr 2225991sr 6136642sr 8136642sr 1869646sr 8798751sr 04928201sr 3926642sr 5926642sr 43666231sr 9926642sr 2954642sr 01350501sr 5769646sr 4577187sr 4769646sr 41350501sr 5784931sr 2173387sr 39250501sr 5050007sr 7492602sr 1255051sr 156868sr 4373387sr 4784931sr 7501107sr 2697402sr 91518711sr 6461001sr 29250501sr 5889103sr 8669646sr 0889103sr 4688392sr SLC30A8 IDE HHEXKIF11 ** ** ** 0 2 4 * * 5470942sr 7602242sr 28178111sr 1570942sr 2394424sr 8838141sr 76029511sr 37178111sr 2945391sr 2608842sr 64690501sr 1537942sr 2950249sr 0339351sr 1708842sr 195749sr 4037942sr 1137942sr 7383297sr 5781111sr 9275722sr 9537197sr 6342097sr 0383856sr 0990707sr 4184197sr 19028801sr 9125722sr 88028801sr 1974064sr 5374283sr 53465221sr 6283856sr 5058573sr 3679991sr 1118097sr 3491242sr 46078111sr 06078111sr 7912381sr 3148707sr 0283856sr 52078111sr 5227373sr 0491242sr 2369412sr 2297881sr 662155sr 7790197sr 44068701sr 35075221sr 5826807sr 7851092sr 9409522sr –log10[P] –log10[P] EXT2 ALX4 0 2 4 *** * ** 0 2 4 LOC387761 a b c d NATURE|Vol 445|22 February 2007 ARTICLES Sladek et al., Nature Genetics (2007) Correlation between occurrence of genetic loci In GWAS, allows one to trace to the “causal” locus. Independence of association: How to untangle “web” of exposure? β-carotene hydrocarbons γ-tocopherol ρ
  • 98.
    Interdependencies of theexposome: Correlation globes paint a complex view of exposure Red: positive ρ Blue: negative ρ thickness: |ρ| permuted data to produce “null ρ” sought replication in > 1 cohort Pac Symp Biocomput 2015 JECH 2015 for each pair of E: Spearman ρ (575 factors: 81,937 correlations)
  • 99.
    Red: positive ρ Blue:negative ρ thickness: |ρ| Interdependencies of the exposome: Correlation globes paint a complex view of exposure permuted data to produce “null ρ” sought replication in > 1 cohort Pac Symp Biocomput 2015 JECH 2015 Effective number of variables: 500 (10% decrease) for each pair of E: Spearman ρ (575 factors: 81,937 correlations)
  • 100.
    Estimating the LDof the exposome: Diabetes vs. death have distinct globes (PoPs vs. smoking?)... Diabetes All-cause mortality Pac Symp Biocomput. 2015
  • 101.
    Browse these and82 other phenotype-exposome globes! http://www.chiragjpgroup.org/exposome_correlation
  • 102.
    What nodes havethe most connections? (“hubs”) sex, age, and income ρ What factor(s) is(are) correlated with many other exposures?
  • 103.
    Pulse rate Eosinophils number Lymphocytenumber Monocyte Segmented neutrophils number Blood 2,5-Dimethylfuran Cadmium LeadCotinine C-reactive protein Floor, GFAAS Protoporphyrin Glycohemoglobin Glucose, plasma g-tocopherol Hepatitis A Antibody Homocysteine Herpes I Herpes II Red cell distribution width Alkaline phosphotase Globulin Glucose, serum Gamma glutamyl transferase Triglycerides Blood Benzene Blood 1,4-Dichlorobenzene Blood Ethylbenzene Blood Styrene Blood Toluene Blood m-/p-Xylene White blood cell count Mono-benzyl phthalate 3-fluorene 2-fluorene 3-phenanthrene 2-phenanthrene 1-pyrene Cadmium, urine Albumin, urine Lead, urine 10 20 30 -0.3 -0.2 -0.1 0.0 Effect Size per 1SD of income/poverty ratio -log10(pvalue) overall income/poverty ratio effects (per 1SD) validated results Lower income associated with 43 of 330 (>13%) exposures and biomarkers in the US population Higher income: lower levels of biomarkers AJE, 2015 (Another 23 associated with higher levels=20%)
  • 104.
    EWAS: Possible to acceleratethe pace of discovery of exposures • generalizable, comprehensive, transparent, and systematic study of environment • Created hypotheses for T2D, CVD, death, and others • What is LD of the environment? • Needles among needles • Confounding, reverse causality... −log10(pvalue) ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ●● ● ● ●● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ●● ● ● ● ● ● ● ● ●●● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ●● ● ●● ●● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ●● ● ●● ●● ● ● ● ● ● acrylamide allergentest bacterialinfection cotinine diakyl dioxins furansdibenzofuran heavymetals hydrocarbons latex nutrientscarotenoid nutrientsminerals nutrientsvitaminA nutrientsvitaminB nutrientsvitaminC nutrientsvitaminD nutrientsvitaminE pcbs perchlorate pesticidesatrazine pesticidescarbamate pesticideschlorophenol pesticidesorganochlorine pesticidesorganophosphate pesticidespyrethyroid phenols phthalates phytoestrogens polybrominatedethers polyflourochemicals viralinfection volatilecompounds 01234 HDL-C: 1-10 mg/dL T2D: ~2-3 OR mortality: ~1.5-2 HR
  • 105.
    Can exposure enablere-classification of phenotypes?
  • 106.
    Committee on AFramework for Developing a New Taxonomy of Disease Board on Life Sciences Division on Earth and Life Studies NRC, National Academy of Sciences 2011 The use of multiple molecular parameters to characterize disease [P] may lead to a more accurate and find-grained classification of disease [P]… “multiple molecular parameters” must include E!
  • 107.
    An icon for“precision medicine”?: Linnaeus: classification of phenotypes (P) for treatment and prevention (18th century) signs (signa), symptoms essensia (essence of symptoms; e.g., inflammation) causa (what caused the disease; e.g. pathogen) Related diseases: common cause and treatment. Class 5: MENTALES (mental disturbances) Order 1: IDEALIS (faulty judgment) Order 2: IMAGINI (imagination disorder) Order 3: PATHETICI (irregular desires) L-5-3: CITTA (eat the inedible) L-5-3: TARANTISMUS (dancing via tarantula bite) Cogn Behav Neurol 2012
  • 108.
    Classification of phenotypes(P) and disease today for via International Classification of Disease
  • 109.
    We are manyphenotypes simultaneously: Can we better categorize these P? Body Measures Body Mass Index Height Blood pressure & fitness Systolic BP Diastolic BP Pulse rate VO2 Max Metabolic Glucose LDL-Cholesterol Triglycerides Inflammation C-reactive protein white blood cell count Kidney function Creatinine Sodium Uric Acid Liver function Aspartate aminotransferase Gamma glutamyltransferase Aging Telomere length
  • 110.
    EWAS-derived phenotype-exposure associationmap: A 2-D view of phenotype-exposure associations for re- classification PCB170 Glucose BMI Height Cholesterol β-carotene folate http://bit.ly.com/pemap
  • 111.
    Creation of aphenotype-exposure association map: A 2-D view of 83 phenotype by 252 exposure associations > 0 < 0 Association Size: Clusters of exposures associated with clusters of phenotypes? 252 biomarkers of exposure × 83 clinical trait phenotypes NHANES 1999-2000, 2001-2002, 2005-2006 ~21K regressions: replicated significant (FDR < 5%) in 2003-2004 adjusted by age, age2, sex, race, income, chronic disease Hugues Aschard, JP Ioannidis 83phenotypes 252 exposures
  • 112.
    Alpha-carotene Alcohol VitaminEasalpha-tocopherol Beta-carotene Caffeine Calcium Carbohydrate Cholesterol Copper Beta-cryptoxanthin Folicacid Folate,DFE Foodfolate Dietaryfiber Iron Energy Lycopene Lutein+zeaxanthin MFA16:1 MFA18:1 MFA20:1 Magnesium Totalmonounsaturatedfattyacids Moisture Niacin PFA18:2 PFA18:3 PFA20:4 PFA22:5 PFA22:6 Totalpolyunsaturatedfattyacids Phosphorus Potassium Protein Retinol SFA4:0 SFA6:0 SFA8:0 SFA10:0 SFA12:0 SFA14:0 SFA16:0 SFA18:0 Selenium Totalsaturatedfattyacids Totalsugars Totalfat Theobromine VitaminA,RAE Thiamin VitaminB12 Riboflavin VitaminB6 VitaminC VitaminK Zinc NoSalt OrdinarySalt a-Carotene VitaminB12,serum trans-b-carotene cis-b-carotene b-cryptoxanthin Folate,serum g-tocopherol Iron,FrozenSerum CombinedLutein/zeaxanthin trans-lycopene Folate,RBC Retinylpalmitate Retinylstearate Retinol VitaminD a-Tocopherol Daidzein o-Desmethylangolensin Equol Enterodiol Enterolactone Genistein EstimatedVO2max PhysicalActivity Doesanyonesmokeinhome? Total#ofcigarettessmokedinhome Cotinine CurrentCigaretteSmoker? Agelastsmokedcigarettesregularly #cigarettessmokedperdaywhenquit #cigarettessmokedperdaynow #dayssmokedcigsduringpast30days Avg#cigarettes/dayduringpast30days Smokedatleast100cigarettesinlife Doyounowsmokecigarettes... numberofdayssincequit Usedsnuffatleast20timesinlife drink5inaday drinkperday days5drinksinyear daysdrinkinyear 3-fluorene 2-fluorene 3-phenanthrene 1-phenanthrene 2-phenanthrene 1-pyrene 3-benzo[c]phenanthrene 3-benz[a]anthracene Mono-n-butylphthalate Mono-phthalate Mono-cyclohexylphthalate Mono-ethylphthalate Mono-phthalate Mono--hexylphthalate Mono-isobutylphthalate Mono-n-methylphthalate Mono-phthalate Mono-benzylphthalate Cadmium Lead Mercury,total Barium,urine Cadmium,urine Cobalt,urine Cesium,urine Mercury,urine Iodine,urine Molybdenum,urine Lead,urine Platinum,urine Antimony,urine Thallium,urine Tungsten,urine Uranium,urine BloodBenzene BloodEthylbenzene Bloodo-Xylene BloodStyrene BloodTrichloroethene BloodToluene Bloodm-/p-Xylene 1,2,3,7,8-pncdd 1,2,3,7,8,9-hxcdd 1,2,3,4,6,7,8-hpcdd 1,2,3,4,6,7,8,9-ocdd 2,3,7,8-tcdd Beta-hexachlorocyclohexane Gamma-hexachlorocyclohexane Hexachlorobenzene HeptachlorEpoxide Mirex Oxychlordane p,p-DDE Trans-nonachlor 2,5-dichlorophenolresult 2,4,6-trichlorophenolresult Pentachlorophenol Dimethylphosphate Diethylphosphate Dimethylthiophosphate PCB66 PCB74 PCB99 PCB105 PCB118 PCB138&158 PCB146 PCB153 PCB156 PCB157 PCB167 PCB170 PCB172 PCB177 PCB178 PCB180 PCB183 PCB187 3,3,4,4,5,5-hxcb 3,3,4,4,5-pncb 3,4,4,5-tcb Perfluoroheptanoicacid Perfluorohexanesulfonicacid Perfluorononanoicacid Perfluorooctanoicacid Perfluorooctanesulfonicacid Perfluorooctanesulfonamide 2,3,7,8-tcdf 1,2,3,7,8-pncdf 2,3,4,7,8-pncdf 1,2,3,4,7,8-hxcdf 1,2,3,6,7,8-hxcdf 1,2,3,7,8,9-hxcdf 2,3,4,6,7,8-hxcdf 1,2,3,4,6,7,8-hpcdf Measles Toxoplasma HepatitisAAntibody HepatitisBcoreantibody HepatitisBSurfaceAntibody HerpesII Albumin, urine Uric acid Phosphorus Osmolality Sodium Potassium Creatinine Chloride Totalcalcium Bicarbonate Blood urea nitrogen Total protein Total bilirubin Lactate dehydrogenase LDH Gamma glutamyl transferase Globulin Alanine aminotransferase ALT Aspartate aminotransferase AST Alkaline phosphotase Albumin Methylmalonic acid PSA. total Prostate specific antigen ratio TIBC, Frozen Serum Red cell distribution width Red blood cell count Platelet count SI Segmented neutrophils percent Mean platelet volume Mean cell volume Mean cell hemoglobin MCHC Hemoglobin Hematocrit Ferritin Protoporphyrin Transferrin saturation White blood cell count Monocyte percent Lymphocyte percent Eosinophils percent C-reactive protein Segmented neutrophils number Monocyte number Lymphocyte number Eosinophils number Basophils number mean systolic mean diastolic 60 sec. pulse: 60 sec HR Total Cholesterol Triglycerides Glucose, serum Insulin Homocysteine Glucose, plasma Glycohemoglobin C-peptide: SI LDL-cholesterol Direct HDL-Cholesterol Bone alkaline phosphotase Trunk Fat Lumber Pelvis BMD Lumber Spine BMD Head BMD Trunk Lean excl BMC Total Lean excl BMC Total Fat Total BMD Weight Waist Circumference Triceps Skinfold Thigh Circumference Subscapular Skinfold Recumbent Length Upper Leg Length Standing Height Head Circumference Maximal Calf Circumference Body Mass Index -0.4 -0.2 0 0.2 0.4 Value 050100150 Color Key and Histogram Count http://bit.ly.com/pemap phenotypes exposures +- EWAS-derived phenotype-exposure association map: A 2-D view of connections between P and E
  • 113.
    Alpha-carotene Alcohol VitaminEasalpha-tocopherol Beta-carotene Caffeine Calcium Carbohydrate Cholesterol Copper Beta-cryptoxanthin Folicacid Folate,DFE Foodfolate Dietaryfiber Iron Energy Lycopene Lutein+zeaxanthin MFA16:1 MFA18:1 MFA20:1 Magnesium Totalmonounsaturatedfattyacids Moisture Niacin PFA18:2 PFA18:3 PFA20:4 PFA22:5 PFA22:6 Totalpolyunsaturatedfattyacids Phosphorus Potassium Protein Retinol SFA4:0 SFA6:0 SFA8:0 SFA10:0 SFA12:0 SFA14:0 SFA16:0 SFA18:0 Selenium Totalsaturatedfattyacids Totalsugars Totalfat Theobromine VitaminA,RAE Thiamin VitaminB12 Riboflavin VitaminB6 VitaminC VitaminK Zinc NoSalt OrdinarySalt a-Carotene VitaminB12,serum trans-b-carotene cis-b-carotene b-cryptoxanthin Folate,serum g-tocopherol Iron,FrozenSerum CombinedLutein/zeaxanthin trans-lycopene Folate,RBC Retinylpalmitate Retinylstearate Retinol VitaminD a-Tocopherol Daidzein o-Desmethylangolensin Equol Enterodiol Enterolactone Genistein EstimatedVO2max PhysicalActivity Doesanyonesmokeinhome? Total#ofcigarettessmokedinhome Cotinine CurrentCigaretteSmoker? Agelastsmokedcigarettesregularly #cigarettessmokedperdaywhenquit #cigarettessmokedperdaynow #dayssmokedcigsduringpast30days Avg#cigarettes/dayduringpast30days Smokedatleast100cigarettesinlife Doyounowsmokecigarettes... numberofdayssincequit Usedsnuffatleast20timesinlife drink5inaday drinkperday days5drinksinyear daysdrinkinyear 3-fluorene 2-fluorene 3-phenanthrene 1-phenanthrene 2-phenanthrene 1-pyrene 3-benzo[c]phenanthrene 3-benz[a]anthracene Mono-n-butylphthalate Mono-phthalate Mono-cyclohexylphthalate Mono-ethylphthalate Mono-phthalate Mono--hexylphthalate Mono-isobutylphthalate Mono-n-methylphthalate Mono-phthalate Mono-benzylphthalate Cadmium Lead Mercury,total Barium,urine Cadmium,urine Cobalt,urine Cesium,urine Mercury,urine Iodine,urine Molybdenum,urine Lead,urine Platinum,urine Antimony,urine Thallium,urine Tungsten,urine Uranium,urine BloodBenzene BloodEthylbenzene Bloodo-Xylene BloodStyrene BloodTrichloroethene BloodToluene Bloodm-/p-Xylene 1,2,3,7,8-pncdd 1,2,3,7,8,9-hxcdd 1,2,3,4,6,7,8-hpcdd 1,2,3,4,6,7,8,9-ocdd 2,3,7,8-tcdd Beta-hexachlorocyclohexane Gamma-hexachlorocyclohexane Hexachlorobenzene HeptachlorEpoxide Mirex Oxychlordane p,p-DDE Trans-nonachlor 2,5-dichlorophenolresult 2,4,6-trichlorophenolresult Pentachlorophenol Dimethylphosphate Diethylphosphate Dimethylthiophosphate PCB66 PCB74 PCB99 PCB105 PCB118 PCB138&158 PCB146 PCB153 PCB156 PCB157 PCB167 PCB170 PCB172 PCB177 PCB178 PCB180 PCB183 PCB187 3,3,4,4,5,5-hxcb 3,3,4,4,5-pncb 3,4,4,5-tcb Perfluoroheptanoicacid Perfluorohexanesulfonicacid Perfluorononanoicacid Perfluorooctanoicacid Perfluorooctanesulfonicacid Perfluorooctanesulfonamide 2,3,7,8-tcdf 1,2,3,7,8-pncdf 2,3,4,7,8-pncdf 1,2,3,4,7,8-hxcdf 1,2,3,6,7,8-hxcdf 1,2,3,7,8,9-hxcdf 2,3,4,6,7,8-hxcdf 1,2,3,4,6,7,8-hpcdf Measles Toxoplasma HepatitisAAntibody HepatitisBcoreantibody HepatitisBSurfaceAntibody HerpesII Albumin, urine Uric acid Phosphorus Osmolality Sodium Potassium Creatinine Chloride Totalcalcium Bicarbonate Blood urea nitrogen Total protein Total bilirubin Lactate dehydrogenase LDH Gamma glutamyl transferase Globulin Alanine aminotransferase ALT Aspartate aminotransferase AST Alkaline phosphotase Albumin Methylmalonic acid PSA. total Prostate specific antigen ratio TIBC, Frozen Serum Red cell distribution width Red blood cell count Platelet count SI Segmented neutrophils percent Mean platelet volume Mean cell volume Mean cell hemoglobin MCHC Hemoglobin Hematocrit Ferritin Protoporphyrin Transferrin saturation White blood cell count Monocyte percent Lymphocyte percent Eosinophils percent C-reactive protein Segmented neutrophils number Monocyte number Lymphocyte number Eosinophils number Basophils number mean systolic mean diastolic 60 sec. pulse: 60 sec HR Total Cholesterol Triglycerides Glucose, serum Insulin Homocysteine Glucose, plasma Glycohemoglobin C-peptide: SI LDL-cholesterol Direct HDL-Cholesterol Bone alkaline phosphotase Trunk Fat Lumber Pelvis BMD Lumber Spine BMD Head BMD Trunk Lean excl BMC Total Lean excl BMC Total Fat Total BMD Weight Waist Circumference Triceps Skinfold Thigh Circumference Subscapular Skinfold Recumbent Length Upper Leg Length Standing Height Head Circumference Maximal Calf Circumference Body Mass Index -0.4 -0.2 0 0.2 0.4 Value 050100150 Color Key and Histogram Count http://bit.ly.com/pemap phenotypes exposures +- nutrients BMI,weight, BMD metabolic renalfunction pcbs metabolic bloodparameters hydrocarbons EWAS-derived phenotype-exposure association map: A 2-D view of connections between P and E
  • 114.
    Toward a phenotype-exposureassociation map: (Re)-categorizing phenotypes with E 7 6 5 4 3 2 1 0 Distance liver:Albumin kidney:Bicarbonate immunological:Basophils percent immunological:Lymphocyte percent immunological:Eosinophils percent kidney:Phosphorus liver:Total protein liver:Aspartate aminotransferase AST liver:Alanine aminotransferase ALT body measures:Head Circumference body measures:Recumbent Length liver:Lactate dehydrogenase LDH cancer:Prostate specific antigen ratio cancer:PSA, free blood:Transferrin saturation liver:Total bilirubin heart:Direct HDL-Cholesterol immunological:Monocyte percent bone:Head BMD body measures:Standing Height body measures:Upper Leg Length bone:Total BMD bone:Lumber Spine BMD bone:Lumber Pelvis BMD heart:Triglycerides heart:LDL-cholesterol heart:Total Cholesterol blood:MCHC blood:TIBC, Frozen Serum blood:Hematocrit blood:Hemoglobin kidney:Potassium blood:Mean cell hemoglobin blood:Mean cell volume kidney:Uric acid kidney:Blood urea nitrogen kidney:Total calcium kidney:Creatinine blood:Ferritin blood:Red blood cell count body measures:Weight blood:Segmented neutrophils percent body measures:Total Lean excl BMC body measures:Trunk Lean excl BMC body measures:Body Mass Index body measures:Waist Circumference body measures:Triceps Skinfold body measures:Maximal Calf Circumference body measures:Thigh Circumference liver:Gamma glutamyl transferase blood pressure:60 sec. pulse: metabolic:Insulin body measures:Total Fat body measures:Trunk Fat body measures:Subscapular Skinfold blood pressure:mean systolic immunological:C-reactive protein liver:Globulin immunological:Monocyte number immunological:Segmented neutrophils number immunological:Lymphocyte number immunological:White blood cell count immunological:Basophils number immunological:Eosinophils number blood:Mean platelet volume heart:Homocysteine nutrition:Methylmalonic acid kidney:Osmolality kidney:Chloride kidney:Sodium kidney:Albumin, urine blood pressure:60 sec HR cancer:PSA. total blood:Platelet count SI blood:Protoporphyrin blood:Red cell distribution width bone:Bone alkaline phosphotase liver:Alkaline phosphotase blood pressure:mean diastolic metabolic:C-peptide: SI metabolic:Glycohemoglobin metabolic:Glucose, plasma metabolic:Glucose, serum inflammation adiposity kidney function metabolic traits
  • 115.
    7 6 54 3 2 1 0 Distance liver:Albumin kidney:Bicarbonate immunological:Basophils percent immunological:Lymphocyte percent immunological:Eosinophils percent kidney:Phosphorus liver:Total protein liver:Aspartate aminotransferase AST liver:Alanine aminotransferase ALT body measures:Head Circumference body measures:Recumbent Length liver:Lactate dehydrogenase LDH cancer:Prostate specific antigen ratio cancer:PSA, free blood:Transferrin saturation liver:Total bilirubin heart:Direct HDL-Cholesterol immunological:Monocyte percent bone:Head BMD body measures:Standing Height body measures:Upper Leg Length bone:Total BMD bone:Lumber Spine BMD bone:Lumber Pelvis BMD heart:Triglycerides heart:LDL-cholesterol heart:Total Cholesterol blood:MCHC blood:TIBC, Frozen Serum blood:Hematocrit blood:Hemoglobin kidney:Potassium blood:Mean cell hemoglobin blood:Mean cell volume kidney:Uric acid kidney:Blood urea nitrogen kidney:Total calcium kidney:Creatinine blood:Ferritin blood:Red blood cell count body measures:Weight blood:Segmented neutrophils percent body measures:Total Lean excl BMC body measures:Trunk Lean excl BMC body measures:Body Mass Index body measures:Waist Circumference body measures:Triceps Skinfold body measures:Maximal Calf Circumference body measures:Thigh Circumference liver:Gamma glutamyl transferase blood pressure:60 sec. pulse: metabolic:Insulin body measures:Total Fat body measures:Trunk Fat body measures:Subscapular Skinfold blood pressure:mean systolic immunological:C-reactive protein liver:Globulin immunological:Monocyte number immunological:Segmented neutrophils number immunological:Lymphocyte number immunological:White blood cell count immunological:Basophils number immunological:Eosinophils number blood:Mean platelet volume heart:Homocysteine nutrition:Methylmalonic acid kidney:Osmolality kidney:Chloride kidney:Sodium kidney:Albumin, urine blood pressure:60 sec HR cancer:PSA. total blood:Platelet count SI blood:Protoporphyrin blood:Red cell distribution width bone:Bone alkaline phosphotase liver:Alkaline phosphotase blood pressure:mean diastolic metabolic:C-peptide: SI metabolic:Glycohemoglobin metabolic:Glucose, plasma metabolic:Glucose, serum “bad” cholesterol “good” cholesterol Toward a phenotype-exposure association map: (Re)-categorizing phenotypes with E
  • 116.
    7 6 54 3 2 1 0 Distance liver:Albumin kidney:Bicarbonate immunological:Basophils percent immunological:Lymphocyte percent immunological:Eosinophils percent kidney:Phosphorus liver:Total protein liver:Aspartate aminotransferase AST liver:Alanine aminotransferase ALT body measures:Head Circumference body measures:Recumbent Length liver:Lactate dehydrogenase LDH cancer:Prostate specific antigen ratio cancer:PSA, free blood:Transferrin saturation liver:Total bilirubin heart:Direct HDL-Cholesterol immunological:Monocyte percent bone:Head BMD body measures:Standing Height body measures:Upper Leg Length bone:Total BMD bone:Lumber Spine BMD bone:Lumber Pelvis BMD heart:Triglycerides heart:LDL-cholesterol heart:Total Cholesterol blood:MCHC blood:TIBC, Frozen Serum blood:Hematocrit blood:Hemoglobin kidney:Potassium blood:Mean cell hemoglobin blood:Mean cell volume kidney:Uric acid kidney:Blood urea nitrogen kidney:Total calcium kidney:Creatinine blood:Ferritin blood:Red blood cell count body measures:Weight blood:Segmented neutrophils percent body measures:Total Lean excl BMC body measures:Trunk Lean excl BMC body measures:Body Mass Index body measures:Waist Circumference body measures:Triceps Skinfold body measures:Maximal Calf Circumference body measures:Thigh Circumference liver:Gamma glutamyl transferase blood pressure:60 sec. pulse: metabolic:Insulin body measures:Total Fat body measures:Trunk Fat body measures:Subscapular Skinfold blood pressure:mean systolic immunological:C-reactive protein liver:Globulin immunological:Monocyte number immunological:Segmented neutrophils number immunological:Lymphocyte number immunological:White blood cell count immunological:Basophils number immunological:Eosinophils number blood:Mean platelet volume heart:Homocysteine nutrition:Methylmalonic acid kidney:Osmolality kidney:Chloride kidney:Sodium kidney:Albumin, urine blood pressure:60 sec HR cancer:PSA. total blood:Platelet count SI blood:Protoporphyrin blood:Red cell distribution width bone:Bone alkaline phosphotase liver:Alkaline phosphotase blood pressure:mean diastolic metabolic:C-peptide: SI metabolic:Glycohemoglobin metabolic:Glucose, plasma metabolic:Glucose, serum height + BMD Toward a phenotype-exposure association map: (Re)-categorizing phenotypes with E
  • 117.
  • 118.
    Triglycerides Total Cholesterol LDL-cholesterol Trunk Fat Albumin,urine Insulin Total Fat Head Circumference Blood urea nitrogen Albumin Homocysteine C-peptide: SI C-reactive protein Body Mass Index Ferritin Thigh Circumference Maximal Calf Circumference Direct HDL-Cholesterol Total calcium Total bilirubin Red cell distribution width Gamma glutamyl transferase Mean cell volume Mean cell hemoglobin White blood cell count Uric acid Protoporphyrin Hemoglobin Total protein Alkaline phosphotase Waist Circumference Hematocrit Weight Standing Height 1/Creatinine Creatinine Trunk Lean excl BMC Methylmalonic acid Triceps Skinfold Lymphocyte number Subscapular Skinfold Total Lean excl BMC Segmented neutrophils number Lactate dehydrogenase LDH Bone alkaline phosphotase TIBC, Frozen Serum Aspartate aminotransferase AST Phosphorus Lumber Pelvis BMD Glycohemoglobin Globulin Chloride Bicarbonate Alanine aminotransferase ALT 60 sec. pulse: Upper Leg Length Total BMD Potassium Glucose, serum Glucose, plasma Red blood cell count Lumber Spine BMD Platelet count SI MCHC Osmolality Monocyte number mean systolic Lymphocyte percent Segmented neutrophils percent Recumbent Length Eosinophils number Monocyte percent Head BMD mean diastolic Prostate specific antigen ratio 60 sec HR Basophils number Sodium PSA, free Mean platelet volume Eosinophils percent PSA. total Basophils percent 0 10 20 30 40 R^2 * 100 1 to 66 exposures identified for 81 phenotypes Additive effect of E factors: Describe < 20% of variability in P (On average: 8%) σ2 E?
  • 119.
    Emerging technologies toascertain exposome will enable biomedical discovery High-throughput E standards: mitigate fragmented literature of associations Confounding, reverse causality: how to handle at large dimension? e.g., EWASs in complex disease through the life course Enable more precise definitions of P
  • 120.
    ...but what aboutinteraction between these factors? Do a combination of genetic and environmental factors impart different risk for disease than either alone? P = G x E Complex traits are a function of genes and environment...
  • 121.
    Gene-Environment Interactions: Combinationof G and E different than of variant or factor alone Find additional disease risk (variance) Posit biological mechanisms G+ G- E+ E-NAT2 variant smoke? cancer non-cancer Bladder Cancer Environmental Toxicology. 2012 Bioinformatics. 2012 Curr Op Env Health (in press) Analytically complex • How do you select which G and E to test??? • Need a lot of samples (power!) Few studies exist that measure G & E together
  • 122.
    Why not investigategenes and environment simultaneously: Analytic complexity and large numbers of interactions! G genetic variants and E exposures = G × E possible pairs = 100 possible interactions 10 genetic variants 1 2 3 4 5 6 7 8 9 10 rs13266634 (SLC30A8) rs1807292 (PPARγ) rs7903146 (TCF7L2) ............................................. 10 exposures1 2 3 4 5 6 7 8 9 10 sm oke? vitam in E radiation ............................... pesticide vitam in C Bioinformatics. 2012 Curr Op Env Health (in press)
  • 123.
    Why not investigategenes and environment simultaneously: Analytic complexity and large numbers of interactions! G genetic variants and E exposures = G × E possible pairs = 100 possible interactions 10 genetic variants 1 2 3 4 5 6 7 8 9 10 rs13266634 (SLC30A8) rs1807292 (PPARγ) rs7903146 (TCF7L2) ............................................. 10 exposures1 2 3 4 5 6 7 8 9 10 sm oke? vitam in E radiation ............................... pesticide vitam in C Bioinformatics. 2012 Curr Op Env Health (in press)
  • 124.
    Combining EWAS andGWAS: Select pairs by their main effects −log10(pvalue) ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ●● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● acrylamide allergentest bacterialinfection cotinine diakyl dioxins furansdibenzofuran heavymetals hydrocarbons latex nutrientscarotenoid nutrientsminerals nutrientsvitaminA nutrientsvitaminB nutrientsvitaminC nutrientsvitaminD nutrientsvitaminE pcbs perchlorate pesticidesatrazine pesticideschlorophenol pesticidesorganochlorine pesticidesorganophosphate pesticidespyrethyroid phenols phthalates phytoestrogens polybrominatedethers polyflourochemicals viralinfection volatilecompounds 012 γ-tocopherol β-carotene heptachlor PCB170 ex: PLOS ONE (2010) A RT I C L E S 50 Locus established previously Locus identified by current study Locus not confirmed by current study BCL11A THADA NOTCH2 ADAMTS9 IRS1 IGF2BP2 WFS1 ZBED3 CDKAL1 HHEX/IDE KCNQ1 (2 signals*: ) TCF7L2 KCNJ11 CENTD2 MTNR1B HMGA2 ZFAND6 PRC1 FTO HNF1B DUSP9 Conditional analysis Unconditional analysis TSPAN8/LGR5 HNF1A CDC123/CAMK1D CHCHD9 CDKN2A/2B SLC30A8 TP53INP1 JAZF1 KLF14 PPAR 40 30 –log10(P)(P) 20 10 10 0 Suggestive statistical association (P < 1 10 –5 ) Association in identified or established region (P < 1 10 –4 ) rs7903146 (TCF7L2) rs13266634 (SLC30A8) rs1801282 (PPARG) + ex: Voight et al., Nature Genetics (2010) WTCCC, Nature (2007) Sladek et al., Nature (2007) Human Genetics. 2013
  • 125.
    Prototype G-EWAS Methodology GxEin association to T2D 1.) Nyholt. AJHG 2004 2.) Bůžková et al. Annals of Human Genetics. 2010 4.4 17.8 Bonferroni Correction Number of Effective Tests1 ≅80 α=0.05/80 = 0.0006 False Discovery Rate Parametric Bootstrap of Null Model2 γ-tocopherol cis-β-carotene PCB170 heptachlor rs10923931(NOTCH2) rs7903146(TCF7L2) rs13266634(SLC30A8) rs7901695(TCF7L2) total: 90 rs2383208(Unknown) rs1260326(GCKR) rs780094(GCKR) rs2237895(KCNQ1) rs10811661(Unknown) rs4712523(CDKAL1) rs4607103(Unknown) rs1111875(Unknown) rs7578597(THADA) rs4402960(IGF2BP2) rs1801282(PPARG) rs12779790(Unknown) rs8050136(FTO) rs864745(JAZF1) trans-β-carotene 18 GWAS loci 5 EWAS factors Logistic Regression Fasting Blood Glucose ≥ 126 mg/dL (age, BMI, sex, race) logit(diabetes) z(γ-tocopherol) rs13266634 (0) (1) (2) (#) risk alleles
  • 126.
    Per-risk allele ORfor rs13266634 (SLC30A8) Stratified by E Increase or decrease up to 30-40% vs. marginal effect! Adjusted for race, sex, BMI, age trans-β-carotene (low(-1SD)) trans-β-carotene (mean) trans-β-carotene (high(+1SD)) γ-tocopherol (low(-1SD)) γ-tocopherol (mean) γ-tocopherol (high(+1SD)) rs13266634(SLC30A8) rs13266634(SLC30A8) 0 0.5 1 1.5 2 2.5 Per risk allele OR OR (95% CI) 1.8 [1.3,2.6] 1.1 [0.79,1.5] 0.65 [0.4,1.1] p-value:5e-05 N(cases):1702(164) 0.82 [0.52,1.3] 1.1 [0.87,1.5] 1.6 [1.3,2] p-value:0.0094 N(cases):2925(274) marginal OR=1.1 trans-β-carotene (low(-1SD)) trans-β-carotene (mean) trans-β-carotene (high(+1SD)) γ-tocopherol (low(-1SD)) γ-tocopherol (mean) γ-tocopherol (high(+1SD)) rs13266634(SLC30A8) rs13266634(SLC30A8) 0 0.5 1 1.5 2 2.5 Per risk allele OR OR (95% CI) 1.8 [1.3,2.6] 1.1 [0.79,1.5] 0.65 [0.4,1.1] p-value:5e-05 N(cases):1702(164) 0.82 [0.52,1.3] 1.1 [0.87,1.5] 1.6 [1.3,2] p-value:0.0094 N(cases):2925(274) FDR=2% FDR=18% Human Genetics. 2013
  • 127.
    It is possibleto detect GxE by combining EWAS and GWAS Detected interaction effect changes between EWAS and GWAS factors What is the biological mechanism of interaction? Need to replicate these results in diverse populations. Re-capture GWAS “investment” by considering prevalent E-factors?!
  • 128.
    Possible to utilizethe XWAS approach for general purpose discovery…
  • 129.
    PheWAS: dissecting theshared genetic architecture (pleiotropy) of disease! PheWAS: Phenome-wide association study Denny et al, Nature Biotech 2013 c Coronary atherosclerosis Ischemic heart disease Chronic ischemic heart disease Angina pectoris Occlusion & stenosis of precerebral arteries Hemorrhoids Intermediate coronary syndrome Myocardial infraction Polyneuropathy in diabetes Type 2 diabetic nephropathy 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 Infectious N eoplastic Psychiatric N eurologic Cardiovascular Pulm onaryD igestive G enitourinary D erm atologic M usculoskeletal Injuries Sym ptom s and signs H em atopoietic Endocrine and m etabolic rs4977574 (CDKN2BAS) –log10(P) Seborrheic keratosis f oral mucosa ologic sculoskeletal Injuries m ptom s and signs d Type 1 diabetic ketoacidosis Type 1 diabetes Rheum. arthritis nephropathy diabetic neuropathy 11 10 9 8 7 ) rs660895 (HLA-DRB1) 10−12), acute myocardial infarction (OR = d abdominal aortic aneurysm (OR = 1.29, with prior publications3, but also with other ular” phenotypes such as unstable angina, Our study replicated the association between rheumatoid arthritis and rs660895 near HLA-DRB1 (Fig. 3d; OR = 1.56, P = 6.7 × 10−8). This SNP was also strongly associated with type 1 diabetes (OR = 1.44, P = 7.1 × 10−8) and potentially associated with inflammatory −5 e to brain Solar dermatitis Seborrheic keratosis Osteopenia m onaryD igestive G enitourinary D erm atologic M usculoskeletal Injuries Sym ptom s and signs Angi Occlu Hemorrhoids Polyneuro Type 2 diabet5 4 3 2 1 0 Infectious N eoplastic Psychiatric N eurologic Cardiovascular Pulm onaryD igestive G enitourinary D erm atologic M usculoskeletal Injuries Sym ptom s and signs H em atopoietic Endocrine and m etabolic Infectious N eoplastic Psychiatric N eurologic Cardiovascular Pulm onaryD igestive G enitourinary D erm atologic M usculoskeletal Injuries Sym ptom s and signs H em atopoietic Endocrine and m etabolic 2 1 0 d Type 1 diabetic ketoacidosis Type 1 diabetes Type 2 diabetes Arteritides Giant cell arteritis Conjunctivitis, infectious Visual field defects Viral pneumonia Nasal polyps Rheum. arthritis Shock Type 1 diabetes nephropathy Polyneuropathy in diabetes Type 1 diabetic neuropathy 11 10 9 8 7 6 5 4 3 2 1 0 Infectious N eoplastic Psychiatric N eurologic Cardiovascular Pulm onaryD igestive G enitourinary D erm atologic M usculoskeletal Injuries Sym ptom s and signs H em atopoietic Endocrine and m etabolic –log10(P) rs660895 (HLA-DRB1) or four SNPs. Each panel represents 1,358 phenotypes h a particular SNP, using logistic regression assuming an djusted for age, sex, study site and the first three principal are grouped along the x axis by categorization within chy. The upper red lines indicate P = 4.6 × 10−6 (FDR = 0.1 r blue lines indicate P = 0.05; dashed lines are a orrection (P = 0.05/1,358). Diamonds encircling phenotype NHGRI Catalog associations. (a) PheWAS associations for eviously associated with hair and eye color, freckling and palsy. (b) PheWAS associations for rs2853676 in TERT, h glioma. (c) PheWAS associations for rs4977574 near previously associated with myocardial infarction, and in (d) PheWAS associations for rs660895 near HLA-DRB1, h rheumatoid arthritis. Results and plots for all SNPs tudy are available at http://phewascatalog.org/. MI GWAS SNP RA GWAS SNP
  • 130.
    MWAS: Medication-wide association study Ryan,PB., CPT 2013 www.nature.com/psp 3 1.0E-001 atc1_concept_name, atc3_concept_name, rxnorm_concept_name Color by atc1_concept_name ALIMENTARY TRACT AND METABOLISM ANTIINFECTIVES FOR SYSTEMIC USE ANTIPARASITIC PRODUCTS, INSECTICIDES AND REPELLENTS BLOOD AND BLOOD FORMING ORGANS CARDIOVASCULAR SYSTEM DERMATOLOGICALS GENITO-URINARY SYSTEM AND SEX HORMONES MUSCULO-SKELETAL SYSTEM NERVOUS SYSTEM NULL RESPIRATORY SYSTEM SENSORY ORGANS SYSTEMIC HORMONAL PREPARATIONS, EXCLUDING SEX HORMONES AND INSULINS Shape by GROUND_TRUTH Horizontal line: Horizontal line: Bonferroni adjustment: P P < 0.05 0 1 SulfasalazineANTIDIARRHEALS,INTES... ANTIEMETICSANDANTI... DRUGSFORACIDRELA... DRUGSUSEDIN DIABETES LAXATIVES ANTIBACTERIALSFOR SYSTEMICUSE ANTIMYCOTICSFOR SYSTEMICUSE ANTIVIRALSFORSYSTE... ANTHELMINTICS ANTIPROTOZOALS ANTIANEMIC PREPARATIONS ANTITHROMBOTICAGE... AGENTSACTINGONTH... ANTIFUNGALSFORDER... EMOLLIENTSAND PROTECTIVES SEXHORMONESAND MODULATORSOFTHE GENITALSYSTEM UROLOGICALS ANTIINFLAMMATORYAND ANTIRHEUMATIC PRODUCTS MUSCLERELAXANTS TOPICALPRODUCTSFOR JOINTANDMUSCULAR PAIN ANALGESICS ANESTHETICS ANTIEPILEPTICS ANTI-PARKINSONDRUGS PHYCHOANALEPTICS PSYCHOLEPTICS NULL ANTIHISTAMINESFOR SYSTEMICUSE COUGHANDCOLDPRE... DRUGSFOR OBSTRUCTIVEAIRWAY... NASALPREPARATIONS OPHTHALMOLOGICALS OTOLOGICALS PITUITARYANDHYPOTH... THYROIDTHERAPY CALCIUMCHANNEL BLOCKERS DRUGSFORFUNCTIONAL GASTROINTESTINALDIS... ALIMEN TARY TRACT AND METAB OLISM ANTINF ECTIVE SFOR SYSTE MIC USE ANTIPA RASITIC PRODU BLOOD AND BLOO... CARDIO VASCUL ARSY... DERMA TOLOGI CALS GENITO URINAR Y SYSTE MAND SEX HORMO NES MUSCU LO- SKELET AL SYSTE M NERVO US SYSTE M NULL RESPIR ATORY SYSTE M SENSO RY ORGAN S SYSTE MIC HORMO NALP... CTS,I... Tetrahydrocannabinol Sucralfate Dicyclomine Hyoscyamine Acarbose Sitagliptin Lactulose Clindamycin Methenamine PenicillinV Ketoconazole Nevirapine Mebendazole Tinidazole Darbepoetinalfa EpoetinAlfa Dipyridamole Moexipril Amlodipine Nifedipine Terbinafine Urea Estradiol Estrogens,conjugated(USP) Estropipate Darifenacin Flavoxate Oxybutynin Etodolac Fenoprofen Indomethacine Ketorolac Nabumetone Oxaprozin Sulindac Metaxalone Methocarbamol Flurbiprofen Ketoprofen Piroxicam Tolmetin Almotriptan Diflunisal Eletriptan Frovatriptan Naratriptan Rizatriptan Salicylsalicylicacid Sumatriptan Zolmitriptan Prilocaine Primidone Bromocriptine Desipramine Imipramine Nortriptyline Chlorazepate Droperidol Prochlorperazine Ramelteon Temazepam Amylases Endopeptidases Lipase Sodiumphosphate,monobasic Loratadine Benzonatate Salmeterol Zafirlukast Fluticasone Acetazolamide Bromfenac Gatifloxacin Ketotifen Scopolamine Miconazole Cosyntropin Methimazole 1.0E-002 p_full 1.0E-003 1.0E-004 1.0E-005 1.0E-006 1.0E-007 1.0E-008 1.0E-009 1.0E-010 1.0E-011 1.0E-012 MWAS MarketScan CCAE OMOP acute myocardial infarction 1 a
  • 131.
    In conclusion: on GWASand EWAS GWAS has been unparalleled in biological discovery... ... coupled with EWAS, will lead to precise and personal medicine. −log10(pvalue) ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ●● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● acrylamide allergentest bacterialinfection cotinine diakyl dioxins furansdibenzofuran heavymetals hydrocarbons latex nutrientscarotenoid nutrientsminerals nutrientsvitaminA nutrientsvitaminB nutrientsvitaminC nutrientsvitaminD nutrientsvitaminE pcbs perchlorate pesticidesatrazine pesticideschlorophenol pesticidesorganochlorine pesticidesorganophosphate pesticidespyrethyroid phenols phthalates phytoestrogens polybrominatedethers polyflourochemicals viralinfection volatilecompounds 012 to a doubling of the number of associated variants discov- ered. The proportion of genetic variation explained by significantly associated SNPs is usually low (typically less than 10%) for many complex traits, but for diseases such as CD and multiple sclerosis (MS [MIM 126200]), and for quantitative traits such as height and lipid traits, between 10% and 20% of genetic variance has been accounted for (Table 1). In comparison to the pre-GWAS era, the propor- tion of genetic variation accounted for by newly discov- ered variants that are segregating in the population is large. It is clear that for most complex traits that have been investigated by GWAS, multiple identified loci have genome-wide statistical significance, and thus it is likely that there are (many) other loci that have not been identi- fied because of a lack of statistical significance (false nega- tives). Recently, researchers have developed and applied methods to quantify the proportion of phenotypic varia- Figure 1. GWAS Discoveries over Time Data obtained from the Published GWAS Catalog (see Web Resources). Only the top SNPs representing loci with association p values < 5 3 10À8 are included, and so that multiple counting is avoided, SNPs identified for the same traits with LD r2 > 0.8 esti- mated from the entire HapMap samples are excluded. Figure 2. Increase in Number of Loci Identified as a Function of Experimental Sample Size (A) Selected quantitative traits. (B) Selected diseases. The coordinates are on the log scale. The complex traits were selected with the criteria that there were at least three GWAS papers published on each in journals with a 2010–2011 journal
  • 132.
    Harvard HMS Isaac Kohane SusanneChurchill Stan Shaw Nathan Palmer Jenn Grandfield Sunny Alvear Michal Preminger Harvard Chan Hugues Aschard Francesca Dominici Stanford John Ioannidis Atul Butte (UCSF) U Queensland Jian Yang Peter Visscher Cochrane Belinda Burford Chirag Lakhani Adam Brown Arjun Manrai Erik Corona Nam Pho Chirag J Patel chirag@hms.harvard.edu @chiragjp www.chiragjpgroup.org CDC/NCHS Ajay Yesupriya Imperial Ioanna Tzoulaki Paul Elliott Lund (Sweden) Jan Sundquist Kristina Sundquist NIH Common Fund Big Data to Knowledge Thanks...