Repurposing large datasets to dissect exposomic contributions in health and disease

Repurposing large datasets to dissect
exposomic (and genomic) contributions in
health and disease
Chirag J Patel

CDC Oﬃce of Public Health Genomics

2/22/16
chirag@hms.harvard.edu
@chiragjp
www.chiragjpgroup.org

P = G + EType 2 Diabetes

Cancer

Alzheimer’s

Gene expression
Phenotype Genome
Variants
Environment
Infectious agents

Nutrients

Pollutants

Drugs

We are great at G investigation!
over 2400

Genome-wide Association Studies (GWAS)

https://www.ebi.ac.uk/gwas/
G

Nothing comparable to elucidate E inﬂuence!
E: ???
We lack high-throughput methods
and data to discover new E in P…

A similar paradigm for discovery should exist

for E!
Why?

σ2
G
σ2P
H2 =
Heritability (H2) is the range of phenotypic
variability attributed to genetic variability in a
population
Indicator of the proportion of phenotypic
diﬀerences attributed to G.

Height is an example of a heritable trait:

Francis Galton shows how its done (1887)
“mid-height of 205 parents
described 60% of variability of 928
offspring”

Eye color
Hair curliness
Type-1 diabetes
Height
Schizophrenia
Epilepsy
Graves' disease
Celiac disease
Polycystic ovary syndrome
Attention deficit hyperactivity disorder
Bipolar disorder
Obesity
Alzheimer's disease
Anorexia nervosa
Psoriasis
Bone mineral density
Menarche, age at
Nicotine dependence
Sexual orientation
Alcoholism
Lupus
Rheumatoid arthritis
Crohn's disease
Migraine
Thyroid cancer
Autism
Blood pressure, diastolic
Body mass index
Depression
Coronary artery disease
Insomnia
Menopause, age at
Heart disease
Prostate cancer
QT interval
Breast cancer
Ovarian cancer
Hangover
Stroke
Asthma
Blood pressure, systolic
Hypertension
Osteoarthritis
Parkinson's disease
Longevity
Type-2 diabetes
Gallstone disease
Testicular cancer
Cervical cancer
Sciatica
Bladder cancer
Colon cancer
Lung cancer
Leukemia
Stomach cancer
0 25 50 75 100
Heritability: Var(G)/Var(Phenotype) Source: SNPedia.com
G estimates for burdensome diseases are low and variable:
massive opportunity for high-throughput E discovery
Type 2 Diabetes
Heart Disease
Autism (50%???)

Eye color
Hair curliness
Type-1 diabetes
Height
Schizophrenia
Epilepsy
Graves' disease
Celiac disease
Polycystic ovary syndrome
Attention deficit hyperactivity disorder
Bipolar disorder
Obesity
Alzheimer's disease
Anorexia nervosa
Psoriasis
Bone mineral density
Menarche, age at
Nicotine dependence
Sexual orientation
Alcoholism
Lupus
Rheumatoid arthritis
Crohn's disease
Migraine
Thyroid cancer
Autism
Blood pressure, diastolic
Body mass index
Depression
Coronary artery disease
Insomnia
Menopause, age at
Heart disease
Prostate cancer
QT interval
Breast cancer
Ovarian cancer
Hangover
Stroke
Asthma
Blood pressure, systolic
Hypertension
Osteoarthritis
Parkinson's disease
Longevity
Type-2 diabetes
Gallstone disease
Testicular cancer
Cervical cancer
Sciatica
Bladder cancer
Colon cancer
Lung cancer
Leukemia
Stomach cancer
0 25 50 75 100
Heritability: Var(G)/Var(Phenotype) Source: SNPedia.com
G estimates for complex traits are low and variable:
massive opportunity for high-throughput E discovery
σ2
E : Exposome!

©2015NatureAmerica,Inc.Allrightsreserved.
Despite a century of research on complex traits in humans, the
relative importance and specific nature of the influences of
genes and environment on human traits remain controversial.
We report a meta-analysis of twin correlations and reported
variance components for 17,804 traits from 2,748 publications
including 14,558,903 partly dependent twin pairs, virtually
all published twin studies of complex traits. Estimates of
heritability cluster strongly within functional domains,
and across all traits the reported heritability is 49%. For a
majority (69%) of traits, the observed twin correlations are
consistent with a simple and parsimonious model where twin
resemblance is solely due to additive genetic variation. The
data are inconsistent with substantial influences from shared
environment or non-additive genetic variation. This study
provides the most comprehensive analysis of the causes of
individual differences in human traits thus far and will guide
future gene-mapping efforts. All the results can be visualized
using the MaTCH webtool.
Specifically, the partitioning of observed variability into underlying
genetic and environmental sources and the relative importance of
additive and non-additive genetic variation are continually debated1–5.
Recent results from large-scale genome-wide association studies
(GWAS) show that many genetic variants contribute to the variation
in complex traits and that effect sizes are typically small6,7. However,
the sum of the variance explained by the detected variants is much
smaller than the reported heritability of the trait4,6–10. This ‘missing
heritability’ has led some investigators to conclude that non-additive
variation must be important4,11. Although the presence of gene-gene
interaction has been demonstrated empirically5,12–17, little is known
about its relative contribution to observed variation18.
In this study, our aim is twofold. First, we analyze empirical esti-
mates of the relative contributions of genes and environment for
virtually all human traits investigated in the past 50 years. Second, we
assess empirical evidence for the presence and relative importance of
non-additive genetic influences on all human traits studied. We rely
on classical twin studies, as the twin design has been used widely
to disentangle the relative contributions of genes and environment,
across a variety of human traits. The classical twin design is based
on contrasting the trait resemblance of monozygotic and dizygotic
twin pairs. Monozygotic twins are genetically identical, and dizygotic
twins are genetically full siblings. We show that, for a majority of traits
(69%), the observed statistics are consistent with a simple and parsi-
monious model where the observed variation is solely due to additive
genetic variation. The data are inconsistent with a substantial influence
from shared environment or non-additive genetic variation. We also
show that estimates of heritability cluster strongly within functional
domains, and across all traits the reported heritability is 49%. Our
results are based on a meta-analysis of twin correlations and reported
variance components for 17,804 traits from 2,748 publications includ-
ing 14,558,903 partly dependent twin pairs, virtually all twin studies of
complex traits published between 1958 and 2012. This study provides
the most comprehensive analysis of the causes of individual differences
in human traits thus far and will guide future gene-mapping efforts. All
Meta-analysis of the heritability of human traits based on
fifty years of twin studies
Tinca J C Polderman1,10, Beben Benyamin2,10, Christiaan A de Leeuw1,3, Patrick F Sullivan4–6,
Arjen van Bochoven7, Peter M Visscher2,8,11 & Danielle Posthuma1,9,11
1Department of Complex Trait Genetics, VU University, Center for Neurogenomics
and Cognitive Research, Amsterdam, the Netherlands. 2Queensland Brain
Institute, University of Queensland, Brisbane, Queensland, Australia. 3Institute
for Computing and Information Sciences, Radboud University Nijmegen,
Nijmegen, the Netherlands. 4Center for Psychiatric Genomics, Department
of Genetics, University of North Carolina, Chapel Hill, North Carolina, USA.
5Department of Psychiatry, University of North Carolina, Chapel Hill, North
Carolina, USA. 6Department of Medical Epidemiology and Biostatistics,
Karolinska Institutet, Stockholm, Sweden. 7Faculty of Sciences, VU University,
Insight into the nature of observed variation in human traits is impor-
tant in medicine, psychology, social sciences and evolutionary biology.
It has gained new relevance with both the ability to map genes for
human traits and the availability of large, collaborative data sets to do
so on an extensive and comprehensive scale. Individual differences in
human traits have been studied for more than a century, yet the causes
of variation in human traits remain uncertain and controversial.
Nature Genetics, 2015
17,804 traits of the phenome
2,748 publications

14,558,903 twin pairs
Average H2 (genome): 0.49
Exposome may play an equal role.

It took a new paradigm of GWAS for discovery:
Human Genome Project to GWAS
Sequencing of the genome
2001
HapMap project:
http://hapmap.ncbi.nlm.nih.gov/
Characterize common variation
2001-current day
High-throughput variant
assay
< $99 for ~1M variants
Measurement tools
~2003 (ongoing)
ARTICLES
Genome-wide association study of 14,000
cases of seven common diseases and
3,000 shared controls
The Wellcome Trust Case Control Consortium*
There is increasing evidence that genome-wide association (GWA) studies represent a powerful approach to the
identification of genes involved in common human diseases. We describe a joint GWA study (using the Affymetrix GeneChip
500K Mapping Array Set) undertaken in the British population, which has examined ,2,000 individuals for each of 7 major
diseases and a shared set of ,3,000 controls. Case-control comparisons identified 24 independent association signals at
P , 5 3 1027
: 1 in bipolar disorder, 1 in coronary artery disease, 9 in Crohn’s disease, 3 in rheumatoid arthritis, 7 in type 1
diabetes and 3 in type 2 diabetes. On the basis of prior findings and replication studies thus-far completed, almost all of these
signals reflect genuine susceptibility effects. We observed association at many previously identified loci, and found
compelling evidence that some loci confer risk for more than one of the diseases studied. Across all diseases, we identified a
25 27
Vol 447|7 June 2007|doi:10.1038/nature05911
WTCCC, Nature, 2008.
Comprehensive, high-throughput analyses
GWAS

Explaining the other 50%:
A big data-driven paradigm for robust discovery of
E in disease via EWAS and the exposome
what to measure? how to measure?
PERSPECTIVES
Xenobiotics
Inflammation
Preexisting disease
Lipid peroxidation
Oxidative stress
Gut flora
Internal
chemical
environment
Externalenvironment
ExposomeRADIATION
DIET
POLLUTION
INFECTIONS
DRUGS
LIFE-STYLE
STRESS
Reactive electrophiles
Metals
Endocrine disrupters
Immune modulators
Receptor-binding proteins
itical entity for disease eti-
ogy (7). Recent discussion
as focused on whether and
ow to implement this vision
8). Although fully charac-
rizing human exposomes
daunting, strategies can be
eveloped for getting “snap-
hots” of critical portions of
person’s exposome during
ifferent stages of life. At
ne extreme is a “bottom-up”
rategy in which all chemi-
als in each external source
f a subject’s exposome are
easured at each time point.
lthoughthisapproachwould
ave the advantage of relat-
g important exposures to
e air, water, or diet, it would
quire enormous effort and
ould miss essential compo-
ents of the internal chemi-
al environment due to such
actors as gender, obesity,
ﬂammation, and stress. By
ontrast, a “top-down” strat-
gy would measure all chem-
als (or products of their
ownstream processing or
ffects, so-called read-outs
r signatures) in a subject’s
ood. This would require
nly a single blood specimen
each time point and would relate directly ruptors and can be measured through serum
some (telomere) length in
peripheral blood mono-
nuclear cells responded
to chronic psychological
stress, possibly mediated
by the production of reac-
tive oxygen species (15).
Characterizing the
exposome represents a tech-
nological challenge like that of
thehumangenomeproject,which
began when DNA sequencing
was in its infancy (16). Analyti-
cal systems are needed to pro-
cess small amounts of blood from
thousands of subjects. Assays
should be multiplexed for mea-
suring many chemicals in each
class of interest. Tandem mass
spectrometry, gene and protein
chips, and microﬂuidic systems
offer the means to do this. Plat-
forms for high-throughput assays
shouldleadtoeconomiesofscale,
again like those experienced by
the human genome project. And
because exposome technologies
would provide feedback for thera-
peuticinterventionsandpersonal-
ized medicine, they should moti-
vate the development of commer-
cial devices for screening impor-
tant environmental exposures in
blood samples.
With successful characterization of both
Characterizing the exposome. The exposome represents
the combined exposures from all sources that reach the
internal chemical environment. Toxicologically important
classes of exposome chemicals are shown. Signatures and
biomarkers can detect these agents in blood or serum.
onOctober21,2010www.sciencemag.orgrom
“A more comprehensive view of
environmental exposure is
needed ... to discover major
causes of diseases...”
how to analyze in relation to health?
Wild, 2005

Rappaport and Smith, 2010, 2011

Buck-Louis and Sundaram 2012

Miller and Jones, 2014

Patel CJ and Ioannidis JPAI, 2014

What is a Genome-Wide Association Study (GWAS)?:
Data-driven search for G factors in P
evolut
partic
eases;
tase 1)
well a
biolog
The
captur
implem
STRU
revert
subset
librium
clearly
−log10(P)
0
5
10
15
Chromosome
22
X
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
80
60
40
100
rvedteststatistic
a
b
NATURE|Vol 447|7 June 2007
WTCCC, 2007
AA Aa aa
case
control
Robust, transparent, and comprehensive search for G in P

evolu
parti
eases
tase 1
well
biolo
Th
captu
imple
STRU
rever
subse
libriu
clearl
−log10(P)
0
5
10
15
Chromosome
22
X
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
80
60
40
100
ervedteststatistic
a
b
NATURE|Vol 447|7 June 2007
comprehensive
and transparent
multiplicity
controlled
novel
ﬁndings
(and validated)
Patel CJ, Ioannidis JPAI, JAMA 2014
Patel CJ, Ioannidis JPAI, JECH 2014
Why carry out a Genome-Wide Association Study:
Analytically robust, transparent, and comprehensive

search for G in P

GWAS example
Example of the big data paradigm:

GWAS to drives discovery in G in P
A RT I C L E S
50 Locus established previously
Locus identified by current study
Locus not confirmed by current study
BCL11A
THADA
NOTCH2
ADAMTS9
IRS1
IGF2BP2
WFS1
ZBED3
CDKAL1
HHEX/IDE
KCNQ1 (2 signals*: )
TCF7L2
KCNJ11
CENTD2
MTNR1B
HMGA2 ZFAND6
PRC1
FTO
HNF1B DUSP9
Conditional analysis
Unconditional analysis
TSPAN8/LGR5
HNF1A
CDC123/CAMK1D
CHCHD9
CDKN2A/2B
SLC30A8
TP53INP1
JAZF1
KLF14
PPAR
40
30
–log10(P)–log10(P)
20
10
10
1 2 3 4 5 6 7 8
Chromosome
9 10 11 12 13 14 15 16 17 18 19 20 21 22 X
0
0
Suggestive statistical association (P < 1 10
–5
)
Association in identified or established region (P < 1 10
–4
)
Figure 1 Genome-wide Manhattan plots for the DIAGRAM+ stage 1 meta-analysis. Top panel summarizes the results of the unconditional meta-
analysis. Previously established loci are denoted in red and loci identified by the current study are denoted in green. The ten signals in blue are those
taken forward but not confirmed in stage 2 analyses. The genes used to name signals have been chosen on the basis of proximity to the index SNP and
should not be presumed to indicate causality. The lower panel summarizes the results of equivalent meta-analysis after conditioning on 30 previously
established and newly identified autosomal T2D-associated SNPs (denoted by the dotted lines below these loci in the upper panel). Newly discovered
conditional signals (outside established loci) are denoted with an orange dot if they show suggestive levels of significance (P < 10−5), whereas
secondary signals close to already confirmed T2D loci are shown in purple (P < 10−4).
Voight et al, Nature Genetics 2012

N=8K T2D, 39K Controls

Impossible to reach this scale in E based investigations

Connecting E with Disease:
Missing the “System” of Exposures?
E+ E-
diseased
non-
diseased
?
Exposed to many things, but do not assess the multiplicity.
Fragmented literature of associations.
Challenge to discover E associated with disease.

Examples of exposome-driven discovery machinery

Gold standard for breadth of exposure & behavior data:
National Health and Nutrition Examination Survey
Nutrients and Vitamins

vitamin D, carotenes
Infectious Agents

hepatitis, HIV, Staph. aureus
Plastics and consumables

phthalates, bisphenol A
Physical Activity

e.g., stepsPesticides and pollutants

atrazine; cadmium; hydrocarbons
Drugs

statins; aspirin

What E are associated with all-cause mortality and

telomere length?

How does it work?:
Searching for exposures and behaviors associated with all-
cause mortality.
NHANES: 1999-2004
National Death Index linked mortality
246 behaviors and exposures (serum/urine/self-report)
NHANES: 1999-2001
N=330 to 6008 (26 to 655 deaths)
~5.5 years of followup
Cox proportional hazards
baseline exposure and time to death
False discovery rate < 5%
NHANES: 2003-2004
N=177 to 3258 (20-202 deaths)
~2.8 years of followup
p < 0.05
Int J Epidem. 2013

Adjusted Hazard Ratio
-log10(pvalue)
0.4 0.6 0.8 1.0 1.2 1.4 1.6 2.0 2.4 2.8
02468
1
2
3
4
5
67
1 Physical Activity
2 Does anyone smoke in home?
3 Cadmium
4 Cadmium, urine
5 Past smoker
6 Current smoker
7 trans-lycopene
(11) 1
2
3 4
5 6
78
9
10 1112
13 14
1516
1 age (10 year increment)
2 SES_1
3 male
4 SES_0
5 black
6 SES_2
7 SES_3
8 education_hs
9 other_eth
10 mexican
11 occupation_blue_semi
12 education_less_hs
13 occupation_never
14 occupation_blue_high
15 occupation_white_semi
16 other_hispanic
(69)
EWAS in All-cause mortality:
253 exposure/behavior associations in survival
Multivariate Cox (age, sex, income, education, race/ethnicity, occupation [in
red])
FDR < 5%
sociodemographics
replicated factor
Int J Epidem. 2013

Adjusted Hazard Ratio
-log10(pvalue)
0.4 0.6 0.8 1.0 1.2 1.4 1.6 2.0 2.4 2.8
02468
1
2
3
4
5
67
1 Physical Activity
2 Does anyone smoke in home?
3 Cadmium
4 Cadmium, urine
5 Past smoker
6 Current smoker
7 trans-lycopene
(11) 1
2
3 4
5 6
78
9
10 1112
13 14
1516
1 age (10 year increment)
2 SES_1
3 male
4 SES_0
5 black
6 SES_2
7 SES_3
8 education_hs
9 other_eth
10 mexican
11 occupation_blue_semi
12 education_less_hs
13 occupation_never
14 occupation_blue_high
15 occupation_white_semi
16 other_hispanic
(69)
EWAS (re)-identiﬁes factors associated with all-cause mortality:

Volcano plot of 200 associations
age (10 years)
income (quintile 2)
income (quintile 1)
male
black income (quintile 3)
any one smoke in home?
Multivariate cox (age, sex, income, education, race/ethnicity, occupation [in red])
serum and urine cadmium
[1 SD]
past smoker?
current smoker?serum lycopene
[1SD]
physical activity
[low, moderate, high activity]*
*derived from METs per activity and categorized by Health.gov guidelines
R2 ~ 2%

452 associations in Telomere Length:
Polychlorinated biphenyls associated with longer telomeres?!
IJE, in press
0
1
2
3
4
−0.2 −0.1 0.0 0.1 0.2
effect size
−log10(pvalue)
PCBs
FDR<5%
Trunk Fat
Alk. PhosCRP
Cadmium
Cadmium (urine)cigs per day
retinyl stearate
R2 ~ 1%
VO2 Maxpulse rate
shorter telomeres longer telomeres
adjusted by age, age2, race, poverty, education, occupation
median N=3000; N range: 300-7000

Samples exposed to PCBs associated with difference in genes

implicated in telomere length GWAS?
Expression differences for 24 GWAS implicated genes
Queried the Gene Expression Omnibus for PCBs

Affymetrix human arrays (GPL570)

7 gene expression experiments on humans

52 exposed; 14 unexposed
Differential gene expression and a functional analysis of PCB-exposed children:
Understanding disease and disorder development
Sisir K. Dutta a,
⁎, Partha S. Mitra a,1
, Somiranjan Ghosh a,1
, Shizhu Zang a,1
, Dean Sonneborn b
,
Irva Hertz-Picciotto b
, Tomas Trnovec c
, Lubica Palkovicova c
, Eva Sovcikova c
,
Svetlana Ghimbovschi d
, Eric P. Hoffman d
a
Molecular Genetics Laboratory, Howard University, Washington, DC, USA
b
Department of Public Health Sciences, University of California Davis, Davis, CA, USA
c
Slovak Medical University, Bratislava, Slovak Republic
d
Center for Genetic Medicine, Children's National Medical Center, Washington, DC, USA
a b s t r a c ta r t i c l e i n f o
Article history:
Received 20 December 2010
Accepted 10 July 2011
The goal of the present study is to understand the probable molecular mechanism of toxicities and the
associated pathways related to observed pathophysiology in high PCB-exposed populations. We have
performed a microarray-based differential gene expression analysis of children (mean age 46.1 months) of
Environment International 40 (2012) 143–154
Contents lists available at ScienceDirect
Environment International
journal homepage: www.elsevier.com/locate/envint
IJE, in press

Suggestive, but need more N!
0
1
2
−0.50 −0.25 0.00 0.25 0.50 0.75
log(difference)
−log10(pvalue)
1555203_s_at (SLC44A4)
1555203_s_at (MYNN)
224206_x_at (MYNN)
Could PCBs inﬂuence expression of genes

implicated in telomere length GWAS?
myoneurin

bladder, leukemia, colorectal cancer GWASs

Studying the Elusive Environment in Large Scale
Itispossiblethatmorethan50%ofcomplexdiseaserisk
isattributedtodifferencesinanindividual’senvironment.1
Airpollution,smoking,anddietaredocumentedenviron-
mental factors affecting health, yet these factors are but
a fraction of the “exposome,” the totality of the exposure
loadoccurringthroughoutaperson’slifetime.1
Investigat-
ing one or a handful of exposures at a time has led to a
highly fragmented literature of epidemiologic associa-
tions. Much of that literature is not reproducible, and se-
lectivereportingmaybeamajorreasonforthelackofre-
producibility. A new model is required to discover
environmental exposures associated with disease while
mitigating possibilities of selective reporting.
Toremedythelackofreproducibilityandconcernsof
validity, multiple personal exposures can be assessed si-
multaneously in terms of their association with a condi-
tion or disease of interest; the strongest associations can
then be tentatively validated in independent data sets
(eg, as done in references 2 and 3).2,3
The main advan-
tages of this process include the ability to search the list
ofexposuresandadjustformultiplicitysystematicallyand
reportalltheprobedassociationsinsteadofonlythemost
significant results. The term “environment-wide associa-
tion studies” (EWAS) has been used to describe this ap-
proach (an analogy to genome-wide association stud-
ies).Forexample,Wangetal4
screenedmorethan2000
chemicalsinserumtodiscoverendogenousexposuresas-
sociated with risk for cardiovascular disease.
Therearenotablehurdlesinanalyzing“big”environ-
mental data. These same problems affect epidemiology
of1-risk-factor-at-a-time,butinEWAStheirprevalencebe-
comes more clearly manifest at large scale. When study-
the EWAS vantage point, intervening on β-carotene
(Figure, D) seems a futile exercise given its complex rela-
tionship with other nutrients and pollutants.
Giventhiscomplexity,howcanstudiesofenvironmen-
talriskmoveforward?First,EWASanalysesshouldbeap-
pliedtomultipledatasets,andconsistencycanbeformally
examinedforallassessedcorrelations.Second,thetempo-
ral relationship between exposure and changes in health
parametersmayofferhelpfulhintsaboutwhichofthesig-
nalsaremorethansimplecorrelations.Third,standardized
adjustedanalyses,inwhichadjustmentsareperformedsys-
tematicallyandinthesamewayacrossmultipledatasets,
may also help. This is in stark contrast with the current
model,wherebymostepidemiologicstudiesusesingledata
setswithoutreplicationaswellasnon–time-dependentas-
sessments,andreportedadjustmentsaremarkedlydiffer-
entacrossreportsanddatasets,eventhoseperformedby
thesameteam(differentapproachesincreasevaliditybut
mustbereconciledandassimilated).
However, eventually for most environmental cor-
relates,theremaybeunsurpassabledifficultyestablish-
ing potential causal inferences based on observational
data alone. Factors that seem protective may some-
times be tested in randomized trials. The complexity of
the multiple correlations also highlights the challenge
thatinterveningtomodify1putativeriskfactoralsomay
inadvertently affect multiple other correlated factors.
Even when a seemingly simple intervention is tested in
randomizedtrials(affectingasingleriskfactoramongthe
manycorrelations),theinterventionisnotreallysimple.
In essence what is tested are multiple perturbations of
factors correlated with the one targeted for interven-
VIEWPOINT
Chirag J. Patel, PhD
Center for Biomedical
Informatics, Harvard
Medical School,
Boston, Massachusetts.
John P. A. Ioannidis,
MD, DSc
Stanford Prevention
Research Center,
Department of Health
Research and Policy,
Department of
Medicine, Stanford
University School of
Medicine, Stanford,
California, Department
of Statistics, Stanford
Humanities and
Sciences, Stanford,
California, and
Meta-Research
Innovation Center at
Stanford (METRICS),
Stanford, California.
Opinion
JAMA, 2014
JECH, 2014
Proc Symp Biocomp, 2015
How can we study the elusive environment in larger scale for
biomedical discovery?
Studying the Elusive Environment in Large Scale
Itispossiblethatmorethan50%ofcomplexdiseaserisk
isattributedtodifferencesinanindividual’senvironment.1
Airpollution,smoking,anddietaredocumentedenviron-
mental factors affecting health, yet these factors are but
a fraction of the “exposome,” the totality of the exposure
loadoccurringthroughoutaperson’slifetime.1
Investigat-
ing one or a handful of exposures at a time has led to a
highly fragmented literature of epidemiologic associa-
tions. Much of that literature is not reproducible, and se-
lectivereportingmaybeamajorreasonforthelackofre-
producibility. A new model is required to discover
environmental exposures associated with disease while
mitigating possibilities of selective reporting.
Toremedythelackofreproducibilityandconcernsof
validity, multiple personal exposures can be assessed si-
multaneously in terms of their association with a condi-
tion or disease of interest; the strongest associations can
then be tentatively validated in independent data sets
(eg, as done in references 2 and 3).2,3
The main advan-
tages of this process include the ability to search the list
ofexposuresandadjustformultiplicitysystematicallyand
reportalltheprobedassociationsinsteadofonlythemost
significant results. The term “environment-wide associa-
tion studies” (EWAS) has been used to describe this ap-
the EWAS vantage point, intervening on β-carotene
(Figure, D) seems a futile exercise given its complex rela-
tionship with other nutrients and pollutants.
Giventhiscomplexity,howcanstudiesofenvironmen-
talriskmoveforward?First,EWASanalysesshouldbeap-
pliedtomultipledatasets,andconsistencycanbeformally
examinedforallassessedcorrelations.Second,thetempo-
ral relationship between exposure and changes in health
parametersmayofferhelpfulhintsaboutwhichofthesig-
nalsaremorethansimplecorrelations.Third,standardized
adjustedanalyses,inwhichadjustmentsareperformedsys-
tematicallyandinthesamewayacrossmultipledatasets
may also help. This is in stark contrast with the current
model,wherebymostepidemiologicstudiesusesingledata
setswithoutreplicationaswellasnon–time-dependentas-
sessments,andreportedadjustmentsaremarkedlydiffer-
entacrossreportsanddatasets,eventhoseperformedby
thesameteam(differentapproachesincreasevaliditybut
mustbereconciledandassimilated).
However, eventually for most environmental cor-
relates,theremaybeunsurpassabledifficultyestablish-
ing potential causal inferences based on observationa
data alone. Factors that seem protective may some-
times be tested in randomized trials. The complexity of
VIEWPOINT
Chirag J. Patel, PhD
Center for Biomedical
Informatics, Harvard
Medical School,
Boston, Massachusetts.
John P. A. Ioannidis,
MD, DSc
Stanford Prevention
Research Center,
Department of Health
Research and Policy,
Department of
Medicine, Stanford
Medicine, Stanford,
California, Department
of Statistics, Stanford
Humanities and
Sciences, Stanford,
California, and
Meta-Research
Innovation Center at
Stanford (METRICS),
Stanford, California.
Opinion
High-throughputascertainmentofendogenousindicatorsofen-
vironmentalexposurethatmayreflecttheexposomeincreasinglyat-
tractattention,andtheirperformanceneedstobecarefullyevaluated.
These include chemical detection of indicators of exposure through
metabolomics, proteomics, and biosensors.7
Eventually, patterns of
US federally funded gene expression experiment data be d
itedinpublicrepositoriessuchastheGeneExpressionOmnibu
repositoryhasbeeninstrumentalindevelopmentoftechnolo
measurement of gene expression, data standardization, and
ofdatafordiscovery.JustaswiththeGeneExpressionOmnib
Figure. Correlation Interdependency Globes for 4 Environmental Exposures (Cotinine, Mercury, Cadmium, Trans-β-Carotene) in National Healt
Nutrition Examination Survey (NHANES) Participants, 2003-2004
A Serum cotinine B Serum total mercury C Serum cadmium D Serum trans-β-carotene
37 Total correlations 42 Total correlations 68 Total correlations 68 Total correlations
Negative correlation Positive correl
Infectious
agents
Pollutants
Nutrients
and vitamins
Demographic
attributes
Eachcorrelationinterdependencyglobeincludes317environmentalexposures
representedbythenodesaroundtheperipheryoftheglobe.Pairwisecorrelations
aredepictedbyedges(lines)betweenthenodeofinterest(arrowhead)andother
nodes.Correlationswithabsolutevaluesexceeding0.2areshown(stronge
Thesizeofeachnodeisproportionaltothenumberofedgesforanode,and
thicknessofeachedgeindicatesthemagnitudeofthecorrelation.
Opinion Viewpoint
•bioinformatics to connect exposome with phenome
•new ‘omics technologies to measure the exposome
•dense correlations

•reverse causality
•confounding
•(longitudinal) publicly available data

Interdependencies of the exposome:
Correlation globes paint a complex view of exposure
Red: positive ρ

Blue: negative ρ

thickness: |ρ|
for each pair of E:

Spearman ρ

(575 factors: 81,937 correlations)
permuted data to produce

“null ρ”

sought replication in > 1
cohort
Pac Symp Biocomput. 2015

JECH. 2015

Red: positive ρ

Blue: negative ρ

thickness: |ρ|
for each pair of E:

Spearman ρ

(575 factors: 81,937 correlations)
Correlation globes paint a complex view of exposure
permuted data to produce

“null ρ”

sought replication in > 1
cohort
Pac Symp Biocomput. 2015

JECH. 2015
Eﬀective number of
variables:

500 (10% decrease)

Telomere Length All-cause mortality
http://bit.ly/globebrowse
Telomeres vs. all-cause mortality

BD2K Patient-Centered Information Commons
Integrated repositories of individual-level information
PI: Isaac Kohane
http://pic-sure.org

with Paul Avillach, Michael McDuﬃe, Jeremy Easton-Marks,

Cartik Saravanamuthu and the BD2K PIC-SURE team
NHANES 1999-2006

API available now

http://bit.ly/nhanes_pici
BD2K Patient-Centered Information Commons
NHANES exposome browser

THE PRECISION MEDICINE INITIATIVE
WHAT IS IT?
Precision medicine is an emerging approach for disease
prevention and treatment that takes into account people’s
individual variations in genes, environment, and lifestyle.
The Precision Medicine Initiative will generate the
scientific evidence needed to move the concept of
precision medicine into clinical practice.
WHY NOW?
The time is right because of:
Sequencing
of the human
genome
Improved
technologies for
biomedical analysis
New tools
for using large
datasets
NEAR TERM GOALS
Intensify efforts to apply precision medicine to cancer.http://www.nih.gov/precisionmedicine

Committee on A Framework for Developing a
New Taxonomy of Disease
Board on Life Sciences
Division on Earth and Life Studies
NRC, National Academy of Sciences 2011
The use of multiple molecular parameters to
characterize disease [P] may lead to a more
accurate and ﬁnd-grained classiﬁcation of
disease [P]…
“multiple molecular parameters” must include E!

P
We are many phenotypes simultaneously:

Can we better categorize these P?
Body Measures

Body Mass Index

Height
Blood pressure & ﬁtness

Systolic BP

Diastolic BP

Pulse rate

VO2 Max
Metabolic

Glucose

LDL-Cholesterol

Triglycerides
Inﬂammation

C-reactive protein

white blood cell count
Kidney function

Creatinine

Sodium

Uric Acid
Liver function

Aspartate aminotransferase

Gamma glutamyltransferase
Aging

Telomere length

Creation of a phenotype-exposure association map:
A 2-D view of 83 phenotype by 252 exposure associations
> 0
< 0
Association Size:
Clusters of exposures associated with clusters of phenotypes?
252 biomarkers of exposure × 83 clinical trait phenotypes

NHANES 1999-2000, 2001-2002, 2005-2006

~21K regressions: replicated signiﬁcant (FDR < 5%) in 2003-2004

adjusted by age, age2, sex, race, income, chronic disease

Hugues Aschard, JP Ioannidis
83phenotypes
252 exposures

Alpha-carotene
Alcohol
VitaminEasalpha-tocopherol
Beta-carotene
Caffeine
Calcium
Carbohydrate
Cholesterol
Copper
Beta-cryptoxanthin
Folicacid
Folate,DFE
Foodfolate
Dietaryfiber
Iron
Energy
Lycopene
Lutein+zeaxanthin
MFA16:1
MFA18:1
MFA20:1
Magnesium
Totalmonounsaturatedfattyacids
Moisture
Niacin
PFA18:2
PFA18:3
PFA20:4
PFA22:5
PFA22:6
Totalpolyunsaturatedfattyacids
Phosphorus
Potassium
Protein
Retinol
SFA4:0
SFA6:0
SFA8:0
SFA10:0
SFA12:0
SFA14:0
SFA16:0
SFA18:0
Selenium
Totalsaturatedfattyacids
Totalsugars
Totalfat
Theobromine
VitaminA,RAE
Thiamin
VitaminB12
Riboflavin
VitaminB6
VitaminC
VitaminK
Zinc
NoSalt
OrdinarySalt
a-Carotene
VitaminB12,serum
trans-b-carotene
cis-b-carotene
b-cryptoxanthin
Folate,serum
g-tocopherol
Iron,FrozenSerum
CombinedLutein/zeaxanthin
trans-lycopene
Folate,RBC
Retinylpalmitate
Retinylstearate
Retinol
VitaminD
a-Tocopherol
Daidzein
o-Desmethylangolensin
Equol
Enterodiol
Enterolactone
Genistein
EstimatedVO2max
PhysicalActivity
Doesanyonesmokeinhome?
Total#ofcigarettessmokedinhome
Cotinine
CurrentCigaretteSmoker?
Agelastsmokedcigarettesregularly
#cigarettessmokedperdaywhenquit
#cigarettessmokedperdaynow
#dayssmokedcigsduringpast30days
Avg#cigarettes/dayduringpast30days
Smokedatleast100cigarettesinlife
Doyounowsmokecigarettes...
numberofdayssincequit
Usedsnuffatleast20timesinlife
drink5inaday
drinkperday
days5drinksinyear
daysdrinkinyear
3-fluorene
2-fluorene
3-phenanthrene
1-phenanthrene
2-phenanthrene
1-pyrene
3-benzo[c]phenanthrene
3-benz[a]anthracene
Mono-n-butylphthalate
Mono-phthalate
Mono-cyclohexylphthalate
Mono-ethylphthalate
Mono-phthalate
Mono--hexylphthalate
Mono-isobutylphthalate
Mono-n-methylphthalate
Mono-phthalate
Mono-benzylphthalate
Cadmium
Lead
Mercury,total
Barium,urine
Cadmium,urine
Cobalt,urine
Cesium,urine
Mercury,urine
Iodine,urine
Molybdenum,urine
Lead,urine
Platinum,urine
Antimony,urine
Thallium,urine
Tungsten,urine
Uranium,urine
BloodBenzene
BloodEthylbenzene
Bloodo-Xylene
BloodStyrene
BloodTrichloroethene
BloodToluene
Bloodm-/p-Xylene
1,2,3,7,8-pncdd
1,2,3,7,8,9-hxcdd
1,2,3,4,6,7,8-hpcdd
1,2,3,4,6,7,8,9-ocdd
2,3,7,8-tcdd
Beta-hexachlorocyclohexane
Gamma-hexachlorocyclohexane
Hexachlorobenzene
HeptachlorEpoxide
Mirex
Oxychlordane
p,p-DDE
Trans-nonachlor
2,5-dichlorophenolresult
2,4,6-trichlorophenolresult
Pentachlorophenol
Dimethylphosphate
Diethylphosphate
Dimethylthiophosphate
PCB66
PCB74
PCB99
PCB105
PCB118
PCB138&158
PCB146
PCB153
PCB156
PCB157
PCB167
PCB170
PCB172
PCB177
PCB178
PCB180
PCB183
PCB187
3,3,4,4,5,5-hxcb
3,3,4,4,5-pncb
3,4,4,5-tcb
Perfluoroheptanoicacid
Perfluorohexanesulfonicacid
Perfluorononanoicacid
Perfluorooctanoicacid
Perfluorooctanesulfonicacid
Perfluorooctanesulfonamide
2,3,7,8-tcdf
1,2,3,7,8-pncdf
2,3,4,7,8-pncdf
1,2,3,4,7,8-hxcdf
1,2,3,6,7,8-hxcdf
1,2,3,7,8,9-hxcdf
2,3,4,6,7,8-hxcdf
1,2,3,4,6,7,8-hpcdf
Measles
Toxoplasma
HepatitisAAntibody
HepatitisBcoreantibody
HepatitisBSurfaceAntibody
HerpesII
Albumin, urine
Uric acid
Phosphorus
Osmolality
Sodium
Potassium
Creatinine
Chloride
Total calcium
Bicarbonate
Blood urea nitrogen
Total protein
Total bilirubin
Lactate dehydrogenase LDH
Gamma glutamyl transferase
Globulin
Alanine aminotransferase ALT
Aspartate aminotransferase AST
Alkaline phosphotase
Albumin
Methylmalonic acid
PSA. total
Prostate specific antigen ratio
TIBC, Frozen Serum
Red cell distribution width
Red blood cell count
Platelet count SI
Segmented neutrophils percent
Mean platelet volume
Mean cell volume
Mean cell hemoglobin
MCHC
Hemoglobin
Hematocrit
Ferritin
Protoporphyrin
Transferrin saturation
White blood cell count
Monocyte percent
Lymphocyte percent
Eosinophils percent
C-reactive protein
Segmented neutrophils number
Monocyte number
Lymphocyte number
Eosinophils number
Basophils number
mean systolic
mean diastolic
60 sec. pulse:
60 sec HR
Total Cholesterol
Triglycerides
Glucose, serum
Insulin
Homocysteine
Glucose, plasma
Glycohemoglobin
C-peptide: SI
LDL-cholesterol
Direct HDL-Cholesterol
Bone alkaline phosphotase
Trunk Fat
Lumber Pelvis BMD
Lumber Spine BMD
Head BMD
Trunk Lean excl BMC
Total Lean excl BMC
Total Fat
Total BMD
Weight
Waist Circumference
Triceps Skinfold
Thigh Circumference
Subscapular Skinfold
Recumbent Length
Upper Leg Length
Standing Height
Head Circumference
Maximal Calf Circumference
Body Mass Index
-0.4 -0.2 0 0.2 0.4
Value
050100150
Color Key
and Histogram
Count
http://bit.ly.com/pemap
phenotypes
exposures
+-
nutrients
BMI,weight,
BMD
metabolic
renalfunction
pcbs
metabolic
bloodparameters
hydrocarbons
Creation of a phenotype-exposure association map:
A 2-D view of connections between P and E

Body Mass Index
Waist circumference
Trunk fat
Total fat
Weight
Total lean fat
Thigh circumference
Calf circumference
Trunk Lean
Skinfold
CRP
Trans-b-carotene
a-carotene
cis-b-carotene
b-cryptoxanthin
lutein/xeaxanthin
VitaminD
Magnesium
Folate
Vo2Max
PCB180
Cotinine
100cigs
Ciginlast30
Cadmium
Benzene
Toluene
Smokeinhome?
Styrene
Currentsmoker
3-ﬂuorene
2-ﬂuorene
Segmented neutrophils
Monocyte number
Lymphocyte number
Eosinophils number
Basophils number
Homocysteine
Hemoglobin
Pulse rate
http://bit.ly.com/pemap
EWAS-derived phenotype-exposure association map:
Zooming in to WBC and BMI phenotype clusters
Alpha-carotene
Alcohol
VitaminEasalpha-tocopherol
Beta-carotene
Caffeine
Calcium
Carbohydrate
Cholesterol
Copper
Beta-cryptoxanthin
Folicacid
Folate,DFE
Foodfolate
Dietaryfiber
Iron
Energy
Lycopene
Lutein+zeaxanthin
MFA16:1
MFA18:1
MFA20:1
Magnesium
Totalmonounsaturatedfattyacids
Moisture
Niacin
PFA18:2
PFA18:3
PFA20:4
PFA22:5
PFA22:6
Totalpolyunsaturatedfattyacids
Phosphorus
Potassium
Protein
Retinol
SFA4:0
SFA6:0
SFA8:0
SFA10:0
SFA12:0
SFA14:0
SFA16:0
SFA18:0
Selenium
Totalsaturatedfattyacids
Totalsugars
Totalfat
Theobromine
VitaminA,RAE
Thiamin
VitaminB12
Riboflavin
VitaminB6
VitaminC
VitaminK
Zinc
NoSalt
OrdinarySalt
a-Carotene
VitaminB12,serum
trans-b-carotene
cis-b-carotene
b-cryptoxanthin
Folate,serum
g-tocopherol
Iron,FrozenSerum
CombinedLutein/zeaxanthin
trans-lycopene
Folate,RBC
Retinylpalmitate
Retinylstearate
Retinol
VitaminD
a-Tocopherol
Daidzein
o-Desmethylangolensin
Equol
Enterodiol
Enterolactone
Genistein
EstimatedVO2max
PhysicalActivity
Doesanyonesmokeinhome?
Total#ofcigarettessmokedinhome
Cotinine
CurrentCigaretteSmoker?
Agelastsmokedcigarettesregularly
#cigarettessmokedperdaywhenquit
#cigarettessmokedperdaynow
#dayssmokedcigsduringpast30days
Avg#cigarettes/dayduringpast30days
Smokedatleast100cigarettesinlife
Doyounowsmokecigarettes...
numberofdayssincequit
Usedsnuffatleast20timesinlife
drink5inaday
drinkperday
days5drinksinyear
daysdrinkinyear
3-fluorene
2-fluorene
3-phenanthrene
1-phenanthrene
2-phenanthrene
1-pyrene
3-benzo[c]phenanthrene
3-benz[a]anthracene
Mono-n-butylphthalate
Mono-phthalate
Mono-cyclohexylphthalate
Mono-ethylphthalate
Mono-phthalate
Mono--hexylphthalate
Mono-isobutylphthalate
Mono-n-methylphthalate
Mono-phthalate
Mono-benzylphthalate
Cadmium
Lead
Mercury,total
Barium,urine
Cadmium,urine
Cobalt,urine
Cesium,urine
Mercury,urine
Iodine,urine
Molybdenum,urine
Lead,urine
Platinum,urine
Antimony,urine
Thallium,urine
Tungsten,urine
Uranium,urine
BloodBenzene
BloodEthylbenzene
Bloodo-Xylene
BloodStyrene
BloodTrichloroethene
BloodToluene
Bloodm-/p-Xylene
1,2,3,7,8-pncdd
1,2,3,7,8,9-hxcdd
1,2,3,4,6,7,8-hpcdd
1,2,3,4,6,7,8,9-ocdd
2,3,7,8-tcdd
Beta-hexachlorocyclohexane
Gamma-hexachlorocyclohexane
Hexachlorobenzene
HeptachlorEpoxide
Mirex
Oxychlordane
p,p-DDE
Trans-nonachlor
2,5-dichlorophenolresult
2,4,6-trichlorophenolresult
Pentachlorophenol
Dimethylphosphate
Diethylphosphate
Dimethylthiophosphate
PCB66
PCB74
PCB99
PCB105
PCB118
PCB138&158
PCB146
PCB153
PCB156
PCB157
PCB167
PCB170
PCB172
PCB177
PCB178
PCB180
PCB183
PCB187
3,3,4,4,5,5-hxcb
3,3,4,4,5-pncb
3,4,4,5-tcb
Perfluoroheptanoicacid
Perfluorohexanesulfonicacid
Perfluorononanoicacid
Perfluorooctanoicacid
Perfluorooctanesulfonicacid
Perfluorooctanesulfonamide
2,3,7,8-tcdf
1,2,3,7,8-pncdf
2,3,4,7,8-pncdf
1,2,3,4,7,8-hxcdf
1,2,3,6,7,8-hxcdf
1,2,3,7,8,9-hxcdf
2,3,4,6,7,8-hxcdf
1,2,3,4,6,7,8-hpcdf
Measles
Toxoplasma
HepatitisAAntibody
HepatitisBcoreantibody
HepatitisBSurfaceAntibody
HerpesII
Albumin, urine
Uric acid
Phosphorus
Osmolality
Sodium
Potassium
Creatinine
Chloride
Total calcium
Bicarbonate
Blood urea nitrogen
Total protein
Total bilirubin
Globulin
Albumin
Methylmalonic acid
PSA. total
TIBC, Frozen Serum
Platelet count SI
Mean cell volume
MCHC
Hemoglobin
Hematocrit
Ferritin
Protoporphyrin
Transferrin saturation
Monocyte percent
Lymphocyte percent
Eosinophils percent
C-reactive protein
Monocyte number
Lymphocyte number
Eosinophils number
Basophils number
mean systolic
mean diastolic
60 sec. pulse:
60 sec HR
Total Cholesterol
Triglycerides
Glucose, serum
Insulin
Homocysteine
Glucose, plasma
Glycohemoglobin
C-peptide: SI
LDL-cholesterol
Trunk Fat
Lumber Pelvis BMD
Lumber Spine BMD
Head BMD
Trunk Lean excl BMC
Total Lean excl BMC
Total Fat
Total BMD
Weight
Waist Circumference
Triceps Skinfold
Thigh Circumference
Recumbent Length
Upper Leg Length
Standing Height
Head Circumference
Body Mass Index
-0.4 -0.2 0 0.2 0.4
Value
050100150
Color Key
and Histogram
Count
+-

Toward a phenotype-exposure association map:
(Re)-categorizing phenotypes with E
7 6 5 4 3 2 1 0
Distance
liver:Albumin
kidney:Bicarbonate
immunological:Basophils percent
immunological:Lymphocyte percent
immunological:Eosinophils percent
kidney:Phosphorus
liver:Total protein
liver:Aspartate aminotransferase AST
liver:Alanine aminotransferase ALT
body measures:Head Circumference
body measures:Recumbent Length
liver:Lactate dehydrogenase LDH
cancer:Prostate specific antigen ratio
cancer:PSA, free
blood:Transferrin saturation
liver:Total bilirubin
heart:Direct HDL-Cholesterol
immunological:Monocyte percent
bone:Head BMD
body measures:Standing Height
body measures:Upper Leg Length
bone:Total BMD
bone:Lumber Spine BMD
bone:Lumber Pelvis BMD
heart:Triglycerides
heart:LDL-cholesterol
heart:Total Cholesterol
blood:MCHC
blood:TIBC, Frozen Serum
blood:Hematocrit
blood:Hemoglobin
kidney:Potassium
blood:Mean cell hemoglobin
blood:Mean cell volume
kidney:Uric acid
kidney:Blood urea nitrogen
kidney:Total calcium
kidney:Creatinine
blood:Ferritin
blood:Red blood cell count
body measures:Weight
blood:Segmented neutrophils percent
body measures:Total Lean excl BMC
body measures:Trunk Lean excl BMC
body measures:Body Mass Index
body measures:Waist Circumference
body measures:Triceps Skinfold
body measures:Maximal Calf Circumference
body measures:Thigh Circumference
liver:Gamma glutamyl transferase
blood pressure:60 sec. pulse:
metabolic:Insulin
body measures:Total Fat
body measures:Trunk Fat
body measures:Subscapular Skinfold
blood pressure:mean systolic
immunological:C-reactive protein
liver:Globulin
immunological:Monocyte number
immunological:Segmented neutrophils number
immunological:Lymphocyte number
immunological:White blood cell count
immunological:Basophils number
immunological:Eosinophils number
blood:Mean platelet volume
heart:Homocysteine
nutrition:Methylmalonic acid
kidney:Osmolality
kidney:Chloride
kidney:Sodium
kidney:Albumin, urine
blood pressure:60 sec HR
cancer:PSA. total
blood:Platelet count SI
blood:Protoporphyrin
blood:Red cell distribution width
bone:Bone alkaline phosphotase
liver:Alkaline phosphotase
blood pressure:mean diastolic
metabolic:C-peptide: SI
metabolic:Glycohemoglobin
metabolic:Glucose, plasma
metabolic:Glucose, serum
inﬂammation
adiposity
kidney function
metabolic traits

7 6 5 4 3 2 1 0
Distance
liver:Albumin
kidney:Bicarbonate
kidney:Phosphorus
liver:Total protein
cancer:PSA, free
bone:Head BMD
bone:Total BMD
heart:Triglycerides
blood:MCHC
blood:Hematocrit
blood:Hemoglobin
kidney:Potassium
kidney:Uric acid
kidney:Creatinine
blood:Ferritin
metabolic:Insulin
liver:Globulin
heart:Homocysteine
kidney:Osmolality
kidney:Chloride
kidney:Sodium
cancer:PSA. total
“bad” cholesterol
“good” cholesterol

7 6 5 4 3 2 1 0
Distance
liver:Albumin
kidney:Bicarbonate
kidney:Phosphorus
liver:Total protein
cancer:PSA, free
bone:Head BMD
bone:Total BMD
heart:Triglycerides
blood:MCHC
blood:Hematocrit
blood:Hemoglobin
kidney:Potassium
kidney:Uric acid
kidney:Creatinine
blood:Ferritin
metabolic:Insulin
liver:Globulin
heart:Homocysteine
kidney:Osmolality
kidney:Chloride
kidney:Sodium
cancer:PSA. total
height + BMD

Triglycerides
Total Cholesterol
LDL-cholesterol
Trunk Fat
Albumin, urine
Insulin
Total Fat
Head Circumference
Blood urea nitrogen
Albumin
Homocysteine
C-peptide: SI
C-reactive protein
Body Mass Index
Ferritin
Thigh Circumference
Total calcium
Total bilirubin
Mean cell volume
Uric acid
Protoporphyrin
Hemoglobin
Total protein
Waist Circumference
Hematocrit
Weight
Standing Height
1/Creatinine
Creatinine
Trunk Lean excl BMC
Methylmalonic acid
Triceps Skinfold
Lymphocyte number
Total Lean excl BMC
TIBC, Frozen Serum
Phosphorus
Lumber Pelvis BMD
Glycohemoglobin
Globulin
Chloride
Bicarbonate
60 sec. pulse:
Upper Leg Length
Total BMD
Potassium
Glucose, serum
Glucose, plasma
Lumber Spine BMD
Platelet count SI
MCHC
Osmolality
Monocyte number
mean systolic
Lymphocyte percent
Recumbent Length
Eosinophils number
Monocyte percent
Head BMD
mean diastolic
60 sec HR
Basophils number
Sodium
PSA, free
Eosinophils percent
PSA. total
Basophils percent
0 10 20 30 40
R^2 * 100
1 to 66 exposures identiﬁed for 81
phenotypes

Additive eﬀect of E factors:

Describe < 20% of variability in P
(On average: 8%)
σ2E?
Recall: Avg(h2) = 50%

Long road ahead to capture σ2
P

Connecting Environmental Exposure with Disease:
Missing the “System” of Exposures?
E+ E-
diseased
non-
diseased
?
Exposed to many things, but do not assess the multiplicity.
Fragmented literature of associations.
Challenge to discover E associated with disease.

Example of fragmentation:
Is everything we eat associated with cancer?
Schoenfeld and Ioannidis, AJCN 2012
50 random ingredients from
Boston Cooking School
Cookbook
Any associated with cancer?
FIGURE 1. Effect estimates reported in the literature by malignancy type (top) or ingredient (bottom). Only ingredients with $10 studie
outliers are not shown (effect estimates .10).
Of 50, 40 studied in cancer risk
Weak statistical evidence:

non-replicated

inconsistent eﬀects

non-standardized

e modelling
oblem is akin to – but less well
sed and more poorly understood than –
e testing. For example, consider the use
r regression to adjust the risk levels of
atments to the same background level
There can be many covariates, and
t of covariates can be in or out of the
With ten covariates, there are over 1000
models. Consider a maze as a metaphor
elling (Figure 3). The red line traces the
path out of the maze. The path through
ze looks simple, once it is known.
ways in the literature for dealing with model
selection, so we propose a new, composite
2. Publication bias
is general recognition that a paper
much better chance of acceptance if
hing new is found. This means that, for
ation, the claim in the paper has to
sed on a p-value less than 0.05. From
g’s point of view5
, this is quality by
tion. The journals are placing heavy
ce on a statistical test rather than
nation of the methods and steps that
o a conclusion. As to having a p-value
han 0.05, some might be tempted to
the system10
through multiple testing,
ple modelling or unfair treatment of
or some combination of the three that
to a small p-value. Researchers can be
creative in devising a plausible story to
statistical finding.
2 The data cleaning team creates a
modelling data set and a holdout set and
P < 0.05
Figure 3. The path through a complex process can appear quite simple once the path is defined. Which terms are
included in a multiple linear regression model? Each turn in a maze is analogous to including or not a specific
term in the evolving linear model. By keeping an eye on the p-value on the term selected to be at issue, one
can work towards a suitably small p-value. © ktsdesign – Fotolia
A maze of associations is one way to a fragmented
literature and Vibration of Effects
Young, 2011
univariate
sex
sex & age
sex & race
sex & race & age
JCE, 2015

Distribution of associations and p-values due to model choice:
Estimating the Vibration of Eﬀects (or Risk)
Variable of Interest
e.g., 1 SD of log(serum Vitamin D)
Adjusting Variable Set
n=13
All-subsets Cox regression
213+ 1 = 8,193 models
SES [3rd tertile]
education [>HS]
race [white]
body mass index [normal]
total cholesterol
any heart disease
family heart disease
any hypertension
any diabetes
any cancer
current/past smoker [no smoking]
drink 5/day
physical activity
Data Source
NHANES 1999-2004
417 variables of interest
time to death
N≧1000 (≧100 deaths)
effect sizes
p-values
●
●
●
●
●
●
●
●
●
●
●
0
1
2
3
4
5
6
7
8
9
10
11
1
50
1 50 99
5.0
7.5
−log10(pvalue)
Vitamin D (1SD(log))
RHR = 1.14
RPvalue = 4.68
A
B
C D
E
median p-value/HR for k
percentile indicator
JCE, 2015
●
●
●
●
●
●
●
●
●
●
●
●
●●
0
1
2
3
4
5
6
7
8
9
10
11
1213
1
50
99
1 50 99
2.5
5.0
7.5
0.64 0.68 0.72 0.76
Hazard Ratio
−log10(pvalue)
RHR = 1.14
RP = 4.68
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0
1
2
3
4
5
6
7
8
9
10
11
12
13
1
50
99
1 50 99
1
2
3
4
0.75 0.80 0.85 0.90
Hazard Ratio
−log10(pvalue)
Thyroxine (1SD(log))
RHR = 1.15
RP = 2.90

The Vibration of Eﬀects:
Vitamin D and Thyroxine and attenuated risk in mortality
JCE, 2015
●
●
●
●
●
●
●
●
●
●
●
●
●●
0
1
2
3
4
5
6
7
8
9
10
11
1213
1
50
99
1 50 99
2.5
5.0
7.5
0.64 0.68 0.72 0.76
Hazard Ratio
−log10(pvalue)
RHR = 1.14
RP = 4.68
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0
1
2
3
4
5
6
7
8
9
10
11
12
13
1
50
99
1 50 99
1
2
3
4
0.75 0.80 0.85 0.90
Hazard Ratio
−log10(pvalue)
Thyroxine (1SD(log))
RHR = 1.15
RP = 2.90

JCE, 2015
Janus (two-faced) risk profile
Risk and significance depends on modeling scenario!
The Vibration of Effects: beware of the Janus effect

(both risk and protection?!)
“risk”“protection”
“significant”
Brittanica.com

http://bit.ly/eﬀectvibration

Emerging technologies to ascertain exposome will enable
biomedical discovery
High-throughput E data standards & exposome:

mitigate fragmented literature of associations
Confounding, reverse causality:

how to handle at large dimension?
e.g., EWASs in telomere length and mortality

and 81 quantitative phenotypes
Prioritize biological and epidemiological studies.

New ways of measuring P are here now!

Can we use them to assess E (and G)?

physical activity monitors

(ﬁtbit)
smart devices

(iOS)
personal E sensors

(exposome band?!)
propeller health

Now possible to consent thousands of people at the push
of a button! http://researchkit.org

Possible to survey P of diabetics consented through
ResearchKit?
Adam Brown
Stanley Shaw (MGH)

Dennis Ausiello (MGH)
http://bit.ly/glucosuccess

Demographics
age, sex, etc
Diabetes Indicators
Hemoglobin A1C

glucose (fasting, bedtime)
Passive Activity
Motion

Step count
N
~4000 diabetics

186K manual glucose entries

7.6M passive step count entries

Age (years): 43.6

Male %: 80%

Female %: 20%
Race (%):
White: 57%

Black: 7%

Hispanic: 11%

Other: 25%
Education (%):
Some High School: 2%

High School: 8%

Some college: 20%

2-year college: 10%

4 year college: 26%

Post-college: 32%
Mean Years Diabetic: 7.8
GlucoSuccess has captured a unique population quickly

(< 1 year of surveillance)
Comorbidities (CDC*)

Stroke: 2% (0.7%)

Heart Failure: 2% (1%)

High Blood Pressure: 47% (57%)

High Lipids: 36% (58%)

Kidney Disease: 4% (0.2%*)

Circulation problems: 8% (4%)

Eye problems: 9% (17%*)

*end-stage renal disease

*visual impairment
http://www.cdc.gov/diabetes
Body Mass Index: 31
Hemoglobin A1C: 7.7

Is step count on previous day associated with fasting
glucose the next day?:

mashing up 24K step counts with glucose (N=600)
10000 steps ~ 1.5 mg/dL (random-eﬀects linear model)

p<1x10-16
glucosedayN(mg/dL)
Steps (in 1000s), day N-1

GlucoSuccess-like apps can enable longitudinal and
dynamic surveillance of P
However: population-level diﬀerences and generalizability

Possible to (re-)use high-throughput data (exposome, medical
claims, devices) to discover the role of E (and G) in P.
−log10(pvalue)
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
● ●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●● ●
●●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
● ●
acrylamide
allergentest
bacterialinfection
cotinine
diakyl
dioxins
furansdibenzofuran
heavymetals
hydrocarbons
latex
nutrientscarotenoid
nutrientsminerals
nutrientsvitaminA
nutrientsvitaminB
nutrientsvitaminC
nutrientsvitaminD
nutrientsvitaminE
pcbs
perchlorate
pesticidesatrazine
pesticideschlorophenol
pesticidesorganochlorine
pesticidesorganophosphate
pesticidespyrethyroid
phenols
phthalates
phytoestrogens
polybrominatedethers
polyflourochemicals
viralinfection
volatilecompounds
012
A Serum cotinine B Serum total mercury
37 Total correlations 42 Total correlations 68 Total correlations 68 Total correlations
Infectious
agents
Pollutants
Nutrients
and vitamins
Demographic
attributes
P = G + E

Harvard DBMI
Isaac Kohane

Susanne Churchill

Stan Shaw

Nathan Palmer

Jenn Grandﬁeld

Sunny Alvear

Michal Preminger

Harvard Chan
Hugues Aschard

Francesca Dominici

Chirag J Patel

chirag@hms.harvard.edu

@chiragjp

www.chiragjpgroup.org
NIH Common Fund

Big Data to Knowledge
Acknowledgements
CDC
Marta Gwinn
Ridgely Green
Muin Khoury
Denise Lowe
Stanford
John Ioannidis

Atul Butte (UCSF)

U Queensland
Jian Yang

Peter Visscher

Cochrane
Belinda Burford
RagGroup
Chirag Lakhani
Adam Brown
Danielle Rasooly

Arjun Manrai

Erik Corona

Nam Pho

Repurposing large datasets to dissect exposomic contributions in health and disease

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Repurposing large datasets to dissect exposomic contributions in health and disease

Similar to Repurposing large datasets to dissect exposomic contributions in health and disease (17)

Recently uploaded

Recently uploaded (20)

Repurposing large datasets to dissect exposomic contributions in health and disease