P = G + EType 2 Diabetes

Cancer

Alzheimer’s

Gene expression
Phenotype Genome
Variants
Environment
Infectious agents

Diet + Nutrients

Pollutants

Drugs

We are great at G investigation!
2,940 (as of 6/1/17)
36,066 G-P associations

Genome-wide Association Studies (GWAS)

https://www.ebi.ac.uk/gwas/

G

Nothing comparable to elucidate E inﬂuence!
E: ???
We lack high-throughput methods
and data to discover new E in P…

A similar paradigm for discovery should exist

for E!
Why?

σ2
G
σ2P
H2 =
Heritability (H2) is the range of phenotypic
variability attributed to genetic variability in a
population
Indicator of the proportion of phenotypic
diﬀerences attributed to G.

Eye color
Hair curliness
Type-1 diabetes
Height
Schizophrenia
Epilepsy
Graves' disease
Celiac disease
Polycystic ovary syndrome
Attention deficit hyperactivity disorder
Bipolar disorder
Obesity
Alzheimer's disease
Anorexia nervosa
Psoriasis
Bone mineral density
Menarche, age at
Nicotine dependence
Sexual orientation
Alcoholism
Lupus
Rheumatoid arthritis
Crohn's disease
Migraine
Thyroid cancer
Autism
Blood pressure, diastolic
Body mass index
Depression
Coronary artery disease
Insomnia
Menopause, age at
Heart disease
Prostate cancer
QT interval
Breast cancer
Ovarian cancer
Hangover
Stroke
Asthma
Blood pressure, systolic
Hypertension
Osteoarthritis
Parkinson's disease
Longevity
Type-2 diabetes
Gallstone disease
Testicular cancer
Cervical cancer
Sciatica
Bladder cancer
Colon cancer
Lung cancer
Leukemia
Stomach cancer
0 25 50 75 100
Heritability: Var(G)/Var(Phenotype)
Source: SNPedia.com
G estimates for burdensome diseases are low and variable:
massive opportunity for high-throughput E discovery
Type 2 Diabetes
Heart Disease
Autism (50%???)

Eye color
Hair curliness
Type-1 diabetes
Height
Schizophrenia
Epilepsy
Graves' disease
Celiac disease
Bipolar disorder
Obesity
Alzheimer's disease
Anorexia nervosa
Psoriasis
Menarche, age at
Nicotine dependence
Sexual orientation
Alcoholism
Lupus
Crohn's disease
Migraine
Thyroid cancer
Autism
Body mass index
Depression
Insomnia
Menopause, age at
Heart disease
Prostate cancer
QT interval
Breast cancer
Ovarian cancer
Hangover
Stroke
Asthma
Hypertension
Osteoarthritis
Parkinson's disease
Longevity
Type-2 diabetes
Gallstone disease
Testicular cancer
Cervical cancer
Sciatica
Bladder cancer
Colon cancer
Lung cancer
Leukemia
Stomach cancer
0 25 50 75 100
Heritability: Var(G)/Var(Phenotype) Source: SNPedia.com
G estimates for complex traits are low and variable:
massive opportunity for high-throughput E discovery
σ2
E : Exposome!

It took a new paradigm of GWAS for discovery:
Human Genome Project to GWAS
Sequencing of the genome
2001
HapMap project:
http://hapmap.ncbi.nlm.nih.gov/
Characterize common variation
2001-current day
High-throughput variant
assay
< $99 for ~1M variants
Measurement tools
~2003 (ongoing)
ARTICLES
Genome-wide association study of 14,000
cases of seven common diseases and
3,000 shared controls
The Wellcome Trust Case Control Consortium*
There is increasing evidence that genome-wide association (GWA) studies represent a powerful approach to the
identification of genes involved in common human diseases. We describe a joint GWA study (using the Affymetrix GeneChip
500K Mapping Array Set) undertaken in the British population, which has examined ,2,000 individuals for each of 7 major
diseases and a shared set of ,3,000 controls. Case-control comparisons identified 24 independent association signals at
P , 5 3 1027
: 1 in bipolar disorder, 1 in coronary artery disease, 9 in Crohn’s disease, 3 in rheumatoid arthritis, 7 in type 1
diabetes and 3 in type 2 diabetes. On the basis of prior findings and replication studies thus-far completed, almost all of these
signals reflect genuine susceptibility effects. We observed association at many previously identified loci, and found
compelling evidence that some loci confer risk for more than one of the diseases studied. Across all diseases, we identified a
25 27
Vol 447|7 June 2007|doi:10.1038/nature05911
WTCCC, Nature, 2008.
Comprehensive, high-throughput analyses
GWAS

Explaining the other 50%:
A data-driven paradigm for robust discovery of E in disease via
EWAS and the exposome
what to measure? how to measure?
PERSPECTIVES
Xenobiotics
Inflammation
Preexisting disease
Lipid peroxidation
Oxidative stress
Gut flora
Internal
chemical
environment
Externalenvironment
ExposomeRADIATION
DIET
POLLUTION
INFECTIONS
DRUGS
LIFE-STYLE
STRESS
Reactive electrophiles
Metals
Endocrine disrupters
Immune modulators
Receptor-binding proteins
itical entity for disease eti-
ogy (7). Recent discussion
as focused on whether and
ow to implement this vision
8). Although fully charac-
rizing human exposomes
daunting, strategies can be
eveloped for getting “snap-
hots” of critical portions of
person’s exposome during
ifferent stages of life. At
ne extreme is a “bottom-up”
rategy in which all chemi-
als in each external source
f a subject’s exposome are
easured at each time point.
lthoughthisapproachwould
ave the advantage of relat-
g important exposures to
e air, water, or diet, it would
quire enormous effort and
ould miss essential compo-
ents of the internal chemi-
al environment due to such
actors as gender, obesity,
ﬂammation, and stress. By
ontrast, a “top-down” strat-
gy would measure all chem-
als (or products of their
ownstream processing or
ffects, so-called read-outs
r signatures) in a subject’s
ood. This would require
nly a single blood specimen
each time point and would relate directly ruptors and can be measured through serum
some (telomere) length in
peripheral blood mono-
nuclear cells responded
to chronic psychological
stress, possibly mediated
by the production of reac-
tive oxygen species (15).
Characterizing the
exposome represents a tech-
nological challenge like that of
thehumangenomeproject,which
began when DNA sequencing
was in its infancy (16). Analyti-
cal systems are needed to pro-
cess small amounts of blood from
thousands of subjects. Assays
should be multiplexed for mea-
suring many chemicals in each
class of interest. Tandem mass
spectrometry, gene and protein
chips, and microﬂuidic systems
offer the means to do this. Plat-
forms for high-throughput assays
shouldleadtoeconomiesofscale,
again like those experienced by
the human genome project. And
because exposome technologies
would provide feedback for thera-
peuticinterventionsandpersonal-
ized medicine, they should moti-
vate the development of commer-
cial devices for screening impor-
tant environmental exposures in
blood samples.
With successful characterization of both
Characterizing the exposome. The exposome represents
the combined exposures from all sources that reach the
internal chemical environment. Toxicologically important
classes of exposome chemicals are shown. Signatures and
biomarkers can detect these agents in blood or serum.
onOctober21,2010www.sciencemag.orgrom
“A more comprehensive view of
environmental exposure is
needed ... to discover major
causes of diseases...”
how to analyze in relation to health?
Wild, 2005, 2012

Rappaport and Smith, 2010, 2011

Buck-Louis and Sundaram 2012

Miller and Jones, 2014

Patel CJ and Ioannidis JPAI, 2014

Promises and Challenges in creating a search engine for E
in P
High-throughput E = discovery!

systematic; reproducible

multiple hypothesis control

prioritization
Eye color
Hair curliness
Type-1 diabetes
Height
Schizophrenia
Epilepsy
Graves' disease
Celiac disease
Bipolar disorder
Obesity
Alzheimer's disease
Anorexia nervosa
Psoriasis
Menarche, age at
Nicotine dependence
Sexual orientation
Alcoholism
Lupus
Crohn's disease
Migraine
Thyroid cancer
Autism
Body mass index
Depression
Insomnia
Menopause, age at
Heart disease
Prostate cancer
QT interval
Breast cancer
Ovarian cancer
Hangover
Stroke
Asthma
Hypertension
Osteoarthritis
Parkinson's disease
Longevity
Type-2 diabetes
Gallstone disease
Testicular cancer
Cervical cancer
Sciatica
Bladder cancer
Colon cancer
Lung cancer
Leukemia
Stomach cancer
0 25 50 75 100
Heritability: Var(G)/Var(Phenotype)
Arjun Manrai
(Yuxia Cui, David Balshaw)
ARPH 2016

JAMA 2014

JECH 2014
σ2
E : Exposome!

Examples of exposome-driven discovery machinery, or
EWASs

Gold standard for breadth of human exposure information:
National Health and Nutrition Examination Survey1
since the 1960s
now biannual: 1999 onwards
10,000 participants per survey
The sample for the survey is selected to represent
the U.S. population of all ages. To produce reli-
able statistics, NHANES over-samples persons 60
and older, African Americans, and Hispanics.
Since the United States has experienced dramatic
growth in the number of older people during this
century, the aging population has major impli-
cations for health care needs, public policy, and
research priorities. NCHS is working with public
health agencies to increase the knowledge of the
health status of older Americans. NHANES has a
primary role in this endeavor.
All participants visit the physician. Dietary inter-
views and body measurements are included for
everyone. All but the very young have a blood
sample taken and will have a dental screening.
Depending upon the age of the participant, the
rest of the examination includes tests and proce-
dures to assess the various aspects of health listed
above. In general, the older the individual, the
more extensive the examination.
Survey Operations
Health interviews are conducted in respondents’
homes. Health measurements are performed in
specially-designed and equipped mobile centers,
which travel to locations throughout the country.
The study team consists of a physician, medical
and health technicians, as well as dietary and health
interviewers. Many of the study staff are
bilingual (English/Spanish).
An advanced computer system using high-
end servers, desktop PCs, and wide-area
networking collect and process all of the
NHANES data, nearly eliminating the need
for paper forms and manual coding operations.
This system allows interviewers to use note-
book computers with electronic pens. The staff
at the mobile center can automatically transmit
data into data bases through such devices as
digital scales and stadiometers. Touch-sensi-
tive computer screens let respondents enter
their own responses to certain sensitive ques-
tions in complete privacy. Survey information
is available to NCHS staff within 24 hours of
collection, which enhances the capability of
collecting quality data and increases the speed
with which results are released to the public.
In each location, local health and government
officials are notified of the upcoming survey.
Households in the study area receive a letter
from the NCHS Director to introduce the
survey. Local media may feature stories about
the survey.
NHANES is designed to facilitate and en-
courage participation. Transportation is provided
to and from the mobile center if necessary.
Participants receive compensation and a report
of medical findings is given to each participant.
All information collected in the survey is kept
strictly confidential. Privacy is protected by
public laws.
Uses of the Data
Information from NHANES is made available
through an extensive series of publications and
articles in scientific and technical journals. For
data users and researchers throughout the world,
survey data are available on the internet and on
easy-to-use CD-ROMs.
Research organizations, universities, health
care providers, and educators benefit from
survey information. Primary data users are
federal agencies that collaborated in the de-
sign and development of the survey. The
National Institutes of Health, the Food and
Drug Administration, and CDC are among the
agencies that rely upon NHANES to provide
data essential for the implementation and
evaluation of program activities. The U.S.
Department of Agriculture and NCHS coop-
erate in planning and reporting dietary and
nutrition information from the survey.
NHANES’ partnership with the U.S. Environ-
mental Protection Agency allows continued
study of the many important environmental
influences on our health.
• Physical fitness and physical functioning
• Reproductive history and sexual behavior
• Respiratory disease (asthma, chronic bron-
chitis, emphysema)
• Sexually transmitted diseases
• Vision
1 http://www.cdc.gov/nchs/nhanes.htm
>250 exposures (serum + urine)
GWAS chip
>85 quantitative clinical traits
(e.g., serum glucose, lipids, body
mass index)
Death index linkage (cause of
death)

Gold standard for breadth of exposure & behavior data:
National Health and Nutrition Examination Survey
Nutrients and Vitamins

vitamin D, carotenes
Infectious Agents

hepatitis, HIV, Staph. aureus
Plastics and consumables

phthalates, bisphenol A
Physical Activity

e.g., stepsPesticides and pollutants

atrazine; cadmium; hydrocarbons
Drugs

statins; aspirin

What E are associated with aging:

all-cause mortality and

telomere length?
Int J Epidem 2013
Int J Epidem 2016

How does it work?:
Searching for exposures and behaviors associated with all-
cause mortality.
NHANES: 1999-2004
National Death Index linked mortality
246 behaviors and exposures (serum/urine/self-report)
NHANES: 1999-2001
N=330 to 6008 (26 to 655 deaths)
~5.5 years of followup
Cox proportional hazards
baseline exposure and time to death
False discovery rate < 5%
NHANES: 2003-2004
N=177 to 3258 (20-202 deaths)
~2.8 years of followup
p < 0.05
Int J Epidem 2013

Adjusted Hazard Ratio
-log10(pvalue)
0.4 0.6 0.8 1.0 1.2 1.4 1.6 2.0 2.4 2.8
02468
1
2
3
4
5
67
1 Physical Activity
2 Does anyone smoke in home?
3 Cadmium
4 Cadmium, urine
5 Past smoker
6 Current smoker
7 trans-lycopene
(11) 1
2
3 4
5 6
78
9
10 1112
13 14
1516
1 age (10 year increment)
2 SES_1
3 male
4 SES_0
5 black
6 SES_2
7 SES_3
8 education_hs
9 other_eth
10 mexican
11 occupation_blue_semi
12 education_less_hs
13 occupation_never
14 occupation_blue_high
15 occupation_white_semi
16 other_hispanic
(69)
EWAS in all-cause mortality:
253 exposure/behavior associations in survival
=age, sex, income, education, race/ethnicity, occupation [in red]
FDR < 5%
sociodemographics
replicated factor
Int J Epidem 2013

-log10(pvalue)
0.4 0.6 0.8 1.0 1.2 1.4 1.6 2.0 2.4 2.8
02468
1
2
3
4
5
67
1 Physical Activity
3 Cadmium
4 Cadmium, urine
5 Past smoker
6 Current smoker
7 trans-lycopene
(11) 1
2
3 4
5 6
78
9
10 1112
13 14
1516
2 SES_1
3 male
4 SES_0
5 black
6 SES_2
7 SES_3
8 education_hs
9 other_eth
10 mexican
13 occupation_never
16 other_hispanic
(69)
EWAS identiﬁes factors associated with all-cause mortality:

Volcano plot of 200 associations
age (10 years)
income (quintile 2)
income (quintile 1)
male
black income (quintile 3)
any one smoke in home?
Multivariate cox (age, sex, income, education, race/ethnicity, occupation [in red])
serum and urine cadmium
[1 SD]
past smoker?
current smoker?serum lycopene
[1SD]
physical activity
[low, moderate, high activity]*
*derived from METs per activity and categorized by Health.gov guidelines R2 ~ 2%

few more examples:

https://paperpile.com/shared/PtvEae
diabetes

preterm birth

income

blood pressure

lipids

kidney disease

telomere length

mortality

Promises and Challenges in creating a search engine for E
in P
High-throughput assays of E!

scalable and standard technologies
ARPH 2016

JAMA 2014

JECH 2014
Big data = big bias!
Confounding; reverse causality

Dense correlational web of E and P
Fragmented and small E-P associations

Inﬂuence of time and life-course

Challenge to scale absolute E due to heterogeneity
and large dynamic range.
Rappaport et al, EHP 2015
Untargeted
Targeted

•Getting cheaper, but still not “at scale”

•relative not absolute

•identiﬁcation of chemical analytes is an art

•detection limits not low enough for E

Example of fragmentation:
Is everything we eat associated with cancer?
Schoenfeld and Ioannidis, AJCN 2012
50 random ingredients from
Boston Cooking School
Cookbook
Any associated with cancer?
FIGURE 1. Effect estimates reported in the literature by malignancy type (top) or ingredient (bottom). Only ingredients with $10 studie
outliers are not shown (effect estimates .10).
Of 50, 40 studied in cancer risk
Weak statistical evidence:

non-replicated

inconsistent eﬀects

non-standardized

Are all the drugs we take associated with cancer?
Sci Reports 2016
Associated all (~500) drugs prescribed in
entire population of Sweden
(N=9M) with time to cancer
Assessed 2 modeling techniques (Cox and case-crossover)

any cancer:
141 (26%)

prostate:
56 (10%)
breast:
41 (7%)

colon:
14 (3%)
What drugs are associated with time to cancer?

Too many to be plausible (up to 26%!)
Sci Reports 2016
Modest
concordance
between Cox and
case-crossover:
12 out of 141!
Most correlations
small (HR < 1.1);
residual confounding?

Distribution of associations and p-values due to model choice:
Estimating the Vibration of Eﬀects (or Risk)
Variable of Interest
e.g., 1 SD of log(serum Vitamin D)
Adjusting Variable Set
n=13
All-subsets Cox regression
213+ 1 = 8,193 models
SES [3rd tertile]
education [>HS]
race [white]
body mass index [normal]
total cholesterol
any heart disease
family heart disease
any hypertension
any diabetes
any cancer
current/past smoker [no smoking]
drink 5/day
physical activity
Data Source
NHANES 1999-2004
417 variables of interest
time to death
N≧1000 (≧100 deaths)
effect sizes
p-values
●
●
●
●
●
●
●
●
●
●
●
●
●●
0
1
2
3
4
5
6
7
8
9
10
11
1213
1
50
99
1 50 99
5.0
7.5
−log10(pvalue)
Vitamin D (1SD(log))
RHR = 1.14
RPvalue = 4.68
A
B
C D
E
median p-value/HR for k
percentile indicator
JCE, 2015
●
●
●
●
●
●
●
●
●
●
●
●
●●
0
1
2
3
4
5
6
7
8
9
10
11
1213
1
50
99
1 50 99
2.5
5.0
7.5
0.64 0.68 0.72 0.76
Hazard Ratio
−log10(pvalue)
Vitamin D (1SD(log))
RHR = 1.14
RP = 4.68
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0
1
2
3
4
5
6
7
8
9
10
11
12
13
1
50
99
1 50 99
1
2
3
4
0.75 0.80 0.85 0.90
Hazard Ratio
−log10(pvalue)
Thyroxine (1SD(log))
RHR = 1.15
RP = 2.90
http://bit.ly/eﬀectvibration

Interdependencies of the exposome:
Correlation globes paint a complex view of exposure
Red: positive ρ

Blue: negative ρ

thickness: |ρ|
for each pair of E:

Spearman ρ

(575 factors: 81,937 correlations)
permuted data to produce

“null ρ”

sought replication in > 1
cohort
Pac Symp Biocomput. 2015

JECH. 2015

Red: positive ρ

Blue: negative ρ

thickness: |ρ|
for each pair of E:

Spearman ρ

(575 factors: 81,937 correlations)
Interdependencies of the exposome:
Correlation globes paint a complex view of exposure
permuted data to produce

“null ρ”

sought replication in > 1
cohort
Pac Symp Biocomput. 2015

JECH. 2015
Eﬀective number of
variables:

500 (10% decrease)

Does my single association between E and P matter?

Does my association between E and P matter in the
entire possible space of associations?
ARPH 2017
Hum Genet 2012
JECH 2014
Curr Epidemiol Rep 2017
Curr Env Health Rep 2016
p-2
20
8
1
6
18
7
10
…
p-1
12
2
5
9
4
16
21
11
13
3
17
14
19
p
15
6 101 1482 7 …1311 12 e54 153 9 e-1e-2
…
…
…
…
E exposure factors
Pphenotypicfactors
which ones to test?

all?

the ones in blue?

E times P possibilities!

how to detect signal from noise?

P
Scaling up the search in multiple (m=157) phenotypes:

does my single association between E and P matter?
Body Measures

Body Mass Index

Height
Blood pressure & ﬁtness

Systolic BP

Diastolic BP

Pulse rate

VO2 Max
Metabolic

Glucose

LDL-Cholesterol

Triglycerides
Inﬂammation

C-reactive protein

white blood cell count
Kidney function

Creatinine

Sodium

Uric Acid
Liver function

Aspartate aminotransferase

Gamma glutamyltransferase
Aging

Telomere length

Time to death
Raj Manrai, Hugues Aschard, JPA Ioannidis, Dennis Bier

Creation of a phenotype-exposure association map:
A 2-D view of 209 phenotype by 514 exposure associations
> 0
< 0
Association Size:
504 E exposure and diet indicators × 209 clinical trait phenotypes

NHANES 1999-2000, 2001-2002, 2005-2006, …, 2011-2012 (8)

Median N: 150-5000 per survey

~83,092 E-P associations!
signiﬁcant associations (FDR < 5%)

adjusted by age, age2, sex, race, income

Raj Manrai, Hugues Aschard, JPA Ioannidis, Dennis Bier
209phenotypes
514 exposures

Alpha-carotene
Alcohol
VitaminEasalpha-tocopherol
Beta-carotene
Caffeine
Calcium
Carbohydrate
Cholesterol
Copper
Beta-cryptoxanthin
Folicacid
Folate,DFE
Foodfolate
Dietaryfiber
Iron
Energy
Lycopene
Lutein+zeaxanthin
MFA16:1
MFA18:1
MFA20:1
Magnesium
Totalmonounsaturatedfattyacids
Moisture
Niacin
PFA18:2
PFA18:3
PFA20:4
PFA22:5
PFA22:6
Totalpolyunsaturatedfattyacids
Phosphorus
Potassium
Protein
Retinol
SFA4:0
SFA6:0
SFA8:0
SFA10:0
SFA12:0
SFA14:0
SFA16:0
SFA18:0
Selenium
Totalsaturatedfattyacids
Totalsugars
Totalfat
Theobromine
VitaminA,RAE
Thiamin
VitaminB12
Riboflavin
VitaminB6
VitaminC
VitaminK
Zinc
NoSalt
OrdinarySalt
a-Carotene
VitaminB12,serum
trans-b-carotene
cis-b-carotene
b-cryptoxanthin
Folate,serum
g-tocopherol
Iron,FrozenSerum
CombinedLutein/zeaxanthin
trans-lycopene
Folate,RBC
Retinylpalmitate
Retinylstearate
Retinol
VitaminD
a-Tocopherol
Daidzein
o-Desmethylangolensin
Equol
Enterodiol
Enterolactone
Genistein
EstimatedVO2max
PhysicalActivity
Doesanyonesmokeinhome?
Total#ofcigarettessmokedinhome
Cotinine
CurrentCigaretteSmoker?
Agelastsmokedcigarettesregularly
#cigarettessmokedperdaywhenquit
#cigarettessmokedperdaynow
#dayssmokedcigsduringpast30days
Avg#cigarettes/dayduringpast30days
Smokedatleast100cigarettesinlife
Doyounowsmokecigarettes...
numberofdayssincequit
Usedsnuffatleast20timesinlife
drink5inaday
drinkperday
days5drinksinyear
daysdrinkinyear
3-fluorene
2-fluorene
3-phenanthrene
1-phenanthrene
2-phenanthrene
1-pyrene
3-benzo[c]phenanthrene
3-benz[a]anthracene
Mono-n-butylphthalate
Mono-phthalate
Mono-cyclohexylphthalate
Mono-ethylphthalate
Mono-phthalate
Mono--hexylphthalate
Mono-isobutylphthalate
Mono-n-methylphthalate
Mono-phthalate
Mono-benzylphthalate
Cadmium
Lead
Mercury,total
Barium,urine
Cadmium,urine
Cobalt,urine
Cesium,urine
Mercury,urine
Iodine,urine
Molybdenum,urine
Lead,urine
Platinum,urine
Antimony,urine
Thallium,urine
Tungsten,urine
Uranium,urine
BloodBenzene
BloodEthylbenzene
Bloodo-Xylene
BloodStyrene
BloodTrichloroethene
BloodToluene
Bloodm-/p-Xylene
1,2,3,7,8-pncdd
1,2,3,7,8,9-hxcdd
1,2,3,4,6,7,8-hpcdd
1,2,3,4,6,7,8,9-ocdd
2,3,7,8-tcdd
Beta-hexachlorocyclohexane
Gamma-hexachlorocyclohexane
Hexachlorobenzene
HeptachlorEpoxide
Mirex
Oxychlordane
p,p-DDE
Trans-nonachlor
2,5-dichlorophenolresult
2,4,6-trichlorophenolresult
Pentachlorophenol
Dimethylphosphate
Diethylphosphate
Dimethylthiophosphate
PCB66
PCB74
PCB99
PCB105
PCB118
PCB138&158
PCB146
PCB153
PCB156
PCB157
PCB167
PCB170
PCB172
PCB177
PCB178
PCB180
PCB183
PCB187
3,3,4,4,5,5-hxcb
3,3,4,4,5-pncb
3,4,4,5-tcb
Perfluoroheptanoicacid
Perfluorohexanesulfonicacid
Perfluorononanoicacid
Perfluorooctanoicacid
Perfluorooctanesulfonicacid
Perfluorooctanesulfonamide
2,3,7,8-tcdf
1,2,3,7,8-pncdf
2,3,4,7,8-pncdf
1,2,3,4,7,8-hxcdf
1,2,3,6,7,8-hxcdf
1,2,3,7,8,9-hxcdf
2,3,4,6,7,8-hxcdf
1,2,3,4,6,7,8-hpcdf
Measles
Toxoplasma
HepatitisAAntibody
HepatitisBcoreantibody
HepatitisBSurfaceAntibody
HerpesII
Albumin, urine
Uric acid
Phosphorus
Osmolality
Sodium
Potassium
Creatinine
Chloride
Total calcium
Bicarbonate
Blood urea nitrogen
Total protein
Total bilirubin
Lactate dehydrogenase LDH
Gamma glutamyl transferase
Globulin
Alanine aminotransferase ALT
Aspartate aminotransferase AST
Alkaline phosphotase
Albumin
Methylmalonic acid
PSA. total
Prostate specific antigen ratio
TIBC, Frozen Serum
Red cell distribution width
Red blood cell count
Platelet count SI
Segmented neutrophils percent
Mean platelet volume
Mean cell volume
Mean cell hemoglobin
MCHC
Hemoglobin
Hematocrit
Ferritin
Protoporphyrin
Transferrin saturation
White blood cell count
Monocyte percent
Lymphocyte percent
Eosinophils percent
C-reactive protein
Segmented neutrophils number
Monocyte number
Lymphocyte number
Eosinophils number
Basophils number
mean systolic
mean diastolic
60 sec. pulse:
60 sec HR
Total Cholesterol
Triglycerides
Glucose, serum
Insulin
Homocysteine
Glucose, plasma
Glycohemoglobin
C-peptide: SI
LDL-cholesterol
Direct HDL-Cholesterol
Bone alkaline phosphotase
Trunk Fat
Lumber Pelvis BMD
Lumber Spine BMD
Head BMD
Trunk Lean excl BMC
Total Lean excl BMC
Total Fat
Total BMD
Weight
Waist Circumference
Triceps Skinfold
Thigh Circumference
Subscapular Skinfold
Recumbent Length
Upper Leg Length
Standing Height
Head Circumference
Maximal Calf Circumference
Body Mass Index
-0.4 -0.2 0 0.2 0.4
Value
050100150
Color Key
and Histogram
Count
phenotypes
exposures
+-
nutrients
BMI,weight,
BMD
metabolic
renalfunction
pcbs
metabolic
bloodparameters
hydrocarbons
EWAS-derived phenotype-exposure association map:
A 2-D view of connections between P and E
R2: ~1-40% (average of 20%)

83,092 total associations between E and P

12,237 signiﬁcant associations (6%, in yellow):

Average association size: 0.6% for 1SD change in E
percent change for 1 SD increase
7%-6%

Alpha-carotene
Alcohol
Beta-carotene
Caffeine
Calcium
Carbohydrate
Cholesterol
Copper
Beta-cryptoxanthin
Folicacid
Folate,DFE
Foodfolate
Dietaryfiber
Iron
Energy
Lycopene
Lutein+zeaxanthin
MFA16:1
MFA18:1
MFA20:1
Magnesium
Moisture
Niacin
PFA18:2
PFA18:3
PFA20:4
PFA22:5
PFA22:6
Phosphorus
Potassium
Protein
Retinol
SFA4:0
SFA6:0
SFA8:0
SFA10:0
SFA12:0
SFA14:0
SFA16:0
SFA18:0
Selenium
Totalsugars
Totalfat
Theobromine
VitaminA,RAE
Thiamin
VitaminB12
Riboflavin
VitaminB6
VitaminC
VitaminK
Zinc
NoSalt
OrdinarySalt
a-Carotene
VitaminB12,serum
trans-b-carotene
cis-b-carotene
b-cryptoxanthin
Folate,serum
g-tocopherol
Iron,FrozenSerum
trans-lycopene
Folate,RBC
Retinylpalmitate
Retinylstearate
Retinol
VitaminD
a-Tocopherol
Daidzein
Equol
Enterodiol
Enterolactone
Genistein
EstimatedVO2max
PhysicalActivity
Cotinine
drink5inaday
drinkperday
days5drinksinyear
daysdrinkinyear
3-fluorene
2-fluorene
3-phenanthrene
1-phenanthrene
2-phenanthrene
1-pyrene
3-benz[a]anthracene
Mono-phthalate
Mono-ethylphthalate
Mono-phthalate
Mono-phthalate
Cadmium
Lead
Mercury,total
Barium,urine
Cadmium,urine
Cobalt,urine
Cesium,urine
Mercury,urine
Iodine,urine
Molybdenum,urine
Lead,urine
Platinum,urine
Antimony,urine
Thallium,urine
Tungsten,urine
Uranium,urine
BloodBenzene
BloodEthylbenzene
Bloodo-Xylene
BloodStyrene
BloodToluene
Bloodm-/p-Xylene
1,2,3,7,8-pncdd
1,2,3,7,8,9-hxcdd
1,2,3,4,6,7,8-hpcdd
1,2,3,4,6,7,8,9-ocdd
2,3,7,8-tcdd
Hexachlorobenzene
HeptachlorEpoxide
Mirex
Oxychlordane
p,p-DDE
Trans-nonachlor
Pentachlorophenol
Dimethylphosphate
Diethylphosphate
PCB66
PCB74
PCB99
PCB105
PCB118
PCB138&158
PCB146
PCB153
PCB156
PCB157
PCB167
PCB170
PCB172
PCB177
PCB178
PCB180
PCB183
PCB187
3,3,4,4,5,5-hxcb
3,3,4,4,5-pncb
3,4,4,5-tcb
2,3,7,8-tcdf
1,2,3,7,8-pncdf
2,3,4,7,8-pncdf
1,2,3,4,7,8-hxcdf
1,2,3,6,7,8-hxcdf
1,2,3,7,8,9-hxcdf
2,3,4,6,7,8-hxcdf
1,2,3,4,6,7,8-hpcdf
Measles
Toxoplasma
HepatitisAAntibody
HerpesII
Albumin, urine
Uric acid
Phosphorus
Osmolality
Sodium
Potassium
Creatinine
Chloride
Total calcium
Bicarbonate
Blood urea nitrogen
Total protein
Total bilirubin
Globulin
Albumin
Methylmalonic acid
PSA. total
TIBC, Frozen Serum
Platelet count SI
Mean cell volume
MCHC
Hemoglobin
Hematocrit
Ferritin
Protoporphyrin
Monocyte percent
Lymphocyte percent
Eosinophils percent
C-reactive protein
Monocyte number
Lymphocyte number
Eosinophils number
Basophils number
mean systolic
mean diastolic
60 sec. pulse:
60 sec HR
Total Cholesterol
Triglycerides
Glucose, serum
Insulin
Homocysteine
Glucose, plasma
Glycohemoglobin
C-peptide: SI
LDL-cholesterol
Trunk Fat
Lumber Pelvis BMD
Lumber Spine BMD
Head BMD
Trunk Lean excl BMC
Total Lean excl BMC
Total Fat
Total BMD
Weight
Waist Circumference
Triceps Skinfold
Thigh Circumference
Recumbent Length
Upper Leg Length
Standing Height
Head Circumference
Body Mass Index
-0.4 -0.2 0 0.2 0.4
Value
050100150
Color Key
and Histogram
Count
phenotypes
exposures
+- EWAS-derived phenotype-exposure association map:
A 2-D view of connections between P and E:
does my correlation matter?

EWAS-derived phenotype-exposure association map:
Alpha-carotene
Alcohol
Beta-carotene
Caffeine
Calcium
Carbohydrate
Cholesterol
Copper
Beta-cryptoxanthin
Folicacid
Folate,DFE
Foodfolate
Dietaryfiber
Iron
Energy
Lycopene
Lutein+zeaxanthin
MFA16:1
MFA18:1
MFA20:1
Magnesium
Moisture
Niacin
PFA18:2
PFA18:3
PFA20:4
PFA22:5
PFA22:6
Phosphorus
Potassium
Protein
Retinol
SFA4:0
SFA6:0
SFA8:0
SFA10:0
SFA12:0
SFA14:0
SFA16:0
SFA18:0
Selenium
Totalsugars
Totalfat
Theobromine
VitaminA,RAE
Thiamin
VitaminB12
Riboflavin
VitaminB6
VitaminC
VitaminK
Zinc
NoSalt
OrdinarySalt
a-Carotene
VitaminB12,serum
trans-b-carotene
cis-b-carotene
b-cryptoxanthin
Folate,serum
g-tocopherol
Iron,FrozenSerum
trans-lycopene
Folate,RBC
Retinylpalmitate
Retinylstearate
Retinol
VitaminD
a-Tocopherol
Daidzein
Equol
Enterodiol
Enterolactone
Genistein
EstimatedVO2max
PhysicalActivity
Cotinine
drink5inaday
drinkperday
days5drinksinyear
daysdrinkinyear
3-fluorene
2-fluorene
3-phenanthrene
1-phenanthrene
2-phenanthrene
1-pyrene
3-benz[a]anthracene
Mono-phthalate
Mono-ethylphthalate
Mono-phthalate
Mono-phthalate
Cadmium
Lead
Mercury,total
Barium,urine
Cadmium,urine
Cobalt,urine
Cesium,urine
Mercury,urine
Iodine,urine
Molybdenum,urine
Lead,urine
Platinum,urine
Antimony,urine
Thallium,urine
Tungsten,urine
Uranium,urine
BloodBenzene
BloodEthylbenzene
Bloodo-Xylene
BloodStyrene
BloodToluene
Bloodm-/p-Xylene
1,2,3,7,8-pncdd
1,2,3,7,8,9-hxcdd
1,2,3,4,6,7,8-hpcdd
1,2,3,4,6,7,8,9-ocdd
2,3,7,8-tcdd
Hexachlorobenzene
HeptachlorEpoxide
Mirex
Oxychlordane
p,p-DDE
Trans-nonachlor
Pentachlorophenol
Dimethylphosphate
Diethylphosphate
PCB66
PCB74
PCB99
PCB105
PCB118
PCB138&158
PCB146
PCB153
PCB156
PCB157
PCB167
PCB170
PCB172
PCB177
PCB178
PCB180
PCB183
PCB187
3,3,4,4,5,5-hxcb
3,3,4,4,5-pncb
3,4,4,5-tcb
2,3,7,8-tcdf
1,2,3,7,8-pncdf
2,3,4,7,8-pncdf
1,2,3,4,7,8-hxcdf
1,2,3,6,7,8-hxcdf
1,2,3,7,8,9-hxcdf
2,3,4,6,7,8-hxcdf
1,2,3,4,6,7,8-hpcdf
Measles
Toxoplasma
HepatitisAAntibody
HerpesII
Albumin, urine
Uric acid
Phosphorus
Osmolality
Sodium
Potassium
Creatinine
Chloride
Total calcium
Bicarbonate
Blood urea nitrogen
Total protein
Total bilirubin
Globulin
Albumin
Methylmalonic acid
PSA. total
TIBC, Frozen Serum
Platelet count SI
Mean cell volume
MCHC
Hemoglobin
Hematocrit
Ferritin
Protoporphyrin
Monocyte percent
Lymphocyte percent
Eosinophils percent
C-reactive protein
Monocyte number
Lymphocyte number
Eosinophils number
Basophils number
mean systolic
mean diastolic
60 sec. pulse:
60 sec HR
Total Cholesterol
Triglycerides
Glucose, serum
Insulin
Homocysteine
Glucose, plasma
Glycohemoglobin
C-peptide: SI
LDL-cholesterol
Trunk Fat
Lumber Pelvis BMD
Lumber Spine BMD
Head BMD
Trunk Lean excl BMC
Total Lean excl BMC
Total Fat
Total BMD
Weight
Waist Circumference
Triceps Skinfold
Thigh Circumference
Recumbent Length
Upper Leg Length
Standing Height
Head Circumference
Body Mass Index
-0.4 -0.2 0 0.2 0.4
Value
050100150
Color Key
and Histogram
Count
phenotypes
exposures
+-

Alpha-carotene
Alcohol
Beta-carotene
Caffeine
Calcium
Carbohydrate
Cholesterol
Copper
Beta-cryptoxanthin
Folicacid
Folate,DFE
Foodfolate
Dietaryfiber
Iron
Energy
Lycopene
Lutein+zeaxanthin
MFA16:1
MFA18:1
MFA20:1
Magnesium
Moisture
Niacin
PFA18:2
PFA18:3
PFA20:4
PFA22:5
PFA22:6
Phosphorus
Potassium
Protein
Retinol
SFA4:0
SFA6:0
SFA8:0
SFA10:0
SFA12:0
SFA14:0
SFA16:0
SFA18:0
Selenium
Totalsugars
Totalfat
Theobromine
VitaminA,RAE
Thiamin
VitaminB12
Riboflavin
VitaminB6
VitaminC
VitaminK
Zinc
NoSalt
OrdinarySalt
a-Carotene
VitaminB12,serum
trans-b-carotene
cis-b-carotene
b-cryptoxanthin
Folate,serum
g-tocopherol
Iron,FrozenSerum
trans-lycopene
Folate,RBC
Retinylpalmitate
Retinylstearate
Retinol
VitaminD
a-Tocopherol
Daidzein
Equol
Enterodiol
Enterolactone
Genistein
EstimatedVO2max
PhysicalActivity
Cotinine
drink5inaday
drinkperday
days5drinksinyear
daysdrinkinyear
3-fluorene
2-fluorene
3-phenanthrene
1-phenanthrene
2-phenanthrene
1-pyrene
3-benz[a]anthracene
Mono-phthalate
Mono-ethylphthalate
Mono-phthalate
Mono-phthalate
Cadmium
Lead
Mercury,total
Barium,urine
Cadmium,urine
Cobalt,urine
Cesium,urine
Mercury,urine
Iodine,urine
Molybdenum,urine
Lead,urine
Platinum,urine
Antimony,urine
Thallium,urine
Tungsten,urine
Uranium,urine
BloodBenzene
BloodEthylbenzene
Bloodo-Xylene
BloodStyrene
BloodToluene
Bloodm-/p-Xylene
1,2,3,7,8-pncdd
1,2,3,7,8,9-hxcdd
1,2,3,4,6,7,8-hpcdd
1,2,3,4,6,7,8,9-ocdd
2,3,7,8-tcdd
Hexachlorobenzene
HeptachlorEpoxide
Mirex
Oxychlordane
p,p-DDE
Trans-nonachlor
Pentachlorophenol
Dimethylphosphate
Diethylphosphate
PCB66
PCB74
PCB99
PCB105
PCB118
PCB138&158
PCB146
PCB153
PCB156
PCB157
PCB167
PCB170
PCB172
PCB177
PCB178
PCB180
PCB183
PCB187
3,3,4,4,5,5-hxcb
3,3,4,4,5-pncb
3,4,4,5-tcb
2,3,7,8-tcdf
1,2,3,7,8-pncdf
2,3,4,7,8-pncdf
1,2,3,4,7,8-hxcdf
1,2,3,6,7,8-hxcdf
1,2,3,7,8,9-hxcdf
2,3,4,6,7,8-hxcdf
1,2,3,4,6,7,8-hpcdf
Measles
Toxoplasma
HepatitisAAntibody
HerpesII
Albumin, urine
Uric acid
Phosphorus
Osmolality
Sodium
Potassium
Creatinine
Chloride
Total calcium
Bicarbonate
Blood urea nitrogen
Total protein
Total bilirubin
Globulin
Albumin
Methylmalonic acid
PSA. total
TIBC, Frozen Serum
Platelet count SI
Mean cell volume
MCHC
Hemoglobin
Hematocrit
Ferritin
Protoporphyrin
Monocyte percent
Lymphocyte percent
Eosinophils percent
C-reactive protein
Monocyte number
Lymphocyte number
Eosinophils number
Basophils number
mean systolic
mean diastolic
60 sec. pulse:
60 sec HR
Total Cholesterol
Triglycerides
Glucose, serum
Insulin
Homocysteine
Glucose, plasma
Glycohemoglobin
C-peptide: SI
LDL-cholesterol
Trunk Fat
Lumber Pelvis BMD
Lumber Spine BMD
Head BMD
Trunk Lean excl BMC
Total Lean excl BMC
Total Fat
Total BMD
Weight
Waist Circumference
Triceps Skinfold
Thigh Circumference
Recumbent Length
Upper Leg Length
Standing Height
Head Circumference
Body Mass Index
-0.4 -0.2 0 0.2 0.4
Value
050100150
Color Key
and Histogram
Count
phenotypes
exposures
+- EWAS-derived phenotype-exposure association map:
total sugar
Polyunsaturated

fats
Vitamin D
Cotinine

High-throughput data analytics to mitigate analytical challenges of
exposome-based research:

Consider multiplicity of hypotheses and correlational web!
Does my correlation matter?
How does my new correlation
compare to the family of correlations?

What is the total variance
explained(σ2
E)?
saturated fatty acids and HA1C: 0.5%

does it matter? (i.e., 1.2% is average!)
ρ
ARPH 2016
JAMA 2014
JECH 2015
Explicit in number of hypotheses
tested
False discovery rate;

family-wise error rate;

Report database size!
p-2
20
8
1
6
18
7
10
…
p-1
12
2
5
9
4
16
21
11
13
3
17
14
19
p
15
6 101 1482 7 …1311 12 e54 153 9 e-1e-2
…
…
…
…
E exposure factors
Pphenotypicfactors

http://chiragjpgroup.org/exposome-analytics-course
Nam Pho
Please contact me for help or project ideas!

Eye color
Hair curliness
Type-1 diabetes
Height
Schizophrenia
Epilepsy
Graves' disease
Celiac disease
Bipolar disorder
Obesity
Alzheimer's disease
Anorexia nervosa
Psoriasis
Menarche, age at
Nicotine dependence
Sexual orientation
Alcoholism
Lupus
Crohn's disease
Migraine
Thyroid cancer
Autism
Body mass index
Depression
Insomnia
Menopause, age at
Heart disease
Prostate cancer
QT interval
Breast cancer
Ovarian cancer
Hangover
Stroke
Asthma
Hypertension
Osteoarthritis
Parkinson's disease
Longevity
Type-2 diabetes
Gallstone disease
Testicular cancer
Cervical cancer
Sciatica
Bladder cancer
Colon cancer
Lung cancer
Leukemia
Stomach cancer
0 25 50 75 100
Heritability: Var(G)/Var(Phenotype) Source: SNPedia.com
Use high-throughput tools and data (e.g., exposome) will
enhance discovery of the role of E (and G) in P.

In conclusion:

Data science inspired approaches to ascertain exposome and
genome will enable biomedical discovery
Dense correlations, confounding, reverse causality:

how to assess at high dimension?
Understand interacting G and E for causation
Mitigate fragmented literature of associations.
september2011 119
ng
akin to – but less well
ore poorly understood than –
For example, consider the use
n to adjust the risk levels of
o the same background level
n be many covariates, and
iates can be in or out of the
ovariates, there are over 1000
onsider a maze as a metaphor
ure 3). The red line traces the
f the maze. The path through
simple, once it is known.
near regression model, terms
nd taken out of a regression
get a p-value smaller than
can be frozen and the model
after the fact. It is easy to
on of multiple testing and
g can lead to a very large
he example of bisphenol A in
h large search spaces can give
ve p-values somewhere within
ly, authors and consumers are
caught in the headlights and
ue as indicating a real effect.
ed? A new, combined
by now that more than small-
e needed. The entire system
studies and the claims that
hem is no longer functional,
urpose. What can be done to
stem? There are no principled
ways in the literature for dealing with model
selection, so we propose a new, composite
strategy. Following Deming, it is based not
upon the workers – the researchers – but on
the production system managers – the funding
agencies and the editors of the journals where
the claims are reported.
We propose a multi-step strategy to help
bring observational studies under control (see
Table 2). The main technical idea is to split the
data into two data sets, a modelling data set
and a holdout data set. The main operational
idea is to require the journal to accept or reject
the paper based on an analysis of the modelling
data set without knowing the results of applying
the methods used for the modelling set on the
holdout set and to publish an addendum to the
paper giving the results of the analysis of the
holdout set. We now cover the steps, one by one.
1 The data collection and clean-up should
be done by a group separate from the
analysis group. There can be a tempta-
tion on the part of the analyst to do some
exploratory data analysis during the data
clean up. Exploratory analysis could lead
to model selection bias.
tion bias
l recognition that a paper
ter chance of acceptance if
s found. This means that, for
claim in the paper has to
-value less than 0.05. From
of view5
, this is quality by
journals are placing heavy
statistical test rather than
the methods and steps that
sion. As to having a p-value
some might be tempted to
m10
through multiple testing,
ing or unfair treatment of
mbination of the three that
p-value. Researchers can be
devising a plausible story to
finding.
2 The data cleaning team creates a
modelling data set and a holdout set and
gives the modelling data set, less the
item to be predicted, to the analyst for
examination.
P < 0.05
Figure 3. The path through a complex process can appear quite simple once the path is defined. Which terms are
included in a multiple linear regression model? Each turn in a maze is analogous to including or not a specific
term in the evolving linear model. By keeping an eye on the p-value on the term selected to be at issue, one
can work towards a suitably small p-value. © ktsdesign – Fotolia
Table 2. Steps 0–7 can be used to help bring the
observational study process into control. Currently
researchers analysing observational data sets are
under no effective oversight
Step Process / Action
0 Data are made publicly available
1 Data cleaning and analysis separate
2 Split sample: A, modelling; and B,
holdout (testing)
3 Analysis plan is written, based on
modelling data only
4 Written protocol, based on viewing
predictor variables of A
5 Analysis of A only data set
6 Journal accepts paper based on A only
7 Analysis of B data set gives Addendum
EWASs in aging: mortality and quantitative traits
-log10(pvalue)
0.4 0.6 0.8 1.0 1.2 1.4 1.6 2.0 2.4 2.8
02468
1
2
3
4
5
67
1 Physical Activity
3 Cadmium
4 Cadmium, urine
5 Past smoker
6 Current smoker
7 trans-lycopene
(11) 1
2
3 4
5 6
78
9
10 1112
13 14
1516
2 SES_1
3 male
4 SES_0
5 black
6 SES_2
7 SES_3
8 education_hs
9 other_eth
10 mexican
13 occupation_never
16 other_hispanic
(69)
?

Harvard DBMI
Susanne Churchill

Nathan Palmer

Sophia Mamousette

Sunny Alvear

Michal Preminger
Chirag J Patel

chirag@hms.harvard.edu

@chiragjp

www.chiragjpgroup.org
NIH Common Fund

Big Data to Knowledge
Acknowledgements
RagGroup
Nam Pho
Jake Chung
Kajal Claypool
Arjun Manrai
Chirag Lakhani
Adam Brown

Danielle Rasooly

Alan LeGoallec

Sivateja Tangirala
Amar Dhand
Center for Complex Networks
Mentioned Collaborators
Isaac Kohane

John Ioannidis

Dennis Bier

Hugo Aschard

Search engine for E NEU network science 080817

Recommended

Recommended

More Related Content

What's hot

What's hot (17)

Similar to Search engine for E NEU network science 080817

Similar to Search engine for E NEU network science 080817 (20)

More from Chirag Patel

More from Chirag Patel (8)

Recently uploaded

Recently uploaded (20)

Search engine for E NEU network science 080817