Japanese Environmental Children's Study and Data-driven E

Building a search engine to ﬁnd
environmental factors associated with
disease and health
Chirag J Patel

IEA-WCE 2017 Symposium

Saitama, Japan

8/20/17
chirag@hms.harvard.edu
@chiragjp
www.chiragjpgroup.org

P = G + EType 2 Diabetes

Cancer

Alzheimer’s

Gene expression
Phenotype Genome
Variants
Environment
Infectious agents

Diet + Nutrients

Pollutants

Drugs

We are great at G investigation!
2,940 (as of 6/1/17)
36,066 G-P associations

Genome-wide Association Studies (GWAS)

https://www.ebi.ac.uk/gwas/

G

Nothing comparable to elucidate E inﬂuence!
E: ???
We lack high-throughput methods
and data to discover new E in P…

A similar paradigm for discovery should exist

for E!
Why?

σ2
G
σ2P
H2 =
Heritability (H2) is the range of phenotypic
variability attributed to genetic variability in a
population
Indicator of the proportion of phenotypic
diﬀerences attributed to G.

Eye color
Hair curliness
Type-1 diabetes
Height
Schizophrenia
Epilepsy
Graves' disease
Celiac disease
Polycystic ovary syndrome
Attention deficit hyperactivity disorder
Bipolar disorder
Obesity
Alzheimer's disease
Anorexia nervosa
Psoriasis
Bone mineral density
Menarche, age at
Nicotine dependence
Sexual orientation
Alcoholism
Lupus
Rheumatoid arthritis
Crohn's disease
Migraine
Thyroid cancer
Autism
Blood pressure, diastolic
Body mass index
Depression
Coronary artery disease
Insomnia
Menopause, age at
Heart disease
Prostate cancer
QT interval
Breast cancer
Ovarian cancer
Hangover
Stroke
Asthma
Blood pressure, systolic
Hypertension
Osteoarthritis
Parkinson's disease
Longevity
Type-2 diabetes
Gallstone disease
Testicular cancer
Cervical cancer
Sciatica
Bladder cancer
Colon cancer
Lung cancer
Leukemia
Stomach cancer
0 25 50 75 100
Heritability: Var(G)/Var(Phenotype)
Source: SNPedia.com
G estimates for burdensome diseases are low and variable:
massive opportunity for high-throughput E discovery
Type 2 Diabetes
Heart Disease
Autism (50%???)

Eye color
Hair curliness
Type-1 diabetes
Height
Schizophrenia
Epilepsy
Graves' disease
Celiac disease
Bipolar disorder
Obesity
Alzheimer's disease
Anorexia nervosa
Psoriasis
Menarche, age at
Nicotine dependence
Sexual orientation
Alcoholism
Lupus
Crohn's disease
Migraine
Thyroid cancer
Autism
Body mass index
Depression
Insomnia
Menopause, age at
Heart disease
Prostate cancer
QT interval
Breast cancer
Ovarian cancer
Hangover
Stroke
Asthma
Hypertension
Osteoarthritis
Parkinson's disease
Longevity
Type-2 diabetes
Gallstone disease
Testicular cancer
Cervical cancer
Sciatica
Bladder cancer
Colon cancer
Lung cancer
Leukemia
Stomach cancer
0 25 50 75 100
Heritability: Var(G)/Var(Phenotype) Source: SNPedia.com
G estimates for complex traits are low and variable:
massive opportunity for high-throughput E discovery
σ2
E : Exposome!

It took a new paradigm of GWAS for discovery:
Human Genome Project to GWAS
Sequencing of the genome
2001
HapMap project:
http://hapmap.ncbi.nlm.nih.gov/
Characterize common variation
2001-current day
High-throughput variant
assay
< $99 for ~1M variants
Measurement tools
~2003 (ongoing)
ARTICLES
Genome-wide association study of 14,000
cases of seven common diseases and
3,000 shared controls
The Wellcome Trust Case Control Consortium*
There is increasing evidence that genome-wide association (GWA) studies represent a powerful approach to the
identification of genes involved in common human diseases. We describe a joint GWA study (using the Affymetrix GeneChip
500K Mapping Array Set) undertaken in the British population, which has examined ,2,000 individuals for each of 7 major
diseases and a shared set of ,3,000 controls. Case-control comparisons identified 24 independent association signals at
P , 5 3 1027
: 1 in bipolar disorder, 1 in coronary artery disease, 9 in Crohn’s disease, 3 in rheumatoid arthritis, 7 in type 1
diabetes and 3 in type 2 diabetes. On the basis of prior findings and replication studies thus-far completed, almost all of these
signals reflect genuine susceptibility effects. We observed association at many previously identified loci, and found
compelling evidence that some loci confer risk for more than one of the diseases studied. Across all diseases, we identified a
25 27
Vol 447|7 June 2007|doi:10.1038/nature05911
WTCCC, Nature, 2008.
Comprehensive, high-throughput analyses
GWAS

What is a Genome-Wide Association Study (GWAS)?:
Data-driven search for G factors in P
evolut
partic
eases;
tase 1)
well a
biolog
The
captur
implem
STRU
revert
subset
librium
clearly
−log10(P)
0
5
10
15
Chromosome
22
X
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
80
60
40
100
rvedteststatistic
a
b
NATURE|Vol 447|7 June 2007
WTCCC Nature, 2007
AA Aa aa
case
control
Robust, transparent, and comprehensive search for G in P

evolu
parti
eases
tase 1
well
biolo
Th
captu
imple
STRU
rever
subse
libriu
clearl
−log10(P)
0
5
10
15
Chromosome
22
X
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
80
60
40
100
ervedteststatistic
a
b
NATURE|Vol 447|7 June 2007
comprehensive
and transparent
multiplicity
controlled
novel
ﬁndings
(and validated)
JAMA 2014
JECH 2014
Why carry out a Genome-Wide Association Study:
Analytically robust, transparent, and comprehensive

search for G in P

Promises and Challenges in creating a search engine for
identifying E in P
Studying the Elusive Environment in Large Scale
Itispossiblethatmorethan50%ofcomplexdiseaserisk
isattributedtodifferencesinanindividual’senvironment.1
Airpollution,smoking,anddietaredocumentedenviron-
mental factors affecting health, yet these factors are but
a fraction of the “exposome,” the totality of the exposure
loadoccurringthroughoutaperson’slifetime.1
Investigat-
ing one or a handful of exposures at a time has led to a
highly fragmented literature of epidemiologic associa-
tions. Much of that literature is not reproducible, and se-
lectivereportingmaybeamajorreasonforthelackofre-
producibility. A new model is required to discover
environmental exposures associated with disease while
mitigating possibilities of selective reporting.
Toremedythelackofreproducibilityandconcernsof
validity, multiple personal exposures can be assessed si-
multaneously in terms of their association with a condi-
tion or disease of interest; the strongest associations can
then be tentatively validated in independent data sets
(eg, as done in references 2 and 3).2,3
The main advan-
tages of this process include the ability to search the list
ofexposuresandadjustformultiplicitysystematicallyand
reportalltheprobedassociationsinsteadofonlythemost
significant results. The term “environment-wide associa-
tion studies” (EWAS) has been used to describe this ap-
proach (an analogy to genome-wide association stud-
ies).Forexample,Wangetal4
screenedmorethan2000
chemicalsinserumtodiscoverendogenousexposuresas-
sociated with risk for cardiovascular disease.
Therearenotablehurdlesinanalyzing“big”environ-
mental data. These same problems affect epidemiology
the EWAS vantage point, intervening on β-carotene
(Figure, D) seems a futile exercise given its complex rela-
tionship with other nutrients and pollutants.
Giventhiscomplexity,howcanstudiesofenvironmen-
talriskmoveforward?First,EWASanalysesshouldbeap-
pliedtomultipledatasets,andconsistencycanbeformally
examinedforallassessedcorrelations.Second,thetempo-
ral relationship between exposure and changes in health
parametersmayofferhelpfulhintsaboutwhichofthesig-
nalsaremorethansimplecorrelations.Third,standardized
adjustedanalyses,inwhichadjustmentsareperformedsys-
tematicallyandinthesamewayacrossmultipledatasets,
may also help. This is in stark contrast with the current
model,wherebymostepidemiologicstudiesusesingledata
setswithoutreplicationaswellasnon–time-dependentas-
sessments,andreportedadjustmentsaremarkedlydiffer-
entacrossreportsanddatasets,eventhoseperformedby
thesameteam(differentapproachesincreasevaliditybut
mustbereconciledandassimilated).
However, eventually for most environmental cor-
relates,theremaybeunsurpassabledifficultyestablish-
ing potential causal inferences based on observational
data alone. Factors that seem protective may some-
times be tested in randomized trials. The complexity of
the multiple correlations also highlights the challenge
thatinterveningtomodify1putativeriskfactoralsomay
inadvertently affect multiple other correlated factors.
Even when a seemingly simple intervention is tested in
randomizedtrials(affectingasingleriskfactoramongthe
manycorrelations),theinterventionisnotreallysimple.
VIEWPOINT
Chirag J. Patel, PhD
Center for Biomedical
Informatics, Harvard
Medical School,
Boston, Massachusetts.
John P. A. Ioannidis,
MD, DSc
Stanford Prevention
Research Center,
Department of Health
Research and Policy,
Department of
Medicine, Stanford
University School of
Medicine, Stanford,
California, Department
of Statistics, Stanford
University School of
Humanities and
Sciences, Stanford,
California, and
Meta-Research
Innovation Center at
Stanford (METRICS),
Stanford, California.
Opinion
JAMA 2014
ARPH 2016
JECH 2014

Promises and Challenges in creating a search engine for E
in P
High-throughput E = discovery!

systematic; reproducible

multiple hypothesis control

prioritization
Eye color
Hair curliness
Type-1 diabetes
Height
Schizophrenia
Epilepsy
Graves' disease
Celiac disease
Bipolar disorder
Obesity
Alzheimer's disease
Anorexia nervosa
Psoriasis
Menarche, age at
Nicotine dependence
Sexual orientation
Alcoholism
Lupus
Crohn's disease
Migraine
Thyroid cancer
Autism
Body mass index
Depression
Insomnia
Menopause, age at
Heart disease
Prostate cancer
QT interval
Breast cancer
Ovarian cancer
Hangover
Stroke
Asthma
Hypertension
Osteoarthritis
Parkinson's disease
Longevity
Type-2 diabetes
Gallstone disease
Testicular cancer
Cervical cancer
Sciatica
Bladder cancer
Colon cancer
Lung cancer
Leukemia
Stomach cancer
0 25 50 75 100
Arjun Manrai
(Yuxia Cui, David Balshaw)
ARPH 2016

JAMA 2014

JECH 2014
σ2
E : Exposome!

Examples of exposome-driven discovery machinery, or
EWASs

Gold standard for breadth of human exposure information:
National Health and Nutrition Examination Survey1
since the 1960s
now biannual: 1999 onwards
10,000 participants per survey
The sample for the survey is selected to represent
the U.S. population of all ages. To produce reli-
able statistics, NHANES over-samples persons 60
and older, African Americans, and Hispanics.
Since the United States has experienced dramatic
growth in the number of older people during this
century, the aging population has major impli-
cations for health care needs, public policy, and
research priorities. NCHS is working with public
health agencies to increase the knowledge of the
health status of older Americans. NHANES has a
primary role in this endeavor.
All participants visit the physician. Dietary inter-
views and body measurements are included for
everyone. All but the very young have a blood
sample taken and will have a dental screening.
Depending upon the age of the participant, the
rest of the examination includes tests and proce-
dures to assess the various aspects of health listed
above. In general, the older the individual, the
more extensive the examination.
Survey Operations
Health interviews are conducted in respondents’
homes. Health measurements are performed in
specially-designed and equipped mobile centers,
which travel to locations throughout the country.
The study team consists of a physician, medical
and health technicians, as well as dietary and health
interviewers. Many of the study staff are
bilingual (English/Spanish).
An advanced computer system using high-
end servers, desktop PCs, and wide-area
networking collect and process all of the
NHANES data, nearly eliminating the need
for paper forms and manual coding operations.
This system allows interviewers to use note-
book computers with electronic pens. The staff
at the mobile center can automatically transmit
data into data bases through such devices as
digital scales and stadiometers. Touch-sensi-
tive computer screens let respondents enter
their own responses to certain sensitive ques-
tions in complete privacy. Survey information
is available to NCHS staff within 24 hours of
collection, which enhances the capability of
collecting quality data and increases the speed
with which results are released to the public.
In each location, local health and government
officials are notified of the upcoming survey.
Households in the study area receive a letter
from the NCHS Director to introduce the
survey. Local media may feature stories about
the survey.
NHANES is designed to facilitate and en-
courage participation. Transportation is provided
to and from the mobile center if necessary.
Participants receive compensation and a report
of medical findings is given to each participant.
All information collected in the survey is kept
strictly confidential. Privacy is protected by
public laws.
Uses of the Data
Information from NHANES is made available
through an extensive series of publications and
articles in scientific and technical journals. For
data users and researchers throughout the world,
survey data are available on the internet and on
easy-to-use CD-ROMs.
Research organizations, universities, health
care providers, and educators benefit from
survey information. Primary data users are
federal agencies that collaborated in the de-
sign and development of the survey. The
National Institutes of Health, the Food and
Drug Administration, and CDC are among the
agencies that rely upon NHANES to provide
data essential for the implementation and
evaluation of program activities. The U.S.
Department of Agriculture and NCHS coop-
erate in planning and reporting dietary and
nutrition information from the survey.
NHANES’ partnership with the U.S. Environ-
mental Protection Agency allows continued
study of the many important environmental
influences on our health.
• Physical fitness and physical functioning
• Reproductive history and sexual behavior
• Respiratory disease (asthma, chronic bron-
chitis, emphysema)
• Sexually transmitted diseases
• Vision
1 http://www.cdc.gov/nchs/nhanes.htm
>250 exposures (serum + urine)
GWAS chip
>85 quantitative clinical traits
(e.g., serum glucose, lipids, body
mass index)
Death index linkage (cause of
death)

Gold standard for breadth of exposure & behavior data:
National Health and Nutrition Examination Survey
Nutrients and Vitamins

vitamin D, carotenes
Infectious Agents

hepatitis, HIV, Staph. aureus
Plastics and consumables

phthalates, bisphenol A
Physical Activity

e.g., stepsPesticides and pollutants

atrazine; cadmium; hydrocarbons
Drugs

statins; aspirin

What E are associated with aging:

all-cause mortality and

telomere length?
Int J Epidem 2013
Int J Epidem 2016

How does it work?:
Searching for exposures and behaviors associated with all-
cause mortality.
NHANES: 1999-2004
National Death Index linked mortality
249 behaviors and exposures (serum/urine/self-report)
NHANES: 1999-2001
N=330 to 6008 (26 to 655 deaths)
~5.5 years of followup
Cox proportional hazards
baseline exposure and time to death
False discovery rate < 5%
NHANES: 2003-2004
N=177 to 3258 (20-202 deaths)
~2.8 years of followup
p < 0.05
Int J Epidem 2013
?
Variance explained (R2):
Proportion of variance in death correlated with E

How does it work?:
Discriminating signal from noise using family-wise error rate with
the False Discovery Rate
Benjamini and Hochberg, J R Stat Soc B 1993

Adjusted Hazard Ratio
-log10(pvalue)
0.4 0.6 0.8 1.0 1.2 1.4 1.6 2.0 2.4 2.8
02468
1
2
3
4
5
67
1 Physical Activity
2 Does anyone smoke in home?
3 Cadmium
4 Cadmium, urine
5 Past smoker
6 Current smoker
7 trans-lycopene
(11) 1
2
3 4
5 6
78
9
10 1112
13 14
1516
1 age (10 year increment)
2 SES_1
3 male
4 SES_0
5 black
6 SES_2
7 SES_3
8 education_hs
9 other_eth
10 mexican
11 occupation_blue_semi
12 education_less_hs
13 occupation_never
14 occupation_blue_high
15 occupation_white_semi
16 other_hispanic
(69)
EWAS in all-cause mortality:
249 exposure/behavior associations in survival
FDR < 5%
sociodemographics
replicated factor
Int J Epidem 2013

-log10(pvalue)
0.4 0.6 0.8 1.0 1.2 1.4 1.6 2.0 2.4 2.8
02468
1
2
3
4
5
67
1 Physical Activity
3 Cadmium
4 Cadmium, urine
5 Past smoker
6 Current smoker
7 trans-lycopene
(11) 1
2
3 4
5 6
78
9
10 1112
13 14
1516
2 SES_1
3 male
4 SES_0
5 black
6 SES_2
7 SES_3
8 education_hs
9 other_eth
10 mexican
13 occupation_never
16 other_hispanic
(69)
EWAS identiﬁes factors associated with all-cause mortality:

Volcano plot of 249 associations
age (10 years)
income (quintile 2)
income (quintile 1)
male
black income (quintile 3)
any one smoke in home?
Multivariate cox (age, sex, income, education, race/ethnicity, occupation [in red])
serum and urine cadmium
[1 SD]
past smoker?
current smoker?serum lycopene
[1SD]
physical activity
[low, moderate, high activity]*
*derived from METs per activity and categorized by Health.gov guidelines R2 ~ 2%

452 associations in Telomere Length:
Polychlorinated biphenyls associated with longer telomeres?!
Int J Epidem 2016
0
1
2
3
4
−0.2 −0.1 0.0 0.1 0.2
effect size
−log10(pvalue)
PCBs
FDR<5%
Trunk Fat
Alk. PhosCRP
Cadmium
Cadmium (urine)cigs per day
retinyl stearate
R2 ~ 1%
VO2 Maxpulse rate
shorter telomeres longer telomeres
adjusted by age, age2, race, poverty, education, occupation
median N=3000; N range: 300-7000
2-8 years

20 more examples:

https://paperpile.com/shared/PtvEae
diabetes

preterm birth

income

blood pressure

lipids

kidney disease

telomere length

mortality

E+ E-
diseased
non-
diseased
?
Versus
It is possible to capture E in high-throughput to create
biomedical hypotheses using tools such as EWAS
candidates comprehensive

Promises and Challenges in creating a search engine for E
in P
High-throughput assays of E!

scalable and standard technologies
ARPH 2016

JAMA 2014

JECH 2014
Big data = big bias!
Confounding; reverse causality

Dense correlational web of E and P
Fragmented and small E-P associations

Inﬂuence of time and life-course
Arjun Manrai
(Yuxia Cui, David Balshaw)

Challenge to scale absolute E due to heterogeneity
and large dynamic range.
Rappaport et al, EHP 2015
Untargeted
Targeted

Example of fragmentation:
Is everything we eat associated with cancer?
Schoenfeld and Ioannidis, AJCN 2012
50 random ingredients from
Boston Cooking School
Cookbook
Any associated with cancer?
FIGURE 1. Effect estimates reported in the literature by malignancy type (top) or ingredient (bottom). Only ingredients with $10 studie
outliers are not shown (effect estimates .10).
Of 50, 40 studied in cancer risk
Weak statistical evidence:
non-replicated

inconsistent eﬀects

non-standardized

Are all the drugs we take associated with cancer?
Sci Reports 2016
Associated all (~500) drugs prescribed in
entire population of Sweden
(N=9M) with time to cancer
Assessed 2 modeling techniques (Cox and case-crossover)

any cancer:
141 (26%)

prostate:
56 (10%)
breast:
41 (7%)

colon:
14 (3%)
What drugs are associated with time to cancer?

Too many to be plausible (up to 26%!)
Sci Reports 2016
Modest
concordance
between Cox and
case-crossover:
12 out of 141!
Most correlations
small (HR < 1.1);
residual confounding?

Interdependencies of the exposome:
Correlation globes paint a complex view of exposure
Red: positive ρ

Blue: negative ρ

thickness: |ρ|
for each pair of E:

Spearman ρ

(575 factors: 81,937 correlations)
permuted data to produce

“null ρ”

sought replication in > 1
cohort
Pac Symp Biocomput. 2015

JECH. 2015

Red: positive ρ

Blue: negative ρ

thickness: |ρ|
for each pair of E:

Spearman ρ

(575 factors: 81,937 correlations)
Interdependencies of the exposome:
Correlation globes paint a complex view of exposure
permuted data to produce

“null ρ”

sought replication in > 1
cohort
Pac Symp Biocomput. 2015

JECH. 2015
Eﬀective number of
variables:

500 (10% decrease)

Dense correlational globes:

Factors such as shared environment, income, and
sex inﬂuence co-variation!
BioArxiv (https://doi.org/10.1101/175513
AJE, 2015

JECH, 2014
Jake Chung, PhD
Germaine Buck Louis, PhD
(NICHD)

Browse these and 82 other phenotype-exposome globes!
http://www.chiragjpgroup.org/exposome_correlation

Does my single association between E and P matter?

Does my association between E and P matter in the
entire possible space of associations?
ARPH 2017
Hum Genet 2012
JECH 2014
Curr Epidemiol Rep 2017
Curr Env Health Rep 2016
p-2
20
8
1
6
18
7
10
…
p-1
12
2
5
9
4
16
21
11
13
3
17
14
19
p
15
6 101 1482 7 …1311 12 e54 153 9 e-1e-2
…
…
…
…
E exposure factors
Pphenotypicfactors
which ones to test?

all?

the ones in blue?

E times P possibilities!

how to detect signal from noise?

P
Scaling up the search in multiple phenotypes:

does my single association between E and P matter?
Body Measures

Body Mass Index

Height
Blood pressure & ﬁtness

Systolic BP

Diastolic BP

Pulse rate

VO2 Max
Metabolic

Glucose

LDL-Cholesterol

Triglycerides
Inﬂammation

C-reactive protein

white blood cell count
Kidney function

Creatinine

Sodium

Uric Acid
Liver function

Aspartate aminotransferase

Gamma glutamyltransferase
Aging

Telomere length

Time to death
Raj Manrai, Hugues Aschard, JPA Ioannidis, Dennis Bier

Creation of a phenotype-exposure association map:
A 2-D view of 209 phenotype by 514 exposure associations
> 0
< 0
Association Size:
504 E exposure and diet indicators × 209 clinical trait phenotypes

NHANES 1999-2000, 2001-2002, 2005-2006, …, 2011-2012 (8)

Median N: 150-5000 per survey

~83,092 E-P associations!
signiﬁcant associations (FDR < 5%)

adjusted by age, age2, sex, race, income

Raj Manrai, Hugues Aschard, JPA Ioannidis, Dennis Bier
209phenotypes
514 exposures

83,092 total associations between E and P

12,237 signiﬁcant associations (6%, in yellow):

Average association size: <1% for 1SD change in E
7%
percent change for 1 SD increase
-6%

Alpha-carotene
Alcohol
VitaminEasalpha-tocopherol
Beta-carotene
Caffeine
Calcium
Carbohydrate
Cholesterol
Copper
Beta-cryptoxanthin
Folicacid
Folate,DFE
Foodfolate
Dietaryfiber
Iron
Energy
Lycopene
Lutein+zeaxanthin
MFA16:1
MFA18:1
MFA20:1
Magnesium
Totalmonounsaturatedfattyacids
Moisture
Niacin
PFA18:2
PFA18:3
PFA20:4
PFA22:5
PFA22:6
Totalpolyunsaturatedfattyacids
Phosphorus
Potassium
Protein
Retinol
SFA4:0
SFA6:0
SFA8:0
SFA10:0
SFA12:0
SFA14:0
SFA16:0
SFA18:0
Selenium
Totalsaturatedfattyacids
Totalsugars
Totalfat
Theobromine
VitaminA,RAE
Thiamin
VitaminB12
Riboflavin
VitaminB6
VitaminC
VitaminK
Zinc
NoSalt
OrdinarySalt
a-Carotene
VitaminB12,serum
trans-b-carotene
cis-b-carotene
b-cryptoxanthin
Folate,serum
g-tocopherol
Iron,FrozenSerum
CombinedLutein/zeaxanthin
trans-lycopene
Folate,RBC
Retinylpalmitate
Retinylstearate
Retinol
VitaminD
a-Tocopherol
Daidzein
o-Desmethylangolensin
Equol
Enterodiol
Enterolactone
Genistein
EstimatedVO2max
PhysicalActivity
Doesanyonesmokeinhome?
Total#ofcigarettessmokedinhome
Cotinine
CurrentCigaretteSmoker?
Agelastsmokedcigarettesregularly
#cigarettessmokedperdaywhenquit
#cigarettessmokedperdaynow
#dayssmokedcigsduringpast30days
Avg#cigarettes/dayduringpast30days
Smokedatleast100cigarettesinlife
Doyounowsmokecigarettes...
numberofdayssincequit
Usedsnuffatleast20timesinlife
drink5inaday
drinkperday
days5drinksinyear
daysdrinkinyear
3-fluorene
2-fluorene
3-phenanthrene
1-phenanthrene
2-phenanthrene
1-pyrene
3-benzo[c]phenanthrene
3-benz[a]anthracene
Mono-n-butylphthalate
Mono-phthalate
Mono-cyclohexylphthalate
Mono-ethylphthalate
Mono-phthalate
Mono--hexylphthalate
Mono-isobutylphthalate
Mono-n-methylphthalate
Mono-phthalate
Mono-benzylphthalate
Cadmium
Lead
Mercury,total
Barium,urine
Cadmium,urine
Cobalt,urine
Cesium,urine
Mercury,urine
Iodine,urine
Molybdenum,urine
Lead,urine
Platinum,urine
Antimony,urine
Thallium,urine
Tungsten,urine
Uranium,urine
BloodBenzene
BloodEthylbenzene
Bloodo-Xylene
BloodStyrene
BloodTrichloroethene
BloodToluene
Bloodm-/p-Xylene
1,2,3,7,8-pncdd
1,2,3,7,8,9-hxcdd
1,2,3,4,6,7,8-hpcdd
1,2,3,4,6,7,8,9-ocdd
2,3,7,8-tcdd
Beta-hexachlorocyclohexane
Gamma-hexachlorocyclohexane
Hexachlorobenzene
HeptachlorEpoxide
Mirex
Oxychlordane
p,p-DDE
Trans-nonachlor
2,5-dichlorophenolresult
2,4,6-trichlorophenolresult
Pentachlorophenol
Dimethylphosphate
Diethylphosphate
Dimethylthiophosphate
PCB66
PCB74
PCB99
PCB105
PCB118
PCB138&158
PCB146
PCB153
PCB156
PCB157
PCB167
PCB170
PCB172
PCB177
PCB178
PCB180
PCB183
PCB187
3,3,4,4,5,5-hxcb
3,3,4,4,5-pncb
3,4,4,5-tcb
Perfluoroheptanoicacid
Perfluorohexanesulfonicacid
Perfluorononanoicacid
Perfluorooctanoicacid
Perfluorooctanesulfonicacid
Perfluorooctanesulfonamide
2,3,7,8-tcdf
1,2,3,7,8-pncdf
2,3,4,7,8-pncdf
1,2,3,4,7,8-hxcdf
1,2,3,6,7,8-hxcdf
1,2,3,7,8,9-hxcdf
2,3,4,6,7,8-hxcdf
1,2,3,4,6,7,8-hpcdf
Measles
Toxoplasma
HepatitisAAntibody
HepatitisBcoreantibody
HepatitisBSurfaceAntibody
HerpesII
Albumin, urine
Uric acid
Phosphorus
Osmolality
Sodium
Potassium
Creatinine
Chloride
Total calcium
Bicarbonate
Blood urea nitrogen
Total protein
Total bilirubin
Lactate dehydrogenase LDH
Gamma glutamyl transferase
Globulin
Alanine aminotransferase ALT
Aspartate aminotransferase AST
Alkaline phosphotase
Albumin
Methylmalonic acid
PSA. total
Prostate specific antigen ratio
TIBC, Frozen Serum
Red cell distribution width
Red blood cell count
Platelet count SI
Segmented neutrophils percent
Mean platelet volume
Mean cell volume
Mean cell hemoglobin
MCHC
Hemoglobin
Hematocrit
Ferritin
Protoporphyrin
Transferrin saturation
White blood cell count
Monocyte percent
Lymphocyte percent
Eosinophils percent
C-reactive protein
Segmented neutrophils number
Monocyte number
Lymphocyte number
Eosinophils number
Basophils number
mean systolic
mean diastolic
60 sec. pulse:
60 sec HR
Total Cholesterol
Triglycerides
Glucose, serum
Insulin
Homocysteine
Glucose, plasma
Glycohemoglobin
C-peptide: SI
LDL-cholesterol
Direct HDL-Cholesterol
Bone alkaline phosphotase
Trunk Fat
Lumber Pelvis BMD
Lumber Spine BMD
Head BMD
Trunk Lean excl BMC
Total Lean excl BMC
Total Fat
Total BMD
Weight
Waist Circumference
Triceps Skinfold
Thigh Circumference
Subscapular Skinfold
Recumbent Length
Upper Leg Length
Standing Height
Head Circumference
Maximal Calf Circumference
Body Mass Index
-0.4 -0.2 0 0.2 0.4
Value
050100150
Color Key
and Histogram
Count
phenotypes
exposures
+- EWAS-derived phenotype-exposure association map:
A 2-D view of connections between P and E

Alpha-carotene
Alcohol
Beta-carotene
Caffeine
Calcium
Carbohydrate
Cholesterol
Copper
Beta-cryptoxanthin
Folicacid
Folate,DFE
Foodfolate
Dietaryfiber
Iron
Energy
Lycopene
Lutein+zeaxanthin
MFA16:1
MFA18:1
MFA20:1
Magnesium
Moisture
Niacin
PFA18:2
PFA18:3
PFA20:4
PFA22:5
PFA22:6
Phosphorus
Potassium
Protein
Retinol
SFA4:0
SFA6:0
SFA8:0
SFA10:0
SFA12:0
SFA14:0
SFA16:0
SFA18:0
Selenium
Totalsugars
Totalfat
Theobromine
VitaminA,RAE
Thiamin
VitaminB12
Riboflavin
VitaminB6
VitaminC
VitaminK
Zinc
NoSalt
OrdinarySalt
a-Carotene
VitaminB12,serum
trans-b-carotene
cis-b-carotene
b-cryptoxanthin
Folate,serum
g-tocopherol
Iron,FrozenSerum
trans-lycopene
Folate,RBC
Retinylpalmitate
Retinylstearate
Retinol
VitaminD
a-Tocopherol
Daidzein
Equol
Enterodiol
Enterolactone
Genistein
EstimatedVO2max
PhysicalActivity
Cotinine
drink5inaday
drinkperday
days5drinksinyear
daysdrinkinyear
3-fluorene
2-fluorene
3-phenanthrene
1-phenanthrene
2-phenanthrene
1-pyrene
3-benz[a]anthracene
Mono-phthalate
Mono-ethylphthalate
Mono-phthalate
Mono-phthalate
Cadmium
Lead
Mercury,total
Barium,urine
Cadmium,urine
Cobalt,urine
Cesium,urine
Mercury,urine
Iodine,urine
Molybdenum,urine
Lead,urine
Platinum,urine
Antimony,urine
Thallium,urine
Tungsten,urine
Uranium,urine
BloodBenzene
BloodEthylbenzene
Bloodo-Xylene
BloodStyrene
BloodToluene
Bloodm-/p-Xylene
1,2,3,7,8-pncdd
1,2,3,7,8,9-hxcdd
1,2,3,4,6,7,8-hpcdd
1,2,3,4,6,7,8,9-ocdd
2,3,7,8-tcdd
Hexachlorobenzene
HeptachlorEpoxide
Mirex
Oxychlordane
p,p-DDE
Trans-nonachlor
Pentachlorophenol
Dimethylphosphate
Diethylphosphate
PCB66
PCB74
PCB99
PCB105
PCB118
PCB138&158
PCB146
PCB153
PCB156
PCB157
PCB167
PCB170
PCB172
PCB177
PCB178
PCB180
PCB183
PCB187
3,3,4,4,5,5-hxcb
3,3,4,4,5-pncb
3,4,4,5-tcb
2,3,7,8-tcdf
1,2,3,7,8-pncdf
2,3,4,7,8-pncdf
1,2,3,4,7,8-hxcdf
1,2,3,6,7,8-hxcdf
1,2,3,7,8,9-hxcdf
2,3,4,6,7,8-hxcdf
1,2,3,4,6,7,8-hpcdf
Measles
Toxoplasma
HepatitisAAntibody
HerpesII
Albumin, urine
Uric acid
Phosphorus
Osmolality
Sodium
Potassium
Creatinine
Chloride
Total calcium
Bicarbonate
Blood urea nitrogen
Total protein
Total bilirubin
Globulin
Albumin
Methylmalonic acid
PSA. total
TIBC, Frozen Serum
Platelet count SI
Mean cell volume
MCHC
Hemoglobin
Hematocrit
Ferritin
Protoporphyrin
Monocyte percent
Lymphocyte percent
Eosinophils percent
C-reactive protein
Monocyte number
Lymphocyte number
Eosinophils number
Basophils number
mean systolic
mean diastolic
60 sec. pulse:
60 sec HR
Total Cholesterol
Triglycerides
Glucose, serum
Insulin
Homocysteine
Glucose, plasma
Glycohemoglobin
C-peptide: SI
LDL-cholesterol
Trunk Fat
Lumber Pelvis BMD
Lumber Spine BMD
Head BMD
Trunk Lean excl BMC
Total Lean excl BMC
Total Fat
Total BMD
Weight
Waist Circumference
Triceps Skinfold
Thigh Circumference
Recumbent Length
Upper Leg Length
Standing Height
Head Circumference
Body Mass Index
-0.4 -0.2 0 0.2 0.4
Value
050100150
Color Key
and Histogram
Count
phenotypes
exposures
+-
nutrients
BMI,weight,
BMD
metabolic
renalfunction
pcbs
metabolic
bloodparameters
hydrocarbons
EWAS-derived phenotype-exposure association map:
A 2-D view of connections between P and E
R2: ~1-40% (average of 20%)

Alpha-carotene
Alcohol
Beta-carotene
Caffeine
Calcium
Carbohydrate
Cholesterol
Copper
Beta-cryptoxanthin
Folicacid
Folate,DFE
Foodfolate
Dietaryfiber
Iron
Energy
Lycopene
Lutein+zeaxanthin
MFA16:1
MFA18:1
MFA20:1
Magnesium
Moisture
Niacin
PFA18:2
PFA18:3
PFA20:4
PFA22:5
PFA22:6
Phosphorus
Potassium
Protein
Retinol
SFA4:0
SFA6:0
SFA8:0
SFA10:0
SFA12:0
SFA14:0
SFA16:0
SFA18:0
Selenium
Totalsugars
Totalfat
Theobromine
VitaminA,RAE
Thiamin
VitaminB12
Riboflavin
VitaminB6
VitaminC
VitaminK
Zinc
NoSalt
OrdinarySalt
a-Carotene
VitaminB12,serum
trans-b-carotene
cis-b-carotene
b-cryptoxanthin
Folate,serum
g-tocopherol
Iron,FrozenSerum
trans-lycopene
Folate,RBC
Retinylpalmitate
Retinylstearate
Retinol
VitaminD
a-Tocopherol
Daidzein
Equol
Enterodiol
Enterolactone
Genistein
EstimatedVO2max
PhysicalActivity
Cotinine
drink5inaday
drinkperday
days5drinksinyear
daysdrinkinyear
3-fluorene
2-fluorene
3-phenanthrene
1-phenanthrene
2-phenanthrene
1-pyrene
3-benz[a]anthracene
Mono-phthalate
Mono-ethylphthalate
Mono-phthalate
Mono-phthalate
Cadmium
Lead
Mercury,total
Barium,urine
Cadmium,urine
Cobalt,urine
Cesium,urine
Mercury,urine
Iodine,urine
Molybdenum,urine
Lead,urine
Platinum,urine
Antimony,urine
Thallium,urine
Tungsten,urine
Uranium,urine
BloodBenzene
BloodEthylbenzene
Bloodo-Xylene
BloodStyrene
BloodToluene
Bloodm-/p-Xylene
1,2,3,7,8-pncdd
1,2,3,7,8,9-hxcdd
1,2,3,4,6,7,8-hpcdd
1,2,3,4,6,7,8,9-ocdd
2,3,7,8-tcdd
Hexachlorobenzene
HeptachlorEpoxide
Mirex
Oxychlordane
p,p-DDE
Trans-nonachlor
Pentachlorophenol
Dimethylphosphate
Diethylphosphate
PCB66
PCB74
PCB99
PCB105
PCB118
PCB138&158
PCB146
PCB153
PCB156
PCB157
PCB167
PCB170
PCB172
PCB177
PCB178
PCB180
PCB183
PCB187
3,3,4,4,5,5-hxcb
3,3,4,4,5-pncb
3,4,4,5-tcb
2,3,7,8-tcdf
1,2,3,7,8-pncdf
2,3,4,7,8-pncdf
1,2,3,4,7,8-hxcdf
1,2,3,6,7,8-hxcdf
1,2,3,7,8,9-hxcdf
2,3,4,6,7,8-hxcdf
1,2,3,4,6,7,8-hpcdf
Measles
Toxoplasma
HepatitisAAntibody
HerpesII
Albumin, urine
Uric acid
Phosphorus
Osmolality
Sodium
Potassium
Creatinine
Chloride
Total calcium
Bicarbonate
Blood urea nitrogen
Total protein
Total bilirubin
Globulin
Albumin
Methylmalonic acid
PSA. total
TIBC, Frozen Serum
Platelet count SI
Mean cell volume
MCHC
Hemoglobin
Hematocrit
Ferritin
Protoporphyrin
Monocyte percent
Lymphocyte percent
Eosinophils percent
C-reactive protein
Monocyte number
Lymphocyte number
Eosinophils number
Basophils number
mean systolic
mean diastolic
60 sec. pulse:
60 sec HR
Total Cholesterol
Triglycerides
Glucose, serum
Insulin
Homocysteine
Glucose, plasma
Glycohemoglobin
C-peptide: SI
LDL-cholesterol
Trunk Fat
Lumber Pelvis BMD
Lumber Spine BMD
Head BMD
Trunk Lean excl BMC
Total Lean excl BMC
Total Fat
Total BMD
Weight
Waist Circumference
Triceps Skinfold
Thigh Circumference
Recumbent Length
Upper Leg Length
Standing Height
Head Circumference
Body Mass Index
-0.4 -0.2 0 0.2 0.4
Value
050100150
Color Key
and Histogram
Count
phenotypes
exposures
+- EWAS-derived phenotype-exposure association map:
A 2-D view of connections between P and E:
does my correlation matter?

EWAS-derived phenotype-exposure association map:
A 2-D view of connections between P and E:
does my correlation matter?
Alpha-carotene
Alcohol
Beta-carotene
Caffeine
Calcium
Carbohydrate
Cholesterol
Copper
Beta-cryptoxanthin
Folicacid
Folate,DFE
Foodfolate
Dietaryfiber
Iron
Energy
Lycopene
Lutein+zeaxanthin
MFA16:1
MFA18:1
MFA20:1
Magnesium
Moisture
Niacin
PFA18:2
PFA18:3
PFA20:4
PFA22:5
PFA22:6
Phosphorus
Potassium
Protein
Retinol
SFA4:0
SFA6:0
SFA8:0
SFA10:0
SFA12:0
SFA14:0
SFA16:0
SFA18:0
Selenium
Totalsugars
Totalfat
Theobromine
VitaminA,RAE
Thiamin
VitaminB12
Riboflavin
VitaminB6
VitaminC
VitaminK
Zinc
NoSalt
OrdinarySalt
a-Carotene
VitaminB12,serum
trans-b-carotene
cis-b-carotene
b-cryptoxanthin
Folate,serum
g-tocopherol
Iron,FrozenSerum
trans-lycopene
Folate,RBC
Retinylpalmitate
Retinylstearate
Retinol
VitaminD
a-Tocopherol
Daidzein
Equol
Enterodiol
Enterolactone
Genistein
EstimatedVO2max
PhysicalActivity
Cotinine
drink5inaday
drinkperday
days5drinksinyear
daysdrinkinyear
3-fluorene
2-fluorene
3-phenanthrene
1-phenanthrene
2-phenanthrene
1-pyrene
3-benz[a]anthracene
Mono-phthalate
Mono-ethylphthalate
Mono-phthalate
Mono-phthalate
Cadmium
Lead
Mercury,total
Barium,urine
Cadmium,urine
Cobalt,urine
Cesium,urine
Mercury,urine
Iodine,urine
Molybdenum,urine
Lead,urine
Platinum,urine
Antimony,urine
Thallium,urine
Tungsten,urine
Uranium,urine
BloodBenzene
BloodEthylbenzene
Bloodo-Xylene
BloodStyrene
BloodToluene
Bloodm-/p-Xylene
1,2,3,7,8-pncdd
1,2,3,7,8,9-hxcdd
1,2,3,4,6,7,8-hpcdd
1,2,3,4,6,7,8,9-ocdd
2,3,7,8-tcdd
Hexachlorobenzene
HeptachlorEpoxide
Mirex
Oxychlordane
p,p-DDE
Trans-nonachlor
Pentachlorophenol
Dimethylphosphate
Diethylphosphate
PCB66
PCB74
PCB99
PCB105
PCB118
PCB138&158
PCB146
PCB153
PCB156
PCB157
PCB167
PCB170
PCB172
PCB177
PCB178
PCB180
PCB183
PCB187
3,3,4,4,5,5-hxcb
3,3,4,4,5-pncb
3,4,4,5-tcb
2,3,7,8-tcdf
1,2,3,7,8-pncdf
2,3,4,7,8-pncdf
1,2,3,4,7,8-hxcdf
1,2,3,6,7,8-hxcdf
1,2,3,7,8,9-hxcdf
2,3,4,6,7,8-hxcdf
1,2,3,4,6,7,8-hpcdf
Measles
Toxoplasma
HepatitisAAntibody
HerpesII
Albumin, urine
Uric acid
Phosphorus
Osmolality
Sodium
Potassium
Creatinine
Chloride
Total calcium
Bicarbonate
Blood urea nitrogen
Total protein
Total bilirubin
Globulin
Albumin
Methylmalonic acid
PSA. total
TIBC, Frozen Serum
Platelet count SI
Mean cell volume
MCHC
Hemoglobin
Hematocrit
Ferritin
Protoporphyrin
Monocyte percent
Lymphocyte percent
Eosinophils percent
C-reactive protein
Monocyte number
Lymphocyte number
Eosinophils number
Basophils number
mean systolic
mean diastolic
60 sec. pulse:
60 sec HR
Total Cholesterol
Triglycerides
Glucose, serum
Insulin
Homocysteine
Glucose, plasma
Glycohemoglobin
C-peptide: SI
LDL-cholesterol
Trunk Fat
Lumber Pelvis BMD
Lumber Spine BMD
Head BMD
Trunk Lean excl BMC
Total Lean excl BMC
Total Fat
Total BMD
Weight
Waist Circumference
Triceps Skinfold
Thigh Circumference
Recumbent Length
Upper Leg Length
Standing Height
Head Circumference
Body Mass Index
-0.4 -0.2 0 0.2 0.4
Value
050100150
Color Key
and Histogram
Count
phenotypes
exposures
+-

High-throughput data analytics to mitigate analytical challenges of
exposome-based research:

Consider multiplicity of hypotheses and correlational web!
Does my correlation matter?
How does my new correlation
compare to the family of correlations?

What is the total variance
explained(σ2
E)?
saturated fatty acids and BMI: 0.5%

does it matter? (i.e., 1.2% is average!)
ρ
ARPH 2016
JAMA 2014
JECH 2015
Explicit in number of hypotheses
tested
False discovery rate;

family-wise error rate;

Report database size!
p-2
20
8
1
6
18
7
10
…
p-1
12
2
5
9
4
16
21
11
13
3
17
14
19
p
15
6 101 1482 7 …1311 12 e54 153 9 e-1e-2
…
…
…
…
E exposure factors
Pphenotypicfactors

Bottom line: high-throughput E research will enable
discovery to explain missing variation in P!
1.) Find elusive E in P and
explain variation of disease risk
2.) Consideration of totality of
evidence:
3.) Reproducible research and increase data literacy.
Eye color
Hair curliness
Type-1 diabetes
Height
Schizophrenia
Epilepsy
Graves' disease
Celiac disease
Bipolar disorder
Obesity
Alzheimer's disease
Anorexia nervosa
Psoriasis
Menarche, age at
Nicotine dependence
Sexual orientation
Alcoholism
Lupus
Crohn's disease
Migraine
Thyroid cancer
Autism
Body mass index
Depression
Insomnia
Menopause, age at
Heart disease
Prostate cancer
QT interval
Breast cancer
Ovarian cancer
Hangover
Stroke
Asthma
Hypertension
Osteoarthritis
Parkinson's disease
Longevity
Type-2 diabetes
Gallstone disease
Testicular cancer
Cervical cancer
Sciatica
Bladder cancer
Colon cancer
Lung cancer
Leukemia
Stomach cancer
0 25 50 75 100
σ2
E : Exposome!

1.) Find elusive E in P and explain
variation of disease risk
evidence:
3.) Reproducible research and increase data literacy.
ρ

1.) Find elusive E in P and explain
variation of disease risk
evidence:
3.) Reproducible research and increase data
literacy.

http://chiragjpgroup.org/exposome-analytics-course
Nam Pho
Please contact me for help or project ideas!

Designing a new children’s study:

(1) Increase sample sizes and make data publicly available
(2) Measure G to discover role of E in P

Designing a new children’s study

(1) Increase sample sizes and make data publicly available
N=500,000 N=500,000
Generate wide interest and visibility

Enhance reproducibility (decrease false positives)

Designing a new children’s study:

(2) Measure G to discover role of E in P
Patel CJ et al, CEBP 2017
Gibson, G. Nature Reviews Genetics 2008
biological function
Gage S et al. PLoS Genetics 2016
GWAS and mendelian randomization
gene-by-environment interactions

In conclusion:

Data science inspired approaches to ascertain exposome and
genome will enable biomedical discovery
Dense correlations, confounding, reverse causality:

how to assess at high dimension?
Understand interacting G and E for causation
Mitigate fragmented literature of associations.
september2011 119
ng
akin to – but less well
ore poorly understood than –
For example, consider the use
n to adjust the risk levels of
o the same background level
n be many covariates, and
iates can be in or out of the
ovariates, there are over 1000
onsider a maze as a metaphor
ure 3). The red line traces the
f the maze. The path through
simple, once it is known.
near regression model, terms
nd taken out of a regression
get a p-value smaller than
can be frozen and the model
after the fact. It is easy to
on of multiple testing and
g can lead to a very large
he example of bisphenol A in
h large search spaces can give
ve p-values somewhere within
ly, authors and consumers are
caught in the headlights and
ue as indicating a real effect.
ed? A new, combined
by now that more than small-
e needed. The entire system
studies and the claims that
hem is no longer functional,
urpose. What can be done to
stem? There are no principled
ways in the literature for dealing with model
selection, so we propose a new, composite
strategy. Following Deming, it is based not
upon the workers – the researchers – but on
the production system managers – the funding
agencies and the editors of the journals where
the claims are reported.
We propose a multi-step strategy to help
bring observational studies under control (see
Table 2). The main technical idea is to split the
data into two data sets, a modelling data set
and a holdout data set. The main operational
idea is to require the journal to accept or reject
the paper based on an analysis of the modelling
data set without knowing the results of applying
the methods used for the modelling set on the
holdout set and to publish an addendum to the
paper giving the results of the analysis of the
holdout set. We now cover the steps, one by one.
1 The data collection and clean-up should
be done by a group separate from the
analysis group. There can be a tempta-
tion on the part of the analyst to do some
exploratory data analysis during the data
clean up. Exploratory analysis could lead
to model selection bias.
tion bias
l recognition that a paper
ter chance of acceptance if
s found. This means that, for
claim in the paper has to
-value less than 0.05. From
of view5
, this is quality by
journals are placing heavy
statistical test rather than
the methods and steps that
sion. As to having a p-value
some might be tempted to
m10
through multiple testing,
ing or unfair treatment of
mbination of the three that
p-value. Researchers can be
devising a plausible story to
finding.
2 The data cleaning team creates a
modelling data set and a holdout set and
gives the modelling data set, less the
item to be predicted, to the analyst for
examination.
P < 0.05
Figure 3. The path through a complex process can appear quite simple once the path is defined. Which terms are
included in a multiple linear regression model? Each turn in a maze is analogous to including or not a specific
term in the evolving linear model. By keeping an eye on the p-value on the term selected to be at issue, one
can work towards a suitably small p-value. © ktsdesign – Fotolia
Table 2. Steps 0–7 can be used to help bring the
observational study process into control. Currently
researchers analysing observational data sets are
under no effective oversight
Step Process / Action
0 Data are made publicly available
1 Data cleaning and analysis separate
2 Split sample: A, modelling; and B,
holdout (testing)
3 Analysis plan is written, based on
modelling data only
4 Written protocol, based on viewing
predictor variables of A
5 Analysis of A only data set
6 Journal accepts paper based on A only
7 Analysis of B data set gives Addendum
EWASs in aging: mortality and quantitative traits
-log10(pvalue)
0.4 0.6 0.8 1.0 1.2 1.4 1.6 2.0 2.4 2.8
02468
1
2
3
4
5
67
1 Physical Activity
3 Cadmium
4 Cadmium, urine
5 Past smoker
6 Current smoker
7 trans-lycopene
(11) 1
2
3 4
5 6
78
9
10 1112
13 14
1516
2 SES_1
3 male
4 SES_0
5 black
6 SES_2
7 SES_3
8 education_hs
9 other_eth
10 mexican
13 occupation_never
16 other_hispanic
(69)

Eye color
Hair curliness
Type-1 diabetes
Height
Schizophrenia
Epilepsy
Graves' disease
Celiac disease
Bipolar disorder
Obesity
Alzheimer's disease
Anorexia nervosa
Psoriasis
Menarche, age at
Nicotine dependence
Sexual orientation
Alcoholism
Lupus
Crohn's disease
Migraine
Thyroid cancer
Autism
Body mass index
Depression
Insomnia
Menopause, age at
Heart disease
Prostate cancer
QT interval
Breast cancer
Ovarian cancer
Hangover
Stroke
Asthma
Hypertension
Osteoarthritis
Parkinson's disease
Longevity
Type-2 diabetes
Gallstone disease
Testicular cancer
Cervical cancer
Sciatica
Bladder cancer
Colon cancer
Lung cancer
Leukemia
Stomach cancer
0 25 50 75 100
Heritability: Var(G)/Var(Phenotype) Source: SNPedia.com
Use high-throughput tools and data (e.g., exposome) will
enhance discovery of the role of E (and G) in P.

raj
(ex oﬃcio)
chirag “the better” adam grace danielle yeran
sivateja jake kajalalan
RagGroup Data Science Team:

2 post-docs, 3 PhD, 2 MS, 1 HS, 2 visiting
PhD: systems biology, integrative genomics

MS: statistics (HSPH)

Post-docs: biology, medicine, and mathematics
nam

Harvard DBMI
Susanne Churchill

Nathan Palmer

Sophia Mamousette

Sunny Alvear

Michal Preminger
Chirag J Patel

chirag@hms.harvard.edu

@chiragjp

www.chiragjpgroup.org
NIH Common Fund

Big Data to Knowledge
Acknowledgements
RagGroup
Nam Pho
Jake Chung
Kajal Claypool
Arjun Manrai
Chirag Lakhani
Adam Brown

Danielle Rasooly

Alan LeGoallec

Sivateja Tangirala
IEA-WCE 2017 symposium
Shoji Nakayama
Junya Kasamatsu
Ministry of Environment (Japan)
Mentioned Collaborators
Isaac Kohane

John Ioannidis

Dennis Bier

Hugo Aschard

Japanese Environmental Children's Study and Data-driven E

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Japanese Environmental Children's Study and Data-driven E

Similar to Japanese Environmental Children's Study and Data-driven E (20)

Recently uploaded

Recently uploaded (20)

Japanese Environmental Children's Study and Data-driven E