Methods to enhance the validity of precision guidelines emerging from big data

Methods to enhance the validity of
precision guidelines emerging from big data
Chirag J Patel

Lorenzini Foundation; Venice, Italy

06/16/16
chirag@hms.harvard.edu
@chiragjp
www.chiragjpgroup.org

Data streams in public health are getting large!
Capacity to measure and compute becoming high-
throughput and cheaper.

Capacity to measure and compute becoming high-
throughput and cheaper.
N=500,000
1M genetic variants
1000s of phenotypes

…and alternative data sets (e.g., EMR) are
omnipresent!
image: Stan Shaw (MGH)

And new concepts for discovery:

The exposome, an analog of the genome!
what to measure? how to measure?
PERSPECTIVES
Xenobiotics
Inflammation
Preexisting disease
Lipid peroxidation
Oxidative stress
Gut flora
Internal
chemical
environment
Externalenvironment
ExposomeRADIATION
DIET
POLLUTION
INFECTIONS
DRUGS
LIFE-STYLE
STRESS
Reactive electrophiles
Metals
Endocrine disrupters
Immune modulators
Receptor-binding proteins
itical entity for disease eti-
ogy (7). Recent discussion
as focused on whether and
ow to implement this vision
8). Although fully charac-
rizing human exposomes
daunting, strategies can be
eveloped for getting “snap-
hots” of critical portions of
person’s exposome during
ifferent stages of life. At
ne extreme is a “bottom-up”
rategy in which all chemi-
als in each external source
f a subject’s exposome are
easured at each time point.
lthoughthisapproachwould
ave the advantage of relat-
g important exposures to
e air, water, or diet, it would
quire enormous effort and
ould miss essential compo-
ents of the internal chemi-
al environment due to such
actors as gender, obesity,
ﬂammation, and stress. By
ontrast, a “top-down” strat-
gy would measure all chem-
als (or products of their
ownstream processing or
ffects, so-called read-outs
r signatures) in a subject’s
ood. This would require
nly a single blood specimen
each time point and would relate directly ruptors and can be measured through serum
some (telomere) length in
peripheral blood mono-
nuclear cells responded
to chronic psychological
stress, possibly mediated
by the production of reac-
tive oxygen species (15).
Characterizing the
exposome represents a tech-
nological challenge like that of
thehumangenomeproject,which
began when DNA sequencing
was in its infancy (16). Analyti-
cal systems are needed to pro-
cess small amounts of blood from
thousands of subjects. Assays
should be multiplexed for mea-
suring many chemicals in each
class of interest. Tandem mass
spectrometry, gene and protein
chips, and microﬂuidic systems
offer the means to do this. Plat-
forms for high-throughput assays
shouldleadtoeconomiesofscale,
again like those experienced by
the human genome project. And
because exposome technologies
would provide feedback for thera-
peuticinterventionsandpersonal-
ized medicine, they should moti-
vate the development of commer-
cial devices for screening impor-
tant environmental exposures in
blood samples.
With successful characterization of both
Characterizing the exposome. The exposome represents
the combined exposures from all sources that reach the
internal chemical environment. Toxicologically important
classes of exposome chemicals are shown. Signatures and
biomarkers can detect these agents in blood or serum.
onOctober21,2010www.sciencemag.orgrom
how to analyze in health?
Wild, 2005

Rappaport and Smith, 2010

Buck-Louis and Sundaram 2012

Miller and Jones, 2014

Patel CJ and Ioannidis JPAI, 2014

Can we use these big data sources for discovery?
Or, for building guidelines?

Many challenges exist in the use of big data for
discovery, guideline development, and causal
research
INSIGHTS
I
n 1854, as cholera swept through Lon-
don, John Snow, the father of modern ep-
idemiology, painstakingly recorded the
locations of affected homes. After long,
laborious work, he implicated the Broad
Street water pump as the source of the
outbreak, even without knowing that a Vib-
rio organism caused cholera. “Today, Snow
might have crunched Global Positioning
System information and disease prevalence
data, solving the problem within hours” (1).
That is the potential impact of “Big Data” on
the public’s health. But the promise of Big
Data is also accompanied by claims that “the
scientific method itself is becoming obso-
lete” (2), as next-generation computers, such
as IBM’s Watson (3), sift through the digital
world to provide predictive models based
on massive information. Separating the true
signal from the gigantic amount of noise is
neither easy nor straightforward, but it is a
challenge that must be tackled if informa-
tion is ever to be translated into societal
well-being.
The term “Big Data” refers to volumes of
large, complex, linkable information (4). Be-
For nongenomic associations, false alarms
due to confounding variables or other biases
are possible even with very large-scale stud-
ies, extensive replication, and very strong
signals (9). Big Data’s strength is in finding
associations, not in showing whether these
associations have meaning. Finding a signal
is only the first step.
Even John Snow needed to start with a
plausible hypothesis to know where to look,
i.e., choose what data to examine. If all he
had was massive amounts of data, he might
well have ended up with a correlation as
spurious as the honey bee–marijuana con-
nection. Crucially, Snow “did the experi-
ment.” He removed the handle from the
water pump and dramatically reduced the
spread of cholera, thus moving from correla-
tion to causation and effective intervention.
How can we improve the potential for
Big Data to improve health and prevent
disease? One priority is that a stronger
epidemiological foundation is needed. Big
Data analysis is currently largely based on
convenient samples of people or informa-
tion available on the Internet. When as-
sociations are probed between perfectly
measured data (e.g., a genome sequence)
and poorly measured data (e.g., adminis-
By Muin J. Khoury1,2
and
John P. A. Ioannidis3
MEDICINE
Big data meets public health
Human well-being could benefit from large-scale data if large-scale noise is minimized
onJune20,2016ttp://science.sciencemag.org/
Science, 2014

Many challenges exist in the use of big data for
discovery, guideline development, and causal research
Thousands of hypotheses are possible.

Multiplicity of hypotheses.
Big data are observational.

Multiplicity of biases:
confounding, selection; reverse causal
Millions of analytic scenarios are possible.

Multiplicity of analytic methods.

Big data oﬀers a multiplicity of possible hypotheses!

A few examples from cohort studies
JECH, 2014

Big Data offers a multiplicity of possible hypotheses!
Example: cohort database of E exposures and P phenotypes
Hum Genet 2012
JECH 2014
Curr Env Health Rep 2016
p-2
20
8
1
6
18
7
10
…
p-1
12
2
5
9
4
16
21
11
13
3
17
14
19
p
15
6 101 1482 7 …1311 12 e54 153 9 e-1e-2
…
…
…
…
E exposure factors
Pphenotypicfactors
which ones to test?

all?

the ones in blue?

E times P possibilities!

how to detect signal from noise?

Big Data oﬀers a multiplicity of possible hypotheses!

… that depends on the domain (type of measure)!
JECH, 2014
National Health and Nutrition Examination
Survey (NHANES)

Big Data = Big Bias:

Confounding, reverse causality, and what
causes what

Interdependencies of the variables:
Correlation globes paint a complex view of exposure and
behavior
Red: positive ρ

Blue: negative ρ

thickness: |ρ|
for each pair of E:

Spearman ρ

(575 factors: 81,937 correlations)
permuted data to produce

“null ρ”

sought replication in > 1
cohort
JAMA 2014

Pac Symp Biocomput. 2015

JECH. 2015
Survey (NHANES)

Red: positive ρ

Blue: negative ρ

thickness: |ρ|
for each pair of E:

Spearman ρ

(575 factors: 81,937 correlations)
Interdependencies of the variables:
Correlation globes paint a complex view of exposure and
behavior
permuted data to produce

“null ρ”

sought replication in > 1
cohort
JAMA 2014

Pac Symp Biocomput. 2015

JECH. 2015
Survey (NHANES)

How to enhance the validity of precision guidelines
emerging from big data?
1.) Test systematically, address multiplicity, and
replicate.
2.) Consider modeling scenarios explicitly.
3.) Practice reproducible research and increase
data literacy

Test systematically and replicate.
Examples: “environment-wide” or “nutrient-wide”
association studies

A search engine for robust, reproducible genotype-
phenotype associations…
A RT I C L E S
13 autosomal loci exceeded the threshold for genome-wide significance (r2 < 0.05), and conditional analyses (see below) establish these SNPs
50 Locus established previously
Locus identified by current study
Locus not confirmed by current study
BCL11A
THADA
NOTCH2
ADAMTS9
IRS1
IGF2BP2
WFS1
ZBED3
CDKAL1
HHEX/IDE
KCNQ1 (2 signals*: )
TCF7L2
KCNJ11
CENTD2
MTNR1B
HMGA2 ZFAND6
PRC1
FTO
HNF1B DUSP9
Conditional analysis
Unconditional analysis
TSPAN8/LGR5
HNF1A
CDC123/CAMK1D
CHCHD9
CDKN2A/2B
SLC30A8
TP53INP1
JAZF1
KLF14
PPAR
40
30
–log10(P)–log10(P)
20
10
10
1 2 3 4 5 6 7 8
Chromosome
9 10 11 12 13 14 15 16 17 18 19 20 21 22 X
0
0
Suggestive statistical association (P < 1 10
–5
)
Association in identified or established region (P < 1 10
–4
)
Figure 1 Genome-wide Manhattan plots for the DIAGRAM+ stage 1 meta-analysis. Top panel summarizes the results of the unconditional meta-
analysis. Previously established loci are denoted in red and loci identified by the current study are denoted in green. The ten signals in blue are those
taken forward but not confirmed in stage 2 analyses. The genes used to name signals have been chosen on the basis of proximity to the index SNP and
should not be presumed to indicate causality. The lower panel summarizes the results of equivalent meta-analysis after conditioning on 30 previously
established and newly identified autosomal T2D-associated SNPs (denoted by the dotted lines below these loci in the upper panel). Newly discovered
conditional signals (outside established loci) are denoted with an orange dot if they show suggestive levels of significance (P < 10−5), whereas
secondary signals close to already confirmed T2D loci are shown in purple (P < 10−4).
Voight et al, Nature Genetics 2012

N=8K T2D, 39K Controls
GWAS in Type 2 Diabetes
A prime example of systematic associations:
Genome-wide association studies (GWASs)

The same can be achieved with non-genetic factors.

Example: Exposures and behaviors in mortality.

Searching for 246 exposures and behaviors associated with all-
cause mortality.
NHANES: 1999-2004
National Death Index linked mortality
246 behaviors and exposures (serum/urine/self-report)
NHANES: 1999-2001
N=330 to 6008 (26 to 655 deaths)
~5.5 years of followup
Cox proportional hazards
baseline exposure and time to death
False discovery rate < 5%
NHANES: 2003-2004
N=177 to 3258 (20-202 deaths)
~2.8 years of followup
p < 0.05
Int J Epidem. 2013

Adjusted Hazard Ratio
-log10(pvalue)
0.4 0.6 0.8 1.0 1.2 1.4 1.6 2.0 2.4 2.8
02468
1
2
3
4
5
67
1 Physical Activity
2 Does anyone smoke in home?
3 Cadmium
4 Cadmium, urine
5 Past smoker
6 Current smoker
7 trans-lycopene
(11) 1
2
3 4
5 6
78
9
10 1112
13 14
1516
1 age (10 year increment)
2 SES_1
3 male
4 SES_0
5 black
6 SES_2
7 SES_3
8 education_hs
9 other_eth
10 mexican
11 occupation_blue_semi
12 education_less_hs
13 occupation_never
14 occupation_blue_high
15 occupation_white_semi
16 other_hispanic
(69)
Searching >250 environmental and behavioral factors in
all-cause mortality
FDR < 5%
sociodemographics
replicated factor
Int J Epidem. 2013

Adjusted Hazard Ratio
-log10(pvalue)
0.4 0.6 0.8 1.0 1.2 1.4 1.6 2.0 2.4 2.8
02468
1
2
3
4
5
67
1 Physical Activity
2 Does anyone smoke in home?
3 Cadmium
4 Cadmium, urine
5 Past smoker
6 Current smoker
7 trans-lycopene
(11) 1
2
3 4
5 6
78
9
10 1112
13 14
1516
1 age (10 year increment)
2 SES_1
3 male
4 SES_0
5 black
6 SES_2
7 SES_3
8 education_hs
9 other_eth
10 mexican
11 occupation_blue_semi
12 education_less_hs
13 occupation_never
14 occupation_blue_high
15 occupation_white_semi
16 other_hispanic
(69) age (10 years)
income (quintile 2)
income (quintile 1)
male
black income (quintile 3)
any one smoke in home?
serum and urine cadmium
[1 SD]
past smoker?
current smoker?serum lycopene
[1SD]
physical activity
[low, moderate, high activity]*
*derived from METs per activity and categorized by Health.gov guidelines
R2 ~ 2%
Searching >250 environmental and behavioral factors in
all-cause mortality

Searching 82 dietary factors in blood pressure:
INTERMAP and NHANES
Tzoulaki et al A Nutrient-Wide Association Study 2459
Circulation. 2012
association size
FDR < 5%
R2 ~ 7%

Testing all associations systematically:

Consideration of multiplicity of hypotheses and correlational web!
Explicit in number of hypotheses
tested
False discovery rate;

family-wise error rate;

Report database size!
Does my correlation matter?
How does my new correlation
compare to the family of correlations?
0.17 (e.g., carotene and diabetes)

is average ρ much less than 0.17? greater?
ρ
JAMA 2014
JECH 2015

Consideration of multitude modeling scenarios.
Example: Vibration of Eﬀects, the empirical
distribution of eﬀect sizes due to model choice

e modelling
oblem is akin to – but less well
sed and more poorly understood than –
e testing. For example, consider the use
r regression to adjust the risk levels of
atments to the same background level
There can be many covariates, and
t of covariates can be in or out of the
With ten covariates, there are over 1000
models. Consider a maze as a metaphor
elling (Figure 3). The red line traces the
path out of the maze. The path through
ze looks simple, once it is known.
ways in the literature for dealing with model
selection, so we propose a new, composite
2. Publication bias
is general recognition that a paper
much better chance of acceptance if
hing new is found. This means that, for
ation, the claim in the paper has to
sed on a p-value less than 0.05. From
g’s point of view5
, this is quality by
tion. The journals are placing heavy
ce on a statistical test rather than
nation of the methods and steps that
o a conclusion. As to having a p-value
han 0.05, some might be tempted to
the system10
through multiple testing,
ple modelling or unfair treatment of
or some combination of the three that
to a small p-value. Researchers can be
creative in devising a plausible story to
statistical finding.
2 The data cleaning team creates a
modelling data set and a holdout set and
P < 0.05
Figure 3. The path through a complex process can appear quite simple once the path is defined. Which terms are
included in a multiple linear regression model? Each turn in a maze is analogous to including or not a specific
term in the evolving linear model. By keeping an eye on the p-value on the term selected to be at issue, one
can work towards a suitably small p-value. © ktsdesign – Fotolia
A maze of associations is one way to a fragmented
literature and Vibration of Effects
Young, 2011
univariate
sex
sex & age
sex & race
sex & race & age
JCE, 2015

Distribution of associations and p-values due to model choice:
Estimating the Vibration of Eﬀects (or Risk)
Variable of Interest
e.g., 1 SD of log(serum Vitamin D)
Adjusting Variable Set
n=13
All-subsets Cox regression
213+ 1 = 8,193 models
SES [3rd tertile]
education [>HS]
race [white]
body mass index [normal]
total cholesterol
any heart disease
family heart disease
any hypertension
any diabetes
any cancer
current/past smoker [no smoking]
drink 5/day
physical activity
Data Source
NHANES 1999-2004
417 variables of interest
time to death
N≧1000 (≧100 deaths)
effect sizes
p-values
●
●
●
●
●
●
●
●
●
●
●
0
1
2
3
4
5
6
7
8
9
10
11
1
50
1 50 99
5.0
7.5
−log10(pvalue)
Vitamin D (1SD(log))
RHR = 1.14
RPvalue = 4.68
A
B
C D
E
median p-value/HR for k
percentile indicator
JCE, 2015
●
●
●
●
●
●
●
●
●
●
●
●
●●
0
1
2
3
4
5
6
7
8
9
10
11
1213
1
50
99
1 50 99
2.5
5.0
7.5
0.64 0.68 0.72 0.76
Hazard Ratio
−log10(pvalue)
RHR = 1.14
RP = 4.68
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0
1
2
3
4
5
6
7
8
9
10
11
12
13
1
50
99
1 50 99
1
2
3
4
0.75 0.80 0.85 0.90
Hazard Ratio
−log10(pvalue)
Thyroxine (1SD(log))
RHR = 1.15
RP = 2.90

The Vibration of Eﬀects:
Vitamin D and Thyroxine and attenuated risk in mortality
JCE, 2015
●
●
●
●
●
●
●
●
●
●
●
●
●●
0
1
2
3
4
5
6
7
8
9
10
11
1213
1
50
99
1 50 99
2.5
5.0
7.5
0.64 0.68 0.72 0.76
Hazard Ratio
−log10(pvalue)
RHR = 1.14
RP = 4.68
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0
1
2
3
4
5
6
7
8
9
10
11
12
13
1
50
99
1 50 99
1
2
3
4
0.75 0.80 0.85 0.90
Hazard Ratio
−log10(pvalue)
Thyroxine (1SD(log))
RHR = 1.15
RP = 2.90

●
●
●
●
●
9
10
111213
1
5
10
1.3
−log10(pvalue)
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0
1
2
3
4
5
6
7
8
9
10
111213
1
50
99
1 50 99
5
10
1.3 1.4 1.5 1.6
Hazard Ratio
−log10(pvalue)
Cadmium (1SD(log))
adjustment=current_past_smoking
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0
1
2
3
4
5
6
7
8
9
10
111213
1
50
99
1 50 99
5
10
1.3 1.4 1.5 1.6
Hazard Ratio
−log10(pvalue)
Cadmium (1SD(log))
RHR = 1.29
RP = 8.29
The Vibration of Effects: shifts in the effect size distribution
due to select adjustments (e.g., adjusting cadmium levels with
smoking status)
JCE, 2015

●●●●●●●●●●●●●●
012345678910111213 1
50
99
15099
0
1
1 2 3 4 5
Hazard Ratio
−log1
●●●
●
●●●●●●●●●●
012345678910111213
1
50
99
15099
0
1
1 2 3 4 5
Hazard Ratio
−log1
●●
●●●●●●●●●●●●
012345678910111213
1
50
99
1 50 99
0
1
1 2 3 4 5
Hazard Ratio
−log1
●
●
●
●
●
● ● ● ●
●
●
●
●
●
0
1
2
3
4
5 6 7 8 9
10
11
12
13
1
50
99
0.0
0.5
0.90 0.95 1.00 1.05
Hazard Ratio
−log1
●
●
●
●
●●●●●
●●●
2
3
4
5678910111213
50
99
0
1
0.85 0.90 0.95
Hazard Ratio
−log1
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0
1
2
3
4
5
6
7
8
9
10
11
1213
1
50
99
1 50 99
1
2
3
4
5
0.75 0.80 0.85 0.90
Hazard Ratio
−log10(pvalue)
Vitamin E as alpha−tocopherol (1SD(log))
RHR = 1.15
RP = 3.17
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0
1
2
3
4
5
6
7
8
9
10
11
1213
1
50
99
1 50 99
1
2
3
0.80 0.85 0.90
Hazard Ratio
−log10(pvalue)
Beta−carotene (1SD(log))
RHR = 1.15
RP = 2.34
●●
●●●●●●●●●● ●●
01
2345678910111213
1
50
99
1 50 99
1
2
3
0.875 0.900 0.925 0.950 0.975
Hazard Ratio
−log10(pvalue)
Caffeine (1SD(log))
RHR = 1.10
RP = 1.99
●
●
●
●
●
●
●
●
●
●
●
●
●●
0
1
2
3
4
5
6
7
8
9
1011
1213
1
50
99
1 50 99
0.0
0.5
1.0
1.5
0.90 0.95 1.00
Hazard Ratio
−log10(pvalue)
Calcium (1SD(log))
RHR = 1.13
RP = 1.15
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0
1
2
3
4
5
6
7
8
9
10
1112
13
1
50
99
1 50 99
0.5
1.0
1.5
2.0
2.5
0.84 0.88 0.92
Hazard Ratio
−log10(pvalue)
Carbohydrate (1SD(log))
RHR = 1.12
RP = 1.57
●
●
●
●
●
●
●
●
●
●
●
●
●●
0
1
2
3
4
5
6
7
8
9
1011
1213
1
50
99
1 50 99
0.5
1.0
1.5
2.0
2.5
0.80 0.84 0.88
Hazard Ratio
−log10(pvalue)
Carotene (1SD(log))
RHR = 1.14
RP = 1.53
●
●●●●●●●●●●●●●
0
12345678910111213
1
50
99
1 50 99
0.5
1.0
1.050 1.075 1.100 1.125
Hazard Ratio
−log10(pvalue)
Cholesterol (1SD(log))
RHR = 1.08
RP = 0.64
●
●
●
●
●
●
●
●
●
●
●
●
●●
0
1
2
3
4
5
6
7
8
9
10
111213
1
50
99
1 50 99
1
2
3
4
0.80 0.85 0.90 0.95
Hazard Ratio
−log10(pvalue)
Copper (1SD(log))
RHR = 1.17
RP = 2.86
●
●
●
●
●
●
●
●
●
●
●
●●
●
0
1
2
3
4
5
6
7
8
910
111213
1
50
99
1 50 99
0.0
0.5
1.0
1.5
0.85 0.90 0.95 1.00
Hazard Ratio
−log10(pvalue)
Beta−cryptoxanthin (1SD(log))
RHR = 1.15
RP = 1.39
●
●
● ● ● ●
●
●
●
●
● ●
●
●
0
1
2 3 4 5 6 7
8 9
10111213
1
50
99
1 50 99
0.0
0.5
1.0
0.96 0.99 1.02 1.05 1.08
Hazard Ratio
−log10(pvalue)
Folic acid (1SD(log))
RHR = 1.09
RP = 0.41
●
●
●
●
●
●
●
●
●
●
●
● ●
●
0
1
2
3
4
5
6
7
8
9 101112
13
1
50
99
1 50 99
1
2
3
4
0.80 0.85 0.90
Hazard Ratio
−log10(pvalue)
Folate, DFE (1SD(log))
RHR = 1.14
RP = 2.39
●
●
●
●
●
●
●
●
●
●
●
● ●
●
0
1
2
3
4
5
6
7
8
9
101112
13
1
50
99
1 50 99
2
4
6
8
0.76 0.80 0.84 0.88
Hazard Ratio
−log10(pvalue)
Food folate (1SD(log))
RHR = 1.14
RP = 4.64
●
●
●
●
●
●
●
●
●
●
●
●
●●
0
1
2
3
4
5
6
7
8
9
10111213
1
50
99
1 50 99
1
2
3
4
0.80 0.84 0.88 0.92
Hazard Ratio
−log10(pvalue)
Dietary fiber (1SD(log))
RHR = 1.15
RP = 2.79
●
●
●
●
●
●
●
●
●
●
●
●
●●
0
1
2
3
4
5
6
7
8
9
1011
1213
1
50
99
1 50 99
1
2
3
0.80 0.84 0.88 0.92 0.96
Hazard Ratio
−log10(pvalue)
Total Folate (1SD(log))
RHR = 1.15
RP = 2.11
●
●
●
●
●
●
●
●
●
●
●
●
●●
0
1
2
3
4
5
6
7
8
9
10
11
1213
1
50
99
1 50 99
1
2
0.84 0.88 0.92
Hazard Ratio
−log10(pvalue)
Iron (1SD(log))
RHR = 1.12
RP = 1.91
β-carotene caﬀeine
cholesterol
food folate

●
●
●
●
●
●
●
●
●
●
●
●
●●
0
1
2
3
4
5
6
7
8
9
1011
1213
1
50
99
1 50 99
1
2
3
0.80 0.85 0.90
Hazard Ratio
−log10(pvalue)
Potassium (1SD(log))
RHR = 1.14
RP = 2.28
●
●
●
●
●
●
●
●
●
●
●
●
●●
0
1
2
3
4
5
6
7
8
9
10111213
1
50
99
1 50 99
0.5
1.0
1.5
2.0
0.850 0.875 0.900 0.925 0.950
Hazard Ratio
−log10(pvalue)
Protein (1SD(log))
RHR = 1.11
RP = 1.42
●
●
● ● ●
●
●
●
●
●
●
●
●
●
01
2 3 4 5
6
7
8
9
10
1112
13
1
50
99
1 50 99
0.0
0.5
1.0
0.95 1.00 1.05 1.10
Hazard Ratio
−log10(pvalue)
Retinol (1SD(log))
RHR = 1.13
RP = 0.67
●
●
●
●
●
●
●
●
●
●
●
●
●●
0
1
2
3
4
5
6
7
8
9
10
11
1213
1
50
99
1 50 99
0.0
0.5
1.0
1.5
1.00 1.05 1.10
Hazard Ratio
−log10(pvalue)
SFA 4:0 (1SD(log))
RHR = 1.11
RP = 1.29
●
●
●
●
●
●
●
●
●
●
●
●
●●
01
2
3
4
5
6
7
8
9
10
11
1213
1
50
99
1 50 99
0.5
1.0
1.5
2.0
2.5
1.04 1.08 1.12 1.16
Hazard Ratio
−log10(pvalue)
SFA 6:0 (1SD(log))
RHR = 1.11
RP = 1.71
●
●
●
●
●
●
●
●
●
●
●
●●●
01
2
3
4
5
6
7
8
9
10
111213
1
50
99
1 50 99
2
3
4
1.12 1.16 1.20
Hazard Ratio
−log10(pvalue)
SFA 8:0 (1SD(log))
RHR = 1.10
RP = 2.55
●
●
●
●
●
●
●
●
●
●
●
●●●
0
1
2
3
4
5
6
7
8
9
10
111213
1
50
99
1 50 99
1
2
1.04 1.08 1.12 1.16
Hazard Ratio
−log10(pvalue)
SFA 10:0 (1SD(log))
RHR = 1.11
RP = 1.87
●
●
●
●
●
●
●
●
●
●
●
●●●
0
1
2
3
4
5
6
7
8
9
10
111213
1
50
99
1 50 99
1.0
1.5
2.0
2.5
3.0
1.075 1.100 1.125 1.150 1.175
Hazard Ratio
−log10(pvalue)
SFA 12:0 (1SD(log))
RHR = 1.08
RP = 1.79
●●
●
●
●
●
●
●
●
●
●
●
●●
01
2
3
4
5
6
7
8
9
10111213
1
50
99
1 50 99
0.5
1.0
1.5
2.0
1.05 1.10 1.15
Hazard Ratio
−log10(pvalue)
SFA 14:0 (1SD(log))
RHR = 1.11
RP = 1.61
●●
●
●
●
●
●
●
●
●
●
●
●●
01
2
3
4
5
6
7
8
9
10
11
1213
1
50
99
1 50 99
0.0
0.5
1.0
1.00 1.05 1.10
Hazard Ratio
−log10(pvalue)
SFA 16:0 (1SD(log))
RHR = 1.11
RP = 0.84
●●
●
●
●
●
●
●
●
●
● ●
●●
01
2
3
4
5 67
891011
1213
1
50
99
1 50 99
0.0
0.5
1.0
1.02 1.06 1.10
Hazard Ratio
−log10(pvalue)
SFA 18:0 (1SD(log))
RHR = 1.10
RP = 0.73
●
●
●
●
●
●
●
●
●
●
●
● ●●
0
1
2
3
4
5
6
7
8 910
111213
1
50
99
1 50 99
0.5
1.0
1.5
2.0
0.875 0.900 0.925 0.950
Hazard Ratio
−log10(pvalue)
Selenium (1SD(log))
RHR = 1.09
RP = 1.24
●
●
●
●
●
●
●
●
●
●
●
●
●●
01
2
3
4
5
6
7
8
910
11
1213
1
50
99
1 50 99
0.0
0.5
1.0
1.00 1.05 1.10 1.15
Hazard Ratio
−log10(pvalue)
Total saturated fatty acids (1SD(log))
RHR = 1.11
RP = 0.93
● ●
●
●●●●●●●●●●●
0 1
2
345678910111213
1
50
99
1 50 99
2
4
6
0.650 0.675 0.700 0.725 0.750
Hazard Ratio
−log10(pvalue)
Sodium (1SD(log))
RHR = 1.12
RP = 3.74
●
●
●
●
●
●
●
●
●
●
●●
● ●
0
1
2
3
4
5
6
7
8910111213
1
50
99
1 50 99
1
2
3
4
0.76 0.80 0.84
Hazard Ratio
−log10(pvalue)
Total sugars (1SD(log))
RHR = 1.13
RP = 2.51
●
● ● ●
●
●
●
●
●
●
●
●
●
●
0
1 2 3 4
5
6
7
8 910
111213
1
50
99
1 50 99
0.0
0.5
1.0
0.95 1.00 1.05 1.10
Hazard Ratio
−log10(pvalue)
Total fat (1SD(log))
RHR = 1.11
RP = 0.54
●
●
●
●
●
●
●
●
●●
●●
●●
0
1
2
3
4
5
6 7891011
1213
1
50
99
1 50 99
0.5
1.0
1.5
0.87 0.90 0.93 0.96
Hazard Ratio
−log10(pvalue)
Theobromine (1SD(log))
RHR = 1.08
RP = 1.19
●
●
●
●
●
●
●
●
●
●
●
●
●●
0
1
2
3
4
5
6
7
8
9
1011
1213
1
50
99
1 50 99
0.4
0.8
1.2
1.6
0.80 0.84 0.88 0.92
Hazard Ratio
−log10(pvalue)
Vitamin A (1SD(log))
RHR = 1.13
RP = 1.09
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0
1
2
3
4
5
6
7
8
9
10
111213
1
50
99
1 50 99
0.0
0.5
1.0
1.5
0.85 0.90 0.95 1.00
Hazard Ratio
−log10(pvalue)
Vitamin A, RAE (1SD(log))
RHR = 1.16
RP = 1.31
●
●
●
●
●
●
●
●
●
●
● ●
●●
0
1
2
3
4
5
6
7 8 910111213
1
50
99
1 50 99
0.5
1.0
0.86 0.90 0.94 0.98
Hazard Ratio
−log10(pvalue)
Retinol (1SD(log))
RHR = 1.15
RP = 0.74
sodium sugars
SFA 6:0
SFA 8:0 SFA 10:0

JCE, 2015
Janus (two-faced) risk profile
Risk and significance depends on modeling scenario!
The Vibration of Effects: beware of the Janus effect

(both risk and protection?!)
“risk”“protection”
“significant”
Brittanica.com

Need to document analysis approach(es).

Accessible analytics tools and computer
infrastructure exist to enable reproducible research
“Ability to recompute data analytic
results given a observed dataset and
knowledge of the pipeline…”
Leek and Peng, PNAS 2015
(1) Raw data available

(2) Analytics code and documentation are available

(3) Correct analysis methodology

(4) Trained data analysts to execute research

http://github.com

repository to deposit and control code

“Markdown” ﬁles to document analytic process
code
output
annotation
annotation

http://chiragjpgroup.org/exposome-analytics-course

In conclusion: Big data promises multitude of ways
to discover precision guidelines
Thousands of hypotheses are possible.

Multiplicity of hypotheses.
Big Data are observational.

Multiplicity of biases: confounding, selection; reverse causal
Millions of analytic scenarios are possible.

Multiplicity of analytic methods.

To enhance the validity of big data results, we must:
1.) Test systematically, address hypothesis tests,
and replicate.
2.) Consider modeling scenarios explicitly.
3.) Practice reproducible research and increase
data literacy.

Harvard DBMI
Isaac Kohane

Susanne Churchill

Stan Shaw

Jenn Grandﬁeld

Sunny Alvear

Michal Preminger

Harvard Chan
Hugues Aschard

Francesca Dominici

Chirag J Patel

chirag@hms.harvard.edu

@chiragjp

www.chiragjpgroup.org
NIH Common Fund

Big Data to Knowledge
Acknowledgements
Stanford
John Ioannidis

Atul Butte (UCSF)

U Queensland
Jian Yang

Peter Visscher

Cochrane
Belinda Burford
RagGroup
Chirag Lakhani
Adam Brown
Danielle Rasooly

Arjun Manrai

Erik Corona

Nam Pho
Dennis Bier
Emanuela Folco
Elena Colombo
Lorenzini Foundation

Methods to enhance the validity of precision guidelines emerging from big data

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (14)

Similar to Methods to enhance the validity of precision guidelines emerging from big data

Similar to Methods to enhance the validity of precision guidelines emerging from big data (20)

Recently uploaded

Recently uploaded (20)

Methods to enhance the validity of precision guidelines emerging from big data